subject:"\[jira\] \[Work logged\] \(HIVE\-24444\) compactor.Cleaner should not set state \"mark cleaned\" if there are obsolete files in the FS"

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=520139=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-520139
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 04/Dec/20 10:38
Start Date: 04/Dec/20 10:38
Worklog Time Spent: 10m 
  Work Description: klcopp commented on pull request #1716:
URL: https://github.com/apache/hive/pull/1716#issuecomment-738709447


   I will close this because HIVE-24403 is making HIVE-23107 etc. backwards 
compatible so this change will not be needed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 520139)
Time Spent: 7h 20m  (was: 7h 10m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=520140=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-520140
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 04/Dec/20 10:38
Start Date: 04/Dec/20 10:38
Worklog Time Spent: 10m 
  Work Description: klcopp closed pull request #1716:
URL: https://github.com/apache/hive/pull/1716


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 520140)
Time Spent: 7.5h  (was: 7h 20m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519762=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519762
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 03/Dec/20 16:12
Start Date: 03/Dec/20 16:12
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r535372633



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   if it's only for versions without HIVE-23107, should it be behind the 
feature flag/schema version check? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519762)
Time Spent: 7h 10m  (was: 7h)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519643=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519643
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 03/Dec/20 13:37
Start Date: 03/Dec/20 13:37
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r535232912



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   No, this isn't necessary upstream, this change is for versions without 
HIVE-23107 etc. But I don't want to hurt upstream functionality with it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519643)
Time Spent: 7h  (was: 6h 50m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519530=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519530
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 03/Dec/20 10:27
Start Date: 03/Dec/20 10:27
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r535061308



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   I haven't checked, but if folder is considered as obsolete when there is 
a base / minor compacted delta shadowing it - that's ok. 
   But i don't see benefits of this patch for upstream (contains HIVE-23107), 
am I missing something? Also it's gonna violate the idea of delayed cleaner 
that is coupled with a specific compaction request.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519530)
Time Spent: 6h 50m  (was: 6h 40m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519526=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519526
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 03/Dec/20 10:22
Start Date: 03/Dec/20 10:22
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r535061308



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   I haven't checked, but if folder is considered as obsolete when there is 
a base / minor compacted delta shadowing it - that's ok. 
   But i don't see benefits of this patch for upstream (contains HIVE-23107), 
am I missing something?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519526)
Time Spent: 6h 40m  (was: 6.5h)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519523=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519523
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 03/Dec/20 10:16
Start Date: 03/Dec/20 10:16
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r535052856



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   you can merge stuff below retention period in this case, but that's 
another story not relevant to this pull request





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519523)
Time Spent: 6h 20m  (was: 6h 10m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519524=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519524
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 03/Dec/20 10:16
Start Date: 03/Dec/20 10:16
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r535052856



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   you can merge stuff above retention period in this case, but that's 
another story not relevant to this pull request





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519524)
Time Spent: 6.5h  (was: 6h 20m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519522=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519522
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 03/Dec/20 10:14
Start Date: 03/Dec/20 10:14
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r535052856



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   you can merge stuff below retention period in this case





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519522)
Time Spent: 6h 10m  (was: 6h)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519505=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519505
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 03/Dec/20 09:25
Start Date: 03/Dec/20 09:25
Worklog Time Spent: 10m 
  Work Description: pvargacl commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r534995678



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   In upstream it would cause problems, for the delayed cleaner it is 
necessary that every compaction entry cleans only its own obsoletes. The 
default is every compaction is delayed by 15 minutes, so it is much more likely 
when a cleaning job finally runs, there are already some other jobs in the 
queue waiting. all of them contains highestwriteId saved, and they will clean 
only up until that. You should not merge those. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519505)
Time Spent: 6h  (was: 5h 50m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519489=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519489
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 03/Dec/20 08:52
Start Date: 03/Dec/20 08:52
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r534951189



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   I guess that's doable...and would fix the issue at hand.
   
   The cons I can think of:
   It might slow down the worker a bit, since every time a table/partition goes 
into "ready for cleaning" state it needs to go through the whole compaction 
queue. And it might confuse users since entries will be deleted from the 
compaction queue.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519489)
Time Spent: 5h 50m  (was: 5h 40m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519477=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519477
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 03/Dec/20 08:22
Start Date: 03/Dec/20 08:22
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r534898652



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   Why don't we keep only 1 clean request per table/partition and just 
replace existing one?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519477)
Time Spent: 5h 40m  (was: 5.5h)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519472=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519472
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 03/Dec/20 08:13
Start Date: 03/Dec/20 08:13
Worklog Time Spent: 10m 
  Work Description: pvargacl commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r534883286



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   It becomes obsolete, when there is a base / minor compacted delta 
shadowing it. Otherwise they are just in currentDirectories list.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519472)
Time Spent: 5.5h  (was: 5h 20m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519462=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519462
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 03/Dec/20 08:02
Start Date: 03/Dec/20 08:02
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r534864106



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   Why new deltas won't appear in obsolete list 
(getNumEventuallyObsoleteDirs) if we consider every transaction as valid? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519462)
Time Spent: 5h 20m  (was: 5h 10m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519448=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519448
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 03/Dec/20 07:41
Start Date: 03/Dec/20 07:41
Worklog Time Spent: 10m 
  Work Description: pvargacl commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r534827989



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   New deltas by themselves don't stop marking the compaction cleaned (they 
are not in obsolete list), but overlapping compactions do, I think problematic 
scenario could be if there are some long running ETL jobs also next to your 
streaming writes. That would mean the that cleaning jobs would pile up, they 
would run continuously, but the low values in min_history_level would prevent 
them to clean every obsolete up and ever marked as cleaned :(





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519448)
Time Spent: 5h 10m  (was: 5h)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519211=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519211
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 02/Dec/20 19:45
Start Date: 02/Dec/20 19:45
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r534377951



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   consider Case2:
   How the situation is going to change if instead of aborted txns we have a 
successful ones? Imagine writes arrive continuously via streaming. Won't new 
deltas prevent us from cleaning up obsolete ones and marking cleanup operation 
as completed for the corresponding compaction, but rather pile up cleanup 
requests in a queue?
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519211)
Time Spent: 5h  (was: 4h 50m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519172=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519172
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 02/Dec/20 18:11
Start Date: 02/Dec/20 18:11
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r534377951



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   considering Case2:
   How the situation is going to change if instead of aborted txn you have 
successful write? Writes coming in continuously (streaming). Won't new delta 
prevent us from marking clean as complete for corresponding compaction request 
and we'll have just pilled up cleanup requests in a queue?
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519172)
Time Spent: 4h 50m  (was: 4h 40m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519128=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519128
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 02/Dec/20 17:19
Start Date: 02/Dec/20 17:19
Worklog Time Spent: 10m 
  Work Description: pvargacl commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r534342957



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   Case 1: If HIVE-23107 and the following are there I think none of these 
checks are necessary, because we can be sure, that Cleaner was running when it 
could delete everything it can. Also if delayed cleaning is enabled it is 
guaranteed, that it will never delete any more obsolete directories no matter 
how many times it is running (see: 
validWriteIdList.updateHighWatermark(ci.highestWriteId)). If we must choose, 
checking if anything was removed does less damage
   Case 2: If those fixes are not there I think checking for obsolete files is 
better than checking if anything was removed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519128)
Time Spent: 4h 40m  (was: 4.5h)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519051=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519051
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 02/Dec/20 14:50
Start Date: 02/Dec/20 14:50
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r534226306



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   Okay, I see what you mean. I removed the aborted files from the total.
   
   In general do you think checking for obsolete files is better than checking 
whether we removed any files?
   Case 1: Assuming HIVE-23107 etc. are present in the version?
   Case 2: Assuming HIVE-23107 etc. are _not_ present in the version?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 519051)
Time Spent: 4.5h  (was: 4h 20m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=518569=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-518569
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 01/Dec/20 17:52
Start Date: 01/Dec/20 17:52
Worklog Time Spent: 10m 
  Work Description: pvargacl commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r533608167



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   I think checking the aborted is a very problematic in upstream, if you 
have transaction coming in continuously and some of them aborting cleaner won't 
be able to mark any compaction cleaned. Example:
   txn1-10 coming in some of them writing, txnid=8 is aborting
   compaction is kicking in, compacts the deltas, will have the next_txn_id for 
example txnId=13
   other transactions are coming in txnid=18 is aborting
   cleaner will start when everything is done under 13, it can clean 
everything. but since txnId=14 is open it will not see txnId=18, so when it 
checks if can mark itself cleaned, it will not do it, because the check sees 
txnId=18 aborted.
   Second compaction kicks in, will have next_txn_id for example txnId=23
   other transactions are coming in txnid=28 is aborting.
   Again cleaner can not mark compaction 1 or compaction 2 cleaned, as none of 
them see 28 and can not clean it. this will pile up and Cleaner only mark them 
cleaned, if finally there is an iteration, when there was no new aborted 
transaction.
   If you check only the obsoleted files, you only miss marking the compaction 
cleaned, when you should have, if you have two overlapping compaction, but in 
that case the second compaction will get marked as cleaned and the next cleaner 
run, compaction 1 will be marked as cleaned too. This too can cause problems if 
you enabled delayed cleaner, but probably less





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 518569)
Time Spent: 4h 20m  (was: 4h 10m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=518544=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-518544
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 01/Dec/20 17:17
Start Date: 01/Dec/20 17:17
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r533584095



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();

Review comment:
   > But if HIVE-24291 is present it shouldn't hurt.
   This isn't necessarily true, since as @pvargacl noted there could be aborts 
since the compaction happened.
   The question here is whether this change or HIVE-24314 is worse for users if 
HIVE-23107 (remove MIN_HISTORY_LEVEL) etc. are not present on their version.

##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();

Review comment:
   > But if HIVE-24291 is present it shouldn't hurt.
   
   This isn't necessarily true, since as @pvargacl noted there could be aborts 
since the compaction happened.
   The question here is whether this change or HIVE-24314 is worse for users if 
HIVE-23107 (remove MIN_HISTORY_LEVEL) etc. are not present on their version.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 518544)
Time Spent: 4h 10m  (was: 4h)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
>

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=518543=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-518543
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 01/Dec/20 17:10
Start Date: 01/Dec/20 17:10
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r533579277



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();

Review comment:
   No we don't need this change if HIVE-24291 is present. But if HIVE-24291 
is present it shouldn't hurt.
   CQ_NEXT_TXN_ID is stored since HIVE-23107, which also has hms schema changes.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 518543)
Time Spent: 4h  (was: 3h 50m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=518532=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-518532
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 01/Dec/20 16:57
Start Date: 01/Dec/20 16:57
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r533570411



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();

Review comment:
   Could we limit by CQ_NEXT_TXN_ID ?  
   Do we need this change in upstream master if HIVE-24291 is already present ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 518532)
Time Spent: 3h 50m  (was: 3h 40m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=518523=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-518523
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 01/Dec/20 16:35
Start Date: 01/Dec/20 16:35
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r533554033



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   Aborted directories are deleted by removeFiles() too, and there can be 
obsolete directories with a higher txnId than this compaction too. The point of 
this change is to call markCleaned() when the table is completely cleaned, not 
necessarily when this compaction's cleaning is done. I figured that would be 
better than calling markCleaned() when some obsolete or aborted file gets 
deleted. (see HIVE-2 description)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 518523)
Time Spent: 3h 40m  (was: 3.5h)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-12-01 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=518518=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-518518
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 01/Dec/20 16:29
Start Date: 01/Dec/20 16:29
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r533549177



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();

Review comment:
   Compactor txnid is stored since HIVE-24291. This patch is for versions 
without that change, or any recent changes to the hms schema.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 518518)
Time Spent: 3.5h  (was: 3h 20m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=518060=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-518060
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 18:06
Start Date: 30/Nov/20 18:06
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532794904



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -265,13 +267,15 @@ private static String idWatermark(CompactionInfo ci) {
   }
 
   /**
-   * @return true if any files were removed
+   * @return true if the cleaner has removed all files rendered obsolete by 
compaction
*/
   private boolean removeFiles(String location, ValidWriteIdList writeIdList, 
CompactionInfo ci)
   throws IOException, NoSuchObjectException, MetaException {
 Path locPath = new Path(location);
+FileSystem fs = locPath.getFileSystem(conf);
+Map dirSnapshots = 
AcidUtils.getHdfsDirSnapshots(fs, locPath);
 AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, writeIdList, 
Ref.from(

Review comment:
   No, not with HIVE-24291.
   Without HIVE-24291 (which might not be usable if for example if HMS schema 
changes are out the question) we still could have a pileup of the same 
table/partition in "ready for cleaning" in the queue.
   Without this change (HIVE-2) some of them might not be deleted.
   The goal of this change is that, when the table does get cleaned, all of the 
records will be deleted.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 518060)
Time Spent: 3h 20m  (was: 3h 10m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=518059=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-518059
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 18:04
Start Date: 30/Nov/20 18:04
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532794904



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -265,13 +267,15 @@ private static String idWatermark(CompactionInfo ci) {
   }
 
   /**
-   * @return true if any files were removed
+   * @return true if the cleaner has removed all files rendered obsolete by 
compaction
*/
   private boolean removeFiles(String location, ValidWriteIdList writeIdList, 
CompactionInfo ci)
   throws IOException, NoSuchObjectException, MetaException {
 Path locPath = new Path(location);
+FileSystem fs = locPath.getFileSystem(conf);
+Map dirSnapshots = 
AcidUtils.getHdfsDirSnapshots(fs, locPath);
 AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, writeIdList, 
Ref.from(

Review comment:
   No, not with HIVE-24291.
   Without HIVE-24291 (which might not be usable if for example if HMS schema 
changes are out the question) and without this change, we still could have a 
pileup of the same table/partition in "ready for cleaning" in the queue.
   Without this change (HIVE-2) some of them might not be deleted.
   The goal of this change is that, when the table does get cleaned, all of the 
records will be deleted.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 518059)
Time Spent: 3h 10m  (was: 3h)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=518057=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-518057
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 18:00
Start Date: 30/Nov/20 18:00
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532776190



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();

Review comment:
   Will we consider all writes as valid here? Shouldn't we limit at least 
by compactor txnId?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 518057)
Time Spent: 3h  (was: 2h 50m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=518048=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-518048
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 17:35
Start Date: 30/Nov/20 17:35
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532776190



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();

Review comment:
   Will we consider all writes as valid here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 518048)
Time Spent: 2h 50m  (was: 2h 40m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=518046=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-518046
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 17:31
Start Date: 30/Nov/20 17:31
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532772806



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -265,13 +267,15 @@ private static String idWatermark(CompactionInfo ci) {
   }
 
   /**
-   * @return true if any files were removed
+   * @return true if the cleaner has removed all files rendered obsolete by 
compaction
*/
   private boolean removeFiles(String location, ValidWriteIdList writeIdList, 
CompactionInfo ci)
   throws IOException, NoSuchObjectException, MetaException {
 Path locPath = new Path(location);
+FileSystem fs = locPath.getFileSystem(conf);
+Map dirSnapshots = 
AcidUtils.getHdfsDirSnapshots(fs, locPath);
 AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, writeIdList, 
Ref.from(

Review comment:
   what would happen if there is a long running read-only txn + multiple 
compaction attempts? are we going to have multiple records in a queue for the 
same table/partition pending cleanup? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 518046)
Time Spent: 2h 40m  (was: 2.5h)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=518045=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-518045
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 17:30
Start Date: 30/Nov/20 17:30
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532772806



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -265,13 +267,15 @@ private static String idWatermark(CompactionInfo ci) {
   }
 
   /**
-   * @return true if any files were removed
+   * @return true if the cleaner has removed all files rendered obsolete by 
compaction
*/
   private boolean removeFiles(String location, ValidWriteIdList writeIdList, 
CompactionInfo ci)
   throws IOException, NoSuchObjectException, MetaException {
 Path locPath = new Path(location);
+FileSystem fs = locPath.getFileSystem(conf);
+Map dirSnapshots = 
AcidUtils.getHdfsDirSnapshots(fs, locPath);
 AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, writeIdList, 
Ref.from(

Review comment:
   what would happen if there is a long running read-only txn + multiple 
compaction attempts? are we going to have multiple records in a queue for the 
same tab;e/partition pending cleanup? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 518045)
Time Spent: 2.5h  (was: 2h 20m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=518043=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-518043
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 17:18
Start Date: 30/Nov/20 17:18
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532764005



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -265,13 +267,15 @@ private static String idWatermark(CompactionInfo ci) {
   }
 
   /**
-   * @return true if any files were removed
+   * @return true if the cleaner has removed all files rendered obsolete by 
compaction
*/
   private boolean removeFiles(String location, ValidWriteIdList writeIdList, 
CompactionInfo ci)
   throws IOException, NoSuchObjectException, MetaException {
 Path locPath = new Path(location);
+FileSystem fs = locPath.getFileSystem(conf);
+Map dirSnapshots = 
AcidUtils.getHdfsDirSnapshots(fs, locPath);
 AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, writeIdList, 
Ref.from(

Review comment:
   could be replaced with above fs variable





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 518043)
Time Spent: 2h 20m  (was: 2h 10m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=517915=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-517915
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 13:01
Start Date: 30/Nov/20 13:01
Worklog Time Spent: 10m 
  Work Description: pvargacl commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532579637



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -265,13 +267,14 @@ private static String idWatermark(CompactionInfo ci) {
   }
 
   /**
-   * @return true if any files were removed
+   * @return true if the cleaner has removed all files rendered obsolete by 
compaction
*/
   private boolean removeFiles(String location, ValidWriteIdList writeIdList, 
CompactionInfo ci)
   throws IOException, NoSuchObjectException, MetaException {
 Path locPath = new Path(location);
+Map dirSnapshots = null;
 AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, writeIdList, 
Ref.from(
-false), false);
+false), false, dirSnapshots);

Review comment:
   Oh, sorry. I was talking nonsense. Yes, we should not pass null in. I 
thought an empty map, but checked the code that would not get filled up either. 
There is a working example in AcidUtils.getAcidFilesForStats, you should create 
the snapshot beforehand





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 517915)
Time Spent: 2h 10m  (was: 2h)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=517913=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-517913
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 12:59
Start Date: 30/Nov/20 12:59
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532577307



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   there could be deltas with higher txnId than compaction :) 
   If we won't handle this, we might get uncleaned aborts 
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 517913)
Time Spent: 2h  (was: 1h 50m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=517912=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-517912
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 12:57
Start Date: 30/Nov/20 12:57
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532577307



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   there could be deltas with higher txnId than compaction :) 
   If we wan't handle this, we might get uncleaned aborts 
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 517912)
Time Spent: 1h 50m  (was: 1h 40m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=517908=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-517908
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 12:54
Start Date: 30/Nov/20 12:54
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532575138



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -265,13 +267,14 @@ private static String idWatermark(CompactionInfo ci) {
   }
 
   /**
-   * @return true if any files were removed
+   * @return true if the cleaner has removed all files rendered obsolete by 
compaction
*/
   private boolean removeFiles(String location, ValidWriteIdList writeIdList, 
CompactionInfo ci)
   throws IOException, NoSuchObjectException, MetaException {
 Path locPath = new Path(location);
+Map dirSnapshots = null;
 AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, writeIdList, 
Ref.from(
-false), false);
+false), false, dirSnapshots);

Review comment:
   how exactly? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 517908)
Time Spent: 1h 40m  (was: 1.5h)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=517897=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-517897
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 12:36
Start Date: 30/Nov/20 12:36
Worklog Time Spent: 10m 
  Work Description: pvargacl commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532564972



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   You should not, there can be aborts with higher txnId than compaction, 
this will see those, but the cleaner will never see them, so it would never 
finish its job





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 517897)
Time Spent: 1.5h  (was: 1h 20m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=517891=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-517891
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 12:34
Start Date: 30/Nov/20 12:34
Worklog Time Spent: 10m 
  Work Description: pvargacl commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532563919



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -265,13 +267,14 @@ private static String idWatermark(CompactionInfo ci) {
   }
 
   /**
-   * @return true if any files were removed
+   * @return true if the cleaner has removed all files rendered obsolete by 
compaction
*/
   private boolean removeFiles(String location, ValidWriteIdList writeIdList, 
CompactionInfo ci)
   throws IOException, NoSuchObjectException, MetaException {
 Path locPath = new Path(location);
+Map dirSnapshots = null;
 AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, writeIdList, 
Ref.from(
-false), false);
+false), false, dirSnapshots);

Review comment:
   This will fill up the snapshot, so it can be used later





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 517891)
Time Spent: 1h 10m  (was: 1h)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=517892=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-517892
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 12:34
Start Date: 30/Nov/20 12:34
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532556849



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -265,13 +267,14 @@ private static String idWatermark(CompactionInfo ci) {
   }
 
   /**
-   * @return true if any files were removed
+   * @return true if the cleaner has removed all files rendered obsolete by 
compaction
*/
   private boolean removeFiles(String location, ValidWriteIdList writeIdList, 
CompactionInfo ci)
   throws IOException, NoSuchObjectException, MetaException {
 Path locPath = new Path(location);
+Map dirSnapshots = null;
 AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, writeIdList, 
Ref.from(
-false), false);
+false), false, dirSnapshots);

Review comment:
   What's you expectation here? dirSnapshots would be always null. I think, 
what you wanted to do is:
   `dirSnapshots = getHdfsDirSnapshots(locPath.getFileSystem(conf), locPath)`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 517892)
Time Spent: 1h 20m  (was: 1h 10m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=517886=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-517886
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 12:21
Start Date: 30/Nov/20 12:21
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532556849



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -265,13 +267,14 @@ private static String idWatermark(CompactionInfo ci) {
   }
 
   /**
-   * @return true if any files were removed
+   * @return true if the cleaner has removed all files rendered obsolete by 
compaction
*/
   private boolean removeFiles(String location, ValidWriteIdList writeIdList, 
CompactionInfo ci)
   throws IOException, NoSuchObjectException, MetaException {
 Path locPath = new Path(location);
+Map dirSnapshots = null;
 AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, writeIdList, 
Ref.from(
-false), false);
+false), false, dirSnapshots);

Review comment:
   What's you expectation here? dirSnapshots would be always null.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 517886)
Time Spent: 1h  (was: 50m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=517882=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-517882
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 12:12
Start Date: 30/Nov/20 12:12
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532552240



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots)
+  throws IOException {
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,
+Ref.from(false), false, dirSnapshots);
+return dir.getObsolete().size();

Review comment:
   should we consider aborts as well?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 517882)
Time Spent: 50m  (was: 40m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=517827=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-517827
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 10:19
Start Date: 30/Nov/20 10:19
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532486047



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +312,29 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location) throws IOException 
{
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,

Review comment:
   Good idea, thanks!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 517827)
Time Spent: 40m  (was: 0.5h)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=517825=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-517825
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 10:13
Start Date: 30/Nov/20 10:13
Worklog Time Spent: 10m 
  Work Description: pvargacl commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r532482152



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -316,6 +312,29 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
   }
   fs.delete(dead, true);
 }
-return true;
+// Check if there will be more obsolete directories to clean when 
possible. We will only mark cleaned when this
+// number reaches 0.
+return getNumEventuallyObsoleteDirs(location) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove 
eventually. If we check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition 
directory that it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and 
then calling getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location) throws IOException 
{
+ValidTxnList validTxnList = new ValidReadTxnList();
+//save it so that getAcidState() sees it
+conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+Path locPath = new Path(location);
+AcidUtils.Directory dir = 
AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, 
validWriteIdList,

Review comment:
   You could pass an empty dirSnapshot to the first getAcidState in 
removeFiles, that will fill up the snapshot, and you can you reuse it here, so 
it won't make a second listing on the FS





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 517825)
Time Spent: 0.5h  (was: 20m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=517811=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-517811
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 09:12
Start Date: 30/Nov/20 09:12
Worklog Time Spent: 10m 
  Work Description: klcopp commented on pull request #1716:
URL: https://github.com/apache/hive/pull/1716#issuecomment-735657907


   @pvargacl would you mind taking a look too?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 517811)
Time Spent: 20m  (was: 10m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS

2020-11-30 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=517809=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-517809
 ]

ASF GitHub Bot logged work on HIVE-2:
-

Author: ASF GitHub Bot
Created on: 30/Nov/20 09:09
Start Date: 30/Nov/20 09:09
Worklog Time Spent: 10m 
  Work Description: klcopp opened a new pull request #1716:
URL: https://github.com/apache/hive/pull/1716


   ### What changes were proposed in this pull request?
   
   ### Why are the changes needed?
   
   ### Does this PR introduce _any_ user-facing change?
   
   See HIVE-2
   
   ### How was this patch tested?
   Unit test
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 517809)
Remaining Estimate: 0h
Time Spent: 10m

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete 
> files in the FS
> ---
>
> Key: HIVE-2
> URL: https://issues.apache.org/jira/browse/HIVE-2
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only 
> if +any+ files are deleted by the cleaner. This could cause a problem in the 
> following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and 
> compaction is run again on the same table (compaction2). Both compaction1 and 
> compaction2 could be in "ready for cleaning" at the same time. By this time 
> the blocking open txn could be committed. When the cleaner runs, one of 
> compaction1 and compaction2 will remain in the "ready for cleaning" state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all 
> obsolete files.  Then compaction1 is picked up by the cleaner; the cleaner 
> doesn't remove any files and compaction1 will stay in the queue in a "ready 
> for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if 
> HMS schema changes are out the question) then HIVE-24314 + this change will 
> fix the issue of the Cleaner not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

45 matches

Mail list logo