[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-04-03 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=222737=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-222737
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 04/Apr/19 00:58
Start Date: 04/Apr/19 00:58
Worklog Time Spent: 10m 
  Work Description: asfgit commented on pull request #2579: [GOBBLIN-712] 
Add version strategy pickup for ConfigBasedDataset distcp workflow
URL: https://github.com/apache/incubator-gobblin/pull/2579
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 222737)
Time Spent: 2h 40m  (was: 2.5h)

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-04-03 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=222431=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-222431
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 03/Apr/19 17:23
Start Date: 03/Apr/19 17:23
Worklog Time Spent: 10m 
  Work Description: ibuenros commented on issue #2579: [GOBBLIN-712] Add 
version strategy pickup for ConfigBasedDataset distcp workflow
URL: 
https://github.com/apache/incubator-gobblin/pull/2579#issuecomment-479582467
 
 
   +1
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 222431)
Time Spent: 2.5h  (was: 2h 20m)

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-04-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=221887=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-221887
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 02/Apr/19 18:02
Start Date: 02/Apr/19 18:02
Worklog Time Spent: 10m 
  Work Description: ibuenros commented on pull request #2579: [GOBBLIN-712] 
Add version strategy pickup for ConfigBasedDataset distcp workflow
URL: https://github.com/apache/incubator-gobblin/pull/2579#discussion_r271429375
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/replication/ConfigBasedDataset.java
 ##
 @@ -135,6 +135,7 @@ public ConfigBasedDataset(ReplicationConfiguration rc, 
Properties props, CopyRou
   log.debug("{} has version strategy {}", hEndpoint.getClusterName());
   return Optional.of(strategy);
 } catch (IOException e) {
+  log.error("Version strategy cannot be created due to {}", e.toString());
 
 Review comment:
   Please use `log.error("Version strategy ...", e)`
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 221887)
Time Spent: 2h 20m  (was: 2h 10m)

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-04-01 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=221420=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-221420
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 01/Apr/19 18:52
Start Date: 01/Apr/19 18:52
Worklog Time Spent: 10m 
  Work Description: ibuenros commented on pull request #2579: [GOBBLIN-712] 
Add version strategy pickup for ConfigBasedDataset distcp workflow
URL: https://github.com/apache/incubator-gobblin/pull/2579#discussion_r271003747
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/replication/ConfigBasedDataset.java
 ##
 @@ -128,6 +175,17 @@ public String datasetURN() {
   return copyableFiles;
 }
 
+if (!this.srcDataFileVersionStrategy.isPresent() || 
!this.dstDataFileVersionStrategy.isPresent()) {
+  log.warn("Version strategy doesn't exist, cannot handle copy");
+  return copyableFiles;
+}
+
+if (!this.srcDataFileVersionStrategy.get().getClass().getName()
+.equals(this.dstDataFileVersionStrategy.get().getClass().getName())) {
+  log.warn("Version strategy doesn't match, cannot handle copy");
 
 Review comment:
   Let's print out the src and target strategies for easier debuggability.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 221420)
Time Spent: 2h 10m  (was: 2h)

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-04-01 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=221421=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-221421
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 01/Apr/19 18:52
Start Date: 01/Apr/19 18:52
Worklog Time Spent: 10m 
  Work Description: ibuenros commented on pull request #2579: [GOBBLIN-712] 
Add version strategy pickup for ConfigBasedDataset distcp workflow
URL: https://github.com/apache/incubator-gobblin/pull/2579#discussion_r271003600
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/replication/ConfigBasedDataset.java
 ##
 @@ -94,6 +107,40 @@ public ConfigBasedDataset(ReplicationConfiguration rc, 
Properties props, CopyRou
 this.pathFilter = DatasetUtils.instantiatePathFilter(this.props);
 this.applyFilterToDirectories =
 
Boolean.parseBoolean(this.props.getProperty(CopyConfiguration.APPLY_FILTER_TO_DIRECTORIES,
 "false"));
+this.srcDataFileVersionStrategy = 
getDataFileVersionStrategy(this.copyRoute.getCopyFrom(), rc, props);
+this.dstDataFileVersionStrategy = 
getDataFileVersionStrategy(this.copyRoute.getCopyTo(), rc, props);
+this.enforceFileLengthMatch = 
Boolean.parseBoolean(this.props.getProperty(CopyConfiguration.ENFORCE_FILE_LENGTH_MATCH,
 "true"));
+  }
+
+  /**
+   * Get the version strategy that can retrieve the data file version from the 
end point.
+   *
+   * @return the version strategy. Empty value when the version is not 
supported for this end point.
+   */
+  private Optional 
getDataFileVersionStrategy(EndPoint endPoint, ReplicationConfiguration rc, 
Properties props) {
+if (!(endPoint instanceof HadoopFsEndPoint)) {
+  log.warn("Data file version currently only handle the Hadoop Fs EndPoint 
replication");
+  return Optional.absent();
+}
+Configuration conf = HadoopUtils.newConfiguration();
+try {
+  HadoopFsEndPoint hEndpoint = (HadoopFsEndPoint) endPoint;
+  FileSystem fs = FileSystem.get(hEndpoint.getFsURI(), conf);
+
+  // IF configStore doesn't contain the strategy, check from job 
properties.
+  // If no strategy is found, default to the modification time strategy.
+  Optional versionStrategy = 
rc.getVersionStrategyFromConfigStore();
+  Config versionStrategyConfig = ConfigFactory.parseMap(ImmutableMap.of(
+  DataFileVersionStrategy.DATA_FILE_VERSION_STRATEGY_KEY, 
versionStrategy.isPresent()? versionStrategy.get() :
+  
props.getProperty(DataFileVersionStrategy.DATA_FILE_VERSION_STRATEGY_KEY,
+  ModTimeDataFileVersionStrategy.Factory.class.getName(;
+
+  DataFileVersionStrategy strategy = 
DataFileVersionStrategy.instantiateDataFileVersionStrategy(fs, 
versionStrategyConfig);
+  log.debug("{} has version strategy {}", hEndpoint.getClusterName());
+  return Optional.of(strategy);
+} catch (IOException e) {
+  return Optional.absent();
 
 Review comment:
   Should we log the exception?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 221421)
Time Spent: 2h 10m  (was: 2h)

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-03-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=219031=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-219031
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 27/Mar/19 01:06
Start Date: 27/Mar/19 01:06
Worklog Time Spent: 10m 
  Work Description: yukuai518 commented on pull request #2579: 
[GOBBLIN-712] Add version strategy pickup for ConfigBasedDataset distcp workflow
URL: https://github.com/apache/incubator-gobblin/pull/2579#discussion_r269328551
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/replication/ConfigBasedDataset.java
 ##
 @@ -177,12 +231,30 @@ public String datasetURN() {
 watermarkMetadataCopied = true;
   }
 
-  // skip copy same file
-  if (copyToFileMap.containsKey(newPath) && 
copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
-  && copyToFileMap.get(newPath).getModificationTime() > 
originFileStatus.getModificationTime()) {
-log.debug("Copy from timestamp older than copy to timestamp, skipped 
copy {} for dataset with metadata {}",
-originFileStatus.getPath(), this.rc.getMetaData());
-  } else {
+
+  boolean shouldCopy = true;
+  if (copyToFileMap.containsKey(newPath)) {
+if (this.srcDataFileVersionStrategy.get() instanceof 
ModTimeDataFileVersionStrategy) {
+  // default version strategy should also compare data file length
+  if (copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
+  && copyToFileMap.get(newPath).getModificationTime() > 
originFileStatus.getModificationTime()) {
+log.debug("Copy from timestamp older than copy to timestamp, 
skipped copy {} for dataset with metadata {}",
+originFileStatus.getPath(), this.rc.getMetaData());
+shouldCopy = false;
+  }
+} else {
+  // other version strategy should just compare the version number
+  Comparable srcVer = 
this.srcDataFileVersionStrategy.get().getVersion(originFileStatus.getPath());
+  Comparable dstVer = 
this.dstDataFileVersionStrategy.get().getVersion(copyToFileMap.get(newPath).getPath());
+  if (srcVer.compareTo(dstVer) == 0) {
 
 Review comment:
   Changed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 219031)
Time Spent: 2h  (was: 1h 50m)

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-03-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=219030=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-219030
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 27/Mar/19 01:06
Start Date: 27/Mar/19 01:06
Worklog Time Spent: 10m 
  Work Description: yukuai518 commented on pull request #2579: 
[GOBBLIN-712] Add version strategy pickup for ConfigBasedDataset distcp workflow
URL: https://github.com/apache/incubator-gobblin/pull/2579#discussion_r269373209
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/replication/ConfigBasedDataset.java
 ##
 @@ -177,12 +231,30 @@ public String datasetURN() {
 watermarkMetadataCopied = true;
   }
 
-  // skip copy same file
-  if (copyToFileMap.containsKey(newPath) && 
copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
-  && copyToFileMap.get(newPath).getModificationTime() > 
originFileStatus.getModificationTime()) {
-log.debug("Copy from timestamp older than copy to timestamp, skipped 
copy {} for dataset with metadata {}",
-originFileStatus.getPath(), this.rc.getMetaData());
-  } else {
+
+  boolean shouldCopy = true;
+  if (copyToFileMap.containsKey(newPath)) {
+if (this.srcDataFileVersionStrategy.get() instanceof 
ModTimeDataFileVersionStrategy) {
+  // default version strategy should also compare data file length
+  if (copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
 
 Review comment:
   Change a little bit.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 219030)
Time Spent: 1h 50m  (was: 1h 40m)

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-03-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=219028=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-219028
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 27/Mar/19 01:05
Start Date: 27/Mar/19 01:05
Worklog Time Spent: 10m 
  Work Description: yukuai518 commented on pull request #2579: 
[GOBBLIN-712] Add version strategy pickup for ConfigBasedDataset distcp workflow
URL: https://github.com/apache/incubator-gobblin/pull/2579#discussion_r269373209
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/replication/ConfigBasedDataset.java
 ##
 @@ -177,12 +231,30 @@ public String datasetURN() {
 watermarkMetadataCopied = true;
   }
 
-  // skip copy same file
-  if (copyToFileMap.containsKey(newPath) && 
copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
-  && copyToFileMap.get(newPath).getModificationTime() > 
originFileStatus.getModificationTime()) {
-log.debug("Copy from timestamp older than copy to timestamp, skipped 
copy {} for dataset with metadata {}",
-originFileStatus.getPath(), this.rc.getMetaData());
-  } else {
+
+  boolean shouldCopy = true;
+  if (copyToFileMap.containsKey(newPath)) {
+if (this.srcDataFileVersionStrategy.get() instanceof 
ModTimeDataFileVersionStrategy) {
+  // default version strategy should also compare data file length
+  if (copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
 
 Review comment:
   Change a little bit.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 219028)
Time Spent: 1.5h  (was: 1h 20m)

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-03-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=219029=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-219029
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 27/Mar/19 01:05
Start Date: 27/Mar/19 01:05
Worklog Time Spent: 10m 
  Work Description: yukuai518 commented on pull request #2579: 
[GOBBLIN-712] Add version strategy pickup for ConfigBasedDataset distcp workflow
URL: https://github.com/apache/incubator-gobblin/pull/2579#discussion_r269328551
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/replication/ConfigBasedDataset.java
 ##
 @@ -177,12 +231,30 @@ public String datasetURN() {
 watermarkMetadataCopied = true;
   }
 
-  // skip copy same file
-  if (copyToFileMap.containsKey(newPath) && 
copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
-  && copyToFileMap.get(newPath).getModificationTime() > 
originFileStatus.getModificationTime()) {
-log.debug("Copy from timestamp older than copy to timestamp, skipped 
copy {} for dataset with metadata {}",
-originFileStatus.getPath(), this.rc.getMetaData());
-  } else {
+
+  boolean shouldCopy = true;
+  if (copyToFileMap.containsKey(newPath)) {
+if (this.srcDataFileVersionStrategy.get() instanceof 
ModTimeDataFileVersionStrategy) {
+  // default version strategy should also compare data file length
+  if (copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
+  && copyToFileMap.get(newPath).getModificationTime() > 
originFileStatus.getModificationTime()) {
+log.debug("Copy from timestamp older than copy to timestamp, 
skipped copy {} for dataset with metadata {}",
+originFileStatus.getPath(), this.rc.getMetaData());
+shouldCopy = false;
+  }
+} else {
+  // other version strategy should just compare the version number
+  Comparable srcVer = 
this.srcDataFileVersionStrategy.get().getVersion(originFileStatus.getPath());
+  Comparable dstVer = 
this.dstDataFileVersionStrategy.get().getVersion(copyToFileMap.get(newPath).getPath());
+  if (srcVer.compareTo(dstVer) == 0) {
 
 Review comment:
   Changed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 219029)
Time Spent: 1h 40m  (was: 1.5h)

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-03-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=218981=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-218981
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 26/Mar/19 22:26
Start Date: 26/Mar/19 22:26
Worklog Time Spent: 10m 
  Work Description: yukuai518 commented on pull request #2579: 
[GOBBLIN-712] Add version strategy pickup for ConfigBasedDataset distcp workflow
URL: https://github.com/apache/incubator-gobblin/pull/2579#discussion_r269327266
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/replication/ConfigBasedDataset.java
 ##
 @@ -177,12 +231,30 @@ public String datasetURN() {
 watermarkMetadataCopied = true;
   }
 
-  // skip copy same file
-  if (copyToFileMap.containsKey(newPath) && 
copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
-  && copyToFileMap.get(newPath).getModificationTime() > 
originFileStatus.getModificationTime()) {
-log.debug("Copy from timestamp older than copy to timestamp, skipped 
copy {} for dataset with metadata {}",
-originFileStatus.getPath(), this.rc.getMetaData());
-  } else {
+
+  boolean shouldCopy = true;
+  if (copyToFileMap.containsKey(newPath)) {
+if (this.srcDataFileVersionStrategy.get() instanceof 
ModTimeDataFileVersionStrategy) {
+  // default version strategy should also compare data file length
+  if (copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
 
 Review comment:
   @ibuenros Let's assume you have the same file copied from source to 
destination. Now what if you apply different purger algorithm on both src and 
dst file? I think then the file size would be different, but you don't want to 
copy again I guess? So should our Purger version strategy exclude the length 
check? In another word, I think the length is also a type of version, which is 
similar to MD5, modTime, etc.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 218981)
Time Spent: 1h 20m  (was: 1h 10m)

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-03-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=218938=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-218938
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 26/Mar/19 21:44
Start Date: 26/Mar/19 21:44
Worklog Time Spent: 10m 
  Work Description: yukuai518 commented on pull request #2579: 
[GOBBLIN-712] Add version strategy pickup for ConfigBasedDataset distcp workflow
URL: https://github.com/apache/incubator-gobblin/pull/2579#discussion_r269328551
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/replication/ConfigBasedDataset.java
 ##
 @@ -177,12 +231,30 @@ public String datasetURN() {
 watermarkMetadataCopied = true;
   }
 
-  // skip copy same file
-  if (copyToFileMap.containsKey(newPath) && 
copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
-  && copyToFileMap.get(newPath).getModificationTime() > 
originFileStatus.getModificationTime()) {
-log.debug("Copy from timestamp older than copy to timestamp, skipped 
copy {} for dataset with metadata {}",
-originFileStatus.getPath(), this.rc.getMetaData());
-  } else {
+
+  boolean shouldCopy = true;
+  if (copyToFileMap.containsKey(newPath)) {
+if (this.srcDataFileVersionStrategy.get() instanceof 
ModTimeDataFileVersionStrategy) {
+  // default version strategy should also compare data file length
+  if (copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
+  && copyToFileMap.get(newPath).getModificationTime() > 
originFileStatus.getModificationTime()) {
+log.debug("Copy from timestamp older than copy to timestamp, 
skipped copy {} for dataset with metadata {}",
+originFileStatus.getPath(), this.rc.getMetaData());
+shouldCopy = false;
+  }
+} else {
+  // other version strategy should just compare the version number
+  Comparable srcVer = 
this.srcDataFileVersionStrategy.get().getVersion(originFileStatus.getPath());
+  Comparable dstVer = 
this.dstDataFileVersionStrategy.get().getVersion(copyToFileMap.get(newPath).getPath());
+  if (srcVer.compareTo(dstVer) == 0) {
 
 Review comment:
   I think so, let me change it.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 218938)
Time Spent: 1h 10m  (was: 1h)

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-03-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=218937=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-218937
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 26/Mar/19 21:41
Start Date: 26/Mar/19 21:41
Worklog Time Spent: 10m 
  Work Description: yukuai518 commented on pull request #2579: 
[GOBBLIN-712] Add version strategy pickup for ConfigBasedDataset distcp workflow
URL: https://github.com/apache/incubator-gobblin/pull/2579#discussion_r269327266
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/replication/ConfigBasedDataset.java
 ##
 @@ -177,12 +231,30 @@ public String datasetURN() {
 watermarkMetadataCopied = true;
   }
 
-  // skip copy same file
-  if (copyToFileMap.containsKey(newPath) && 
copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
-  && copyToFileMap.get(newPath).getModificationTime() > 
originFileStatus.getModificationTime()) {
-log.debug("Copy from timestamp older than copy to timestamp, skipped 
copy {} for dataset with metadata {}",
-originFileStatus.getPath(), this.rc.getMetaData());
-  } else {
+
+  boolean shouldCopy = true;
+  if (copyToFileMap.containsKey(newPath)) {
+if (this.srcDataFileVersionStrategy.get() instanceof 
ModTimeDataFileVersionStrategy) {
+  // default version strategy should also compare data file length
+  if (copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
 
 Review comment:
   @ibuenros Let's assume you have the same file copied from source to 
destination. Now what if you apply different purger algorithm on both src and 
dst file? I think then the file size would be different, but you don't want to 
copy again I guess? So should our Purger version strategy exclude the length 
check? In another word, I think the length is also a type of version, which is 
similar to MD5, modTime, etc.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 218937)
Time Spent: 1h  (was: 50m)

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-03-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=218935=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-218935
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 26/Mar/19 21:40
Start Date: 26/Mar/19 21:40
Worklog Time Spent: 10m 
  Work Description: yukuai518 commented on pull request #2579: 
[GOBBLIN-712] Add version strategy pickup for ConfigBasedDataset distcp workflow
URL: https://github.com/apache/incubator-gobblin/pull/2579#discussion_r269327266
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/replication/ConfigBasedDataset.java
 ##
 @@ -177,12 +231,30 @@ public String datasetURN() {
 watermarkMetadataCopied = true;
   }
 
-  // skip copy same file
-  if (copyToFileMap.containsKey(newPath) && 
copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
-  && copyToFileMap.get(newPath).getModificationTime() > 
originFileStatus.getModificationTime()) {
-log.debug("Copy from timestamp older than copy to timestamp, skipped 
copy {} for dataset with metadata {}",
-originFileStatus.getPath(), this.rc.getMetaData());
-  } else {
+
+  boolean shouldCopy = true;
+  if (copyToFileMap.containsKey(newPath)) {
+if (this.srcDataFileVersionStrategy.get() instanceof 
ModTimeDataFileVersionStrategy) {
+  // default version strategy should also compare data file length
+  if (copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
 
 Review comment:
   @ibuenros Let's assume you have the same file copied from source to 
destination. Now what if you apply different purger algorithm on both src and 
dst file? I think then the file size would be different, but you don't want to 
copy again I guess? So should our Purger version strategy exclude the length 
check?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 218935)
Time Spent: 50m  (was: 40m)

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-03-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=218915=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-218915
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 26/Mar/19 20:47
Start Date: 26/Mar/19 20:47
Worklog Time Spent: 10m 
  Work Description: ibuenros commented on pull request #2579: [GOBBLIN-712] 
Add version strategy pickup for ConfigBasedDataset distcp workflow
URL: https://github.com/apache/incubator-gobblin/pull/2579#discussion_r269307236
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/replication/ConfigBasedDataset.java
 ##
 @@ -177,12 +231,30 @@ public String datasetURN() {
 watermarkMetadataCopied = true;
   }
 
-  // skip copy same file
-  if (copyToFileMap.containsKey(newPath) && 
copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
-  && copyToFileMap.get(newPath).getModificationTime() > 
originFileStatus.getModificationTime()) {
-log.debug("Copy from timestamp older than copy to timestamp, skipped 
copy {} for dataset with metadata {}",
-originFileStatus.getPath(), this.rc.getMetaData());
-  } else {
+
+  boolean shouldCopy = true;
+  if (copyToFileMap.containsKey(newPath)) {
+if (this.srcDataFileVersionStrategy.get() instanceof 
ModTimeDataFileVersionStrategy) {
+  // default version strategy should also compare data file length
+  if (copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
 
 Review comment:
   I think the file length should be a different comparison altogether, which 
can be enabled or disabled independently.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 218915)
Time Spent: 0.5h  (was: 20m)

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-03-26 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=218916=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-218916
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 26/Mar/19 20:47
Start Date: 26/Mar/19 20:47
Worklog Time Spent: 10m 
  Work Description: ibuenros commented on pull request #2579: [GOBBLIN-712] 
Add version strategy pickup for ConfigBasedDataset distcp workflow
URL: https://github.com/apache/incubator-gobblin/pull/2579#discussion_r269307676
 
 

 ##
 File path: 
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/replication/ConfigBasedDataset.java
 ##
 @@ -177,12 +231,30 @@ public String datasetURN() {
 watermarkMetadataCopied = true;
   }
 
-  // skip copy same file
-  if (copyToFileMap.containsKey(newPath) && 
copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
-  && copyToFileMap.get(newPath).getModificationTime() > 
originFileStatus.getModificationTime()) {
-log.debug("Copy from timestamp older than copy to timestamp, skipped 
copy {} for dataset with metadata {}",
-originFileStatus.getPath(), this.rc.getMetaData());
-  } else {
+
+  boolean shouldCopy = true;
+  if (copyToFileMap.containsKey(newPath)) {
+if (this.srcDataFileVersionStrategy.get() instanceof 
ModTimeDataFileVersionStrategy) {
+  // default version strategy should also compare data file length
+  if (copyToFileMap.get(newPath).getLen() == originFileStatus.getLen()
+  && copyToFileMap.get(newPath).getModificationTime() > 
originFileStatus.getModificationTime()) {
+log.debug("Copy from timestamp older than copy to timestamp, 
skipped copy {} for dataset with metadata {}",
+originFileStatus.getPath(), this.rc.getMetaData());
+shouldCopy = false;
+  }
+} else {
+  // other version strategy should just compare the version number
+  Comparable srcVer = 
this.srcDataFileVersionStrategy.get().getVersion(originFileStatus.getPath());
+  Comparable dstVer = 
this.dstDataFileVersionStrategy.get().getVersion(copyToFileMap.get(newPath).getPath());
+  if (srcVer.compareTo(dstVer) == 0) {
 
 Review comment:
   You want to copy only if `srcVersion > dstVersion`, right?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 218916)
Time Spent: 40m  (was: 0.5h)

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-03-25 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=218358=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-218358
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 25/Mar/19 23:53
Start Date: 25/Mar/19 23:53
Worklog Time Spent: 10m 
  Work Description: yukuai518 commented on issue #2579: [GOBBLIN-712] Add 
version strategy pickup for ConfigBasedDataset distcp workflow
URL: 
https://github.com/apache/incubator-gobblin/pull/2579#issuecomment-476421024
 
 
   @ibuenros can you help review?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 218358)
Time Spent: 20m  (was: 10m)

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (GOBBLIN-712) Add version strategy for configbased dataset copy

2019-03-25 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/GOBBLIN-712?focusedWorklogId=218325=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-218325
 ]

ASF GitHub Bot logged work on GOBBLIN-712:
--

Author: ASF GitHub Bot
Created on: 25/Mar/19 22:40
Start Date: 25/Mar/19 22:40
Worklog Time Spent: 10m 
  Work Description: yukuai518 commented on pull request #2579: 
[GOBBLIN-712] Add version strategy pickup for ConfigBasedDataset distcp workflow
URL: https://github.com/apache/incubator-gobblin/pull/2579
 
 
   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I 
have checked off all the steps below!
   
   
   ### JIRA
   - [x] My PR addresses the following [Gobblin 
JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references 
them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
   - https://issues.apache.org/jira/browse/GOBBLIN-712
   
   
   ### Description
   - [x] Here are some details about my PR, including screenshots (if 
applicable):
 consider to compare the data file version instead of always using data 
file length or modification timestamp.
   
   ### Tests
   - [x] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   
   ### Commits
   - [x] My commits all reference JIRA issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
   1. Subject is separated from body by a blank line
   2. Subject is limited to 50 characters
   3. Subject does not end with a period
   4. Subject uses the imperative mood ("add", not "adding")
   5. Body wraps at 72 characters
   6. Body explains "what" and "why", not "how"
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 218325)
Time Spent: 10m
Remaining Estimate: 0h

> Add version strategy for configbased dataset copy
> -
>
> Key: GOBBLIN-712
> URL: https://issues.apache.org/jira/browse/GOBBLIN-712
> Project: Apache Gobblin
>  Issue Type: Improvement
>Reporter: Kuai Yu
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)