subject:"\\\\\\\[GitHub\\\\\\\] mccheah commented on a change in pull request #73\\\\\\\: Allow data output streams to be generated via custom mechanisms when given partitioning and file name"

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

2019-01-15 Thread GitBox

mccheah commented on a change in pull request #73: Allow data output streams to 
be generated via custom mechanisms when given partitioning and file name
URL: https://github.com/apache/incubator-iceberg/pull/73#discussion_r248072334
 
 

 ##
 File path: 
core/src/main/java/com/netflix/iceberg/BaseMetastoreTableOperations.java
 ##
 @@ -84,13 +75,24 @@ protected void requestRefresh() {
   }
 
   protected String writeNewMetadata(TableMetadata metadata, int version) {
-String newTableMetadataFilePath = newTableMetadataFilePath(metadata, 
version);
-OutputFile newMetadataLocation = 
fileIo.newOutputFile(newTableMetadataFilePath);
+FileIO initializedFileIo = io();
+// This would be false if the subclass overrides io() and returns a 
different kind.
+if (initializedFileIo instanceof HadoopFileIO) {
+  ((HadoopFileIO) initializedFileIo).updateTableMetadata(metadata);
 
 Review comment:
   I built something similar to the proposed solution. The main difference is 
that now `TableOperations` has a new `io(TableMetadata)` method, which allows 
custom file IO implementations to be returned both in the case when the current 
table metadata is used and when some alternative table metadata is used. We now 
don't save the `FileIO` object as discussed earlier.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

2019-01-15 Thread GitBox

mccheah commented on a change in pull request #73: Allow data output streams to 
be generated via custom mechanisms when given partitioning and file name
URL: https://github.com/apache/incubator-iceberg/pull/73#discussion_r248028056
 
 

 ##
 File path: 
core/src/main/java/com/netflix/iceberg/BaseMetastoreTableOperations.java
 ##
 @@ -84,13 +75,24 @@ protected void requestRefresh() {
   }
 
   protected String writeNewMetadata(TableMetadata metadata, int version) {
-String newTableMetadataFilePath = newTableMetadataFilePath(metadata, 
version);
-OutputFile newMetadataLocation = 
fileIo.newOutputFile(newTableMetadataFilePath);
+FileIO initializedFileIo = io();
+// This would be false if the subclass overrides io() and returns a 
different kind.
+if (initializedFileIo instanceof HadoopFileIO) {
+  ((HadoopFileIO) initializedFileIo).updateTableMetadata(metadata);
 
 Review comment:
   > I think we would expose the classes themselves. We'd add an interface to 
the API and implement that. That gets around the TableOperations problem.
   
   Meaning, add an interface on top of `TableMetadata`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

2019-01-15 Thread GitBox

mccheah commented on a change in pull request #73: Allow data output streams to 
be generated via custom mechanisms when given partitioning and file name
URL: https://github.com/apache/incubator-iceberg/pull/73#discussion_r248025128
 
 

 ##
 File path: 
core/src/main/java/com/netflix/iceberg/BaseMetastoreTableOperations.java
 ##
 @@ -84,13 +75,24 @@ protected void requestRefresh() {
   }
 
   protected String writeNewMetadata(TableMetadata metadata, int version) {
-String newTableMetadataFilePath = newTableMetadataFilePath(metadata, 
version);
-OutputFile newMetadataLocation = 
fileIo.newOutputFile(newTableMetadataFilePath);
+FileIO initializedFileIo = io();
+// This would be false if the subclass overrides io() and returns a 
different kind.
+if (initializedFileIo instanceof HadoopFileIO) {
+  ((HadoopFileIO) initializedFileIo).updateTableMetadata(metadata);
 
 Review comment:
   I think better would be that instead of caching the `FileIO` object and 
trying to maintain its state, we should just make a new one with the most up to 
date metadata every time it's needed to write something.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

2019-01-15 Thread GitBox

mccheah commented on a change in pull request #73: Allow data output streams to 
be generated via custom mechanisms when given partitioning and file name
URL: https://github.com/apache/incubator-iceberg/pull/73#discussion_r248023345
 
 

 ##
 File path: 
core/src/main/java/com/netflix/iceberg/BaseMetastoreTableOperations.java
 ##
 @@ -84,13 +75,24 @@ protected void requestRefresh() {
   }
 
   protected String writeNewMetadata(TableMetadata metadata, int version) {
-String newTableMetadataFilePath = newTableMetadataFilePath(metadata, 
version);
-OutputFile newMetadataLocation = 
fileIo.newOutputFile(newTableMetadataFilePath);
+FileIO initializedFileIo = io();
+// This would be false if the subclass overrides io() and returns a 
different kind.
+if (initializedFileIo instanceof HadoopFileIO) {
+  ((HadoopFileIO) initializedFileIo).updateTableMetadata(metadata);
 
 Review comment:
   Moving `TableMetadata` into `iceberg-api` runs into the problem of 
`TableOperations` not being in iceberg-api.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

2019-01-15 Thread GitBox

mccheah commented on a change in pull request #73: Allow data output streams to 
be generated via custom mechanisms when given partitioning and file name
URL: https://github.com/apache/incubator-iceberg/pull/73#discussion_r248022851
 
 

 ##
 File path: 
core/src/main/java/com/netflix/iceberg/BaseMetastoreTableOperations.java
 ##
 @@ -84,13 +75,24 @@ protected void requestRefresh() {
   }
 
   protected String writeNewMetadata(TableMetadata metadata, int version) {
-String newTableMetadataFilePath = newTableMetadataFilePath(metadata, 
version);
-OutputFile newMetadataLocation = 
fileIo.newOutputFile(newTableMetadataFilePath);
+FileIO initializedFileIo = io();
+// This would be false if the subclass overrides io() and returns a 
different kind.
+if (initializedFileIo instanceof HadoopFileIO) {
+  ((HadoopFileIO) initializedFileIo).updateTableMetadata(metadata);
 
 Review comment:
   The problem is that `TableMetadata` is in `iceberg-core` but this module is 
in `iceberg-api`. But `FileIO` is returned by `Table`, which is in 
`iceberg-api`. Thus to make this API take `TableMetadata` as input we'd likely 
have to move `TableMetadata` into `iceberg-api`. The modules aren't set up 
correctly right now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

2019-01-14 Thread GitBox

mccheah commented on a change in pull request #73: Allow data output streams to 
be generated via custom mechanisms when given partitioning and file name
URL: https://github.com/apache/incubator-iceberg/pull/73#discussion_r247745842
 
 

 ##
 File path: 
core/src/main/java/com/netflix/iceberg/BaseMetastoreTableOperations.java
 ##
 @@ -52,8 +47,8 @@
   private static final String HIVE_LOCATION_FOLDER_NAME = "empty";
 
   private final Configuration conf;
-  private final FileIO fileIo;
 
+  private HadoopFileIO fileIo;
 
 Review comment:
   Now it does because we specifically refresh the metadata on it (this really 
should be `defaultFileIo` though). However, perhaps part of the question is if 
`updateMetadata` should be part of the `FileIO` API contract too.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

2019-01-14 Thread GitBox

mccheah commented on a change in pull request #73: Allow data output streams to 
be generated via custom mechanisms when given partitioning and file name
URL: https://github.com/apache/incubator-iceberg/pull/73#discussion_r247728506
 
 

 ##
 File path: api/src/main/java/com/netflix/iceberg/io/FileIO.java
 ##
 @@ -47,4 +46,30 @@
* Delete the file at the given path.
*/
   void deleteFile(String path);
+
+  /**
+   * Get an {@link InputFile} to get the bytes for this table's metadata file 
with the given name.
+   */
+  InputFile readMetadataFile(String fileName);
+
+  /**
+   * Get an {@link OutputFile} to write bytes for a new table metadata file 
with the given name.
+   */
+  OutputFile newMetadataFile(String fileName);
+
+  /**
+   * Get an {@link OutputFile} for writing bytes to a new data file for this 
table.
+   * 
+   * The partition values of the rows in this file may be used to derive the 
final location of
+   * the file.
+   */
+  OutputFile newPartitionedDataFile(
+  PartitionSpec partitionSpec, StructLike partitionData, String fileName);
 
 Review comment:
   Passing in specifically the partitioning information here I think carries a 
stronger notion that this will be used for data with those partitionings, as 
opposed to a path in which callers can really pass in any string. So the 
stronger typing carries stronger semantics. It also signifies that the 
partitioning need not be encoded in the path, but the implementation can choose 
to derive its own paths based on the partitioning.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

2019-01-14 Thread GitBox

mccheah commented on a change in pull request #73: Allow data output streams to 
be generated via custom mechanisms when given partitioning and file name
URL: https://github.com/apache/incubator-iceberg/pull/73#discussion_r247728112
 
 

 ##
 File path: 
core/src/main/java/com/netflix/iceberg/hadoop/HadoopTableOperations.java
 ##
 @@ -69,30 +73,33 @@ public TableMetadata current() {
   @Override
   public TableMetadata refresh() {
 int ver = version != null ? version : readVersionHint();
-Path metadataFile = metadataFile(ver);
-FileSystem fs = getFS(metadataFile, conf);
+InputFile metadataFile = io().readMetadataFile(metadataFileName(ver));
+Path metadataFilePath = new Path(metadataFile.location());
+FileSystem fs = getFS(metadataFilePath, conf);
 try {
   // don't check if the file exists if version is non-null because it was 
already checked
-  if (version == null && !fs.exists(metadataFile)) {
+  if (version == null && !fs.exists(metadataFilePath)) {
 
 Review comment:
   Yeah I also understand not wanting to add `exists`. I'm also hesitant 
because we don't want to blow up these APIs to become a full-fledged file 
system. Where is the boundary with regards to what File System-like operations 
we will require implementations to provide?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

2019-01-14 Thread GitBox

mccheah commented on a change in pull request #73: Allow data output streams to 
be generated via custom mechanisms when given partitioning and file name
URL: https://github.com/apache/incubator-iceberg/pull/73#discussion_r247727549
 
 

 ##
 File path: core/src/main/java/com/netflix/iceberg/hadoop/HadoopFileIO.java
 ##
 @@ -38,4 +55,98 @@ public void deleteFile(String path) {
   throw new RuntimeIOException(e, "Failed to delete file: %s", path);
 }
   }
+
+  @Override
+  public InputFile readMetadataFile(String fileName) {
+return newInputFile(metadataPath(fileName).toString());
+  }
+
+  @Override
+  public OutputFile newMetadataFile(String fileName) {
+return newOutputFile(metadataPath(fileName).toString());
+  }
+
+  @Override
+  public OutputFile newPartitionedDataFile(
+  PartitionSpec partitionSpec, StructLike filePartition, String fileName) {
+String location;
+if (useObjectStorage) {
+  // try to get db and table portions of the path for context in the 
object store
+  String context = pathContext(new Path(newDataFileLocation));
+  String partitionAndFilename = String.format(
+  "%s/%s", partitionSpec.partitionToPath(filePartition), fileName);
+  int hash = HASH_FUNC.apply(partitionAndFilename);
+  location = String.format(
+  "%s/%08x/%s/%s/%s",
+  objectStorePath,
+  hash,
+  context,
+  partitionSpec.partitionToPath(filePartition),
+  fileName);
+} else {
+  location = String.format(
+  "%s/%s/%s",
+  newDataFileLocation,
+  partitionSpec.partitionToPath(filePartition),
+  fileName);
+}
+return newOutputFile(location);
+  }
+
+  @Override
+  public OutputFile newUnpartitionedDataFile(String fileName) {
+return newOutputFile(String.format("%s/%s", newDataFileLocation, 
fileName));
+  }
+
+  private Path metadataPath(String filename) {
+return new Path(new Path(tableLocation, "metadata"), filename);
+  }
+
+  public void updateProperties(Map newTableProperties) {
 
 Review comment:
   `updateMetadata` is a fitting compromise here!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

2019-01-14 Thread GitBox

mccheah commented on a change in pull request #73: Allow data output streams to 
be generated via custom mechanisms when given partitioning and file name
URL: https://github.com/apache/incubator-iceberg/pull/73#discussion_r247682766
 
 

 ##
 File path: core/src/main/java/com/netflix/iceberg/hadoop/HadoopFileIO.java
 ##
 @@ -38,4 +55,98 @@ public void deleteFile(String path) {
   throw new RuntimeIOException(e, "Failed to delete file: %s", path);
 }
   }
+
+  @Override
+  public InputFile readMetadataFile(String fileName) {
+return newInputFile(metadataPath(fileName).toString());
+  }
+
+  @Override
+  public OutputFile newMetadataFile(String fileName) {
+return newOutputFile(metadataPath(fileName).toString());
+  }
+
+  @Override
+  public OutputFile newPartitionedDataFile(
+  PartitionSpec partitionSpec, StructLike filePartition, String fileName) {
+String location;
+if (useObjectStorage) {
+  // try to get db and table portions of the path for context in the 
object store
+  String context = pathContext(new Path(newDataFileLocation));
+  String partitionAndFilename = String.format(
+  "%s/%s", partitionSpec.partitionToPath(filePartition), fileName);
+  int hash = HASH_FUNC.apply(partitionAndFilename);
+  location = String.format(
+  "%s/%08x/%s/%s/%s",
+  objectStorePath,
+  hash,
+  context,
+  partitionSpec.partitionToPath(filePartition),
+  fileName);
+} else {
+  location = String.format(
+  "%s/%s/%s",
+  newDataFileLocation,
+  partitionSpec.partitionToPath(filePartition),
+  fileName);
+}
+return newOutputFile(location);
+  }
+
+  @Override
+  public OutputFile newUnpartitionedDataFile(String fileName) {
+return newOutputFile(String.format("%s/%s", newDataFileLocation, 
fileName));
+  }
+
+  private Path metadataPath(String filename) {
+return new Path(new Path(tableLocation, "metadata"), filename);
+  }
+
+  public void updateProperties(Map newTableProperties) {
 
 Review comment:
   Hmm, `TableMetadata` isn't serializable as far as I can tell. So here are 
the fields:
   
   ```
 // stored metadata
 private final String location;
 private final long lastUpdatedMillis;
 private final int lastColumnId;
 private final Schema schema;
 private final int defaultSpecId;
 private final List specs;
 private final Map properties;
 private final long currentSnapshotId;
 private final List snapshots;
 private final Map snapshotsById;
 private final Map specsById;
 private final List snapshotLog;
   ```
   
   The problem is `Snapshot`. The interface doesn't extend Serializable. 
Furthermore, `BaseSnapshot` which I presume the Snapshot fields are, contains a 
reference to `TableOperations`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

2019-01-14 Thread GitBox

mccheah commented on a change in pull request #73: Allow data output streams to 
be generated via custom mechanisms when given partitioning and file name
URL: https://github.com/apache/incubator-iceberg/pull/73#discussion_r247639962
 
 

 ##
 File path: core/src/main/java/com/netflix/iceberg/hadoop/HadoopFileIO.java
 ##
 @@ -38,4 +55,98 @@ public void deleteFile(String path) {
   throw new RuntimeIOException(e, "Failed to delete file: %s", path);
 }
   }
+
+  @Override
+  public InputFile readMetadataFile(String fileName) {
+return newInputFile(metadataPath(fileName).toString());
+  }
+
+  @Override
+  public OutputFile newMetadataFile(String fileName) {
+return newOutputFile(metadataPath(fileName).toString());
+  }
+
+  @Override
+  public OutputFile newPartitionedDataFile(
+  PartitionSpec partitionSpec, StructLike filePartition, String fileName) {
+String location;
+if (useObjectStorage) {
+  // try to get db and table portions of the path for context in the 
object store
+  String context = pathContext(new Path(newDataFileLocation));
+  String partitionAndFilename = String.format(
+  "%s/%s", partitionSpec.partitionToPath(filePartition), fileName);
+  int hash = HASH_FUNC.apply(partitionAndFilename);
+  location = String.format(
+  "%s/%08x/%s/%s/%s",
+  objectStorePath,
+  hash,
+  context,
+  partitionSpec.partitionToPath(filePartition),
+  fileName);
+} else {
+  location = String.format(
+  "%s/%s/%s",
+  newDataFileLocation,
+  partitionSpec.partitionToPath(filePartition),
+  fileName);
+}
+return newOutputFile(location);
+  }
+
+  @Override
+  public OutputFile newUnpartitionedDataFile(String fileName) {
+return newOutputFile(String.format("%s/%s", newDataFileLocation, 
fileName));
+  }
+
+  private Path metadataPath(String filename) {
+return new Path(new Path(tableLocation, "metadata"), filename);
+  }
+
+  public void updateProperties(Map newTableProperties) {
 
 Review comment:
   Actually `Supplier` isn't great because the `Supplier` itself isn't 
serializable. The `Supplier` would be tied to the `HadoopTableOperations` 
object.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

2019-01-14 Thread GitBox

mccheah commented on a change in pull request #73: Allow data output streams to 
be generated via custom mechanisms when given partitioning and file name
URL: https://github.com/apache/incubator-iceberg/pull/73#discussion_r247625807
 
 

 ##
 File path: core/src/main/java/com/netflix/iceberg/hadoop/HadoopFileIO.java
 ##
 @@ -38,4 +55,98 @@ public void deleteFile(String path) {
   throw new RuntimeIOException(e, "Failed to delete file: %s", path);
 }
   }
+
+  @Override
+  public InputFile readMetadataFile(String fileName) {
+return newInputFile(metadataPath(fileName).toString());
+  }
+
+  @Override
+  public OutputFile newMetadataFile(String fileName) {
+return newOutputFile(metadataPath(fileName).toString());
+  }
+
+  @Override
+  public OutputFile newPartitionedDataFile(
+  PartitionSpec partitionSpec, StructLike filePartition, String fileName) {
+String location;
+if (useObjectStorage) {
+  // try to get db and table portions of the path for context in the 
object store
+  String context = pathContext(new Path(newDataFileLocation));
+  String partitionAndFilename = String.format(
+  "%s/%s", partitionSpec.partitionToPath(filePartition), fileName);
+  int hash = HASH_FUNC.apply(partitionAndFilename);
+  location = String.format(
+  "%s/%08x/%s/%s/%s",
+  objectStorePath,
+  hash,
+  context,
+  partitionSpec.partitionToPath(filePartition),
+  fileName);
+} else {
+  location = String.format(
+  "%s/%s/%s",
+  newDataFileLocation,
+  partitionSpec.partitionToPath(filePartition),
+  fileName);
+}
+return newOutputFile(location);
+  }
+
+  @Override
+  public OutputFile newUnpartitionedDataFile(String fileName) {
+return newOutputFile(String.format("%s/%s", newDataFileLocation, 
fileName));
+  }
+
+  private Path metadataPath(String filename) {
+return new Path(new Path(tableLocation, "metadata"), filename);
+  }
+
+  public void updateProperties(Map newTableProperties) {
 
 Review comment:
   The `Supplier` makes sense to me.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

2019-01-10 Thread GitBox

mccheah commented on a change in pull request #73: Allow data output streams to 
be generated via custom mechanisms when given partitioning and file name
URL: https://github.com/apache/incubator-iceberg/pull/73#discussion_r246999702
 
 

 ##
 File path: api/src/main/java/com/netflix/iceberg/io/FileIO.java
 ##
 @@ -47,4 +46,30 @@
* Delete the file at the given path.
*/
   void deleteFile(String path);
+
+  /**
+   * Get an {@link InputFile} to get the bytes for this table's metadata file 
with the given name.
+   */
+  InputFile readMetadataFile(String fileName);
+
+  /**
+   * Get an {@link OutputFile} to write bytes for a new table metadata file 
with the given name.
+   */
+  OutputFile newMetadataFile(String fileName);
+
+  /**
+   * Get an {@link OutputFile} for writing bytes to a new data file for this 
table.
+   * 
+   * The partition values of the rows in this file may be used to derive the 
final location of
+   * the file.
+   */
+  OutputFile newPartitionedDataFile(
 
 Review comment:
   I was thinking about this a little bit more, and in contrast to what was 
discussed on the ticket, I propose creating the `OutputFile` instance directly 
instead of just returning the path alone.
   
   This comes up in the case where the file doesn't exist yet, which is always 
going to be the case for Spark (Spark is always creating new data files). In 
such a case, it is desirable for the plugin to not necessarily only know what 
the path is, but also to perhaps create the path itself, oftentimes by asking 
for some new entry in an external system that is a reference to this file we're 
about to write to.
   
   In other words, the semantics of this API are that we're not only asking for 
the location of the data file, but we're also asking to create the data file as 
part of this action. In practice most of the implementations should just derive 
this as `newOutputFile(getPath(args))`. Nevertheless I think this version of 
the API is more rich. Thoughts?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

2019-01-10 Thread GitBox

mccheah commented on a change in pull request #73: Allow data output streams to 
be generated via custom mechanisms when given partitioning and file name
URL: https://github.com/apache/incubator-iceberg/pull/73#discussion_r246999840
 
 

 ##
 File path: core/src/main/java/com/netflix/iceberg/hadoop/HadoopFileIO.java
 ##
 @@ -38,4 +55,98 @@ public void deleteFile(String path) {
   throw new RuntimeIOException(e, "Failed to delete file: %s", path);
 }
   }
+
+  @Override
+  public InputFile readMetadataFile(String fileName) {
+return newInputFile(metadataPath(fileName).toString());
+  }
+
+  @Override
+  public OutputFile newMetadataFile(String fileName) {
+return newOutputFile(metadataPath(fileName).toString());
+  }
+
+  @Override
+  public OutputFile newPartitionedDataFile(
+  PartitionSpec partitionSpec, StructLike filePartition, String fileName) {
+String location;
+if (useObjectStorage) {
+  // try to get db and table portions of the path for context in the 
object store
+  String context = pathContext(new Path(newDataFileLocation));
+  String partitionAndFilename = String.format(
+  "%s/%s", partitionSpec.partitionToPath(filePartition), fileName);
+  int hash = HASH_FUNC.apply(partitionAndFilename);
+  location = String.format(
+  "%s/%08x/%s/%s/%s",
+  objectStorePath,
+  hash,
+  context,
+  partitionSpec.partitionToPath(filePartition),
+  fileName);
+} else {
+  location = String.format(
+  "%s/%s/%s",
+  newDataFileLocation,
+  partitionSpec.partitionToPath(filePartition),
+  fileName);
+}
+return newOutputFile(location);
+  }
+
+  @Override
+  public OutputFile newUnpartitionedDataFile(String fileName) {
+return newOutputFile(String.format("%s/%s", newDataFileLocation, 
fileName));
+  }
+
+  private Path metadataPath(String filename) {
+return new Path(new Path(tableLocation, "metadata"), filename);
+  }
+
+  public void updateProperties(Map newTableProperties) {
 
 Review comment:
   This is a concession to the fact that when table properties are updated, 
it's possible that the data path could change. It's not immediately clear if we 
can do better than this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

14 matches

Site Navigation

Mail list logo

Footer information