[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

2019-01-10 Thread GitBox
mccheah commented on a change in pull request #73: Allow data output streams to 
be generated via custom mechanisms when given partitioning and file name
URL: https://github.com/apache/incubator-iceberg/pull/73#discussion_r246999702
 
 

 ##
 File path: api/src/main/java/com/netflix/iceberg/io/FileIO.java
 ##
 @@ -47,4 +46,30 @@
* Delete the file at the given path.
*/
   void deleteFile(String path);
+
+  /**
+   * Get an {@link InputFile} to get the bytes for this table's metadata file 
with the given name.
+   */
+  InputFile readMetadataFile(String fileName);
+
+  /**
+   * Get an {@link OutputFile} to write bytes for a new table metadata file 
with the given name.
+   */
+  OutputFile newMetadataFile(String fileName);
+
+  /**
+   * Get an {@link OutputFile} for writing bytes to a new data file for this 
table.
+   * 
+   * The partition values of the rows in this file may be used to derive the 
final location of
+   * the file.
+   */
+  OutputFile newPartitionedDataFile(
 
 Review comment:
   I was thinking about this a little bit more, and in contrast to what was 
discussed on the ticket, I propose creating the `OutputFile` instance directly 
instead of just returning the path alone.
   
   This comes up in the case where the file doesn't exist yet, which is always 
going to be the case for Spark (Spark is always creating new data files). In 
such a case, it is desirable for the plugin to not necessarily only know what 
the path is, but also to perhaps create the path itself, oftentimes by asking 
for some new entry in an external system that is a reference to this file we're 
about to write to.
   
   In other words, the semantics of this API are that we're not only asking for 
the location of the data file, but we're also asking to create the data file as 
part of this action. In practice most of the implementations should just derive 
this as `newOutputFile(getPath(args))`. Nevertheless I think this version of 
the API is more rich. Thoughts?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] mccheah commented on a change in pull request #73: Allow data output streams to be generated via custom mechanisms when given partitioning and file name

2019-01-10 Thread GitBox
mccheah commented on a change in pull request #73: Allow data output streams to 
be generated via custom mechanisms when given partitioning and file name
URL: https://github.com/apache/incubator-iceberg/pull/73#discussion_r246999840
 
 

 ##
 File path: core/src/main/java/com/netflix/iceberg/hadoop/HadoopFileIO.java
 ##
 @@ -38,4 +55,98 @@ public void deleteFile(String path) {
   throw new RuntimeIOException(e, "Failed to delete file: %s", path);
 }
   }
+
+  @Override
+  public InputFile readMetadataFile(String fileName) {
+return newInputFile(metadataPath(fileName).toString());
+  }
+
+  @Override
+  public OutputFile newMetadataFile(String fileName) {
+return newOutputFile(metadataPath(fileName).toString());
+  }
+
+  @Override
+  public OutputFile newPartitionedDataFile(
+  PartitionSpec partitionSpec, StructLike filePartition, String fileName) {
+String location;
+if (useObjectStorage) {
+  // try to get db and table portions of the path for context in the 
object store
+  String context = pathContext(new Path(newDataFileLocation));
+  String partitionAndFilename = String.format(
+  "%s/%s", partitionSpec.partitionToPath(filePartition), fileName);
+  int hash = HASH_FUNC.apply(partitionAndFilename);
+  location = String.format(
+  "%s/%08x/%s/%s/%s",
+  objectStorePath,
+  hash,
+  context,
+  partitionSpec.partitionToPath(filePartition),
+  fileName);
+} else {
+  location = String.format(
+  "%s/%s/%s",
+  newDataFileLocation,
+  partitionSpec.partitionToPath(filePartition),
+  fileName);
+}
+return newOutputFile(location);
+  }
+
+  @Override
+  public OutputFile newUnpartitionedDataFile(String fileName) {
+return newOutputFile(String.format("%s/%s", newDataFileLocation, 
fileName));
+  }
+
+  private Path metadataPath(String filename) {
+return new Path(new Path(tableLocation, "metadata"), filename);
+  }
+
+  public void updateProperties(Map newTableProperties) {
 
 Review comment:
   This is a concession to the fact that when table properties are updated, 
it's possible that the data path could change. It's not immediately clear if we 
can do better than this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] rdblue commented on issue #72: Fix filtering manifests in unpartitioned tables.

2019-01-10 Thread GitBox
rdblue commented on issue #72: Fix filtering manifests in unpartitioned tables.
URL: https://github.com/apache/incubator-iceberg/pull/72#issuecomment-453297507
 
 
   @aokolnychyi, this is a fix for the issue you pointed out. Could you please 
review? Thank you!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] rdblue opened a new pull request #70: Allow schema updates in transactions.

2019-01-10 Thread GitBox
rdblue opened a new pull request #70: Allow schema updates in transactions.
URL: https://github.com/apache/incubator-iceberg/pull/70
 
 
   This is safe because schema updates can always read older data. If a
   transaction includes a write followed by a schema update, the new schema
   can read the data that was just written. Also, writers are allowed to
   write with older schemas because the current schema can read any file
   written with an older schema. If a transaction includes a schema update
   followed by a write, using either the original schema or the new schema
   will work.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] rdblue opened a new pull request #71: Allow passing the unpartitioned spec to DataFiles.builder.

2019-01-10 Thread GitBox
rdblue opened a new pull request #71: Allow passing the unpartitioned spec to 
DataFiles.builder.
URL: https://github.com/apache/incubator-iceberg/pull/71
 
 
   This makes converting tables with SparkTableUtil easier.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] rdblue opened a new pull request #72: Fix filtering manifests in unpartitioned tables.

2019-01-10 Thread GitBox
rdblue opened a new pull request #72: Fix filtering manifests in unpartitioned 
tables.
URL: https://github.com/apache/incubator-iceberg/pull/72
 
 
   FilteredManifest only ran filters if there was a row filter AND a
   partition filter, but it should run filteres if there is a row filter OR
   a partition filter.
   
   Because a filter may be null, this also updates the functions that
   create evaluators to create an evaluator for alwaysTrue when the
   expression is null.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org