Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks merged PR #15630: URL: https://github.com/apache/iceberg/pull/15630 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3290144221 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. Review Comment: There are a number of similar comments about this distinction, but `fully-qualified` (or URI with scheme) is actually the terminology that was used for v1-v3. V4 is distinctly different and part of this change is the shift in terminology, which is why we're not attempting to redefine the usage in prior versions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
adutra commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3287263361 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. Review Comment: ```suggestion Prior to v4, all path fields must contain absolute paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
manuzhang commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3279269940 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path starts with a URI scheme, it is absolute and is used without modification. +* If the path does not start with a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below.) + +Paths in manifests produced prior to v4 are fully-qualified and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. Review Comment: It looks we use `v4` instead of `V4` in all other places. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
manuzhang commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3279248670 ## format/spec.md: ## @@ -123,9 +131,15 @@ Tables do not require random-access writes. Once written, data and metadata file Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files. +### File Locations in Metadata + +All location fields in format versions 3 and prior contain fully-qualified paths. Review Comment: It might be clearer in first glance with `in format versions prior to version 4`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
manuzhang commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3279214807 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path starts with a URI scheme, it is absolute and is used without modification. +* If the path does not start with a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below.) + +Paths in manifests produced prior to v4 are fully-qualified and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. Review Comment: Me too ;), might need a comma somewhere. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
nastra commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3279167342 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path starts with a URI scheme, it is absolute and is used without modification. +* If the path does not start with a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below.) + +Paths in manifests produced prior to v4 are fully-qualified and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. Review Comment: nit: I had to read this sentence multiple times to understand it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276829169 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below). + +Paths in manifests produced prior to v4 are fully-qualified and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location| File Path | Resolved Path | Description | +|-||---|---||-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs://wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet| Absolute path is used | +| Duplicate separator | v4 | s3://bucket/db/table/ | data/0-0.parquet | s3://bucket/db/table//data/0-0.parquet | Join results in duplicate `//` | +| Duplicate separator | v4 | s3://bucket/db/table | /data/0-0.parquet | s3://bucket/db/table//data/0-0.parquet | Join results in duplicate `//` | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet| Scheme is prepended for consistency | + + Path Relativization + +Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. + +* If an absolute path starts with the table location immediately followed by a separator character, the relative path is the remainder of the string after the separator character. +* If an absolute path does not start with the table location immediately followed by the separator character, it is stored as an absolute path. Review Comment: I'm not sure we need to clarify. I think that's important to say for consuming from metadata, but how the persisted path is arrived at is different. If something is producing the path, it can pretty much do what ever it wants with the structure as long as it's absolute when it's first persisted. This is a little nuanced, but I think it would be overreaching in this particular context. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
dramaticlly commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276827702 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: Review Comment: Thanks Dan, found it in https://github.com/apache/iceberg/pull/15630#discussion_r3250983873 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276807847 ## format/spec.md: ## @@ -954,6 +1012,34 @@ Table metadata consists of the following fields: | _optional_ | _optional_ | _optional_ | **`partition-statistics`** | A list (optional) of [partition statistics](#partition-statistics). | ||| _required_ | **`next-row-id`** | A `long` higher than all assigned row IDs; the next snapshot’s `first-row-id`. See [Row Lineage](#row-lineage). | ||| _optional_ | **`encryption-keys`** | A list (optional) of [encryption keys](#encryption-keys) used for table encryption. | +=== "v4" +| v4 | Field | Description | +||-|-| +| _required_ | **`format-version`**| An integer version number for the format. Implementations must throw an exception if a table's version is higher than the supported version. | +| _required_ | **`table-uuid`**| A UUID that identifies the table, generated when the table is created. Implementations must throw an exception if a table's UUID does not match the expected UUID after refreshing metadata. | +| _optional_ | **`location`** | The table's base location. This is used by writers to determine where to store data files, manifest files, and table metadata files. Must be an absolute path when present. See [Table Locations](#table-location-specification). | Review Comment: Same here, we don't want to restrict to catalogs only. See comment here: https://github.com/apache/iceberg/pull/15630#discussion_r3251040825 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276804104 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below). + +Paths in manifests produced prior to v4 are fully-qualified and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location| File Path | Resolved Path | Description | +|-||---|---||-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs://wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet| Absolute path is used | +| Duplicate separator | v4 | s3://bucket/db/table/ | data/0-0.parquet | s3://bucket/db/table//data/0-0.parquet | Join results in duplicate `//` | +| Duplicate separator | v4 | s3://bucket/db/table | /data/0-0.parquet | s3://bucket/db/table//data/0-0.parquet | Join results in duplicate `//` | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet| Scheme is prepended for consistency | + + Path Relativization + +Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. + +* If an absolute path starts with the table location immediately followed by a separator character, the relative path is the remainder of the string after the separator character. +* If an absolute path does not start with the table location immediately followed by the separator character, it is stored as an absolute path. + + Table Location Specification + +When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the table location must be provided. How the table location is persisted or determined when not specified in metadata is not a table-level concern; catalogs should provide a table's location Review Comment: We don't want to restrict this to catalogs only. Please see this comment: https://github.com/apache/iceberg/pull/15630#discussion_r3251040825 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276812324 ## format/spec.md: ## @@ -1647,6 +1733,30 @@ The binary single-value serialization can be used to store the lower and upper b ## Appendix E: Format version changes +### Version 4 + +Relative path support is added in v4. + +Reading v3 metadata for v4: + +* All location fields are fully-qualified paths and interpreted as absolute paths for v4 +* Any location field without a uri scheme prefix must prepend a scheme component consistent with v4 absolute paths + +Writing v4 metadata: + +* Table metadata JSON: +* `location` is now optional and must be absolute when present +* When not present, the table location must be managed externally and provided when loading the metadata Review Comment: Same situation here. We want to leave this open as it's not catalog only. https://github.com/apache/iceberg/pull/15630#discussion_r3251040825 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276792172 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below). Review Comment: Others asked for this explicitly to show what is expected if you sufix/prefix with a separator and what the behavior would look like. The point is to show that you do not de-dup or strip them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276784874 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: Review Comment: We don't want to do this (see other comments on this topic). We don't want to go back and define things that weren't defined for prior versions since it could introduce additional requirements on older versions. The prior spec only referred to "fully-qualified" and "URI with Scheme" for fields and we're not trying to rewrite those versions of the spec. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276758903 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. Review Comment: Updated wording. Technically, they're the same since a URI scheme is defined by the URI spec (it can't be anything else). However, I'll change to starts with for consistency. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276747481 ## format/spec.md: ## @@ -123,9 +131,15 @@ Tables do not require random-access writes. Once written, data and metadata file Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files. +### File Locations in Metadata + +All location fields in format versions 3 and prior contain fully-qualified paths. Review Comment: In this sentence 'versions' refers to multiple versions (3 and prior) so the plural form is correct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276737770 ## format/spec.md: ## @@ -954,6 +1012,34 @@ Table metadata consists of the following fields: | _optional_ | _optional_ | _optional_ | **`partition-statistics`** | A list (optional) of [partition statistics](#partition-statistics). | ||| _required_ | **`next-row-id`** | A `long` higher than all assigned row IDs; the next snapshot’s `first-row-id`. See [Row Lineage](#row-lineage). | ||| _optional_ | **`encryption-keys`** | A list (optional) of [encryption keys](#encryption-keys) used for table encryption. | +=== "v4" +| v4 | Field | Description | +||-|-| +| _required_ | **`format-version`**| An integer version number for the format. Implementations must throw an exception if a table's version is higher than the supported version. | +| _required_ | **`table-uuid`**| A UUID that identifies the table, generated when the table is created. Implementations must throw an exception if a table's UUID does not match the expected UUID after refreshing metadata. | +| _optional_ | **`location`** | The table's base location. This is used by writers to determine where to store data files, manifest files, and table metadata files. Must be an absolute path when present. See [Table Locations](#table-location-specification). | +| _required_ | **`last-sequence-number`** | The table's highest assigned sequence number, a monotonically increasing long that tracks the order of snapshots in a table. | +| _required_ | **`last-updated-ms`** | Timestamp in milliseconds from the unix epoch when the table was last updated. Each table metadata file should update this field just before writing. | +| _required_ | **`last-column-id`**| An integer; the highest assigned column ID for the table. This is used to ensure columns are always assigned an unused ID when evolving schemas. | +|| **`schema`**| The table’s current schema. (**Deprecated**: use `schemas` and `current-schema-id` instead) | Review Comment: These fields are actually from the original table carried over. I don't think we should update the wording here (it's not technically part of this spec change). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
RussellSpitzer commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276588816 ## format/spec.md: ## @@ -1647,6 +1733,30 @@ The binary single-value serialization can be used to store the lower and upper b ## Appendix E: Format version changes +### Version 4 + +Relative path support is added in v4. + +Reading v3 metadata for v4: + +* All location fields are fully-qualified paths and interpreted as absolute paths for v4 +* Any location field without a uri scheme prefix must prepend a scheme component consistent with v4 absolute paths + +Writing v4 metadata: + +* Table metadata JSON: +* `location` is now optional and must be absolute when present +* When not present, the table location must be managed externally and provided when loading the metadata Review Comment: should we just say catalog here? Or do we want to leave this open? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
RussellSpitzer commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276582694 ## format/spec.md: ## @@ -954,6 +1012,34 @@ Table metadata consists of the following fields: | _optional_ | _optional_ | _optional_ | **`partition-statistics`** | A list (optional) of [partition statistics](#partition-statistics). | ||| _required_ | **`next-row-id`** | A `long` higher than all assigned row IDs; the next snapshot’s `first-row-id`. See [Row Lineage](#row-lineage). | ||| _optional_ | **`encryption-keys`** | A list (optional) of [encryption keys](#encryption-keys) used for table encryption. | +=== "v4" +| v4 | Field | Description | +||-|-| +| _required_ | **`format-version`**| An integer version number for the format. Implementations must throw an exception if a table's version is higher than the supported version. | +| _required_ | **`table-uuid`**| A UUID that identifies the table, generated when the table is created. Implementations must throw an exception if a table's UUID does not match the expected UUID after refreshing metadata. | +| _optional_ | **`location`** | The table's base location. This is used by writers to determine where to store data files, manifest files, and table metadata files. Must be an absolute path when present. See [Table Locations](#table-location-specification). | Review Comment: When absent must be provided by the Catalog -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
RussellSpitzer commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276557392 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below). + +Paths in manifests produced prior to v4 are fully-qualified and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location| File Path | Resolved Path | Description | +|-||---|---||-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs://wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet| Absolute path is used | +| Duplicate separator | v4 | s3://bucket/db/table/ | data/0-0.parquet | s3://bucket/db/table//data/0-0.parquet | Join results in duplicate `//` | +| Duplicate separator | v4 | s3://bucket/db/table | /data/0-0.parquet | s3://bucket/db/table//data/0-0.parquet | Join results in duplicate `//` | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet| Scheme is prepended for consistency | + + Path Relativization + +Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. + +* If an absolute path starts with the table location immediately followed by a separator character, the relative path is the remainder of the string after the separator character. +* If an absolute path does not start with the table location immediately followed by the separator character, it is stored as an absolute path. + + Table Location Specification + +When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the table location must be provided. How the table location is persisted or determined when not specified in metadata is not a table-level concern; catalogs should provide a table's location Review Comment: ```suggestion When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the table location must be maintained and provided by the catalog. ``` -- This is an automated message from the Apache Git Service. To respond to the mess
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
RussellSpitzer commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276523748 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below). Review Comment: I'm not sure we need the examples for duplicate separator. I think that's pretty straight forward? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
RussellSpitzer commented on code in PR #15630:
URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276514487
##
format/spec.md:
##
@@ -168,6 +185,46 @@ All columns must be written to data files even if they
introduce redundancy with
Writers are not allowed to commit files with a partition spec that contains a
field with an unknown transform.
+### Paths in Metadata
+
+Path strings stored in Iceberg metadata location fields are classified as one
of two types:
+
+* **Absolute path** -- A path string that includes a [URI
scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g.,
`s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without
modification.
+* **Relative path** -- A path string that does not include a URI scheme.
Relative paths must be resolved against the table's base location before use.
+
+Prior to v4, all path fields must contain fully-qualified paths. Starting with
v4, path fields may contain either absolute or relative paths. [Relative
resolution within a
URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and
`..`) and other file system navigation conventions are not supported in
relative paths.
Review Comment:
We never prohibited this previously but it was basically FileIO dependent on
whether these characters would have any kind actual meaning. We actually do
follow one set of rules here, we never make a // in our own path building
within the Java API. We have a bunch of "strip trailing /" code to prevent
someone from using S3FileIO and HadoopFileIO from getting different results.
("bar//baz" resolving to "bar/baz" in S3A and "bar//baz" in S3FileIO)
I agree with @rdblue here that it's probably a good note to have in the spec
that paths are treated as pure strings. No POSIX resolution should be assumed.
I'm also fine with adding a note about absolute paths as well since I don't
think we ever had this well defined but anyone using posix style things in
their paths is probably making a mistake...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
dramaticlly commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276397036 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below). + +Paths in manifests produced prior to v4 are fully-qualified and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location| File Path | Resolved Path | Description | +|-||---|---||-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs://wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet| Absolute path is used | +| Duplicate separator | v4 | s3://bucket/db/table/ | data/0-0.parquet | s3://bucket/db/table//data/0-0.parquet | Join results in duplicate `//` | +| Duplicate separator | v4 | s3://bucket/db/table | /data/0-0.parquet | s3://bucket/db/table//data/0-0.parquet | Join results in duplicate `//` | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet| Scheme is prepended for consistency | + + Path Relativization + +Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. + +* If an absolute path starts with the table location immediately followed by a separator character, the relative path is the remainder of the string after the separator character. +* If an absolute path does not start with the table location immediately followed by the separator character, it is stored as an absolute path. Review Comment: nit: It might be helpful to explicitly highlight `stored as an absolute path without modification` ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: Review Comment: Nit: There are a few references to "fully qualified path" later in the context of v3 and prior, without it being explicitly defined. Since we're classifying paths into two types below, it might be worth briefly noting that fully qualifi
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
kevinjqliu commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276382032 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. Review Comment: nit: this says "If the path contains a URI scheme" while L191 uses "starts with a URI scheme". i think we should probably align and use "starts with" here to be more explicit L191: > * **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
zhjwpku commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3273615804 ## format/spec.md: ## @@ -123,9 +131,15 @@ Tables do not require random-access writes. Once written, data and metadata file Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files. +### File Locations in Metadata + +All location fields in format versions 3 and prior contain fully-qualified paths. + +Version 4 of the Iceberg spec adds support for relative locations in metadata, enabling tables to be relocated without rewriting metadata files. Relative locations are allowed in all metadata tracked location fields and are resolved against the table's base location. The table's location may be fixed in table metadata or inferred, but is intended to be managed and supplied by a catalog. Requirements for relativization and resolution are in [Relative Paths](#path-resolution) Review Comment: +1, that includes Path Relativization and Path Resolution. `Relative Paths` is a little confusing if it does not link to the section with the same name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
zhjwpku commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3273615804 ## format/spec.md: ## @@ -123,9 +131,15 @@ Tables do not require random-access writes. Once written, data and metadata file Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files. +### File Locations in Metadata + +All location fields in format versions 3 and prior contain fully-qualified paths. + +Version 4 of the Iceberg spec adds support for relative locations in metadata, enabling tables to be relocated without rewriting metadata files. Relative locations are allowed in all metadata tracked location fields and are resolved against the table's base location. The table's location may be fixed in table metadata or inferred, but is intended to be managed and supplied by a catalog. Requirements for relativization and resolution are in [Relative Paths](#path-resolution) Review Comment: +1, that include Path Relativization and Path Resolution ## format/spec.md: ## @@ -123,9 +131,15 @@ Tables do not require random-access writes. Once written, data and metadata file Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files. +### File Locations in Metadata + +All location fields in format versions 3 and prior contain fully-qualified paths. + +Version 4 of the Iceberg spec adds support for relative locations in metadata, enabling tables to be relocated without rewriting metadata files. Relative locations are allowed in all metadata tracked location fields and are resolved against the table's base location. The table's location may be fixed in table metadata or inferred, but is intended to be managed and supplied by a catalog. Requirements for relativization and resolution are in [Relative Paths](#path-resolution) Review Comment: +1, that includes Path Relativization and Path Resolution -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
zhjwpku commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3273577668 ## format/spec.md: ## @@ -123,9 +131,15 @@ Tables do not require random-access writes. Once written, data and metadata file Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files. +### File Locations in Metadata + +All location fields in format versions 3 and prior contain fully-qualified paths. Review Comment: ```suggestion All location fields in format version 3 and prior contain fully-qualified paths. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
zhjwpku commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3273556660 ## format/spec.md: ## @@ -954,6 +1012,34 @@ Table metadata consists of the following fields: | _optional_ | _optional_ | _optional_ | **`partition-statistics`** | A list (optional) of [partition statistics](#partition-statistics). | ||| _required_ | **`next-row-id`** | A `long` higher than all assigned row IDs; the next snapshot’s `first-row-id`. See [Row Lineage](#row-lineage). | ||| _optional_ | **`encryption-keys`** | A list (optional) of [encryption keys](#encryption-keys) used for table encryption. | +=== "v4" +| v4 | Field | Description | +||-|-| +| _required_ | **`format-version`**| An integer version number for the format. Implementations must throw an exception if a table's version is higher than the supported version. | +| _required_ | **`table-uuid`**| A UUID that identifies the table, generated when the table is created. Implementations must throw an exception if a table's UUID does not match the expected UUID after refreshing metadata. | +| _optional_ | **`location`** | The table's base location. This is used by writers to determine where to store data files, manifest files, and table metadata files. Must be an absolute path when present. See [Table Locations](#table-location-specification). | +| _required_ | **`last-sequence-number`** | The table's highest assigned sequence number, a monotonically increasing long that tracks the order of snapshots in a table. | +| _required_ | **`last-updated-ms`** | Timestamp in milliseconds from the unix epoch when the table was last updated. Each table metadata file should update this field just before writing. | +| _required_ | **`last-column-id`**| An integer; the highest assigned column ID for the table. This is used to ensure columns are always assigned an unused ID when evolving schemas. | +|| **`schema`**| The table’s current schema. (**Deprecated**: use `schemas` and `current-schema-id` instead) | Review Comment: There is mixed usage of `table’s` and `table's`. Should we stick to one style? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
manuzhang commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3273112562 ## format/spec.md: ## @@ -123,9 +131,15 @@ Tables do not require random-access writes. Once written, data and metadata file Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files. +### File Locations in Metadata + +All location fields in format versions 3 and prior contain fully-qualified paths. + +Version 4 of the Iceberg spec adds support for relative locations in metadata, enabling tables to be relocated without rewriting metadata files. Relative locations are allowed in all metadata tracked location fields and are resolved against the table's base location. The table's location may be fixed in table metadata or inferred, but is intended to be managed and supplied by a catalog. Requirements for relativization and resolution are in [Relative Paths](#path-resolution) Review Comment: do you want to link to `#paths-in-metadata`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
manuzhang commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3273039719 ## format/spec.md: ## @@ -1647,6 +1733,30 @@ The binary single-value serialization can be used to store the lower and upper b ## Appendix E: Format version changes +### Version 4 + +Relative path support is added in v4. + +Reading v3 metadata for v4: + +* All location fields are fully-qualified paths and interpreted as absolute paths for v4 +* Any location field without a uri scheme prefix must prepend a scheme component consistent with v4 absolute paths Review Comment: nit: `URI` to be consistent -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
manuzhang commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3273002193 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below). + +Paths in manifests produced prior to v4 are fully-qualified and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location| File Path | Resolved Path | Description | +|-||---|---||-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs://wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet| Absolute path is used | +| Duplicate separator | v4 | s3://bucket/db/table/ | data/0-0.parquet | s3://bucket/db/table//data/0-0.parquet | Join results in duplicate `//` | +| Duplicate separator | v4 | s3://bucket/db/table | /data/0-0.parquet | s3://bucket/db/table//data/0-0.parquet | Join results in duplicate `//` | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet| Scheme is prepended for consistency | + + Path Relativization + +Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. + +* If an absolute path starts with the table location immediately followed by a separator character, the relative path is the remainder of the string after the separator character. +* If an absolute path does not start with the table location immediately followed by the separator character, it is stored as an absolute path. + + Table Location Specification + +When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the table location must be provided. How the table location is persisted or determined when not specified in metadata is not a table-level concern; catalogs should provide a table's location Review Comment: nit: missing `.` at the end. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
manuzhang commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3272982868 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below). Review Comment: nit: `character. (See example below.)` or `character (see example below).` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
manuzhang commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3272965517 ## format/spec.md: ## @@ -123,9 +131,15 @@ Tables do not require random-access writes. Once written, data and metadata file Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files. +### File Locations in Metadata + +All location fields in format versions 3 and prior contain fully-qualified paths. + +Version 4 of the Iceberg spec adds support for relative locations in metadata, enabling tables to be relocated without rewriting metadata files. Relative locations are allowed in all metadata tracked location fields and are resolved against the table's base location. The table's location may be fixed in table metadata or inferred, but is intended to be managed and supplied by a catalog. Requirements for relativization and resolution are in [Relative Paths](#path-resolution) Review Comment: nit: missing `.` at the end. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3270244746 ## format/spec.md: ## @@ -1647,6 +1732,30 @@ The binary single-value serialization can be used to store the lower and upper b ## Appendix E: Format version changes +### Version 4 + +Relative path support is added in v4. + +Reading v3 metadata for v4: + +* All location fields are fully-qualified paths and interpreted as absolute paths for v4 +* Any location field without a uri scheme prefix must prepend a scheme component consistent with v4 absolute paths + +Writing v4 metadata: + +* Table metadata JSON: +* `location` is now optional and must be absolute when present +* When not present, the table location must be managed externally and provided when loading the metadata +* Location fields in all metadata structures may contain relative paths +* Writers should produce relative paths by default for files that reside under the table location +* Absolute paths must be used for files that do not share a common prefix with the table location Review Comment: That would make the last bullet dependent upon the prior bullet, which while technically correct if you read them all together, I think this is more clear because each bullet stands as a complete statement on its own. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3269404884 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below). + +Paths in manifests produced prior to v4 are fully-qualified and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location| File Path | Resolved Path | Description | +|-||---|---||-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet| hdfs://wh/db/table/data/0-0.parquet| Absolute path is used | Review Comment: Nice catch, I think that was a typo. Will fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3269398684 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below). Review Comment: No, it means that if there are other separators included in the base location or the relative part, those will not be removed or deduplicated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
anoopj commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3269254911 ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below). + +Paths in manifests produced prior to v4 are fully-qualified and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location| File Path | Resolved Path | Description | +|-||---|---||-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet| hdfs://wh/db/table/data/0-0.parquet| Absolute path is used | +| Duplicate separator | v4 | s3://bucket/db/table/ | data/0-0.parquet | s3://bucket/db/table//data/0-0.parquet | Join results in duplicate `//` | Review Comment: Thanks for calling these out explicitly. ## format/spec.md: ## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below). + +Paths in manifests prod
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
stevenzwu commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3262543155 ## format/spec.md: ## @@ -134,8 +149,10 @@ Tables do not require rename, except for tables that use atomic rename to implem * **Manifest** -- A file that lists data or delete files; a subset of a snapshot. * **Data file** -- A file that contains rows of a table. * **Delete file** -- A file that encodes rows of a table that are deleted by position or data values. +* **Absolute path** -- A path string that includes a [URI](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) scheme and can be used directly. +* **Relative path** -- A path string without a URI scheme that must be [resolved](#path-resolution) against the table location. Review Comment: fair enough -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
stevenzwu commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3262540892 ## format/spec.md: ## @@ -1647,6 +1732,30 @@ The binary single-value serialization can be used to store the lower and upper b ## Appendix E: Format version changes +### Version 4 + +Relative path support is added in v4. + +Reading v3 metadata for v4: + +* All location fields are fully-qualified paths and interpreted as absolute paths for v4 +* Any location field without a uri scheme prefix must prepend a scheme component consistent with v4 absolute paths + +Writing v4 metadata: + +* Table metadata JSON: +* `location` is now optional and must be absolute when present +* When not present, the table location must be managed externally and provided when loading the metadata +* Location fields in all metadata structures may contain relative paths +* Writers should produce relative paths by default for files that reside under the table location +* Absolute paths must be used for files that do not share a common prefix with the table location Review Comment: > Absolute paths must be used for all other files. line 1751 can be simplified as `for all other files`. We don't even need to mention `that don't share a common prefix ...`, which can leads to misinterpretation. But I will leave it up to you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3262251478 ## format/spec.md: ## @@ -1647,6 +1732,30 @@ The binary single-value serialization can be used to store the lower and upper b ## Appendix E: Format version changes +### Version 4 + +Relative path support is added in v4. Review Comment: I think we want to focus on the content and not the formatting right now. It's great to have this section here to note changes, but I expect it to be updated quite a bit from how we handled this for the last 2 versions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3262242368 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-||--|--|---|-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet | Absolute path is used | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/0-0.parquet| hdfs:/wh/db/table/data/0-0.parquet| Scheme is prepended for consistency | Review Comment: I thought that would end up producing `hdfs:/wh/db/table` and not `/wh/db/table`. It's been a while since I've had to worry about this though! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3262236298 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-||--|--|---|-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet | Absolute path is used | Review Comment: Yes, exactly. What we want is to make sure people understand that not removing the separator will cause a double separator. There is no normalization logic allowed or required to catch and prevent it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3262210478 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. Review Comment: Yeah, I guess you're right. I thought sure it was changed. I don't like saying that these aren't supported in relative paths, specifically. If you want to leave the possibility open in v3 and earlier, we can do that. But I think we should not leave open the possibility that absolute paths can contain directory navigation or other resolution patterns. I would just say that absolute and relative paths do not support these things. How about a separate paragraph: > [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in absolute or relative paths. That way it is clear you can't use these conventions in either one. And making it a separate paragraph avoids implying that they are okay for v3 paths. We didn't specify they are not allowed, but we also chose not to implement any interpretation when issues came up, like whether `//` should be normalized or `./` should be removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3262166090 ## format/spec.md: ## @@ -134,8 +149,10 @@ Tables do not require rename, except for tables that use atomic rename to implem * **Manifest** -- A file that lists data or delete files; a subset of a snapshot. * **Data file** -- A file that contains rows of a table. * **Delete file** -- A file that encodes rows of a table that are deleted by position or data values. +* **Absolute path** -- A path string that includes a [URI](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) scheme and can be used directly. +* **Relative path** -- A path string without a URI scheme that must be [resolved](#path-resolution) against the table location. Review Comment: I agree with Dan here. We don't want to introduce more definitions for old versions. We should carry them forward and let them be interpreted as they were before. The only place where this matters is the scheme, which was specifically called out as required. It was never optional, but we know that paths were sometimes created without it. The note about supplying a scheme for v3 and earlier isn't changing the requirement, it is stating that if you produce paths without a scheme when reading, it's going to break in v4 because we are more strict and would decide to handle paths without a scheme as relative. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3261949230 ## format/spec.md: ## @@ -1647,6 +1732,30 @@ The binary single-value serialization can be used to store the lower and upper b ## Appendix E: Format version changes +### Version 4 + +Relative path support is added in v4. + +Reading v3 metadata for v4: + +* All location fields are fully-qualified paths and interpreted as absolute paths for v4 +* Any location field without a uri scheme prefix must prepend a scheme component consistent with v4 absolute paths + +Writing v4 metadata: + +* Table metadata JSON: +* `location` is now optional and must be absolute when present +* When not present, the table location must be managed externally and provided when loading the metadata +* Location fields in all metadata structures may contain relative paths +* Writers should produce relative paths by default for files that reside under the table location +* Absolute paths must be used for files that do not share a common prefix with the table location Review Comment: This is just a summary of changes not the specification itself. While I think it made sense to add in the spec section, adding it here isn't really helpful and just makes it more complicated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3261937079 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-||--|--|---|-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet | Absolute path is used | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/0-0.parquet| hdfs:/wh/db/table/data/0-0.parquet| Scheme is prepended for consistency | + + Path Relativization + +Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. + +* If an absolute path starts with the table location, the table location prefix should be removed along with the separator character and the remaining relative portion stored. +* If an absolute path does not start with the table location, it is stored as an absolute path. + + Table Location Specification + +When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the table location must be provided. How the table location is persisted or determined when not specified in metadata is not a table-level concern; catalogs should and provide a table's location Review Comment: No, there are other ways the location can be provided (e.g. a user can provide it directly when calling a `register_table` for example). Additionally, there are engines that reference the metadata json path directly and then the user or engine needs to provide the path, which is why we don't make it a `MUST` requriement and we don't fully tie this to a catalog. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3261924082 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. Review Comment: I cleaned this up a little. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3261893999 ## format/spec.md: ## @@ -134,8 +149,10 @@ Tables do not require rename, except for tables that use atomic rename to implem * **Manifest** -- A file that lists data or delete files; a subset of a snapshot. * **Data file** -- A file that contains rows of a table. * **Delete file** -- A file that encodes rows of a table that are deleted by position or data values. +* **Absolute path** -- A path string that includes a [URI](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) scheme and can be used directly. +* **Relative path** -- A path string without a URI scheme that must be [resolved](#path-resolution) against the table location. Review Comment: The definition for a `file_path` is "Full URI for the file with FS scheme", which I believe was intended to be the same as "fully-qualified" (though it's unclear because it wasn't defined). I don't think we should go back and try to define terms we didn't define originally since it's ambiguous and clarifying at this point could cause more confusion. I think it's better to focus on defining what v4 requires moving forward. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3261862069 ## format/spec.md: ## @@ -1777,6 +1886,24 @@ Note that these requirements apply when writing data to a v2 table. Tables that This section covers topics not required by the specification but recommendations for systems implementing the Iceberg specification to help maintain a uniform experience. +### Path Construction + +Path construction is the process by which new file locations are created for output files referenced by metadata. While the specific construction logic is not strictly required by the spec, the following guidance is provided for reference implementations to encourage consistency. + +The table properties `write.metadata.path` and `write.data.path` control where metadata and data files are written relative to the table location. When not specified, these default to the values `metadata` and `data` respectively. + +For all metadata files: + +* If `write.metadata.path` is an absolute path, it is used directly as the base for new metadata files. +* If `write.metadata.path` is a relative path, the metadata base is the table location joined to the `write.metadata.path` value with a URI separator `/`. + +For data files: + +* If `write.data.path` is an absolute path, it is used directly as the base for new data files. +* If `write.data.path` is a relative path, the base is the table location joined to the `write.data.path` value with a URI separator `/`. + +When persisting paths into metadata, writers should relativize paths against the table location (see [Path Relativization](#path-relativization)). If a file's absolute path shares a common prefix with the table location, the relative portion should be stored. Otherwise, the absolute path should be stored. Review Comment: Fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3261859399 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. Review Comment: I don't agree with these comments and it's not consistent with what we've discussed. We're trying to treat the location fields as opaque as possible and still support the maximum number of scenarios. The `should` requirement here is intentional in that we still want a way to represent a `//` in a path since it is technically valid in some storage solutions. The "without consideration" is also intentional in that the strings should not be interpreted to try to normalize the path values. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
stevenzwu commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3252350735 ## format/spec.md: ## @@ -134,8 +149,10 @@ Tables do not require rename, except for tables that use atomic rename to implem * **Manifest** -- A file that lists data or delete files; a subset of a snapshot. * **Data file** -- A file that contains rows of a table. * **Delete file** -- A file that encodes rows of a table that are deleted by position or data values. +* **Absolute path** -- A path string that includes a [URI](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) scheme and can be used directly. +* **Relative path** -- A path string without a URI scheme that must be [resolved](#path-resolution) against the table location. Review Comment: In the end, every comment reflects my opinion because I carefully reviewed, edited, or dropped the comments before publishing. I just checked the current table spec. There is only one mentioning of fully qualified for the data file fields, where the wording matching the definition of absolute path. ``` 143 referenced_data_file | string | Fully qualified location (URI with FS scheme) of a data file that all deletes reference ``` I was saying that it was not explicitly defined in the terms list. does it make sense to define the term of fully qualified path in the bullet list formally? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
stevenzwu commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3252350735 ## format/spec.md: ## @@ -134,8 +149,10 @@ Tables do not require rename, except for tables that use atomic rename to implem * **Manifest** -- A file that lists data or delete files; a subset of a snapshot. * **Data file** -- A file that contains rows of a table. * **Delete file** -- A file that encodes rows of a table that are deleted by position or data values. +* **Absolute path** -- A path string that includes a [URI](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) scheme and can be used directly. +* **Relative path** -- A path string without a URI scheme that must be [resolved](#path-resolution) against the table location. Review Comment: In the end, every comment reflects my opinion because I carefully reviewed, edited, or dropped the comments before publishing. I just checked the current table spec. There is only one mentioning of fully qualified for the data file fields, where the wording matching the definition of absolute path. ``` 143 referenced_data_file | string | Fully qualified location (URI with FS scheme) of a data file that all deletes reference ``` I was saying that it was not explicitly defined in the terms list. does it make sense to define the term of fully qualified path in the bullet list formally? or we can clarify what absolute path can behavior slightly differently in v3 and prior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
stevenzwu commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3252350735 ## format/spec.md: ## @@ -134,8 +149,10 @@ Tables do not require rename, except for tables that use atomic rename to implem * **Manifest** -- A file that lists data or delete files; a subset of a snapshot. * **Data file** -- A file that contains rows of a table. * **Delete file** -- A file that encodes rows of a table that are deleted by position or data values. +* **Absolute path** -- A path string that includes a [URI](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) scheme and can be used directly. +* **Relative path** -- A path string without a URI scheme that must be [resolved](#path-resolution) against the table location. Review Comment: In the end, every comment reflects my opinion because I carefully reviewed, edited, or dropped the comments before publishing. Checked the current spec. There is one mention of fully qualified for the data file fields, where the wording matching the definition of absolute path. ``` 143 referenced_data_file | string | Fully qualified location (URI with FS scheme) of a data file that all deletes reference ``` I was saying that it was not explicitly defined in the terms list. does it make sense to define the term of fully qualified path in the bullet list formally? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
stevenzwu commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3252350735 ## format/spec.md: ## @@ -134,8 +149,10 @@ Tables do not require rename, except for tables that use atomic rename to implem * **Manifest** -- A file that lists data or delete files; a subset of a snapshot. * **Data file** -- A file that contains rows of a table. * **Delete file** -- A file that encodes rows of a table that are deleted by position or data values. +* **Absolute path** -- A path string that includes a [URI](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) scheme and can be used directly. +* **Relative path** -- A path string without a URI scheme that must be [resolved](#path-resolution) against the table location. Review Comment: In the end, every comment is my opinion because I have reviewed or edited the comment before posting. Checked the current spec. There is one mention of fully qualified for the data file fields ``` 143 referenced_data_file | string | Fully qualified location (URI with FS scheme) of a data file that all deletes reference ``` I was saying that it was not explicitly defined in the terms list. does it make sense to define the term of fully qualified path in the terms formally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
stevenzwu commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3252360655 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-||--|--|---|-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet | Absolute path is used | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/0-0.parquet| hdfs:/wh/db/table/data/0-0.parquet| Scheme is prepended for consistency | + + Path Relativization + +Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. + +* If an absolute path starts with the table location, the table location prefix should be removed along with the separator character and the remaining relative portion stored. +* If an absolute path does not start with the table location, it is stored as an absolute path. + + Table Location Specification + +When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the table location must be provided. How the table location is persisted or determined when not specified in metadata is not a table-level concern; catalogs should and provide a table's location Review Comment: if location is not in the table metadata, who else can provide the location outside the catalog? is it correct to say that catalog `must` (instead of `should`) provide the location? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
stevenzwu commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3252350735 ## format/spec.md: ## @@ -134,8 +149,10 @@ Tables do not require rename, except for tables that use atomic rename to implem * **Manifest** -- A file that lists data or delete files; a subset of a snapshot. * **Data file** -- A file that contains rows of a table. * **Delete file** -- A file that encodes rows of a table that are deleted by position or data values. +* **Absolute path** -- A path string that includes a [URI](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) scheme and can be used directly. +* **Relative path** -- A path string without a URI scheme that must be [resolved](#path-resolution) against the table location. Review Comment: In the end, every comment is my opinion because I have reviewed or edited the comment before posting. Checked the current spec. There is one mention of fully qualified for the data file fields ``` 143 referenced_data_file | string | Fully qualified location (URI with FS scheme) of a data file that all deletes reference ``` I was saying that it was not explicitly defined in the terms list. does it make sense to define the term of fully qualified path. In v3 and prior, it actually meant absolute path. we just happen to allow the path without scheme and prepend default scheme by the FileIO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3251502225 ## format/spec.md: ## @@ -1777,6 +1886,24 @@ Note that these requirements apply when writing data to a v2 table. Tables that This section covers topics not required by the specification but recommendations for systems implementing the Iceberg specification to help maintain a uniform experience. +### Path Construction + +Path construction is the process by which new file locations are created for output files referenced by metadata. While the specific construction logic is not strictly required by the spec, the following guidance is provided for reference implementations to encourage consistency. + +The table properties `write.metadata.path` and `write.data.path` control where metadata and data files are written relative to the table location. When not specified, these default to the values `metadata` and `data` respectively. Review Comment: I've remove the "relative to the table location" as it's addressed in the bulleted references. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3251496979 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-||--|--|---|-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet | Absolute path is used | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/0-0.parquet| hdfs:/wh/db/table/data/0-0.parquet| Scheme is prepended for consistency | + + Path Relativization + +Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. + +* If an absolute path starts with the table location, the table location prefix should be removed along with the separator character and the remaining relative portion stored. +* If an absolute path does not start with the table location, it is stored as an absolute path. + + Table Location Specification + +When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the table location must be provided. How the table location is persisted or determined when not specified in metadata is not a table-level concern; catalogs should and provide a table's location Review Comment: This is also probably AI picking up prior revisions of the spec (it's using exact wording from earlier that was already rejected). I fixed the grammatical error, the other suggestions are the opposite of what the goal of this section is. We're not trying to say who provides the location, we also say the catalog should provide the location. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3251486214 ## format/spec.md: ## @@ -134,8 +149,10 @@ Tables do not require rename, except for tables that use atomic rename to implem * **Manifest** -- A file that lists data or delete files; a subset of a snapshot. * **Data file** -- A file that contains rows of a table. * **Delete file** -- A file that encodes rows of a table that are deleted by position or data values. +* **Absolute path** -- A path string that includes a [URI](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) scheme and can be used directly. +* **Relative path** -- A path string without a URI scheme that must be [resolved](#path-resolution) against the table location. Review Comment: I disagree with this (I'm not sure if this was AI generated or your opinion). Versions prior to V4 were defined in the spec already as either "URI with scheme" or "fully-qualified". Those were the existing terms in the spec. I don't think we should go back to further define those terms as we may introduce new requirements on older versions. The new terms apply to V4 and the behaviors being introduced in this revision. You need to take into consideration backward compatibility, which means that we cannot apply option 2 as it would redefine prior versions of the spec. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3251469213 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-||--|--|---|-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet | Absolute path is used | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/0-0.parquet| hdfs:/wh/db/table/data/0-0.parquet| Scheme is prepended for consistency | + + Path Relativization + +Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. + +* If an absolute path starts with the table location, the table location prefix should be removed along with the separator character and the remaining relative portion stored. Review Comment: I've updated the wording per @rdblue recommendation and this should make this explicitly clear. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3251432273 ## format/spec.md: ## @@ -123,9 +131,16 @@ Tables do not require random-access writes. Once written, data and metadata file Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files. +### File Locations in Metadata + +All location fields in format versions 3 and prior contain fully-qualified paths. + +Version 4 of the Iceberg spec adds support for relative locations in metadata, enabling tables to be relocated without rewriting metadata files. Relative locations are allowed in all metadata tracked location fields and are resolved against the table's base location. The table's location may be fixed in table metadata or inferred, but is intended to be managed and supplied by a catalog. +Requirements for relativization and resolution are in [Relative Paths](#path-resolution) Review Comment: No, the wrap was unintentional. Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3251235744 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. Review Comment: We never explicitly prohibited it and I don't feel like we should be adding requirements that weren't there previously. I didn't change the wording, so I'm not sure what you're referring to. I added the reference to the URI that describes these behaviors since other people were confused about what kind path evaluations we were referring to. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3251198526 ## format/spec.md: ## @@ -954,6 +1011,34 @@ Table metadata consists of the following fields: | _optional_ | _optional_ | _optional_ | **`partition-statistics`** | A list (optional) of [partition statistics](#partition-statistics). | ||| _required_ | **`next-row-id`** | A `long` higher than all assigned row IDs; the next snapshot’s `first-row-id`. See [Row Lineage](#row-lineage). | ||| _optional_ | **`encryption-keys`** | A list (optional) of [encryption keys](#encryption-keys) used for table encryption. | +=== "v4" Review Comment: The purpose of adding the tabs was to separate V4 as we go to new structure. Why would we update the old table and add tabs with not additional tabs if we're going to just remove that and add the tab later? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3251187778 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-||--|--|---|-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet | Absolute path is used | Review Comment: I'll add some, but we're explicitly prohibiting it, but rather it would produce a undesirable result. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3251187778 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-||--|--|---|-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet | Absolute path is used | Review Comment: I'll add some, but we're not explicitly prohibiting it, but rather it would produce a undesirable result. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3251187778 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-||--|--|---|-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet | Absolute path is used | Review Comment: I'll add some, but we're explicitly saying they're not allowed, but rather it would produce a undesirable result. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3251182951 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-||--|--|---|-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet | Absolute path is used | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/0-0.parquet| hdfs:/wh/db/table/data/0-0.parquet| Scheme is prepended for consistency | Review Comment: No, `file:` is not more common. The typical configuration for hdfs clusters will include the default fs config in the `core-site.xml`. I don't know of any practical situations where people are migrating from local file systems. The example you give would typically be encoded in the cluster configuration like: ``` fs.defaultFS hdfs://namenode-host:8020 The default filesystem URI ``` References to files would just be encoded as: `/wh/db/table` The default fs will be handled by the hdfs FileSystem implementation. The real use case we're covering is these types of deployments in hdfs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
stevenzwu commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3251066303 ## format/spec.md: ## @@ -1647,6 +1732,30 @@ The binary single-value serialization can be used to store the lower and upper b ## Appendix E: Format version changes +### Version 4 + +Relative path support is added in v4. Review Comment: I feel we probably need to introduce sub headers for Version 4, as we have so many changes scoped for V4. if each feature has multiple paragraphs, it is hard to track. The alternative is that each feature only has bullet points. E.g. the row lineage in V3 section below has a large bullet list. I am slightly favor the former. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
stevenzwu commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3251040825 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-||--|--|---|-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet | Absolute path is used | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/0-0.parquet| hdfs:/wh/db/table/data/0-0.parquet| Scheme is prepended for consistency | + + Path Relativization + +Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. + +* If an absolute path starts with the table location, the table location prefix should be removed along with the separator character and the remaining relative portion stored. +* If an absolute path does not start with the table location, it is stored as an absolute path. + + Table Location Specification + +When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the table location must be provided. How the table location is persisted or determined when not specified in metadata is not a table-level concern; catalogs should and provide a table's location Review Comment: Two issues in this paragraph: 1. **Grammatical error**: "catalogs should and provide a table's location" is missing a verb (e.g. "manage and provide"), and the sentence has no closing period. 2. **Implicit actor**: "the table location must be provided" doesn't say by whom, then the next clause silently assumes the catalog. Suggested rewrite: > When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the catalog must provide the table location. How the catalog persists or determines a table's location is outside the scope of this spec. Changes: - Names the catalog as the actor in the second sentence. - Folds "catalogs should and provide a table's location" into the new third sentence, replacing the broken clause and adding the missing period. - "is not a table-level concern" → "is outside the scope of this spec" — clearer about what the spec is leaving unspec
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
stevenzwu commented on code in PR #15630:
URL: https://github.com/apache/iceberg/pull/15630#discussion_r3250869815
##
format/spec.md:
##
@@ -168,6 +185,46 @@ All columns must be written to data files even if they
introduce redundancy with
Writers are not allowed to commit files with a partition spec that contains a
field with an unknown transform.
+### Paths in Metadata
+
+Path strings stored in Iceberg metadata location fields are classified as one
of two types:
+
+* **Absolute path** -- A path string that includes a [URI
scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g.,
`s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without
modification.
+* **Relative path** -- A path string that does not include a URI scheme.
Relative paths must be resolved against the table's base location before use.
+
+Prior to v4, all path fields must contain fully-qualified paths. Starting with
v4, path fields may contain either absolute or relative paths. [Relative
resolution within a
URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and
`..`) and other file system navigation conventions are not supported in
relative paths.
+
+ Path Resolution
+
+Path resolution is the process of producing an absolute path from a relative
path by combining it with the table's base location:
+
+* If the path contains a URI scheme, it is absolute and is used without
modification.
+* If the path does not contain a URI scheme, the resolved path is the table
location followed by the relative path joined by the URI separator character
`/`.
+
+Paths used as prefixes should not end in a path separator. The relative
portion is joined to the prefix without consideration of any additional
separator characters.
+
+Any path from a manifest produced prior to v4 is a fully-qualified path and
must be produced with a URI scheme if the scheme was omitted to be consistent
with V4 paths.
+
+Examples of path resolution:
+
+| | Format Version | Table Location | File Path
| Resolved Path | Description
|
+|-||--|--|---|-|
+| Relative Path | v4 | s3://bucket/db/table |
data/0-0.parquet |
s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` |
+| Absolute Path | v4 | s3://bucket/db/table |
hdfs:/wh/db/table/data/0-0.parquet |
hdfs://wh/db/table/data/0-0.parquet | Absolute path is used |
+| Fully-qualified | v3 and earlier | s3://bucket/db/table |
s3://bucket/db/table/data/0-0.parquet |
s3://bucket/db/table/data/0-0.parquet | Fully-qualified path is used |
+| Missing scheme | v3 and earlier | /wh/db/table |
/wh/db/table/data/0-0.parquet|
hdfs:/wh/db/table/data/0-0.parquet| Scheme is prepended for consistency
|
+
+ Path Relativization
+
+Path relativization is the process of converting an absolute path to a
relative path by removing the table location prefix. This is used when
persisting paths to metadata files.
+
+* If an absolute path starts with the table location, the table location
prefix should be removed along with the separator character and the remaining
relative portion stored.
Review Comment:
The new wording — "removed along with the separator character" — is the
closest the spec gets to closing the prefix-collision case discussed in
apache/iceberg#16174
(https://github.com/apache/iceberg/pull/16174#discussion_r3228742112), but it's
only implicit. There can be two different interpretation of this text:
- **Strict**: the rule applies only when the prefix is followed by the
separator. If the next character isn't `/`, the prefix-removal-with-separator
can't be performed, so the absolute path is stored.
- **Lax**: the rule says "strip the prefix, then strip the separator if
present." `relativize("s3://bucket/table", "s3://bucket/table_v2/file")` would
still strip the prefix and produce `_v2/file`.
The lax reading doesn't close the prefix-collision case discussed in
apache/iceberg#16174
(https://github.com/apache/iceberg/pull/16174#discussion_r3228742112). Worth
pinning this down explicitly. Suggested wording:
> An absolute path is considered under the table location if and only if it
starts with the table location followed by the URI separator `/`. In that case,
the table location and the separator are removed, and the remaining portion is
stored as a relative path. Otherwise, the absolute path is stored.
Also worth adding one more row to the resolution example table (around line
212) that pins this case — e.g. a sibling file at
`s3://bucket/db/table_v2/file.parquet` against `s3://bucket/db/table` showing
the absolute path is stored, not relativized
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3250959736 ## format/spec.md: ## @@ -1777,6 +1886,24 @@ Note that these requirements apply when writing data to a v2 table. Tables that This section covers topics not required by the specification but recommendations for systems implementing the Iceberg specification to help maintain a uniform experience. +### Path Construction + +Path construction is the process by which new file locations are created for output files referenced by metadata. While the specific construction logic is not strictly required by the spec, the following guidance is provided for reference implementations to encourage consistency. + +The table properties `write.metadata.path` and `write.data.path` control where metadata and data files are written relative to the table location. When not specified, these default to the values `metadata` and `data` respectively. + +For all metadata files: + +* If `write.metadata.path` is an absolute path, it is used directly as the base for new metadata files. +* If `write.metadata.path` is a relative path, the metadata base is the table location joined to the `write.metadata.path` value with a URI separator `/`. + +For data files: + +* If `write.data.path` is an absolute path, it is used directly as the base for new data files. +* If `write.data.path` is a relative path, the base is the table location joined to the `write.data.path` value with a URI separator `/`. + +When persisting paths into metadata, writers should relativize paths against the table location (see [Path Relativization](#path-relativization)). If a file's absolute path shares a common prefix with the table location, the relative portion should be stored. Otherwise, the absolute path should be stored. Review Comment: If allowed by the table version? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3250951393 ## format/spec.md: ## @@ -954,6 +1011,34 @@ Table metadata consists of the following fields: | _optional_ | _optional_ | _optional_ | **`partition-statistics`** | A list (optional) of [partition statistics](#partition-statistics). | ||| _required_ | **`next-row-id`** | A `long` higher than all assigned row IDs; the next snapshot’s `first-row-id`. See [Row Lineage](#row-lineage). | ||| _optional_ | **`encryption-keys`** | A list (optional) of [encryption keys](#encryption-keys) used for table encryption. | +=== "v4" Review Comment: I didn't expect this to add a new table for v4. Is this the same as the old table, except with a v4 requirement column? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3250955046 ## format/spec.md: ## @@ -954,6 +1011,34 @@ Table metadata consists of the following fields: | _optional_ | _optional_ | _optional_ | **`partition-statistics`** | A list (optional) of [partition statistics](#partition-statistics). | ||| _required_ | **`next-row-id`** | A `long` higher than all assigned row IDs; the next snapshot’s `first-row-id`. See [Row Lineage](#row-lineage). | ||| _optional_ | **`encryption-keys`** | A list (optional) of [encryption keys](#encryption-keys) used for table encryption. | +=== "v4" +| v4 | Field | Description | +||-|-| +| _required_ | **`format-version`**| An integer version number for the format. Implementations must throw an exception if a table's version is higher than the supported version. | +| _required_ | **`table-uuid`**| A UUID that identifies the table, generated when the table is created. Implementations must throw an exception if a table's UUID does not match the expected UUID after refreshing metadata. | +| _optional_ | **`location`** | The table's base location. This is used by writers to determine where to store data files, manifest files, and table metadata files. Must be an absolute path when present. See [Table Locations](#table-location-specification). | Review Comment: I'm assuming that this is the only line changed. Seems good to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3250945058 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-||--|--|---|-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet | Absolute path is used | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/0-0.parquet| hdfs:/wh/db/table/data/0-0.parquet| Scheme is prepended for consistency | + + Path Relativization + +Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. + +* If an absolute path starts with the table location, the table location prefix should be removed along with the separator character and the remaining relative portion stored. Review Comment: I think we have to be more strict. If a path starts with the table location immediately followed by a separator character, the relative path is the remainder of the string after the separator character. I want to avoid ambiguity in the case of a table location that is a prefix to the middle of a name, like `s3://bucket/db/tab` and `s3://bucket/db/table`. If the separator character isn't immediately after the `/`, then relativization can't happen. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3250946634 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-||--|--|---|-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet | Absolute path is used | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/0-0.parquet| hdfs:/wh/db/table/data/0-0.parquet| Scheme is prepended for consistency | + + Path Relativization + +Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. + +* If an absolute path starts with the table location, the table location prefix should be removed along with the separator character and the remaining relative portion stored. +* If an absolute path does not start with the table location, it is stored as an absolute path. Review Comment: Table location immediately followed by the separator character -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3250929752 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-||--|--|---|-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet | Absolute path is used | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/0-0.parquet| hdfs:/wh/db/table/data/0-0.parquet| Scheme is prepended for consistency | Review Comment: Should this use `file:` or `hdfs:`? I'd lean toward `file:` because I think that was more common. But I could be wrong. I just haven't seen many paths that are `//namenode:8020/wh/db/table`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3250932463 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. + +Any path from a manifest produced prior to v4 is a fully-qualified path and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|-||--|--|---|-| +| Relative Path | v4 | s3://bucket/db/table | data/0-0.parquet | s3://bucket/db/table/data/0-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs:/wh/db/table/data/0-0.parquet | hdfs://wh/db/table/data/0-0.parquet | Absolute path is used | Review Comment: Should we add examples to show that trailing `/` and starting `/` cause separator duplication? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3250919301 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +Paths used as prefixes should not end in a path separator. The relative portion is joined to the prefix without consideration of any additional separator characters. Review Comment: I would use primarily the second sentence because it is clearly defining what happens. The first sentence is more of a recommendation to provide context: > The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3250904676 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. Review Comment: This seems to imply that resolution was previously allowed, which isn't the case. I like the older wording that things like `.` and `..` have no special meaning. I would bring back the older version and then add the link to the RFC as a clarification: The relative resolution components defined by the RFC have no special meaning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3250904676 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. Review Comment: This seems to imply that resolution was previously allowed, which isn't the case. I like the older wording that things like `.` and `..` have no special meaning. I would bring back the older version and then add the link to the RFC as a clarification: The relative resolution components defined by the RFC have no special meaning and are opaque to Iceberg. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3250893047 ## format/spec.md: ## @@ -168,6 +185,46 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. Review Comment: Nit: "start with" instead of "include", although I think it is likely clear. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3250884836 ## format/spec.md: ## @@ -123,9 +131,16 @@ Tables do not require random-access writes. Once written, data and metadata file Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files. +### File Locations in Metadata + +All location fields in format versions 3 and prior contain fully-qualified paths. + +Version 4 of the Iceberg spec adds support for relative locations in metadata, enabling tables to be relocated without rewriting metadata files. Relative locations are allowed in all metadata tracked location fields and are resolved against the table's base location. The table's location may be fixed in table metadata or inferred, but is intended to be managed and supplied by a catalog. +Requirements for relativization and resolution are in [Relative Paths](#path-resolution) Review Comment: If you intended for this to be a separate paragraph, I think you have to add an empty newline. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3236839950 ## format/spec.md: ## @@ -168,6 +185,35 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata files are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain absolute paths. Starting with v4, path fields may contain either absolute or relative paths. Directory navigation symbols (`.` and `..`) and other file system conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path. Review Comment: I feel like this line and the line below clearly state how this is handled. I'm not sure how much we want to focus on how it can be done wrong in the spec and the reference implementation does a lot of trailing slash removal for table locations. If someone read the next line, it's clear that this would be wrong: ``` Paths used as prefixes must not end in a path separator . . . ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3236842140 ## format/spec.md: ## @@ -168,6 +185,35 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata files are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain absolute paths. Starting with v4, path fields may contain either absolute or relative paths. Directory navigation symbols (`.` and `..`) and other file system conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path. Review Comment: I also added an chart of examples below -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3236800460 ## format/spec.md: ## @@ -168,6 +188,35 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata files are classified as one of two types: Review Comment: I'm generally referring to paths as the content stored within metadata location fields. This allows us to continue to use standard conventions like "absolute paths" or "relative path", but the fields are location fields. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
anoopj commented on code in PR #15630:
URL: https://github.com/apache/iceberg/pull/15630#discussion_r3220842567
##
format/spec.md:
##
@@ -168,6 +185,35 @@ All columns must be written to data files even if they
introduce redundancy with
Writers are not allowed to commit files with a partition spec that contains a
field with an unknown transform.
+### Paths in Metadata
+
+Path strings stored in Iceberg metadata files are classified as one of two
types:
+
+* **Absolute path** -- A path string that includes a [URI
scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g.,
`s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without
modification.
+* **Relative path** -- A path string that does not include a URI scheme.
Relative paths must be resolved against the table's base location before use.
+
+Prior to v4, all path fields must contain absolute paths. Starting with v4,
path fields may contain either absolute or relative paths. Directory navigation
symbols (`.` and `..`) and other file system conventions are not supported in
relative paths.
+
+ Path Resolution
+
+Path resolution is the process of producing an absolute path from a relative
path by combining it with the table's base location:
+
+* If the path contains a URI scheme, it is absolute and is used without
modification.
+* If the path does not contain a URI scheme, the resolved path is the table
location followed by the relative path.
Review Comment:
The line clearly defines resolution as string concatenation, not RFC 3986
reference resolution. The RFC has defined the reference resolution algorithm
[here](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2.2).
This sentence is easy to miss and an implementer could reasonably reach for
their language's URL resolver (like Rust's `Url::join()`) and get wrong results
for `/` prefixed paths.
Specifically Rust `Url::join("s3://bucket/table/", "/metadata/foo.parquet")`
hits the `if (R.path starts-with "/") then[...]` branch in the RFC and
incorrectly produces `s3://bucket/metadata/foo.parquet`.
Should we add a warning somewhere in the Path Resolution section?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3175015502 ## format/spec.md: ## @@ -123,9 +128,22 @@ Tables do not require random-access writes. Once written, data and metadata file Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files. +### File Locations in Metadata + +Version 4 of the Iceberg spec adds support for relative locations in metadata, enabling tables to be relocated without rewriting metadata files. Key changes include: + +* Support for relative locations in all metadata tracked path fields, resolved against the table's base location +* The table `location` field becomes optional, allowing the table location to be: + * Provided by an owning catalog + * Inferred from the metadata file location or storage layout + * Supplied directly where necessary +* Formal definitions for path types, path resolution, and path relativization + +The full set of changes are listed in [Appendix E](#version-4). + ## Specification - Terms +### Terms Review Comment: This actually fixes a minor rendering issue with the table of contents, but doesn't affect the indentation or structure. Because it fixes the rendering for the next few sections (which is material to this PR) I'd just include it here. If you really think there's substance to the change, I can split it out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3175015502 ## format/spec.md: ## @@ -123,9 +128,22 @@ Tables do not require random-access writes. Once written, data and metadata file Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files. +### File Locations in Metadata + +Version 4 of the Iceberg spec adds support for relative locations in metadata, enabling tables to be relocated without rewriting metadata files. Key changes include: + +* Support for relative locations in all metadata tracked path fields, resolved against the table's base location +* The table `location` field becomes optional, allowing the table location to be: + * Provided by an owning catalog + * Inferred from the metadata file location or storage layout + * Supplied directly where necessary +* Formal definitions for path types, path resolution, and path relativization + +The full set of changes are listed in [Appendix E](#version-4). + ## Specification - Terms +### Terms Review Comment: This actually fixes a minor rendering issue with the table of contents, but doesn't affect the indentation or structure. It's immaterial, so I'd just include it here. If you really think there's substance to the change, I can split it out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3175006640 ## format/spec.md: ## @@ -1637,6 +1686,30 @@ The binary single-value serialization can be used to store the lower and upper b ## Appendix E: Format version changes +### Version 4 + +Relative path support is added in v4. + +Reading v3 metadata for v4: + +* All location fields are treated as absolute paths +* Any location field without a uri scheme prefix must prepend a scheme component consistent with v4 Review Comment: I've updated the description in the spec to clarify that v3 paths are 'fully-qualified' which is the terminology we had for v3- and added minor clarification. We never fully defined 'fully-qualified' but that's part of what we're trying to address here, so I wouldn't lean too hard into trying to now go back and rewrite the intent behind 'fully-qualified' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3174954855 ## format/spec.md: ## @@ -921,33 +970,33 @@ The atomic operation used to commit metadata depends on how tables are tracked a Table metadata consists of the following fields: -| v1 | v2 | v3 | Field | Description | -| -- | -- ||-|--| -| _required_ | _required_ | _required_ | **`format-version`**| An integer version number for the format. Implementations must throw an exception if a table's version is higher than the supported version. | -| _optional_ | _required_ | _required_ | **`table-uuid`**| A UUID that identifies the table, generated when the table is created. Implementations must throw an exception if a table's UUID does not match the expected UUID after refreshing metadata. | -| _required_ | _required_ | _required_ | **`location`** | The table's base location. This is used by writers to determine where to store data files, manifest files, and table metadata files. | -|| _required_ | _required_ | **`last-sequence-number`** | The table's highest assigned sequence number, a monotonically increasing long that tracks the order of snapshots in a table. | -| _required_ | _required_ | _required_ | **`last-updated-ms`** | Timestamp in milliseconds from the unix epoch when the table was last updated. Each table metadata file should update this field just before writing. | -| _required_ | _required_ | _required_ | **`last-column-id`**| An integer; the highest assigned column ID for the table. This is used to ensure columns are always assigned an unused ID when evolving schemas. | -| _required_ ||| **`schema`**| The table’s current schema. (**Deprecated**: use `schemas` and `current-schema-id` instead) | -| _optional_ | _required_ | _required_ | **`schemas`** | A list of schemas, stored as objects with `schema-id`. | -| _optional_ | _required_ | _required_ | **`current-schema-id`** | ID of the table's cur
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3174557454 ## format/spec.md: ## @@ -168,6 +188,35 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata files are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain absolute paths. Starting with v4, path fields may contain either absolute or relative paths. Directory navigation symbols (`.` and `..`) and other file system conventions are not supported in relative paths. Review Comment: I will update this to refer to v3- paths as 'fully-qualified', but we were not specific about that in the spec with the exception of the description for the `referenced_data_file` field. Paths that start with `/` like your example may exist for local paths, but it's not limited to local paths and is very common for HDFS. The configuration of the `core-site/hdfs-site` typically configures a default file system that includes the scheme and namenode address (e.g. `fs.defaultFS=hdfs://namenode-host:8020`). All root based references start with `/` are then resolved based on the default fs value. This means that there are many scenarios with HDFS where you will have these types of paths in existing metadata. Moving forward to v4, we would require these be canonicalized to `hdfs://...` (the default namenode can still be omitted so as to not to require encoding it in the path). Cloud providers do not have this issue unless similarly configured for hdfs, but that's not typical. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3174519925 ## format/spec.md: ## @@ -123,9 +128,22 @@ Tables do not require random-access writes. Once written, data and metadata file Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files. +### File Locations in Metadata + +Version 4 of the Iceberg spec adds support for relative locations in metadata, enabling tables to be relocated without rewriting metadata files. Key changes include: Review Comment: This is going to get a little pedantic, but I think that's fair to include and have the following comment for reference. The issue is that there is disagreement about what 'absolute' and 'fully-qualified' actually mean. In many contexts, they're considered equivalent. However, in other contexts, 'absolute' would not require a scheme, while 'fully-qualified' does require a scheme. Though, at the same time, 'fully-qualified' also allows for relative references. I think we want to define 'absolute' as requiring a scheme without relative or special references (as we have done here). That allows to at least make the argument that 'fully-qualified' is what the spec previously expected (though only explicitly called out for the `referenced_data_file` field). We were not explicit in the spec about the contents of other fields. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3174519925 ## format/spec.md: ## @@ -123,9 +128,22 @@ Tables do not require random-access writes. Once written, data and metadata file Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files. +### File Locations in Metadata + +Version 4 of the Iceberg spec adds support for relative locations in metadata, enabling tables to be relocated without rewriting metadata files. Key changes include: Review Comment: This is going to get a little pedantic, but I think that's fair to include. The issue is that there is disagreement about what 'absolute' and 'fully-qualified' actually mean. In many contexts, they're considered equivalent. However, in other contexts, 'absolute' would not require a scheme, while 'fully-qualified' does require a scheme. Though, at the same time, 'fully-qualified' also allows for relative references. I think we want to define 'absolute' as requiring a scheme without relative or special references (as we have done here). That allows to at least make the argument that 'fully-qualified' is what the spec previously expected (though only explicitly called out for the `referenced_data_file` field). We were not explicit in the spec about the contents of other fields. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
danielcweeks commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3174493626 ## format/spec.md: ## @@ -168,6 +188,35 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata files are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. Review Comment: I agree that we should just refer to the RFC. Implementations can make reasonable optimizations. The intent here is to be clear about what the expected content is, not to force a full and rigorous validation anywhere a path is seen. In practice, I we haven't really had any problems, but if there is a dispute about whether a representation is valid, we can always refer back to the RFC for clarification. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3157884042 ## format/spec.md: ## @@ -921,33 +970,33 @@ The atomic operation used to commit metadata depends on how tables are tracked a Table metadata consists of the following fields: -| v1 | v2 | v3 | Field | Description | -| -- | -- ||-|--| -| _required_ | _required_ | _required_ | **`format-version`**| An integer version number for the format. Implementations must throw an exception if a table's version is higher than the supported version. | -| _optional_ | _required_ | _required_ | **`table-uuid`**| A UUID that identifies the table, generated when the table is created. Implementations must throw an exception if a table's UUID does not match the expected UUID after refreshing metadata. | -| _required_ | _required_ | _required_ | **`location`** | The table's base location. This is used by writers to determine where to store data files, manifest files, and table metadata files. | -|| _required_ | _required_ | **`last-sequence-number`** | The table's highest assigned sequence number, a monotonically increasing long that tracks the order of snapshots in a table. | -| _required_ | _required_ | _required_ | **`last-updated-ms`** | Timestamp in milliseconds from the unix epoch when the table was last updated. Each table metadata file should update this field just before writing. | -| _required_ | _required_ | _required_ | **`last-column-id`**| An integer; the highest assigned column ID for the table. This is used to ensure columns are always assigned an unused ID when evolving schemas. | -| _required_ ||| **`schema`**| The table’s current schema. (**Deprecated**: use `schemas` and `current-schema-id` instead) | -| _optional_ | _required_ | _required_ | **`schemas`** | A list of schemas, stored as objects with `schema-id`. | -| _optional_ | _required_ | _required_ | **`current-schema-id`** | ID of the table's current s
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3157880077 ## format/spec.md: ## @@ -168,6 +188,35 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata files are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain absolute paths. Starting with v4, path fields may contain either absolute or relative paths. Directory navigation symbols (`.` and `..`) and other file system conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location. If a path is absolute, it is used as-is. If a path is relative, it is concatenated with the table location to produce an absolute path: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path. + +Paths used as prefixes must not end in a path separator. The relative portion is appended to the prefix without introduction of any additional separator characters. + + Path Relativization + +Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. + +* If an absolute path starts with the table location, the table location prefix should be removed and the remaining relative portion stored. +* If an absolute path does not start with the table location, it is stored as an absolute path. + + Table Location Specification + +When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the table location must be provided. How the table location is persisted/determined when not specified in metadata is outside the scope of the spec and is the responsibility of catalogs to track and provide. Review Comment: ```suggestion When the `location` field is present in table metadata, it is used directly as the table's base location. When the `location` field is not present (v4 and later), the table location must be provided. How the table location is persisted or determined when not specified in metadata is not a table-level concern; catalogs are intended to track and provide a table's location. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3157873755 ## format/spec.md: ## @@ -168,6 +188,35 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata files are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain absolute paths. Starting with v4, path fields may contain either absolute or relative paths. Directory navigation symbols (`.` and `..`) and other file system conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location. If a path is absolute, it is used as-is. If a path is relative, it is concatenated with the table location to produce an absolute path: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path. + +Paths used as prefixes must not end in a path separator. The relative portion is appended to the prefix without introduction of any additional separator characters. Review Comment: +1 for an example. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3157872431 ## format/spec.md: ## @@ -168,6 +188,35 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata files are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain absolute paths. Starting with v4, path fields may contain either absolute or relative paths. Directory navigation symbols (`.` and `..`) and other file system conventions are not supported in relative paths. + + Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location. If a path is absolute, it is used as-is. If a path is relative, it is concatenated with the table location to produce an absolute path: + +* If the path contains a URI scheme, it is absolute and is used without modification. Review Comment: The paragraph above already says "If a path is absolute, it is used as-is", and we have a definition of "absolute" just above. I'd separate the last two sentences from that paragraph and combine them with these points to dedup. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3157868731 ## format/spec.md: ## @@ -168,6 +188,35 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata files are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain absolute paths. Starting with v4, path fields may contain either absolute or relative paths. Directory navigation symbols (`.` and `..`) and other file system conventions are not supported in relative paths. Review Comment: ```suggestion Prior to v4, all path fields must contain absolute paths. Starting with v4, path fields may contain either absolute or relative paths. Directory navigation symbols (`.` and `..`) and other file system conventions are not supported in any path or location string. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPEC] Add relative paths to v4 spec [iceberg]
rdblue commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3157865952 ## format/spec.md: ## @@ -168,6 +188,35 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata files are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain absolute paths. Starting with v4, path fields may contain either absolute or relative paths. Directory navigation symbols (`.` and `..`) and other file system conventions are not supported in relative paths. Review Comment: This isn't strictly true. I think it said "fully-qualified". Plus, we want to fix the case where local file paths (like `/path/without/scheme`) have leaked into metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
