Re: [PR] Add Pagination To List Apis [iceberg]
sachet commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1548998658 ## core/src/main/java/org/apache/iceberg/rest/PaginatedList.java: ## @@ -0,0 +1,269 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.rest; + +import java.util.Collection; +import java.util.Collections; +import java.util.Iterator; +import java.util.List; +import java.util.ListIterator; +import java.util.Map; +import java.util.Spliterator; +import java.util.Spliterators; +import java.util.function.Supplier; +import org.apache.iceberg.catalog.Namespace; +import org.apache.iceberg.exceptions.ValidationException; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.rest.responses.ListNamespacesResponse; +import org.apache.iceberg.rest.responses.ListTablesResponse; +import org.apache.iceberg.rest.responses.Route; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class PaginatedList implements List { Review Comment: +1, Current implementation of paginated list seems wrong as caller can get unexpected results. Two possible options are: 1. Pre-fetch all items (as suggested above) 2. methods like size(), contains(), isEmpty(), toArray(), get() .. first paginates across all items before performing the operation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Fix metadata file not found [iceberg]
pvary commented on PR #10069: URL: https://github.com/apache/iceberg/pull/10069#issuecomment-2033649191 @lurnagao-dahua: If you check https://github.com/apache/iceberg/blob/3caa3a28d07a2d08b9a0e4196634126f1e016d6a/hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCommits.java, you can find plenty of examples for commit errors. Maybe if we could do something similar, like throwing an exception without a message. It would be nice to have a test. OTOH, if the test is more than 50 lines, it would cost us more in the upkeep of the test in the long run, than what we gain with testing a null check. In this case I would skip addig the extra code, following the example of #701. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] PyArrow S3FileSystem doesn't honor the AWS profile config [iceberg-python]
HonahX commented on issue #570: URL: https://github.com/apache/iceberg-python/issues/570#issuecomment-2033588669 @geruh, thanks for highlighting this issue. The confusion largely stems from the naming convention used when the `profile_name`, `region_name`, `aws_access_key_id`, etc., were introduced in [#7781](https://github.com/apache/iceberg/pull/7781). Initially, these configurations were intended solely for GlueCatalog, but their generic names suggest they might influence both Glue and S3 operations. To address this, we can consider renaming these configurations with a `glue.` prefix (e.g., `glue.profile_name`) to clarify their scope. However, to maintain API compatibility, we may need to support both the new and old naming conventions temporarily. > But on the other hand it seems reasonable that the AWS profile config should work uniformly across both the catalog and filesystem levels. +1 for unified configurations. I think it may be convenient to introduce other unified configurations, with generic names like `aws-access-key-id`. So the overall order of config will be: 1. Client-specific configs: glue.access-key-id, s3.access-key-id, etc. 3. Unified AWS configurations like aws-access-key-id 5. Environment variables and the default AWS config > However, we're currently utilizing PyArrow's [S3FileSystem](https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html#pyarrow.fs.S3FileSystem), which doesn't inherently support AWS profiles. This means we'd need to bridge that gap manually. Regarding the `profile_name` support for PyArrow's S3FileSystem, it seems there might not be a direct solution from the pyiceberg side. This functionality appears to be more suitably addressed through enhancements to the PyArrow library itself. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
danielcweeks commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1548945175 ## core/src/main/java/org/apache/iceberg/rest/PaginatedList.java: ## @@ -0,0 +1,269 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.rest; + +import java.util.Collection; +import java.util.Collections; +import java.util.Iterator; +import java.util.List; +import java.util.ListIterator; +import java.util.Map; +import java.util.Spliterator; +import java.util.Spliterators; +import java.util.function.Supplier; +import org.apache.iceberg.catalog.Namespace; +import org.apache.iceberg.exceptions.ValidationException; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.rest.responses.ListNamespacesResponse; +import org.apache.iceberg.rest.responses.ListTablesResponse; +import org.apache.iceberg.rest.responses.Route; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class PaginatedList implements List { Review Comment: To clarify my comment, I think there are really two issues: 1) The current implementation does not properly adhere to the List behaviors. The dynamodb example that's referenced will eagerly fetch if certain methods are called, which is necessary to get the correct behavior out of the List interface. 2) The second issue is more about the utility of a lazy pagination since it seems the original motivation is more to address server side limitations (like Glue's 100 results per request limit) as opposed to a client side memory limitation. While you might be able to optimize something like the `SHOW TABLES LIKE` command client-side, we haven't see the client-side as the limiting factor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support creating tags [iceberg-python]
enkidulan commented on issue #573: URL: https://github.com/apache/iceberg-python/issues/573#issuecomment-2033571970 For the reference, I was able to make a tag only by using some private properties of the transaction object: ```py from pyiceberg.table import SetSnapshotRefUpdate, update_table_metadata with table.transaction() as txn: update = SetSnapshotRefUpdate( ref_name=tag, type="tag", snapshot_id=snapshot_id, max_ref_age_ms=None, max_snapshot_age_ms=None, min_snapshots_to_keep=None, ) txn._updates = [update] txn.table_metadata = update_table_metadata(txn.table_metadata, [update]) ``` This seems to work fine as a temporary workaround for development purposes, but it would great to have a public method for creating tags. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Support creating tags [iceberg-python]
enkidulan opened a new issue, #573: URL: https://github.com/apache/iceberg-python/issues/573 ### Feature Request / Improvement Historical tags in iceberg docs - https://iceberg.apache.org/docs/1.5.0/branching/#historical-tags Not sure if it was intentional behavior, but pyiceberg `v0.6.0` allowed tagging by using public `set_ref_snapshot` method: ```py with table.transaction() as transaction: transaction.set_ref_snapshot( snapshot_id=snapshot_id, parent_snapshot_id=snapshot_id, ref_name=revision, type="tag", ) ``` The new dev version (the current main branch) has deprecated the `set_ref_snapshot` method, so I can't find a way to create a tag using public methods on the transaction object. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]
liurenjie1024 commented on code in PR #295: URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1548900850 ## crates/iceberg/src/arrow.rs: ## @@ -113,6 +143,405 @@ impl ArrowReader { // TODO: full implementation ProjectionMask::all() } + +fn get_row_filter(&self, parquet_schema: &SchemaDescriptor) -> Result> { +if let Some(predicates) = &self.predicates { +let field_id_map = self.build_field_id_map(parquet_schema)?; + +// Collect Parquet column indices from field ids +let column_indices = predicates +.iter() +.map(|predicate| { +let mut collector = CollectFieldIdVisitor { field_ids: vec![] }; +collector.visit_predicate(predicate).unwrap(); +collector +.field_ids +.iter() +.map(|field_id| { +field_id_map.get(field_id).cloned().ok_or_else(|| { +Error::new(ErrorKind::DataInvalid, "Field id not found in schema") +}) +}) +.collect::>>() +}) +.collect::>>()?; + +// Convert BoundPredicates to ArrowPredicates +let mut arrow_predicates = vec![]; +for (predicate, columns) in predicates.iter().zip(column_indices.iter()) { +let mut converter = PredicateConverter { +columns, +projection_mask: ProjectionMask::leaves(parquet_schema, columns.clone()), +parquet_schema, +column_map: &field_id_map, +}; +let arrow_predicate = converter.visit_predicate(predicate)?; +arrow_predicates.push(arrow_predicate); +} +Ok(Some(RowFilter::new(arrow_predicates))) +} else { +Ok(None) +} +} + +/// Build the map of field id to Parquet column index in the schema. +fn build_field_id_map(&self, parquet_schema: &SchemaDescriptor) -> Result> { +let mut column_map = HashMap::new(); +for (idx, field) in parquet_schema.columns().iter().enumerate() { +let field_type = field.self_type(); +match field_type { +ParquetType::PrimitiveType { basic_info, .. } => { +if !basic_info.has_id() { +return Err(Error::new( +ErrorKind::DataInvalid, +format!( +"Leave column {:?} in schema doesn't have field id", +field_type +), +)); +} +column_map.insert(basic_info.id(), idx); +} +ParquetType::GroupType { .. } => { +return Err(Error::new( +ErrorKind::DataInvalid, +format!( +"Leave column in schema should be primitive type but got {:?}", +field_type +), +)); +} +}; +} + +Ok(column_map) +} +} + +/// A visitor to collect field ids from bound predicates. +struct CollectFieldIdVisitor { +field_ids: Vec, +} + +impl BoundPredicateVisitor for CollectFieldIdVisitor { +type T = (); +type U = (); + +fn and(&mut self, _predicates: Vec) -> Result { +Ok(()) +} + +fn or(&mut self, _predicates: Vec) -> Result { +Ok(()) +} + +fn not(&mut self, _predicate: Self::T) -> Result { +Ok(()) +} + +fn visit_always_true(&mut self) -> Result { +Ok(()) +} + +fn visit_always_false(&mut self) -> Result { +Ok(()) +} + +fn visit_unary(&mut self, predicate: &UnaryExpression) -> Result { +self.bound_reference(predicate.term())?; +Ok(()) +} + +fn visit_binary(&mut self, predicate: &BinaryExpression) -> Result { +self.bound_reference(predicate.term())?; +Ok(()) +} + +fn visit_set(&mut self, predicate: &SetExpression) -> Result { +self.bound_reference(predicate.term())?; +Ok(()) +} + +fn bound_reference(&mut self, reference: &BoundReference) -> Result { +self.field_ids.push(reference.field().id); +Ok(()) +} +} + +struct PredicateConverter<'a> { +pub columns: &'a Vec, +pub projection_mask: ProjectionMask, +pub parquet_schema: &'a SchemaDescriptor, +pub column_map: &'a HashMap, +} + +fn get_arrow_datum(datum: &Datum) -> Box { +match datum.literal() { +PrimitiveLiteral::Boolean(value) => Box::new(BooleanArray::new_scalar(*value)), +PrimitiveLiteral::Int(value)
Re: [PR] refine: seperate parquet reader and arrow convert [iceberg-rust]
viirya commented on code in PR #313: URL: https://github.com/apache/iceberg-rust/pull/313#discussion_r1548890936 ## crates/iceberg/src/reader.rs: ## Review Comment: +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] refine: seperate parquet reader and arrow convert [iceberg-rust]
liurenjie1024 commented on PR #313: URL: https://github.com/apache/iceberg-rust/pull/313#issuecomment-2033470850 cc @viirya Would you also take a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] refine: seperate parquet reader and arrow convert [iceberg-rust]
liurenjie1024 commented on code in PR #313: URL: https://github.com/apache/iceberg-rust/pull/313#discussion_r1548868232 ## crates/iceberg/src/arrow/from.rs: ## Review Comment: How about just name it as `schema.rs`, and we can put all schema related codes here? ## crates/iceberg/src/reader.rs: ## Review Comment: Move this to `arrow` module? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] int64() is converted to float for nullable field when None is provided + nan returned when None is expected [iceberg-python]
Co0olCat commented on issue #572: URL: https://github.com/apache/iceberg-python/issues/572#issuecomment-2033459923 It looks like an issue of pyarrow. scan() returns correct data and type. pyarrow.project_table() does the conversion... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] int64() is converted to float for nullable field when None is provided + nan returned when None is expected [iceberg-python]
Co0olCat closed issue #572: int64() is converted to float for nullable field when None is provided + nan returned when None is expected URL: https://github.com/apache/iceberg-python/issues/572 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] int64() is converted to float for nullable field when None is provided + nan returned when None is expected [iceberg-python]
Co0olCat opened a new issue, #572: URL: https://github.com/apache/iceberg-python/issues/572 ### Apache Iceberg version 0.6.0 (latest release) ### Please describe the bug 🐞 For reproduction using https://github.com/apache/iceberg-python/blob/main/tests/catalog/test_glue.py Here is failing test ```python @mock_aws def test_create_table_with_pyarrow_schema( _bucket_initialize: None, moto_endpoint_url: str, database_name: str, table_name: str, ) -> None: catalog_name = "glue" identifier = (database_name, table_name) test_catalog = GlueCatalog(catalog_name, **{"s3.endpoint": moto_endpoint_url}) test_catalog.create_namespace(namespace=database_name) pa_schema = pa.schema([ pa.field('year', pa.int64(), nullable=False), pa.field('n_legs', pa.int64(), nullable=True), pa.field('animals', pa.string(), nullable=True) ]) table = test_catalog.create_table( identifier=identifier, schema=pa_schema, location=f"s3://{BUCKET_NAME}/{database_name}.db/{table_name}", ) assert table.identifier == (catalog_name,) + identifier assert TABLE_METADATA_LOCATION_REGEX.match(table.metadata_location) assert test_catalog._parse_metadata_version(table.metadata_location) == 0 table.append( pa.Table.from_pylist( [ {"year": 2001, "n_legs": 2, "animals": None}, {"year": 2002, "n_legs": None, "animals": "Horse"}, ], schema=pa_schema ) ) assert len(table.scan().to_arrow()) == 2 table.append( pa.Table.from_pylist( [ {"year": 2003, "n_legs": 6, "animals": "Cicada"}, {"year": 2004, "n_legs": 8, "animals": "Spider"}, ], schema=pa_schema ) ) assert len(table.scan().to_arrow()) == 4 assert table.scan().to_pandas().to_dict("records") == [ {"animals": "Cicada", "n_legs": 6, "year": 2003}, {"animals": "Spider", "n_legs": 8, "year": 2004}, {"animals": None, "n_legs": 2, "year": 2001}, {"animals": "Horse", "n_legs": None, "year": 2002}, ] ``` Error part: ``` E Full diff: E [ E - {'animals': 'Cicada', 'n_legs': 6, 'year': 2003}, E + {'animals': 'Cicada', 'n_legs': 6.0, 'year': 2003}, E ? ++ E - {'animals': 'Spider', 'n_legs': 8, 'year': 2004}, E + {'animals': 'Spider', 'n_legs': 8.0, 'year': 2004}, E ? ++ E - {'animals': None, 'n_legs': 2, 'year': 2001}, E + {'animals': None, 'n_legs': 2.0, 'year': 2001}, E ? ++ E - {'animals': 'Horse', 'n_legs': None, 'year': 2002}, E ? -- ^ E + {'animals': 'Horse', 'n_legs': nan, 'year': 2002}, E ? ^^ E ] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] [WIP] Migrate TableTestBase related classes to JUnit5 [iceberg]
tomtongue opened a new pull request, #10080: URL: https://github.com/apache/iceberg/pull/10080 Migrate the following test classes to delete `TableTestBase` for https://github.com/apache/iceberg/issues/9085. This PR is for the migration of `TableTestBase` related classes in https://github.com/apache/iceberg/pull/10063. ## Current Progress Core: - [x] `TestContentFileParser` (skipped) - [x] `TestFileScanTaskParser` (skipped) - [x] `util/TestTableScanUtil` - [ ] `WriterTestBase` - [x] `TestFileWriterFactory` - [x] `TestGenericFileWriterFactory` - [ ] `TestSparkFileWriterFactory` for the versions: v3.3, v3.4, v3.5 - [ ] `TestFlinkFileWriterFactory` for the versions: v1.15, v1.16, v.1.17, v1.18 - [x] `TestPartitioningWriters` - [ ] `TestSparkPartitioningWriters` for the versions: v3.3, v3.4, v3.5 - [ ] `TestFlinkPartitioningWriters` for the versions: v1.15, v1.16, v.1.17, v1.18 - [x] `TestPositionDeltaWriters` - [ ] `TestSparkPositionDeltaWriters` for the versions: v3.3, v3.4, v3.5 - [ ] `TestFlinkPositionDeltaWriters` for the versions: v1.15, v1.16, v.1.17, v1.18 - [x] `TestRollingFileWriters` - [ ] `TestSparkRollingFileWriters` for the versions: v3.3, v3.4, v3.5 - [ ] `TestFlinkRollingFileWriters` for the versions: v1.15, v1.16, v.1.17, v1.18 `iceberg-flink` for the versions: v1.15, v1.16, v.1.17, v1.18 - [ ] `TestStreamingReaderOperator` latest ok - [ ] `TestStreamingMonitorFunction` latest ok - [ ] `TestIcebergFilesCommitter` latest ok - [ ] `TestDeltaTaskWriter` latest ok -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Fix metadata file not found [iceberg]
lurnagao-dahua commented on PR #10069: URL: https://github.com/apache/iceberg/pull/10069#issuecomment-2033424821 Thank you for your response! In my case flink streaming write to iceberg 1.Hive metastore has been full GC continuously so it will throw SocketTimeoutException: Read timed out(`hive.metastore.client.socket.timeout` default 600s) 2.hiveTableOperations commit thread call `Thread.sleep(retryDelaySeconds * 1000)` to retry 3.The Flink checkpoint timeout time is less than 600s and Interrupt it, then throw InterruptedException and not message I have been thinking for a while and I have some doubts about this UT. Can you give me some advice? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]
viirya commented on code in PR #295: URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1548817290 ## crates/iceberg/src/arrow.rs: ## @@ -113,6 +143,405 @@ impl ArrowReader { // TODO: full implementation ProjectionMask::all() } + +fn get_row_filter(&self, parquet_schema: &SchemaDescriptor) -> Result> { +if let Some(predicates) = &self.predicates { +let field_id_map = self.build_field_id_map(parquet_schema)?; + +// Collect Parquet column indices from field ids +let column_indices = predicates +.iter() +.map(|predicate| { +let mut collector = CollectFieldIdVisitor { field_ids: vec![] }; +collector.visit_predicate(predicate).unwrap(); +collector +.field_ids +.iter() +.map(|field_id| { +field_id_map.get(field_id).cloned().ok_or_else(|| { +Error::new(ErrorKind::DataInvalid, "Field id not found in schema") +}) +}) +.collect::>>() +}) +.collect::>>()?; + +// Convert BoundPredicates to ArrowPredicates +let mut arrow_predicates = vec![]; +for (predicate, columns) in predicates.iter().zip(column_indices.iter()) { +let mut converter = PredicateConverter { +columns, +projection_mask: ProjectionMask::leaves(parquet_schema, columns.clone()), +parquet_schema, +column_map: &field_id_map, +}; +let arrow_predicate = converter.visit_predicate(predicate)?; +arrow_predicates.push(arrow_predicate); +} +Ok(Some(RowFilter::new(arrow_predicates))) +} else { +Ok(None) +} +} + +/// Build the map of field id to Parquet column index in the schema. +fn build_field_id_map(&self, parquet_schema: &SchemaDescriptor) -> Result> { +let mut column_map = HashMap::new(); +for (idx, field) in parquet_schema.columns().iter().enumerate() { +let field_type = field.self_type(); +match field_type { +ParquetType::PrimitiveType { basic_info, .. } => { +if !basic_info.has_id() { +return Err(Error::new( +ErrorKind::DataInvalid, +format!( +"Leave column {:?} in schema doesn't have field id", +field_type +), +)); +} +column_map.insert(basic_info.id(), idx); +} +ParquetType::GroupType { .. } => { +return Err(Error::new( +ErrorKind::DataInvalid, +format!( +"Leave column in schema should be primitive type but got {:?}", +field_type +), +)); +} +}; +} + +Ok(column_map) +} +} + +/// A visitor to collect field ids from bound predicates. +struct CollectFieldIdVisitor { +field_ids: Vec, +} + +impl BoundPredicateVisitor for CollectFieldIdVisitor { +type T = (); +type U = (); + +fn and(&mut self, _predicates: Vec) -> Result { +Ok(()) +} + +fn or(&mut self, _predicates: Vec) -> Result { +Ok(()) +} + +fn not(&mut self, _predicate: Self::T) -> Result { +Ok(()) +} + +fn visit_always_true(&mut self) -> Result { +Ok(()) +} + +fn visit_always_false(&mut self) -> Result { +Ok(()) +} + +fn visit_unary(&mut self, predicate: &UnaryExpression) -> Result { +self.bound_reference(predicate.term())?; +Ok(()) +} + +fn visit_binary(&mut self, predicate: &BinaryExpression) -> Result { +self.bound_reference(predicate.term())?; +Ok(()) +} + +fn visit_set(&mut self, predicate: &SetExpression) -> Result { +self.bound_reference(predicate.term())?; +Ok(()) +} + +fn bound_reference(&mut self, reference: &BoundReference) -> Result { +self.field_ids.push(reference.field().id); +Ok(()) +} +} + +struct PredicateConverter<'a> { +pub columns: &'a Vec, +pub projection_mask: ProjectionMask, +pub parquet_schema: &'a SchemaDescriptor, +pub column_map: &'a HashMap, +} + +fn get_arrow_datum(datum: &Datum) -> Box { +match datum.literal() { +PrimitiveLiteral::Boolean(value) => Box::new(BooleanArray::new_scalar(*value)), +PrimitiveLiteral::Int(value) => Bo
Re: [PR] Open-api: update prefix param description [iceberg]
ajantha-bhat commented on code in PR #9870: URL: https://github.com/apache/iceberg/pull/9870#discussion_r1548815679 ## open-api/rest-catalog-open-api.yaml: ## @@ -1444,7 +1444,7 @@ components: schema: type: string required: true - description: An optional prefix in the path + description: Prefix in the path Review Comment: validator doesn't allow setting `required` as `false` for the Path parameter. What we need maybe have two paths defined, one with prefix and one without prefix? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Move writes to Transaction [iceberg-python]
syun64 opened a new pull request, #571: URL: https://github.com/apache/iceberg-python/pull/571 As a followup from @HonahX 's suggestion on https://github.com/apache/iceberg-python/pull/498 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]
viirya commented on code in PR #295: URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1548813765 ## crates/iceberg/src/arrow.rs: ## @@ -113,6 +143,405 @@ impl ArrowReader { // TODO: full implementation ProjectionMask::all() } + +fn get_row_filter(&self, parquet_schema: &SchemaDescriptor) -> Result> { +if let Some(predicates) = &self.predicates { +let field_id_map = self.build_field_id_map(parquet_schema)?; + +// Collect Parquet column indices from field ids +let column_indices = predicates +.iter() +.map(|predicate| { +let mut collector = CollectFieldIdVisitor { field_ids: vec![] }; +collector.visit_predicate(predicate).unwrap(); +collector +.field_ids +.iter() +.map(|field_id| { +field_id_map.get(field_id).cloned().ok_or_else(|| { +Error::new(ErrorKind::DataInvalid, "Field id not found in schema") +}) +}) +.collect::>>() +}) +.collect::>>()?; + +// Convert BoundPredicates to ArrowPredicates +let mut arrow_predicates = vec![]; +for (predicate, columns) in predicates.iter().zip(column_indices.iter()) { +let mut converter = PredicateConverter { +columns, +projection_mask: ProjectionMask::leaves(parquet_schema, columns.clone()), +parquet_schema, +column_map: &field_id_map, +}; +let arrow_predicate = converter.visit_predicate(predicate)?; +arrow_predicates.push(arrow_predicate); +} +Ok(Some(RowFilter::new(arrow_predicates))) +} else { +Ok(None) +} +} + +/// Build the map of field id to Parquet column index in the schema. +fn build_field_id_map(&self, parquet_schema: &SchemaDescriptor) -> Result> { +let mut column_map = HashMap::new(); +for (idx, field) in parquet_schema.columns().iter().enumerate() { +let field_type = field.self_type(); +match field_type { +ParquetType::PrimitiveType { basic_info, .. } => { +if !basic_info.has_id() { +return Err(Error::new( +ErrorKind::DataInvalid, +format!( +"Leave column {:?} in schema doesn't have field id", +field_type +), +)); +} +column_map.insert(basic_info.id(), idx); +} +ParquetType::GroupType { .. } => { +return Err(Error::new( +ErrorKind::DataInvalid, +format!( +"Leave column in schema should be primitive type but got {:?}", +field_type +), +)); +} +}; +} + +Ok(column_map) +} +} + +/// A visitor to collect field ids from bound predicates. +struct CollectFieldIdVisitor { +field_ids: Vec, +} + +impl BoundPredicateVisitor for CollectFieldIdVisitor { +type T = (); +type U = (); + +fn and(&mut self, _predicates: Vec) -> Result { +Ok(()) +} + +fn or(&mut self, _predicates: Vec) -> Result { +Ok(()) +} + +fn not(&mut self, _predicate: Self::T) -> Result { +Ok(()) +} + +fn visit_always_true(&mut self) -> Result { +Ok(()) +} + +fn visit_always_false(&mut self) -> Result { +Ok(()) +} + +fn visit_unary(&mut self, predicate: &UnaryExpression) -> Result { +self.bound_reference(predicate.term())?; +Ok(()) +} + +fn visit_binary(&mut self, predicate: &BinaryExpression) -> Result { +self.bound_reference(predicate.term())?; +Ok(()) +} + +fn visit_set(&mut self, predicate: &SetExpression) -> Result { +self.bound_reference(predicate.term())?; +Ok(()) +} + +fn bound_reference(&mut self, reference: &BoundReference) -> Result { +self.field_ids.push(reference.field().id); +Ok(()) +} +} + +struct PredicateConverter<'a> { +pub columns: &'a Vec, +pub projection_mask: ProjectionMask, +pub parquet_schema: &'a SchemaDescriptor, +pub column_map: &'a HashMap, +} + +fn get_arrow_datum(datum: &Datum) -> Box { +match datum.literal() { +PrimitiveLiteral::Boolean(value) => Box::new(BooleanArray::new_scalar(*value)), +PrimitiveLiteral::Int(value) => Bo
Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]
viirya commented on code in PR #295: URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1548809949 ## crates/iceberg/src/arrow.rs: ## @@ -20,24 +20,38 @@ use async_stream::try_stream; use futures::stream::StreamExt; use parquet::arrow::{ParquetRecordBatchStreamBuilder, ProjectionMask}; +use std::collections::HashMap; use crate::io::FileIO; use crate::scan::{ArrowRecordBatchStream, FileScanTask, FileScanTaskStream}; -use crate::spec::SchemaRef; +use crate::spec::{Datum, PrimitiveLiteral, SchemaRef}; use crate::error::Result; +use crate::expr::{ +BinaryExpression, BoundPredicate, BoundReference, PredicateOperator, SetExpression, +UnaryExpression, +}; use crate::spec::{ ListType, MapType, NestedField, NestedFieldRef, PrimitiveType, Schema, StructType, Type, }; use crate::{Error, ErrorKind}; +use arrow_arith::boolean::{and, is_not_null, is_null, not, or}; +use arrow_array::{ +BooleanArray, Datum as ArrowDatum, Float32Array, Float64Array, Int32Array, Int64Array, +}; +use arrow_ord::cmp::{eq, gt, gt_eq, lt, lt_eq, neq}; use arrow_schema::{DataType, Field, Fields, Schema as ArrowSchema, TimeUnit}; +use bitvec::macros::internal::funty::Fundamental; +use parquet::arrow::arrow_reader::{ArrowPredicate, ArrowPredicateFn, RowFilter}; +use parquet::schema::types::{SchemaDescriptor, Type as ParquetType}; use std::sync::Arc; /// Builder to create ArrowReader pub struct ArrowReaderBuilder { batch_size: Option, file_io: FileIO, schema: SchemaRef, +predicates: Option>, Review Comment: This is because the Parquet API `RowFilter` takes `predicates: Vec>`. This is `RowFilter`'s doc: ``` /// A [`RowFilter`] allows pushing down a filter predicate to skip IO and decode /// /// This consists of a list of [`ArrowPredicate`] where only the rows that satisfy all /// of the predicates will be returned. ``` So I think it is conjunction relationships between these predicates. > Since we already have And/Or as part of BoundPredicate variant, how about just BoundPredicate? Yea, I can use just one `BoundPredicate`. So users can define conjunctions in one single predicate using `And`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]
viirya commented on code in PR #295: URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1548810072 ## crates/iceberg/src/arrow.rs: ## @@ -113,6 +143,405 @@ impl ArrowReader { // TODO: full implementation ProjectionMask::all() } + +fn get_row_filter(&self, parquet_schema: &SchemaDescriptor) -> Result> { +if let Some(predicates) = &self.predicates { +let field_id_map = self.build_field_id_map(parquet_schema)?; + +// Collect Parquet column indices from field ids +let column_indices = predicates +.iter() +.map(|predicate| { +let mut collector = CollectFieldIdVisitor { field_ids: vec![] }; +collector.visit_predicate(predicate).unwrap(); +collector +.field_ids +.iter() +.map(|field_id| { +field_id_map.get(field_id).cloned().ok_or_else(|| { +Error::new(ErrorKind::DataInvalid, "Field id not found in schema") +}) +}) +.collect::>>() +}) +.collect::>>()?; + +// Convert BoundPredicates to ArrowPredicates +let mut arrow_predicates = vec![]; +for (predicate, columns) in predicates.iter().zip(column_indices.iter()) { +let mut converter = PredicateConverter { +columns, +projection_mask: ProjectionMask::leaves(parquet_schema, columns.clone()), +parquet_schema, +column_map: &field_id_map, +}; +let arrow_predicate = converter.visit_predicate(predicate)?; +arrow_predicates.push(arrow_predicate); +} +Ok(Some(RowFilter::new(arrow_predicates))) +} else { +Ok(None) +} +} + +/// Build the map of field id to Parquet column index in the schema. +fn build_field_id_map(&self, parquet_schema: &SchemaDescriptor) -> Result> { +let mut column_map = HashMap::new(); +for (idx, field) in parquet_schema.columns().iter().enumerate() { +let field_type = field.self_type(); +match field_type { +ParquetType::PrimitiveType { basic_info, .. } => { +if !basic_info.has_id() { +return Err(Error::new( +ErrorKind::DataInvalid, +format!( +"Leave column {:?} in schema doesn't have field id", +field_type +), +)); +} +column_map.insert(basic_info.id(), idx); +} +ParquetType::GroupType { .. } => { +return Err(Error::new( +ErrorKind::DataInvalid, +format!( +"Leave column in schema should be primitive type but got {:?}", +field_type +), +)); +} +}; +} + +Ok(column_map) +} +} + +/// A visitor to collect field ids from bound predicates. +struct CollectFieldIdVisitor { Review Comment: OKay ## crates/iceberg/src/arrow.rs: ## @@ -113,6 +143,405 @@ impl ArrowReader { // TODO: full implementation ProjectionMask::all() } + +fn get_row_filter(&self, parquet_schema: &SchemaDescriptor) -> Result> { +if let Some(predicates) = &self.predicates { +let field_id_map = self.build_field_id_map(parquet_schema)?; + +// Collect Parquet column indices from field ids +let column_indices = predicates +.iter() +.map(|predicate| { +let mut collector = CollectFieldIdVisitor { field_ids: vec![] }; +collector.visit_predicate(predicate).unwrap(); +collector +.field_ids +.iter() +.map(|field_id| { +field_id_map.get(field_id).cloned().ok_or_else(|| { +Error::new(ErrorKind::DataInvalid, "Field id not found in schema") +}) +}) +.collect::>>() +}) +.collect::>>()?; + +// Convert BoundPredicates to ArrowPredicates +let mut arrow_predicates = vec![]; +for (predicate, columns) in predicates.iter().zip(column_indices.iter()) { +let mut converter = PredicateConverter { +columns, +
Re: [PR] refine: seperate parquet reader and arrow convert [iceberg-rust]
ZENOTME commented on PR #313: URL: https://github.com/apache/iceberg-rust/pull/313#issuecomment-2033373684 cc @liurenjie1024 @Xuanwo @Fokko -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] refine: seperate parquet reader and arrow convert [iceberg-rust]
ZENOTME opened a new pull request, #313: URL: https://github.com/apache/iceberg-rust/pull/313 This PR separates out the parquet reader from the arrow module. And make the arrow module a dir so that we can separate the `from_arow` and `to_arrow`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [WIP] Add `ManifestEvaluator` to allow filtering of files in a table scan (Issue #152) [iceberg-rust]
liurenjie1024 commented on PR #241: URL: https://github.com/apache/iceberg-rust/pull/241#issuecomment-2033371130 > > which is somehow motivated by @viirya 's [pr](https://github.com/apache/iceberg-rust/pull/295/files#diff-a59622727cd67153abdf02031475bf8a1b1921738df4ca9903a685ff6970b7aaR472), but move the travsering flow out of trait body. > > @liurenjie1024 ...so the traversal flow could then be implemented on e.g. the `ManifestEvaluator` itself. For example, `eval()` could call the corresponding 'visit_xx' on the visitor that implements `BoundPredicateVisitor`. Is this what you mean? I was thinking about following following structure: ```rust pub trait BoundPredicateVisitor { type T; fn visit_and(&mut self, values: [Self::T; 2]) -> Result; fn visit_or(&mut self, values: [Self::T; 2]) -> Result; ... } pub fn visit_bound_predicate(visitor: &mut V, predicate: &BoundPredicate) -> Result { match predicate { BoundPredicate::And(children) => { let ret = [visit_bound_predicate(visitor, children[0]),visit_bound_predicate(visitor, children[1])]; visitor.visit_and(ret) }, ... } } pub struct ManifestEvaluator {} impl BoundPredicateVisitor for ManifestEvaluator {} impl ManifestEvaluator { pub fn eval(&mut self, predicate: &BoundPredicate) -> bool { visit_bound_predicate(self, predicate)? } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: support uri redirect in rest client [iceberg-rust]
liurenjie1024 merged PR #310: URL: https://github.com/apache/iceberg-rust/pull/310 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Support CreateTableTransaction in Glue and Rest [iceberg-python]
syun64 commented on PR #498: URL: https://github.com/apache/iceberg-python/pull/498#issuecomment-2033344103 > Shall we move "append", "overwrite", and "add_files" to `Transaction` class? This change would enable us to seamlessly chain these operations with other table updates in a single commit. This adjustment could be particularly beneficial in the context of `CreateTableTransaction`, as it would enable users to not only create a table but also populate it with initial data in one go. I think this is a great question. I think we have two options here: 1. We move these actions into the Transaction class, and remove them from Table class 2. We move them into the Transaction class, and also keep an implementation in the Table class I'm not sure which of the above two are better, but I keep asking myself whether there's a 'good' reason why we have two separate APIs that achieve similar results. For example, we have **update_spec**, **update_schema** that can be created from the **Transaction** or the **Table**, and I feel like we might be creating work for ourselves by duplicating the feature in both classes. What if we consolidated all of our actions into the Transaction class, and removed them from the Table class? I think the upside of that would be that API would convey a very clear message to the developer that a _transaction is committed to a table_, and that a series of _actions_ can be chained onto the _same transaction_, as a single commit. In addition, we can avoid [issues like this](https://github.com/apache/iceberg-python/pull/508) where we roll out a feature to one API implementation, but not the other. ``` with given_table.update_schema() as tx: tx.add_column(path="new_column1", field_type=IntegerType()) ``` ``` with given_table.transaction() as tx: with tx.update_schema() as update: update.add_column(path="new_column1", field_type=IntegerType()) ``` To me, the bottom pattern feels more explicit than the above option, and I'm curious to hear others' opinions on this topic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Please remove old releases [iceberg]
github-actions[bot] commented on issue #2414: URL: https://github.com/apache/iceberg/issues/2414#issuecomment-2033310864 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Flink cdc events with update or delete doesn't work in 0.11.0 branch [iceberg]
github-actions[bot] commented on issue #2409: URL: https://github.com/apache/iceberg/issues/2409#issuecomment-2033310842 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] how to fix org.apache.spark.shuffle.FetchFailedException: [iceberg]
github-actions[bot] closed issue #2211: how to fix org.apache.spark.shuffle.FetchFailedException: URL: https://github.com/apache/iceberg/issues/2211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] how to fix org.apache.spark.shuffle.FetchFailedException: [iceberg]
github-actions[bot] commented on issue #2211: URL: https://github.com/apache/iceberg/issues/2211#issuecomment-2033310616 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Need help inserting data into hadoop table with flink sql in java [iceberg]
github-actions[bot] closed issue #2209: Need help inserting data into hadoop table with flink sql in java URL: https://github.com/apache/iceberg/issues/2209 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Need help inserting data into hadoop table with flink sql in java [iceberg]
github-actions[bot] commented on issue #2209: URL: https://github.com/apache/iceberg/issues/2209#issuecomment-2033310589 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Hive: got error while joining iceberg table and hive table [iceberg]
github-actions[bot] commented on issue #2198: URL: https://github.com/apache/iceberg/issues/2198#issuecomment-2033310562 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Hive: got error while joining iceberg table and hive table [iceberg]
github-actions[bot] closed issue #2198: Hive: got error while joining iceberg table and hive table URL: https://github.com/apache/iceberg/issues/2198 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Cannot write incompatible dataset to table with schema error for list types [iceberg]
github-actions[bot] commented on issue #2192: URL: https://github.com/apache/iceberg/issues/2192#issuecomment-2033310543 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Cannot write incompatible dataset to table with schema error for list types [iceberg]
github-actions[bot] closed issue #2192: Cannot write incompatible dataset to table with schema error for list types URL: https://github.com/apache/iceberg/issues/2192 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Iceberg/Comet integration POC [iceberg]
huaxingao commented on code in PR #9841: URL: https://github.com/apache/iceberg/pull/9841#discussion_r1548708375 ## spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/comet/CometIcebergColumnReader.java: ## @@ -0,0 +1,164 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.data.vectorized.comet; + +import java.io.IOException; +import java.util.Map; +import org.apache.comet.parquet.AbstractColumnReader; +import org.apache.comet.parquet.ColumnReader; +import org.apache.comet.parquet.TypeUtil; +import org.apache.comet.parquet.Utils; +import org.apache.comet.vector.CometVector; +import org.apache.iceberg.parquet.VectorizedReader; +import org.apache.iceberg.spark.SparkSchemaUtil; +import org.apache.iceberg.types.Types; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.column.page.PageReadStore; +import org.apache.parquet.column.page.PageReader; +import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData; +import org.apache.parquet.hadoop.metadata.ColumnPath; +import org.apache.spark.sql.types.DataType; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; + +/** + * A Iceberg Parquet column reader backed by a Boson {@link ColumnReader}. This class should be used Review Comment: Oops. Will change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] 5 dremio blog march 2024 [iceberg]
AlexMercedCoder commented on code in PR #10067: URL: https://github.com/apache/iceberg/pull/10067#discussion_r1548619662 ## site/docs/blogs.md: ## @@ -23,6 +23,37 @@ title: "Blogs" Here is a list of company blogs that talk about Iceberg. The blogs are ordered from most recent to oldest. + +### [End-to-End Basic Data Engineering Tutorial (Apache Spark, Apache Iceberg Dremio, Apache Superset, Nessie)](https://medium.com/data-engineering-with-dremio/end-to-end-basic-data-engineering-tutorial-apache-spark-apache-iceberg-dremio-apache-superset-a896ecab46f6) Review Comment: also fixed on medium -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] 5 dremio blog march 2024 [iceberg]
AlexMercedCoder commented on code in PR #10067: URL: https://github.com/apache/iceberg/pull/10067#discussion_r1548618808 ## site/docs/blogs.md: ## @@ -23,6 +23,37 @@ title: "Blogs" Here is a list of company blogs that talk about Iceberg. The blogs are ordered from most recent to oldest. + +### [End-to-End Basic Data Engineering Tutorial (Apache Spark, Apache Iceberg Dremio, Apache Superset, Nessie)](https://medium.com/data-engineering-with-dremio/end-to-end-basic-data-engineering-tutorial-apache-spark-apache-iceberg-dremio-apache-superset-a896ecab46f6) Review Comment: no, that's a typo, missing comma, just added a new commit adding the comma. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Iceberg/Comet integration POC [iceberg]
RussellSpitzer commented on code in PR #9841: URL: https://github.com/apache/iceberg/pull/9841#discussion_r1548617271 ## spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/comet/CometIcebergColumnReader.java: ## @@ -0,0 +1,164 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.data.vectorized.comet; + +import java.io.IOException; +import java.util.Map; +import org.apache.comet.parquet.AbstractColumnReader; +import org.apache.comet.parquet.ColumnReader; +import org.apache.comet.parquet.TypeUtil; +import org.apache.comet.parquet.Utils; +import org.apache.comet.vector.CometVector; +import org.apache.iceberg.parquet.VectorizedReader; +import org.apache.iceberg.spark.SparkSchemaUtil; +import org.apache.iceberg.types.Types; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.column.page.PageReadStore; +import org.apache.parquet.column.page.PageReader; +import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData; +import org.apache.parquet.hadoop.metadata.ColumnPath; +import org.apache.spark.sql.types.DataType; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; + +/** + * A Iceberg Parquet column reader backed by a Boson {@link ColumnReader}. This class should be used Review Comment: What is boson :P -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] PyArrow S3FileSystem doesn't honor the AWS profile config [iceberg-python]
geruh opened a new issue, #570: URL: https://github.com/apache/iceberg-python/issues/570 ### Apache Iceberg version main (development) ### Please describe the bug 🐞 When initializing the GlueCatalog with a specific AWS profile, everything works as it should with catalog operations. But, we’ve hit a issue when it comes to working with S3 via the PyArrow S3FileSystem. Users can specify a profile for initiating a boto connection however, this preference doesn’t carry over to the S3FileSystem. Instead of using the specified AWS profile, it will check the catalog configs for the s3 configs like:`s3.access-key-id, s3.region... `. If those aren't passed in PyArrow's S3Filesystem has it's own strategy of inferring credentials such as: 1. the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN environment variables. 2. the default profile credentials in your ~/.aws/credentials and ~/.aws/config. This workflow leads to some inconsistencies. For example, while Glue operations might be using a ux specified profile, S3 operations could end up using a different set of credentials or even a different region from what’s set in the environment variables or the AWS config files. This is seen in issue #515, where one region (like us-west-2) unexpectedly switches to another (like us-east-1), causing a 301 exception. For example: 1. Set up an AWS profile in ~/.aws/config with an incorrect region: ``` [default] region = us-east-1 [test] region = us-west-2 ``` 2. Initialize the GlueCatalog with the correct region you want to use: ``` catalog = pyiceberg.catalog.load_catalog( catalog_name, **{"type": "glue", "profile_name": "test", "region_name": "us-west-2"} ) ``` 3. load a table ``` catalog.load_table("default.test") File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status OSError: When reading information for key 'test/metadata/0-c0fc4e45-d79d-41a1-ba92-a4122c09171c.metadata.json' in bucket 'test_bucket': AWS Error UNKNOWN (HTTP status 301) during HeadObject operation: No response body. ``` On one hand, we could argue that this profile configuration should only work at the catalog level, and for filesystems, the user must specify the aforementioned configs like `s3.region`. But on the other hand it seems reasonable that the AWS profile config should work uniformly across both the catalog and filesystem levels. This unified approach would certainly simplify configuration management for users. I’m leaning towards this perspective. However, we're currently utilizing PyArrow's S3FileSystem, which doesn't inherently support AWS profiles. This means we'd need to bridge that gap manually. cc: @HonahX @Fokko @kevinjqliu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add PrePlanTable and PlanTable Endpoints to open api spec [iceberg]
rahil-c commented on PR #9695: URL: https://github.com/apache/iceberg/pull/9695#issuecomment-2033006567 @nastra @rdblue @danielcweeks @jackye1995 @amogh-jahagirdar When looking again at the `capabalities` pr: https://github.com/apache/iceberg/pull/9940, are we sure we want to add scan-planning as part of the `capabilities` list in the `ConfigResponse`? One question i want to raise is if server returns back `scan-planning` in the `capabalities`, would this mean that all tables under `RestCatalog` would support `scan planning`? I believe that we need a way tell clients the rest scan planning is supported, but I think this is something that should be done at the table level, in the `LoadTableResponse` as a new property? Let me know what you all think -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Snowflake Iceberg Partitioned data read issue [iceberg]
findinpath commented on issue #9404: URL: https://github.com/apache/iceberg/issues/9404#issuecomment-2032964716 @sfc-gh-rortloff i went through the Snowflake documenation https://docs.snowflake.com/en/sql-reference/sql/create-iceberg-table and don't see any reference related to partitioning. Could you pls sketch here how to create via Snowflake SQL syntax an Iceberg partitioned table? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Unable to load an iceberg table from aws glue catalog [iceberg-python]
geruh commented on issue #515: URL: https://github.com/apache/iceberg-python/issues/515#issuecomment-2032913100 No Problem!! This could potentially be a bug if we assume that the catalog and FileIO (S3) share the same aws profile configs. On one side, having a single profile configuration is convenient for the user's boto client, as it allows initializing all AWS clients with the correct credentials. However, on the other hand, we could argue that this configuration should only work at the catalog level, and for filesystems, separate configurations might be required. I'm inclined towards the first option. However, we are using pyarrow's S3FileSystem implementation, which has no concept of a aws profile. Therefore, we will need to initialize these values through boto's session.get_credentials() and pass them to the filesystem. I'll raise an issue for this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Partitioned Append on Identity Transform [iceberg-python]
jqin61 commented on code in PR #555: URL: https://github.com/apache/iceberg-python/pull/555#discussion_r1548140319 ## pyiceberg/table/__init__.py: ## @@ -2526,25 +2537,44 @@ def _dataframe_to_data_files( """ from pyiceberg.io.pyarrow import bin_pack_arrow_table, write_file -if len([spec for spec in table_metadata.partition_specs if spec.spec_id != 0]) > 0: -raise ValueError("Cannot write to partitioned tables") - counter = itertools.count(0) write_uuid = write_uuid or uuid.uuid4() - target_file_size = PropertyUtil.property_as_int( properties=table_metadata.properties, property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES, default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT, ) +if target_file_size is None: +raise ValueError( +"Fail to get neither TableProperties.WRITE_TARGET_FILE_SIZE_BYTES nor WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT for writing target data file." Review Comment: i just find the default value itself could be None: ```PARQUET_COMPRESSION_LEVEL_DEFAULT = None``` so this None checking is not unnecessary? the original code for this target_file_size check just `type: ignores` it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Partitioned Append on Identity Transform [iceberg-python]
jqin61 commented on code in PR #555: URL: https://github.com/apache/iceberg-python/pull/555#discussion_r1548140319 ## pyiceberg/table/__init__.py: ## @@ -2526,25 +2537,44 @@ def _dataframe_to_data_files( """ from pyiceberg.io.pyarrow import bin_pack_arrow_table, write_file -if len([spec for spec in table_metadata.partition_specs if spec.spec_id != 0]) > 0: -raise ValueError("Cannot write to partitioned tables") - counter = itertools.count(0) write_uuid = write_uuid or uuid.uuid4() - target_file_size = PropertyUtil.property_as_int( properties=table_metadata.properties, property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES, default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT, ) +if target_file_size is None: +raise ValueError( +"Fail to get neither TableProperties.WRITE_TARGET_FILE_SIZE_BYTES nor WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT for writing target data file." Review Comment: i just find the default value itself could be None: ```PARQUET_COMPRESSION_LEVEL_DEFAULT = None``` so this None checking is not unnecessary? the original code for this target_file_size just type: ignores it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Partitioned Append on Identity Transform [iceberg-python]
jqin61 commented on code in PR #555: URL: https://github.com/apache/iceberg-python/pull/555#discussion_r1548140319 ## pyiceberg/table/__init__.py: ## @@ -2526,25 +2537,44 @@ def _dataframe_to_data_files( """ from pyiceberg.io.pyarrow import bin_pack_arrow_table, write_file -if len([spec for spec in table_metadata.partition_specs if spec.spec_id != 0]) > 0: -raise ValueError("Cannot write to partitioned tables") - counter = itertools.count(0) write_uuid = write_uuid or uuid.uuid4() - target_file_size = PropertyUtil.property_as_int( properties=table_metadata.properties, property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES, default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT, ) +if target_file_size is None: +raise ValueError( +"Fail to get neither TableProperties.WRITE_TARGET_FILE_SIZE_BYTES nor WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT for writing target data file." Review Comment: i just find the default value itself could be None: ```PARQUET_COMPRESSION_LEVEL_DEFAULT = None``` so this None checking is not unnecessary? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Improve CLI Text by Adding Verbose Text for Commands [iceberg-go]
zeroshade commented on code in PR #68: URL: https://github.com/apache/iceberg-go/pull/68#discussion_r1548127844 ## cmd/iceberg/main.go: ## @@ -34,16 +34,21 @@ import ( const usage = `iceberg. Usage: - iceberg list [options] [PARENT] - iceberg describe [options] [namespace | table] IDENTIFIER - iceberg (schema | spec | uuid | location) [options] TABLE_ID - iceberg drop [options] (namespace | table) IDENTIFIER - iceberg files [options] TABLE_ID [--history] - iceberg rename [options] - iceberg properties [options] get (namespace | table) IDENTIFIER [PROPNAME] - iceberg properties [options] set (namespace | table) IDENTIFIER PROPNAME VALUE - iceberg properties [options] remove (namespace | table) IDENTIFIER PROPNAME - iceberg -h | --help | --version + iceberg [command] [options] [arguments] Review Comment: does this actually work? The library being used `docopt` actually uses the `usage` string to perform the parsing and processing. `[options]` is a special handle thing for docopt, but i don't think that `[command]` and `[arguments]` is. So i think this might break the CLI here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Spark configuration for amazon access key and secret key with glue catalog for apache Iceberg is not honoring [iceberg]
AwasthiSomesh opened a new issue, #10078: URL: https://github.com/apache/iceberg/issues/10078 Hi Team , We are doing below code to access iceberg table from glue catalog and data storage as S3 var spark = SparkSession.builder().master("local[*]") .config("spark.sql.defaultCatalog", "AwsDataCatalog") .config("spark.sql.catalog.AwsDataCatalog", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.AwsDataCatalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") .config("spark.sql.catalog.AwsDataCatalog.io-imp", "org.apache.iceberg.aws.s3.S3FileIO") .config("spark.hadoop.fs.s3a.access.key", "XXxxx") .config("spark.hadoop.fs.s3a.secret.key", "XXXx") .config("spark.hadoop.fs.s3a.aws.region", "us-west-2") .getOrCreate(); val df1 = spark.sql("select * from default.iceberg_table_exercise1"); Error- Exception in thread "main" software.amazon.awssdk.core.exception.SdkClientException: Unable to load credentials from any of the providers in the chain AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(), EnvironmentVariableCredentialsProvider(), WebIdentityTokenCredentialsProvider(), ProfileCredentialsProvider(profileName=default, profileFile=ProfileFile(profilesAndSectionsMap=[])), ContainerCredentialsProvider(), InstanceProfileCredentialsProvider()]) : [SystemPropertyCredentialsProvider(): Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId)., EnvironmentVariableCredentialsProvider(): Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId)., WebIdentityTokenCredentialsProvider(): Either the environment variable AWS_WEB_IDENT ITY_TOKEN_FILE or the javaproperty aws.webIdentityTokenFile must be set., ProfileCredentialsProvider(profileName=default, profileFile=ProfileFile(profilesAndSectionsMap=[])): Profile file contained no credentials for profile 'default': ProfileFile(profilesAndSectionsMap=[]), ContainerCredentialsProvider(): Cannot fetch credentials from container - neither AWS_CONTAINER_CREDENTIALS_FULL_URI or AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variables are set., InstanceProfileCredentialsProvider(): Failed to load credentials from IMDS.] This code is throwing unable to load access and secret key but when we passing these information with System.setproperty then its working. But our requirement is set to spark level not system level. Jars we used : - iceberg-spark-runtime-3.5_2.12-1.5.0 , iceberg-aws-bundle-1.5.0 Please any one help ASAP Thanks, -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Partitioned Append on Identity Transform [iceberg-python]
jqin61 commented on code in PR #555: URL: https://github.com/apache/iceberg-python/pull/555#discussion_r1548068306 ## pyiceberg/typedef.py: ## @@ -199,3 +199,7 @@ def __repr__(self) -> str: def record_fields(self) -> List[str]: """Return values of all the fields of the Record class except those specified in skip_fields.""" return [self.__getattribute__(v) if hasattr(self, v) else None for v in self._position_to_field_name] + +def __hash__(self) -> int: +"""Return hash value of the Record class.""" +return hash(str(self)) Review Comment: I think since `__repr_` is defined, the str() might still work? I tested: ``` r1 = Record(1,2) r2 = Record(x=1, y="string value") print("") print(str(r1), hash(str(r1))) print(str(r2), hash(str(r2))) ``` prints: ``` Record[field1=1, field2=2] -7504199255027864703 Record[x=1, y='string value'] -4897332691101137012 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Add option to delete datafiles [iceberg-python]
Fokko opened a new pull request, #569: URL: https://github.com/apache/iceberg-python/pull/569 This is done through the Iceberg metadata, resulting in efficient deletes if the data is partitioned correctly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [WIP] Add `ManifestEvaluator` to allow filtering of files in a table scan (Issue #152) [iceberg-rust]
marvinlanhenke commented on PR #241: URL: https://github.com/apache/iceberg-rust/pull/241#issuecomment-2032233828 > which is somehow motivated by @viirya 's [pr](https://github.com/apache/iceberg-rust/pull/295/files#diff-a59622727cd67153abdf02031475bf8a1b1921738df4ca9903a685ff6970b7aaR472), but move the travsering flow out of trait body. @liurenjie1024 ...so the traversal flow could then be implemented on e.g. the `ManifestEvaluator` itself. For example, `eval()` could call the corresponding 'visits` on the visitor that implements `BoundPredicateVisitor`. Is this what you mean? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] OpenAPI: Express server capabilities via /config endpoint [iceberg]
snazy commented on PR #9940: URL: https://github.com/apache/iceberg/pull/9940#issuecomment-2032202321 > > I've got strong concerns about using `enum` here - special handling here and there, I think, that complicates things for adopters of any OpenAPI spec. > > @snazy we use `enum` in the OpenAPI spec to list possible values of a string, which is also what's being documented in https://swagger.io/docs/specification/data-models/enums/. The underlying type of a capability is still a `string`. Do you have an alternative in mind for the issue you're seeing? Also I believe when code is generated from the OpenAPI spec, there should be an option to generate enums as literals The issue at hand is that an enum cannot (by default) handle unknown values - which is generally fine. But here we have to expect that endpoints to return values that are _not_ known by a client. (Not only generated) client's that did not handle this rather special case will fail to parse the result, which is a problem. I'm still in favor of just using a string. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Fix metadata file not found [iceberg]
lurnagao-dahua commented on PR #10069: URL: https://github.com/apache/iceberg/pull/10069#issuecomment-2032193393 > @lurnagao-dahua please check styles. > > > The reason is that in some cases, the e.getMessage() return null and it will throw NullPointerException, then skip checkCommitStatus, it may be delete metadataLocation, actually, metadata commit succeed. > > Is it possible to add a UT for this case? Hi, I found the original pr similar to this one [701](https://github.com/apache/iceberg/pull/701), then I will try to write an UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Fix metadata file not found [iceberg]
lurnagao-dahua commented on PR #10069: URL: https://github.com/apache/iceberg/pull/10069#issuecomment-2032181736 > @lurnagao-dahua请检查样式。 > > > 原因是在某些情况下,e.getMessage()返回null,会抛出NullPointerException,然后跳过checkCommitStatus,可能是删除metadataLocation,实际上元数据提交成功。 > > 是否可以为这种情况添加 UT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Fix metadata file not found [iceberg]
lurnagao-dahua closed pull request #10069: Hive: Fix metadata file not found URL: https://github.com/apache/iceberg/pull/10069 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] OpenAPI: Express server capabilities via /config endpoint [iceberg]
nastra commented on PR #9940: URL: https://github.com/apache/iceberg/pull/9940#issuecomment-2032092212 > I've got strong concerns about using `enum` here - special handling here and there, I think, that complicates things for adopters of any OpenAPI spec. @snazy we use `enum` in the OpenAPI spec to list possible values of a string, which is also what's being documented in https://swagger.io/docs/specification/data-models/enums/. The underlying type of a capability is still a `string`. Do you have an alternative in mind for the issue you're seeing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Open-api: update prefix param description [iceberg]
jbonofre commented on code in PR #9870: URL: https://github.com/apache/iceberg/pull/9870#discussion_r1547920214 ## open-api/rest-catalog-open-api.yaml: ## @@ -1444,7 +1444,7 @@ components: schema: type: string required: true - description: An optional prefix in the path + description: Prefix in the path Review Comment: I think it makes sense to have `prefix` optional. Specifically for the `prefix` I would set `required: false` and check in the validator. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Open-api: update prefix param description [iceberg]
ajantha-bhat commented on code in PR #9870: URL: https://github.com/apache/iceberg/pull/9870#discussion_r1547852220 ## open-api/rest-catalog-open-api.yaml: ## @@ -1444,7 +1444,7 @@ components: schema: type: string required: true - description: An optional prefix in the path + description: Prefix in the path Review Comment: I think if we need optional params in path, we need to define a separate path without prefix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Partitioned Append on Identity Transform [iceberg-python]
Fokko commented on code in PR #555: URL: https://github.com/apache/iceberg-python/pull/555#discussion_r1547810906 ## pyiceberg/manifest.py: ## @@ -289,10 +286,7 @@ def partition_field_to_data_file_partition_field(partition_field_type: IcebergTy @partition_field_to_data_file_partition_field.register(LongType) -@partition_field_to_data_file_partition_field.register(DateType) Review Comment: This single-dispatch is there only for the `TimeType` it seems. Probably we should we should also convert those into a native type. ## tests/conftest.py: ## @@ -2000,7 +2000,11 @@ def spark() -> "SparkSession": 'float': [0.0, None, 0.9], 'double': [0.0, None, 0.9], 'timestamp': [datetime(2023, 1, 1, 19, 25, 00), None, datetime(2023, 3, 1, 19, 25, 00)], -'timestamptz': [datetime(2023, 1, 1, 19, 25, 00), None, datetime(2023, 3, 1, 19, 25, 00)], +'timestamptz': [ Review Comment: Nice one! ## pyiceberg/table/__init__.py: ## @@ -3111,3 +3147,112 @@ def snapshots(self) -> "pa.Table": snapshots, schema=snapshots_schema, ) + + +@dataclass(frozen=True) +class TablePartition: +partition_key: PartitionKey +arrow_table_partition: pa.Table + + +def _get_partition_sort_order(partition_columns: list[str], reverse: bool = False) -> dict[str, Any]: +order = 'ascending' if not reverse else 'descending' +null_placement = 'at_start' if reverse else 'at_end' +return {'sort_keys': [(column_name, order) for column_name in partition_columns], 'null_placement': null_placement} + + +def group_by_partition_scheme(arrow_table: pa.Table, partition_columns: list[str]) -> pa.Table: +"""Given a table, sort it by current partition scheme.""" +# only works for identity for now +sort_options = _get_partition_sort_order(partition_columns, reverse=False) +sorted_arrow_table = arrow_table.sort_by(sorting=sort_options['sort_keys'], null_placement=sort_options['null_placement']) +return sorted_arrow_table + + +def get_partition_columns( +spec: PartitionSpec, +schema: Schema, +) -> list[str]: +partition_cols = [] +for partition_field in spec.fields: +column_name = schema.find_column_name(partition_field.source_id) +if not column_name: +raise ValueError(f"{partition_field=} could not be found in {schema}.") +partition_cols.append(column_name) +return partition_cols + + +def _get_table_partitions( +arrow_table: pa.Table, +partition_spec: PartitionSpec, +schema: Schema, +slice_instructions: list[dict[str, Any]], +) -> list[TablePartition]: +sorted_slice_instructions = sorted(slice_instructions, key=lambda x: x['offset']) + +partition_fields = partition_spec.fields + +offsets = [inst["offset"] for inst in sorted_slice_instructions] +projected_and_filtered = { +partition_field.source_id: arrow_table[schema.find_field(name_or_id=partition_field.source_id).name] +.take(offsets) +.to_pylist() +for partition_field in partition_fields +} + +table_partitions = [] +for idx, inst in enumerate(sorted_slice_instructions): +partition_slice = arrow_table.slice(**inst) +fieldvalues = [ +PartitionFieldValue(partition_field, projected_and_filtered[partition_field.source_id][idx]) +for partition_field in partition_fields +] +partition_key = PartitionKey(raw_partition_field_values=fieldvalues, partition_spec=partition_spec, schema=schema) +table_partitions.append(TablePartition(partition_key=partition_key, arrow_table_partition=partition_slice)) +return table_partitions + + +def partition(spec: PartitionSpec, schema: Schema, arrow_table: pa.Table) -> Iterable[TablePartition]: Review Comment: It would be good to have a bit more length filenames. I also think we should hide this from the outside user. ```suggestion def _determine_partitions(spec: PartitionSpec, schema: Schema, arrow_table: pa.Table) -> List[TablePartition]: ``` I think we can also return a list, so folks know that it is already materialized. ## tests/conftest.py: ## @@ -2045,3 +2049,19 @@ def arrow_table_with_null(pa_schema: "pa.Schema") -> "pa.Table": """PyArrow table with all kinds of columns""" return pa.Table.from_pydict(TEST_DATA_WITH_NULL, schema=pa_schema) + + +@pytest.fixture(scope="session") +def arrow_table_without_data(pa_schema: "pa.Schema") -> "pa.Table": +import pyarrow as pa + +"""PyArrow table with all kinds of columns.""" Review Comment: ```suggestion """PyArrow table with all kinds of columns.""" import pyarrow as pa ``` ## pyiceberg/table/__init__.py: ## @@ -1131,8 +1133,11 @@ def append(self, df: pa.Table, snapshot_properties: Dict[str, str] = EMPTY_DICT) if not isinstance(df, pa.Table): raise Valu
Re: [PR] feat: Project transform [iceberg-rust]
liurenjie1024 commented on code in PR #309: URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547819858 ## crates/iceberg/src/spec/transform.rs: ## @@ -261,6 +269,300 @@ impl Transform { _ => self == other, } } + +/// Projects a given predicate according to the transformation +/// specified by the `Transform` instance. +/// +/// This allows predicates to be effectively applied to data +/// that has undergone transformation, enabling efficient querying +/// and filtering based on the original, untransformed data. +/// +/// # Example +/// Suppose, we have row filter `a = 10`, and a partition spec +/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition +/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to +/// `bs = bucket(10, 37)` +pub fn project(&self, name: String, predicate: &BoundPredicate) -> Result> { +let func = create_transform_function(self)?; + +let projection = match predicate { +BoundPredicate::Unary(expr) => match self { Review Comment: Yeah, it looks much better now, thanks! I'll take a careful review later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [WIP] Add `ManifestEvaluator` to allow filtering of files in a table scan (Issue #152) [iceberg-rust]
liurenjie1024 commented on PR #241: URL: https://github.com/apache/iceberg-rust/pull/241#issuecomment-2031946776 > @liurenjie1024 @ZENOTME @sdd @Xuanwo I'd really appreciate your thoughts on this: > > I took a closer look at the work @sdd has already done - and I think in order to proceed it would make sense to split the `ManifestEvaluator`, `InclusiveProjection` and `PartitionEvaluator` into separate modules. > > I am thinking about putting each visitor into its own file within `/iceberg/src/expr` (perhaps even another subfolder /visitors) - for example: `/iceberg/src/expr/manifest_evaluator.rs.` > > This way, we could create individual issues on each implementation - and work better in parallel. > > Perhaps, it also makes sense to provide a trait to make sure each visitor adheres to the same interface (although I'm not sure this is neccessary right now...) > > ```rust > // Pseudo-Example > pub trait BooleanVisitor: > fn eval() -> boolean > ``` +1 for splitting these into split modules. Instead of a boolean visitor, I'm thinking about a post order predicate visitor: ```rust pub trait BoundPredicateVisitor { } ``` which is somehow motivated by @viirya 's [pr](https://github.com/apache/iceberg-rust/pull/295/files#diff-a59622727cd67153abdf02031475bf8a1b1921738df4ca9903a685ff6970b7aaR472), but move the travsering flow out of trait body. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Project transform [iceberg-rust]
marvinlanhenke commented on code in PR #309: URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547785953 ## crates/iceberg/src/spec/transform.rs: ## @@ -261,6 +269,300 @@ impl Transform { _ => self == other, } } + +/// Projects a given predicate according to the transformation +/// specified by the `Transform` instance. +/// +/// This allows predicates to be effectively applied to data +/// that has undergone transformation, enabling efficient querying +/// and filtering based on the original, untransformed data. +/// +/// # Example +/// Suppose, we have row filter `a = 10`, and a partition spec +/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition +/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to +/// `bs = bucket(10, 37)` +pub fn project(&self, name: String, predicate: &BoundPredicate) -> Result> { +let func = create_transform_function(self)?; + +let projection = match predicate { +BoundPredicate::Unary(expr) => match self { Review Comment: @liurenjie1024 I did a refactor changing the structure (matching order). I also extracted common functionality, renamed those helpers and updated the docs. I hope not only the structure but the overall design is more readable and understandable with those changes applied? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Project transform [iceberg-rust]
marvinlanhenke commented on code in PR #309: URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547681855 ## crates/iceberg/src/spec/transform.rs: ## @@ -261,6 +269,300 @@ impl Transform { _ => self == other, } } + +/// Projects a given predicate according to the transformation +/// specified by the `Transform` instance. +/// +/// This allows predicates to be effectively applied to data +/// that has undergone transformation, enabling efficient querying +/// and filtering based on the original, untransformed data. +/// +/// # Example +/// Suppose, we have row filter `a = 10`, and a partition spec +/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition +/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to +/// `bs = bucket(10, 37)` +pub fn project(&self, name: String, predicate: &BoundPredicate) -> Result> { +let func = create_transform_function(self)?; + +let projection = match predicate { +BoundPredicate::Unary(expr) => match self { Review Comment: sure, I'm already implementing it - since I wanted to compare for myself. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Project transform [iceberg-rust]
liurenjie1024 commented on code in PR #309: URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547676828 ## crates/iceberg/src/spec/transform.rs: ## @@ -261,6 +269,300 @@ impl Transform { _ => self == other, } } + +/// Projects a given predicate according to the transformation +/// specified by the `Transform` instance. +/// +/// This allows predicates to be effectively applied to data +/// that has undergone transformation, enabling efficient querying +/// and filtering based on the original, untransformed data. +/// +/// # Example +/// Suppose, we have row filter `a = 10`, and a partition spec +/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition +/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to +/// `bs = bucket(10, 37)` +pub fn project(&self, name: String, predicate: &BoundPredicate) -> Result> { +let func = create_transform_function(self)?; + +let projection = match predicate { +BoundPredicate::Unary(expr) => match self { Review Comment: I think some code duplication is worth so that we can have better readability? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Project transform [iceberg-rust]
marvinlanhenke commented on code in PR #309: URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547667177 ## crates/iceberg/src/spec/transform.rs: ## @@ -261,6 +269,300 @@ impl Transform { _ => self == other, } } + +/// Projects a given predicate according to the transformation +/// specified by the `Transform` instance. +/// +/// This allows predicates to be effectively applied to data +/// that has undergone transformation, enabling efficient querying +/// and filtering based on the original, untransformed data. +/// +/// # Example +/// Suppose, we have row filter `a = 10`, and a partition spec +/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition +/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to +/// `bs = bucket(10, 37)` +pub fn project(&self, name: String, predicate: &BoundPredicate) -> Result> { +let func = create_transform_function(self)?; + +let projection = match predicate { +BoundPredicate::Unary(expr) => match self { Review Comment: I had the structure you suggested in an earlier version. I changed it the other way around since `predicate` has the smaller cardinality, which allows me to group more transforms into a single predicate match arm. I can change it back, however this would introduce more match arms and some code duplication? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.3: drop_namespace with CASCADE support [iceberg]
supsupsap commented on PR #7275: URL: https://github.com/apache/iceberg/pull/7275#issuecomment-2031741342 @abmo-x do you plan to merge this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Project transform [iceberg-rust]
liurenjie1024 commented on code in PR #309: URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547649993 ## crates/iceberg/src/spec/transform.rs: ## @@ -261,6 +269,300 @@ impl Transform { _ => self == other, } } + +/// Projects a given predicate according to the transformation +/// specified by the `Transform` instance. +/// +/// This allows predicates to be effectively applied to data +/// that has undergone transformation, enabling efficient querying +/// and filtering based on the original, untransformed data. +/// +/// # Example +/// Suppose, we have row filter `a = 10`, and a partition spec +/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition +/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to +/// `bs = bucket(10, 37)` +pub fn project(&self, name: String, predicate: &BoundPredicate) -> Result> { +let func = create_transform_function(self)?; + +let projection = match predicate { +BoundPredicate::Unary(expr) => match self { +Transform::Identity +| Transform::Bucket(_) +| Transform::Truncate(_) +| Transform::Year +| Transform::Month +| Transform::Day +| Transform::Hour => Some(Predicate::Unary(UnaryExpression::new( +expr.op(), +Reference::new(name), +))), +_ => None, +}, +BoundPredicate::Binary(expr) => match self { +Transform::Identity => Some(Predicate::Binary(BinaryExpression::new( +expr.op(), +Reference::new(name), +expr.literal().to_owned(), +))), +Transform::Bucket(_) => { +if expr.op() != PredicateOperator::Eq || !self.can_transform(expr.literal()) { +return Ok(None); +} + +Some(Predicate::Binary(BinaryExpression::new( +expr.op(), +Reference::new(name), +func.transform_literal_result(expr.literal())?, +))) +} +Transform::Truncate(width) => { +if !self.can_transform(expr.literal()) { +return Ok(None); +} + +self.transform_projected_boundary( +name, +expr.literal(), +&expr.op(), +&func, +Some(*width), +)? +} +Transform::Year | Transform::Month | Transform::Day | Transform::Hour => { +if !self.can_transform(expr.literal()) { +return Ok(None); +} + +self.transform_projected_boundary( +name, +expr.literal(), +&expr.op(), +&func, +None, +)? +} +_ => None, +}, +BoundPredicate::Set(expr) => match self { +Transform::Identity => Some(Predicate::Set(SetExpression::new( +expr.op(), +Reference::new(name), +expr.literals().to_owned(), +))), +Transform::Bucket(_) +| Transform::Truncate(_) +| Transform::Year +| Transform::Month +| Transform::Day +| Transform::Hour => { +if expr.op() != PredicateOperator::In +|| expr.literals().iter().any(|d| !self.can_transform(d)) +{ +return Ok(None); +} + +Some(Predicate::Set(SetExpression::new( +expr.op(), +Reference::new(name), +self.transform_set(expr.literals(), &func)?, +))) +} +_ => None, +}, +_ => None, +}; + +Ok(projection) +} + +/// Check if `Transform` is applicable on datum's `PrimitiveType` +fn can_transform(&self, datum: &Datum) -> bool { +let input_type = datum.data_type().clone(); +self.result_type(&Type::Primitive(input_type)).is_ok() +} + +/// Transform each literal value of `FnvHashSet` +fn transform_set( +&self, +literals: &FnvHashSet, +func: &BoxedTransformFunction, +) -> Result> { +let mut new_set = FnvHashSet::default(); + +for lit in literals { +let datum = fu
Re: [PR] feat: Project transform [iceberg-rust]
marvinlanhenke commented on code in PR #309: URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547653811 ## crates/iceberg/src/spec/transform.rs: ## @@ -261,6 +269,300 @@ impl Transform { _ => self == other, } } + +/// Projects a given predicate according to the transformation +/// specified by the `Transform` instance. +/// +/// This allows predicates to be effectively applied to data +/// that has undergone transformation, enabling efficient querying +/// and filtering based on the original, untransformed data. +/// +/// # Example +/// Suppose, we have row filter `a = 10`, and a partition spec +/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition +/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to +/// `bs = bucket(10, 37)` +pub fn project(&self, name: String, predicate: &BoundPredicate) -> Result> { +let func = create_transform_function(self)?; + +let projection = match predicate { +BoundPredicate::Unary(expr) => match self { +Transform::Identity +| Transform::Bucket(_) +| Transform::Truncate(_) +| Transform::Year +| Transform::Month +| Transform::Day +| Transform::Hour => Some(Predicate::Unary(UnaryExpression::new( +expr.op(), +Reference::new(name), +))), +_ => None, +}, +BoundPredicate::Binary(expr) => match self { +Transform::Identity => Some(Predicate::Binary(BinaryExpression::new( +expr.op(), +Reference::new(name), +expr.literal().to_owned(), +))), +Transform::Bucket(_) => { +if expr.op() != PredicateOperator::Eq || !self.can_transform(expr.literal()) { +return Ok(None); +} + +Some(Predicate::Binary(BinaryExpression::new( +expr.op(), +Reference::new(name), +func.transform_literal_result(expr.literal())?, +))) +} +Transform::Truncate(width) => { +if !self.can_transform(expr.literal()) { +return Ok(None); +} + +self.transform_projected_boundary( +name, +expr.literal(), +&expr.op(), +&func, +Some(*width), +)? +} +Transform::Year | Transform::Month | Transform::Day | Transform::Hour => { +if !self.can_transform(expr.literal()) { +return Ok(None); +} + +self.transform_projected_boundary( +name, +expr.literal(), +&expr.op(), +&func, +None, +)? +} +_ => None, +}, +BoundPredicate::Set(expr) => match self { +Transform::Identity => Some(Predicate::Set(SetExpression::new( +expr.op(), +Reference::new(name), +expr.literals().to_owned(), +))), +Transform::Bucket(_) +| Transform::Truncate(_) +| Transform::Year +| Transform::Month +| Transform::Day +| Transform::Hour => { +if expr.op() != PredicateOperator::In +|| expr.literals().iter().any(|d| !self.can_transform(d)) +{ +return Ok(None); +} + +Some(Predicate::Set(SetExpression::new( +expr.op(), +Reference::new(name), +self.transform_set(expr.literals(), &func)?, +))) +} +_ => None, +}, +_ => None, +}; + +Ok(projection) +} + +/// Check if `Transform` is applicable on datum's `PrimitiveType` +fn can_transform(&self, datum: &Datum) -> bool { +let input_type = datum.data_type().clone(); +self.result_type(&Type::Primitive(input_type)).is_ok() +} + +/// Transform each literal value of `FnvHashSet` +fn transform_set( +&self, +literals: &FnvHashSet, +func: &BoxedTransformFunction, +) -> Result> { +let mut new_set = FnvHashSet::default(); + +for lit in literals { +let datum = f
Re: [PR] Migrate TableTestBase related classes to JUnit5 and delete TableTestBase [iceberg]
nastra merged PR #10063: URL: https://github.com/apache/iceberg/pull/10063 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Migrate TableTestBase related classes to JUnit5 and delete TableTestBase [iceberg]
tomtongue commented on code in PR #10063: URL: https://github.com/apache/iceberg/pull/10063#discussion_r1547650695 ## data/src/test/java/org/apache/iceberg/io/TestPositionDeltaWriters.java: ## @@ -20,43 +20,42 @@ import java.io.File; import java.io.IOException; +import java.nio.file.Files; +import java.util.Arrays; import java.util.List; import org.apache.iceberg.DataFile; import org.apache.iceberg.DeleteFile; import org.apache.iceberg.FileFormat; +import org.apache.iceberg.Parameter; +import org.apache.iceberg.ParameterizedTestExtension; +import org.apache.iceberg.Parameters; import org.apache.iceberg.PartitionSpec; import org.apache.iceberg.RowDelta; import org.apache.iceberg.expressions.Expressions; import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; import org.apache.iceberg.util.StructLikeSet; import org.junit.Assert; Review Comment: Thank you. I'll create a PR to migrate them in a separate PR. At least, through this issue, we can delete the `TableTestBase`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Migrate TableTestBase related classes to JUnit5 and delete TableTestBase [iceberg]
nastra commented on code in PR #10063: URL: https://github.com/apache/iceberg/pull/10063#discussion_r1547646211 ## data/src/test/java/org/apache/iceberg/io/TestPositionDeltaWriters.java: ## @@ -20,43 +20,42 @@ import java.io.File; import java.io.IOException; +import java.nio.file.Files; +import java.util.Arrays; import java.util.List; import org.apache.iceberg.DataFile; import org.apache.iceberg.DeleteFile; import org.apache.iceberg.FileFormat; +import org.apache.iceberg.Parameter; +import org.apache.iceberg.ParameterizedTestExtension; +import org.apache.iceberg.Parameters; import org.apache.iceberg.PartitionSpec; import org.apache.iceberg.RowDelta; import org.apache.iceberg.expressions.Expressions; import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; import org.apache.iceberg.util.StructLikeSet; import org.junit.Assert; Review Comment: yeah indeed, that would lead to many changes. I'm ok doing the conversion from JUnit4 asserts to AssertJ on these files in a separate PR then -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Project transform [iceberg-rust]
marvinlanhenke commented on code in PR #309: URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547639860 ## crates/iceberg/src/spec/transform.rs: ## @@ -261,6 +269,300 @@ impl Transform { _ => self == other, } } + +/// Projects a given predicate according to the transformation +/// specified by the `Transform` instance. +/// +/// This allows predicates to be effectively applied to data +/// that has undergone transformation, enabling efficient querying +/// and filtering based on the original, untransformed data. +/// +/// # Example +/// Suppose, we have row filter `a = 10`, and a partition spec +/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition +/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to +/// `bs = bucket(10, 37)` +pub fn project(&self, name: String, predicate: &BoundPredicate) -> Result> { +let func = create_transform_function(self)?; + +let projection = match predicate { +BoundPredicate::Unary(expr) => match self { +Transform::Identity +| Transform::Bucket(_) +| Transform::Truncate(_) +| Transform::Year +| Transform::Month +| Transform::Day +| Transform::Hour => Some(Predicate::Unary(UnaryExpression::new( +expr.op(), +Reference::new(name), +))), +_ => None, +}, +BoundPredicate::Binary(expr) => match self { +Transform::Identity => Some(Predicate::Binary(BinaryExpression::new( +expr.op(), +Reference::new(name), +expr.literal().to_owned(), +))), +Transform::Bucket(_) => { +if expr.op() != PredicateOperator::Eq || !self.can_transform(expr.literal()) { +return Ok(None); +} + +Some(Predicate::Binary(BinaryExpression::new( +expr.op(), +Reference::new(name), +func.transform_literal_result(expr.literal())?, +))) +} +Transform::Truncate(width) => { +if !self.can_transform(expr.literal()) { +return Ok(None); +} + +self.transform_projected_boundary( +name, +expr.literal(), +&expr.op(), +&func, +Some(*width), +)? +} +Transform::Year | Transform::Month | Transform::Day | Transform::Hour => { +if !self.can_transform(expr.literal()) { +return Ok(None); +} + +self.transform_projected_boundary( +name, +expr.literal(), +&expr.op(), +&func, +None, +)? +} +_ => None, +}, +BoundPredicate::Set(expr) => match self { +Transform::Identity => Some(Predicate::Set(SetExpression::new( +expr.op(), +Reference::new(name), +expr.literals().to_owned(), +))), +Transform::Bucket(_) +| Transform::Truncate(_) +| Transform::Year +| Transform::Month +| Transform::Day +| Transform::Hour => { +if expr.op() != PredicateOperator::In +|| expr.literals().iter().any(|d| !self.can_transform(d)) +{ +return Ok(None); +} + +Some(Predicate::Set(SetExpression::new( +expr.op(), +Reference::new(name), +self.transform_set(expr.literals(), &func)?, +))) +} +_ => None, +}, +_ => None, +}; + +Ok(projection) +} + +/// Check if `Transform` is applicable on datum's `PrimitiveType` +fn can_transform(&self, datum: &Datum) -> bool { +let input_type = datum.data_type().clone(); +self.result_type(&Type::Primitive(input_type)).is_ok() +} + +/// Transform each literal value of `FnvHashSet` +fn transform_set( +&self, +literals: &FnvHashSet, +func: &BoxedTransformFunction, +) -> Result> { +let mut new_set = FnvHashSet::default(); + +for lit in literals { +let datum = f
Re: [PR] Migrate TableTestBase related classes to JUnit5 and delete TableTestBase [iceberg]
tomtongue commented on code in PR #10063: URL: https://github.com/apache/iceberg/pull/10063#discussion_r1547639555 ## data/src/test/java/org/apache/iceberg/io/TestPositionDeltaWriters.java: ## @@ -20,43 +20,42 @@ import java.io.File; import java.io.IOException; +import java.nio.file.Files; +import java.util.Arrays; import java.util.List; import org.apache.iceberg.DataFile; import org.apache.iceberg.DeleteFile; import org.apache.iceberg.FileFormat; +import org.apache.iceberg.Parameter; +import org.apache.iceberg.ParameterizedTestExtension; +import org.apache.iceberg.Parameters; import org.apache.iceberg.PartitionSpec; import org.apache.iceberg.RowDelta; import org.apache.iceberg.expressions.Expressions; import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; import org.apache.iceberg.util.StructLikeSet; import org.junit.Assert; Review Comment: @nastra Thanks for the review. I only migrate classes related to `TableTestBase` to delete the class, so I keep these classes JUnit4 because there should be a lot of changes if each class is updated to JUnit5. And, I'm thinking that after deleting `TableTestBase`, I'll migrate those classes to Junit 5. Should I update all classes in this PR to JUnit 5? or partially update? ## data/src/test/java/org/apache/iceberg/io/TestPositionDeltaWriters.java: ## @@ -20,43 +20,42 @@ import java.io.File; import java.io.IOException; +import java.nio.file.Files; +import java.util.Arrays; import java.util.List; import org.apache.iceberg.DataFile; import org.apache.iceberg.DeleteFile; import org.apache.iceberg.FileFormat; +import org.apache.iceberg.Parameter; +import org.apache.iceberg.ParameterizedTestExtension; +import org.apache.iceberg.Parameters; import org.apache.iceberg.PartitionSpec; import org.apache.iceberg.RowDelta; import org.apache.iceberg.expressions.Expressions; import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; import org.apache.iceberg.util.StructLikeSet; import org.junit.Assert; Review Comment: @nastra Thanks for the review. This time, I only migrate classes related to `TableTestBase` to delete the class, so I keep these classes JUnit4 because there should be a lot of changes if each class is updated to JUnit5. And, I'm thinking that after deleting `TableTestBase`, I'll migrate those classes to Junit 5. Should I update all classes in this PR to JUnit 5? or partially update? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Migrate TableTestBase related classes to JUnit5 and delete TableTestBase [iceberg]
nastra commented on code in PR #10063: URL: https://github.com/apache/iceberg/pull/10063#discussion_r1547636721 ## flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/sink/TestIcebergFilesCommitter.java: ## @@ -73,44 +77,39 @@ import org.apache.iceberg.util.ThreadPools; import org.junit.Assert; import org.junit.Assume; Review Comment: same as in the other files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Migrate TableTestBase related classes to JUnit5 and delete TableTestBase [iceberg]
nastra commented on code in PR #10063: URL: https://github.com/apache/iceberg/pull/10063#discussion_r1547634234 ## flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/sink/TestDeltaTaskWriter.java: ## @@ -65,31 +68,28 @@ import org.apache.iceberg.util.StructLikeSet; import org.assertj.core.api.Assertions; import org.junit.Assert; Review Comment: still uses JUnit4-Assert -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Migrate TableTestBase related classes to JUnit5 and delete TableTestBase [iceberg]
nastra commented on code in PR #10063: URL: https://github.com/apache/iceberg/pull/10063#discussion_r1547631706 ## data/src/test/java/org/apache/iceberg/io/TestPositionDeltaWriters.java: ## @@ -20,43 +20,42 @@ import java.io.File; import java.io.IOException; +import java.nio.file.Files; +import java.util.Arrays; import java.util.List; import org.apache.iceberg.DataFile; import org.apache.iceberg.DeleteFile; import org.apache.iceberg.FileFormat; +import org.apache.iceberg.Parameter; +import org.apache.iceberg.ParameterizedTestExtension; +import org.apache.iceberg.Parameters; import org.apache.iceberg.PartitionSpec; import org.apache.iceberg.RowDelta; import org.apache.iceberg.expressions.Expressions; import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; import org.apache.iceberg.util.StructLikeSet; import org.junit.Assert; Review Comment: still a JUnit4-Assert left here ## data/src/test/java/org/apache/iceberg/io/TestPartitioningWriters.java: ## @@ -35,33 +40,27 @@ import org.apache.iceberg.util.StructLikeSet; import org.assertj.core.api.Assertions; import org.junit.Assert; Review Comment: should be converted to AssertJ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Migrate TableTestBase related classes to JUnit5 and delete TableTestBase [iceberg]
nastra commented on code in PR #10063: URL: https://github.com/apache/iceberg/pull/10063#discussion_r1547632925 ## data/src/test/java/org/apache/iceberg/io/TestRollingFileWriters.java: ## @@ -20,60 +20,60 @@ import java.io.File; import java.io.IOException; +import java.nio.file.Files; +import java.util.Arrays; import java.util.List; import org.apache.iceberg.FileFormat; +import org.apache.iceberg.Parameter; +import org.apache.iceberg.ParameterizedTestExtension; +import org.apache.iceberg.Parameters; import org.apache.iceberg.PartitionSpec; import org.apache.iceberg.Schema; import org.apache.iceberg.StructLike; import org.apache.iceberg.deletes.PositionDelete; import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; import org.apache.iceberg.relocated.com.google.common.collect.Lists; import org.junit.Assert; Review Comment: same as in the other tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Project transform [iceberg-rust]
marvinlanhenke commented on PR #309: URL: https://github.com/apache/iceberg-rust/pull/309#issuecomment-2031700958 > Hi, @marvinlanhenke Thanks for pr, it looks great! I have some small suggestion to restructure the code to make it easier for review. Really greatful for these tests! Thanks for the review, I'll get to your suggestions - those should be easy to fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]
liurenjie1024 commented on code in PR #295: URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1547582654 ## crates/iceberg/src/arrow.rs: ## @@ -113,6 +143,405 @@ impl ArrowReader { // TODO: full implementation ProjectionMask::all() } + +fn get_row_filter(&self, parquet_schema: &SchemaDescriptor) -> Result> { +if let Some(predicates) = &self.predicates { +let field_id_map = self.build_field_id_map(parquet_schema)?; + +// Collect Parquet column indices from field ids +let column_indices = predicates +.iter() +.map(|predicate| { +let mut collector = CollectFieldIdVisitor { field_ids: vec![] }; +collector.visit_predicate(predicate).unwrap(); +collector +.field_ids +.iter() +.map(|field_id| { +field_id_map.get(field_id).cloned().ok_or_else(|| { +Error::new(ErrorKind::DataInvalid, "Field id not found in schema") +}) +}) +.collect::>>() +}) +.collect::>>()?; + +// Convert BoundPredicates to ArrowPredicates +let mut arrow_predicates = vec![]; +for (predicate, columns) in predicates.iter().zip(column_indices.iter()) { +let mut converter = PredicateConverter { +columns, +projection_mask: ProjectionMask::leaves(parquet_schema, columns.clone()), +parquet_schema, +column_map: &field_id_map, +}; +let arrow_predicate = converter.visit_predicate(predicate)?; +arrow_predicates.push(arrow_predicate); +} +Ok(Some(RowFilter::new(arrow_predicates))) +} else { +Ok(None) +} +} + +/// Build the map of field id to Parquet column index in the schema. +fn build_field_id_map(&self, parquet_schema: &SchemaDescriptor) -> Result> { +let mut column_map = HashMap::new(); +for (idx, field) in parquet_schema.columns().iter().enumerate() { +let field_type = field.self_type(); +match field_type { +ParquetType::PrimitiveType { basic_info, .. } => { +if !basic_info.has_id() { +return Err(Error::new( +ErrorKind::DataInvalid, +format!( +"Leave column {:?} in schema doesn't have field id", +field_type +), +)); +} +column_map.insert(basic_info.id(), idx); +} +ParquetType::GroupType { .. } => { +return Err(Error::new( +ErrorKind::DataInvalid, +format!( +"Leave column in schema should be primitive type but got {:?}", +field_type +), +)); +} +}; +} + +Ok(column_map) +} +} + +/// A visitor to collect field ids from bound predicates. +struct CollectFieldIdVisitor { +field_ids: Vec, +} + +impl BoundPredicateVisitor for CollectFieldIdVisitor { +type T = (); +type U = (); + +fn and(&mut self, _predicates: Vec) -> Result { +Ok(()) +} + +fn or(&mut self, _predicates: Vec) -> Result { +Ok(()) +} + +fn not(&mut self, _predicate: Self::T) -> Result { +Ok(()) +} + +fn visit_always_true(&mut self) -> Result { +Ok(()) +} + +fn visit_always_false(&mut self) -> Result { +Ok(()) +} + +fn visit_unary(&mut self, predicate: &UnaryExpression) -> Result { +self.bound_reference(predicate.term())?; +Ok(()) +} + +fn visit_binary(&mut self, predicate: &BinaryExpression) -> Result { +self.bound_reference(predicate.term())?; +Ok(()) +} + +fn visit_set(&mut self, predicate: &SetExpression) -> Result { +self.bound_reference(predicate.term())?; +Ok(()) +} + +fn bound_reference(&mut self, reference: &BoundReference) -> Result { +self.field_ids.push(reference.field().id); +Ok(()) +} +} + +struct PredicateConverter<'a> { +pub columns: &'a Vec, +pub projection_mask: ProjectionMask, +pub parquet_schema: &'a SchemaDescriptor, +pub column_map: &'a HashMap, +} + +fn get_arrow_datum(datum: &Datum) -> Box { +match datum.literal() { +PrimitiveLiteral::Boolean(value) => Box::new(BooleanArray::new_scalar(*value)), +PrimitiveLiteral::Int(value)
Re: [PR] Spark 3.5: Parallelize reading files in snapshot and migrate procedures [iceberg]
manuzhang commented on code in PR #10037: URL: https://github.com/apache/iceberg/pull/10037#discussion_r1547578684 ## data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java: ## @@ -171,6 +176,10 @@ public static List listPartition( } } + public static boolean isFileReadingParallelized() { Review Comment: @nastra do you have an idea to test whether parallelism setting is working? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Project transform [iceberg-rust]
liurenjie1024 commented on code in PR #309: URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547488931 ## crates/iceberg/src/spec/transform.rs: ## @@ -261,6 +269,300 @@ impl Transform { _ => self == other, } } + +/// Projects a given predicate according to the transformation +/// specified by the `Transform` instance. +/// +/// This allows predicates to be effectively applied to data +/// that has undergone transformation, enabling efficient querying +/// and filtering based on the original, untransformed data. +/// +/// # Example +/// Suppose, we have row filter `a = 10`, and a partition spec +/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition +/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to +/// `bs = bucket(10, 37)` +pub fn project(&self, name: String, predicate: &BoundPredicate) -> Result> { +let func = create_transform_function(self)?; + +let projection = match predicate { +BoundPredicate::Unary(expr) => match self { +Transform::Identity +| Transform::Bucket(_) +| Transform::Truncate(_) +| Transform::Year +| Transform::Month +| Transform::Day +| Transform::Hour => Some(Predicate::Unary(UnaryExpression::new( +expr.op(), +Reference::new(name), +))), +_ => None, +}, +BoundPredicate::Binary(expr) => match self { +Transform::Identity => Some(Predicate::Binary(BinaryExpression::new( +expr.op(), +Reference::new(name), +expr.literal().to_owned(), +))), +Transform::Bucket(_) => { +if expr.op() != PredicateOperator::Eq || !self.can_transform(expr.literal()) { +return Ok(None); +} + +Some(Predicate::Binary(BinaryExpression::new( +expr.op(), +Reference::new(name), +func.transform_literal_result(expr.literal())?, +))) +} +Transform::Truncate(width) => { +if !self.can_transform(expr.literal()) { +return Ok(None); +} + +self.transform_projected_boundary( +name, +expr.literal(), +&expr.op(), +&func, +Some(*width), +)? +} +Transform::Year | Transform::Month | Transform::Day | Transform::Hour => { +if !self.can_transform(expr.literal()) { +return Ok(None); +} + +self.transform_projected_boundary( +name, +expr.literal(), +&expr.op(), +&func, +None, +)? +} +_ => None, +}, +BoundPredicate::Set(expr) => match self { +Transform::Identity => Some(Predicate::Set(SetExpression::new( +expr.op(), +Reference::new(name), +expr.literals().to_owned(), +))), +Transform::Bucket(_) +| Transform::Truncate(_) +| Transform::Year +| Transform::Month +| Transform::Day +| Transform::Hour => { +if expr.op() != PredicateOperator::In +|| expr.literals().iter().any(|d| !self.can_transform(d)) +{ +return Ok(None); +} + +Some(Predicate::Set(SetExpression::new( +expr.op(), +Reference::new(name), +self.transform_set(expr.literals(), &func)?, +))) +} +_ => None, +}, +_ => None, +}; + +Ok(projection) +} + +/// Check if `Transform` is applicable on datum's `PrimitiveType` +fn can_transform(&self, datum: &Datum) -> bool { +let input_type = datum.data_type().clone(); +self.result_type(&Type::Primitive(input_type)).is_ok() +} + +/// Transform each literal value of `FnvHashSet` +fn transform_set( +&self, +literals: &FnvHashSet, +func: &BoxedTransformFunction, +) -> Result> { +let mut new_set = FnvHashSet::default(); + +for lit in literals { +let datum = fu
Re: [PR] #9073 Junit 4 tests switched to JUnit 5 [iceberg]
nastra commented on code in PR #9793: URL: https://github.com/apache/iceberg/pull/9793#discussion_r1547543252 ## data/src/test/java/org/apache/iceberg/data/DataTestHelpers.java: ## @@ -84,16 +86,16 @@ private static void assertEquals(Type type, Object expected, Object actual) { case UUID: case BINARY: case DECIMAL: -Assert.assertEquals( -"Primitive value should be equal to expected for type " + type, expected, actual); +org.junit.jupiter.api.Assertions.assertEquals( +expected, actual, "Primitive value should be equal to expected for type " + type); break; case FIXED: Assertions.assertThat(expected) .as("Expected should be a byte[]") .isInstanceOf(byte[].class); Assertions.assertThat(expected).as("Actual should be a byte[]").isInstanceOf(byte[].class); -Assert.assertArrayEquals( -"Array contents should be equal", (byte[]) expected, (byte[]) actual); +org.junit.jupiter.api.Assertions.assertArrayEquals( Review Comment: this should use AssertJ's `assertThat()` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Parallelize reading files in snapshot and migrate procedures [iceberg]
nastra commented on code in PR #10037: URL: https://github.com/apache/iceberg/pull/10037#discussion_r1547529529 ## data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java: ## @@ -171,6 +176,10 @@ public static List listPartition( } } + public static boolean isFileReadingParallelized() { Review Comment: I don't think it's a good idea to introduce this, since this reflects whatever was set after calling `listPartition(...)` and depending on the order of calls, it could return something unexpected -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Support identifier warehouses [iceberg-rust]
liurenjie1024 commented on code in PR #308: URL: https://github.com/apache/iceberg-rust/pull/308#discussion_r1547526634 ## crates/catalog/rest/src/catalog.rs: ## @@ -617,7 +617,13 @@ impl RestCatalog { props.extend(config); } -let file_io = match self.config.warehouse.as_deref().or(metadata_location) { +let warehouse_path = match self.config.warehouse.as_deref() { +Some(url) if url.contains("://") => Some(url), Review Comment: Yes, we don't need to raise an exception here, just match `if Url::parse().is_ok()` would be enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Support identifier warehouses [iceberg-rust]
Fokko commented on code in PR #308: URL: https://github.com/apache/iceberg-rust/pull/308#discussion_r1547503192 ## crates/catalog/rest/src/catalog.rs: ## @@ -617,7 +617,13 @@ impl RestCatalog { props.extend(config); } -let file_io = match self.config.warehouse.as_deref().or(metadata_location) { +let warehouse_path = match self.config.warehouse.as_deref() { +Some(url) if url.contains("://") => Some(url), Review Comment: This is also happening [below in the `FileIO::from_path`](https://github.com/apache/iceberg-rust/blob/d57d91b9a72c516c6665d5faef349f52ebe59585/crates/iceberg/src/io.rs#L155), but it raises an exception we want to avoid. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Support identifier warehouses [iceberg-rust]
liurenjie1024 commented on PR #308: URL: https://github.com/apache/iceberg-rust/pull/308#issuecomment-2031513654 > Hi, @Fokko Thanks for this fix. It also reminds me that should we append the warehouse parameter to `getConfig` call? Seems we already have that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Parallelize reading files in snapshot and migrate procedures [iceberg]
manuzhang commented on code in PR #10037: URL: https://github.com/apache/iceberg/pull/10037#discussion_r1547447584 ## spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java: ## @@ -232,4 +232,25 @@ public void testMigrateEmptyTable() throws Exception { Object result = scalarSql("CALL %s.system.migrate('%s')", catalogName, tableName); assertThat(result).isEqualTo(0L); } + + @TestTemplate + public void testMigrateWithParallelism() throws IOException { +assumeThat(catalogName).isEqualToIgnoringCase("spark_catalog"); +String location = Files.createTempDirectory(temp, "junit").toFile().toString(); +for (int p = -1; p <= 2; p++) { Review Comment: Thanks @nastra , I've updated as @RussellSpitzer suggests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Disable checking links for Blogs section [iceberg]
manuzhang commented on issue #10060: URL: https://github.com/apache/iceberg/issues/10060#issuecomment-2031439818 I also see errors when checking maven repo. ``` ERROR: 21 dead links found! [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime/0.13.2/iceberg-spark-runtime-0.13.2.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.14/0.13.2/iceberg-flink-runtime-1.14-0.13.2.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.13/0.13.2/iceberg-flink-runtime-1.13-0.13.2.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.12/0.13.2/iceberg-flink-runtime-1.12-0.13.2.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/0.13.2/iceberg-hive-runtime-0.13.2.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark3-runtime/0.13.1/iceberg-spark3-runtime-0.13.1.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime/0.13.1/iceberg-spark-runtime-0.13.1.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.13/0.13.1/iceberg-flink-runtime-1.13-0.13.1.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.12/0.13.1/iceberg-flink-runtime-1.12-0.13.1.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime/0.13.0/iceberg-spark-runtime-0.13.0.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.12/0.13.0/iceberg-flink-runtime-1.12-0.13.0.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark3-runtime/0.12.1/iceberg-spark3-runtime-0.12.1.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime/0.12.1/iceberg-spark-runtime-0.12.1.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark3-runtime/0.12.0/iceberg-spark3-runtime-0.12.0.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/0.12.0/iceberg-hive-runtime-0.12.0.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/0.11.1/iceberg-hive-runtime-0.11.1.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime/0.11.0/iceberg-flink-runtime-0.11.0.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark3-runtime/0.10.0/iceberg-spark3-runtime-0.10.0.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark3-runtime/0.9.0/iceberg-spark3-runtime-0.9.0.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime/0.9.0/iceberg-spark-runtime-0.9.0.jar → Status: 0 [✖] https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime/0.8.0-incubating/iceberg-spark-runtime-0.8.0-incubating.jar → Status: 0 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Open-api: update prefix param description [iceberg]
ajantha-bhat commented on code in PR #9870: URL: https://github.com/apache/iceberg/pull/9870#discussion_r1547408417 ## open-api/rest-catalog-open-api.yaml: ## @@ -1444,7 +1444,7 @@ components: schema: type: string required: true - description: An optional prefix in the path + description: Prefix in the path Review Comment: If I change to `required: false` for `prefix` ``` Spec is invalid. Issues: components.parameters.For path parameter prefix the required value should be true Execution failed for task ':iceberg-open-api:validateRESTCatalogSpec'. > Validation failed. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Parallelize reading files in snapshot and migrate procedures [iceberg]
nastra commented on code in PR #10037: URL: https://github.com/apache/iceberg/pull/10037#discussion_r1547402202 ## spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java: ## @@ -232,4 +232,25 @@ public void testMigrateEmptyTable() throws Exception { Object result = scalarSql("CALL %s.system.migrate('%s')", catalogName, tableName); assertThat(result).isEqualTo(0L); } + + @TestTemplate + public void testMigrateWithParallelism() throws IOException { +assumeThat(catalogName).isEqualToIgnoringCase("spark_catalog"); +String location = Files.createTempDirectory(temp, "junit").toFile().toString(); +for (int p = -1; p <= 2; p++) { Review Comment: > I was trying that initially, but there's no way to add parameters for just one test. This is because the test is already parameterized at the class level. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Parallelize reading files in snapshot and migrate procedures [iceberg]
nastra commented on code in PR #10037: URL: https://github.com/apache/iceberg/pull/10037#discussion_r1547402202 ## spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java: ## @@ -232,4 +232,25 @@ public void testMigrateEmptyTable() throws Exception { Object result = scalarSql("CALL %s.system.migrate('%s')", catalogName, tableName); assertThat(result).isEqualTo(0L); } + + @TestTemplate + public void testMigrateWithParallelism() throws IOException { +assumeThat(catalogName).isEqualToIgnoringCase("spark_catalog"); +String location = Files.createTempDirectory(temp, "junit").toFile().toString(); +for (int p = -1; p <= 2; p++) { Review Comment: > I was trying that initially, but there's no way to add parameters for just one test. This is because the test is already parameterized at the class level -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Open-api: update prefix param description [iceberg]
Fokko commented on code in PR #9870: URL: https://github.com/apache/iceberg/pull/9870#discussion_r1547395200 ## open-api/rest-catalog-open-api.yaml: ## @@ -1444,7 +1444,7 @@ components: schema: type: string required: true - description: An optional prefix in the path + description: Prefix in the path Review Comment: What's the error you're seeing? The prefix is optional, but it depends on the definition. We can also update the default to be an empty string. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Open-api: update prefix param description [iceberg]
ajantha-bhat commented on PR #9870: URL: https://github.com/apache/iceberg/pull/9870#issuecomment-2031310056 cc: @jbonofre -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]
nastra commented on code in PR #10053: URL: https://github.com/apache/iceberg/pull/10053#discussion_r1547300740 ## core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java: ## @@ -133,6 +136,59 @@ public void testDynamicHttpRequestInterceptorLoading() { assertThat(((TestHttpRequestInterceptor) interceptor).properties).isEqualTo(properties); } + @Test + public void testHttpClientGetConnectionConfig() { +long connectionTimeoutMs = 10L; +int socketTimeoutMs = 10; +Map properties = +ImmutableMap.of( +HTTPClient.REST_CONNECTION_TIMEOUT_MS, String.valueOf(connectionTimeoutMs), +HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); + +ConnectionConfig connectionConfig = HTTPClient.configureConnectionConfig(properties); +assertThat(connectionConfig).isNotNull(); + assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs); + assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs); + } + + @Test + public void testHttpClientWithSocketTimeout() throws IOException { +long socketTimeoutMs = 2000L; +Map properties = +ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); +String path = "socket/timeout/path"; + +try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) { + HttpRequest mockRequest = + request() + .withPath("/" + path) + .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT)); + // Setting a response delay of 5 seconds to simulate hitting the configured socket timeout of + // 2 seconds + HttpResponse mockResponse = + response() + .withStatusCode(200) + .withBody("Delayed response") + .withDelay(TimeUnit.MILLISECONDS, 5000); + mockServer.when(mockRequest).respond(mockResponse); + + Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), (unused) -> {})) + .cause() + .isInstanceOf(SocketTimeoutException.class); +} + } + + @Test + public void testHttpClientInvalidConnectionTimeout() { Review Comment: please also add a separate test method where the socket timeout is invalid -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]
nastra commented on code in PR #10053: URL: https://github.com/apache/iceberg/pull/10053#discussion_r1547305188 ## core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java: ## @@ -133,6 +136,59 @@ public void testDynamicHttpRequestInterceptorLoading() { assertThat(((TestHttpRequestInterceptor) interceptor).properties).isEqualTo(properties); } + @Test + public void testHttpClientGetConnectionConfig() { +long connectionTimeoutMs = 10L; +int socketTimeoutMs = 10; +Map properties = +ImmutableMap.of( +HTTPClient.REST_CONNECTION_TIMEOUT_MS, String.valueOf(connectionTimeoutMs), +HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); + +ConnectionConfig connectionConfig = HTTPClient.configureConnectionConfig(properties); +assertThat(connectionConfig).isNotNull(); + assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs); + assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs); + } + + @Test + public void testHttpClientWithSocketTimeout() throws IOException { Review Comment: should there also be a test verifying the connection timeout? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]
nastra commented on code in PR #10053: URL: https://github.com/apache/iceberg/pull/10053#discussion_r1547304325 ## core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java: ## @@ -133,6 +136,59 @@ public void testDynamicHttpRequestInterceptorLoading() { assertThat(((TestHttpRequestInterceptor) interceptor).properties).isEqualTo(properties); } + @Test + public void testHttpClientGetConnectionConfig() { +long connectionTimeoutMs = 10L; +int socketTimeoutMs = 10; +Map properties = +ImmutableMap.of( +HTTPClient.REST_CONNECTION_TIMEOUT_MS, String.valueOf(connectionTimeoutMs), +HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); + +ConnectionConfig connectionConfig = HTTPClient.configureConnectionConfig(properties); +assertThat(connectionConfig).isNotNull(); + assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs); + assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs); + } + + @Test + public void testHttpClientWithSocketTimeout() throws IOException { +long socketTimeoutMs = 2000L; +Map properties = +ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); +String path = "socket/timeout/path"; + +try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) { + HttpRequest mockRequest = + request() + .withPath("/" + path) + .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT)); + // Setting a response delay of 5 seconds to simulate hitting the configured socket timeout of + // 2 seconds + HttpResponse mockResponse = + response() + .withStatusCode(200) + .withBody("Delayed response") + .withDelay(TimeUnit.MILLISECONDS, 5000); + mockServer.when(mockRequest).respond(mockResponse); + + Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), (unused) -> {})) + .cause() + .isInstanceOf(SocketTimeoutException.class); +} + } + + @Test + public void testHttpClientInvalidConnectionTimeout() { Review Comment: also no need for the `HttpClient` prefix in the test names as this is already obvious, because the test class itself is called `TestHTTPClient` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]
nastra commented on code in PR #10053: URL: https://github.com/apache/iceberg/pull/10053#discussion_r1547301639 ## core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java: ## @@ -133,6 +136,59 @@ public void testDynamicHttpRequestInterceptorLoading() { assertThat(((TestHttpRequestInterceptor) interceptor).properties).isEqualTo(properties); } + @Test + public void testHttpClientGetConnectionConfig() { +long connectionTimeoutMs = 10L; +int socketTimeoutMs = 10; +Map properties = +ImmutableMap.of( +HTTPClient.REST_CONNECTION_TIMEOUT_MS, String.valueOf(connectionTimeoutMs), +HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); + +ConnectionConfig connectionConfig = HTTPClient.configureConnectionConfig(properties); +assertThat(connectionConfig).isNotNull(); + assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs); + assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs); + } + + @Test + public void testHttpClientWithSocketTimeout() throws IOException { +long socketTimeoutMs = 2000L; +Map properties = +ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); +String path = "socket/timeout/path"; + +try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) { + HttpRequest mockRequest = + request() + .withPath("/" + path) + .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT)); + // Setting a response delay of 5 seconds to simulate hitting the configured socket timeout of + // 2 seconds + HttpResponse mockResponse = + response() + .withStatusCode(200) + .withBody("Delayed response") + .withDelay(TimeUnit.MILLISECONDS, 5000); + mockServer.when(mockRequest).respond(mockResponse); + + Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), (unused) -> {})) + .cause() + .isInstanceOf(SocketTimeoutException.class); +} + } + + @Test + public void testHttpClientInvalidConnectionTimeout() { +// We expect a Runtime exception Review Comment: this is obvious from the check further above, so no need for the comment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]
nastra commented on code in PR #10053: URL: https://github.com/apache/iceberg/pull/10053#discussion_r1547301269 ## core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java: ## @@ -133,6 +136,59 @@ public void testDynamicHttpRequestInterceptorLoading() { assertThat(((TestHttpRequestInterceptor) interceptor).properties).isEqualTo(properties); } + @Test + public void testHttpClientGetConnectionConfig() { +long connectionTimeoutMs = 10L; +int socketTimeoutMs = 10; +Map properties = +ImmutableMap.of( +HTTPClient.REST_CONNECTION_TIMEOUT_MS, String.valueOf(connectionTimeoutMs), +HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); + +ConnectionConfig connectionConfig = HTTPClient.configureConnectionConfig(properties); +assertThat(connectionConfig).isNotNull(); + assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs); + assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs); + } + + @Test + public void testHttpClientWithSocketTimeout() throws IOException { +long socketTimeoutMs = 2000L; +Map properties = +ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); +String path = "socket/timeout/path"; + +try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) { + HttpRequest mockRequest = + request() + .withPath("/" + path) + .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT)); + // Setting a response delay of 5 seconds to simulate hitting the configured socket timeout of + // 2 seconds + HttpResponse mockResponse = + response() + .withStatusCode(200) + .withBody("Delayed response") + .withDelay(TimeUnit.MILLISECONDS, 5000); + mockServer.when(mockRequest).respond(mockResponse); + + Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), (unused) -> {})) + .cause() + .isInstanceOf(SocketTimeoutException.class); +} + } + + @Test + public void testHttpClientInvalidConnectionTimeout() { +// We expect a Runtime exception +String connectionTimeoutMsStr = "invalidMs"; +Map properties = +ImmutableMap.of(HTTPClient.REST_CONNECTION_TIMEOUT_MS, connectionTimeoutMsStr); + +Assertions.assertThatThrownBy(() -> HTTPClient.builder(properties).uri(URI).build()) +.isInstanceOf(RuntimeException.class); Review Comment: this should have a `.hasMessage(...)` check -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]
nastra commented on code in PR #10053: URL: https://github.com/apache/iceberg/pull/10053#discussion_r1547300740 ## core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java: ## @@ -133,6 +136,59 @@ public void testDynamicHttpRequestInterceptorLoading() { assertThat(((TestHttpRequestInterceptor) interceptor).properties).isEqualTo(properties); } + @Test + public void testHttpClientGetConnectionConfig() { +long connectionTimeoutMs = 10L; +int socketTimeoutMs = 10; +Map properties = +ImmutableMap.of( +HTTPClient.REST_CONNECTION_TIMEOUT_MS, String.valueOf(connectionTimeoutMs), +HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); + +ConnectionConfig connectionConfig = HTTPClient.configureConnectionConfig(properties); +assertThat(connectionConfig).isNotNull(); + assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs); + assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs); + } + + @Test + public void testHttpClientWithSocketTimeout() throws IOException { +long socketTimeoutMs = 2000L; +Map properties = +ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); +String path = "socket/timeout/path"; + +try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) { + HttpRequest mockRequest = + request() + .withPath("/" + path) + .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT)); + // Setting a response delay of 5 seconds to simulate hitting the configured socket timeout of + // 2 seconds + HttpResponse mockResponse = + response() + .withStatusCode(200) + .withBody("Delayed response") + .withDelay(TimeUnit.MILLISECONDS, 5000); + mockServer.when(mockRequest).respond(mockResponse); + + Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), (unused) -> {})) + .cause() + .isInstanceOf(SocketTimeoutException.class); +} + } + + @Test + public void testHttpClientInvalidConnectionTimeout() { Review Comment: what about an invalid socket timeout? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org