Re: [PR] Add Pagination To List Apis [iceberg]

2024-04-02 Thread via GitHub


sachet commented on code in PR #9782:
URL: https://github.com/apache/iceberg/pull/9782#discussion_r1548998658


##
core/src/main/java/org/apache/iceberg/rest/PaginatedList.java:
##
@@ -0,0 +1,269 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.rest;
+
+import java.util.Collection;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.Map;
+import java.util.Spliterator;
+import java.util.Spliterators;
+import java.util.function.Supplier;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.rest.responses.ListNamespacesResponse;
+import org.apache.iceberg.rest.responses.ListTablesResponse;
+import org.apache.iceberg.rest.responses.Route;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class PaginatedList implements List {

Review Comment:
   +1, Current implementation of paginated list seems wrong as caller can get 
unexpected results. Two possible options are:
   1. Pre-fetch all items (as suggested above)
   2. methods like size(), contains(), isEmpty(), toArray(), get() .. first 
paginates across all items before performing the operation. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Fix metadata file not found [iceberg]

2024-04-02 Thread via GitHub


pvary commented on PR #10069:
URL: https://github.com/apache/iceberg/pull/10069#issuecomment-2033649191

   @lurnagao-dahua: If you check 
https://github.com/apache/iceberg/blob/3caa3a28d07a2d08b9a0e4196634126f1e016d6a/hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCommits.java,
 you can find plenty of examples for commit errors. Maybe if we could do 
something similar, like throwing an exception without a message. It would be 
nice to have a test.
   
   OTOH, if the test is more than 50 lines, it would cost us more in the upkeep 
of the test in the long run, than what we gain with testing a null check. In 
this case I would skip addig the extra code, following the example of #701.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] PyArrow S3FileSystem doesn't honor the AWS profile config [iceberg-python]

2024-04-02 Thread via GitHub


HonahX commented on issue #570:
URL: https://github.com/apache/iceberg-python/issues/570#issuecomment-2033588669

   @geruh, thanks for highlighting this issue. The confusion largely stems from 
the naming convention used when the `profile_name`, `region_name`, 
`aws_access_key_id`, etc., were introduced in 
[#7781](https://github.com/apache/iceberg/pull/7781). Initially, these 
configurations were intended solely for GlueCatalog, but their generic names 
suggest they might influence both Glue and S3 operations. To address this, we 
can consider renaming these configurations with a `glue.` prefix (e.g., 
`glue.profile_name`) to clarify their scope. However, to maintain API 
compatibility, we may need to support both the new and old naming conventions 
temporarily.
   
   > But on the other hand it seems reasonable that the AWS profile config 
should work uniformly across both the catalog and filesystem levels. 
   
   +1 for unified configurations. I think it may be convenient to introduce 
other unified configurations, with generic names like `aws-access-key-id`. So 
the overall order of config will be:
   
   1. Client-specific configs: glue.access-key-id, s3.access-key-id, etc.
   3. Unified AWS configurations like aws-access-key-id
   5. Environment variables and the default AWS config
   
   > However, we're currently utilizing PyArrow's 
[S3FileSystem](https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html#pyarrow.fs.S3FileSystem),
 which doesn't inherently support AWS profiles. This means we'd need to bridge 
that gap manually.
   
   Regarding the `profile_name` support for PyArrow's S3FileSystem, it seems 
there might not be a direct solution from the pyiceberg side. This 
functionality appears to be more suitably addressed through enhancements to the 
PyArrow library itself. WDYT?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Pagination To List Apis [iceberg]

2024-04-02 Thread via GitHub


danielcweeks commented on code in PR #9782:
URL: https://github.com/apache/iceberg/pull/9782#discussion_r1548945175


##
core/src/main/java/org/apache/iceberg/rest/PaginatedList.java:
##
@@ -0,0 +1,269 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.rest;
+
+import java.util.Collection;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.Map;
+import java.util.Spliterator;
+import java.util.Spliterators;
+import java.util.function.Supplier;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.rest.responses.ListNamespacesResponse;
+import org.apache.iceberg.rest.responses.ListTablesResponse;
+import org.apache.iceberg.rest.responses.Route;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class PaginatedList implements List {

Review Comment:
   To clarify my comment, I think there are really two issues:
   
   1) The current implementation does not properly adhere to the List 
behaviors.  The dynamodb example that's referenced will eagerly fetch if 
certain methods are called, which is necessary to get the correct behavior out 
of the List interface.  
   
   2) The second issue is more about the utility of a lazy pagination since it 
seems the original motivation is more to address server side limitations (like 
Glue's 100 results per request limit) as opposed to a client side memory 
limitation.  While you might be able to optimize something like the `SHOW 
TABLES LIKE` command client-side, we haven't see the client-side as the 
limiting factor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support creating tags [iceberg-python]

2024-04-02 Thread via GitHub


enkidulan commented on issue #573:
URL: https://github.com/apache/iceberg-python/issues/573#issuecomment-2033571970

   For the reference, I was able to make a tag only by using some private 
properties of the transaction object:
   
   ```py
   from pyiceberg.table import SetSnapshotRefUpdate, update_table_metadata
   
   with table.transaction() as txn:
   update = SetSnapshotRefUpdate(
   ref_name=tag,
   type="tag",
   snapshot_id=snapshot_id,
   max_ref_age_ms=None,
   max_snapshot_age_ms=None,
   min_snapshots_to_keep=None,
   )
   txn._updates = [update]
   txn.table_metadata = update_table_metadata(txn.table_metadata, [update])
   ```
   This seems to work fine as a temporary workaround for development purposes, 
but it would great to have a public method for creating tags.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Support creating tags [iceberg-python]

2024-04-02 Thread via GitHub


enkidulan opened a new issue, #573:
URL: https://github.com/apache/iceberg-python/issues/573

   ### Feature Request / Improvement
   
   Historical tags in iceberg docs - 
https://iceberg.apache.org/docs/1.5.0/branching/#historical-tags
   
   Not sure if it was intentional behavior, but pyiceberg `v0.6.0` allowed 
tagging by using public  `set_ref_snapshot` method:
   ```py
   with table.transaction() as transaction:
   transaction.set_ref_snapshot(
   snapshot_id=snapshot_id,
   parent_snapshot_id=snapshot_id,
   ref_name=revision,
   type="tag",
   )
   ```
   
   The new dev version (the current main branch) has deprecated the 
`set_ref_snapshot` method, so I can't find a way to create a tag using public 
methods on the transaction object. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]

2024-04-02 Thread via GitHub


liurenjie1024 commented on code in PR #295:
URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1548900850


##
crates/iceberg/src/arrow.rs:
##
@@ -113,6 +143,405 @@ impl ArrowReader {
 // TODO: full implementation
 ProjectionMask::all()
 }
+
+fn get_row_filter(&self, parquet_schema: &SchemaDescriptor) -> 
Result> {
+if let Some(predicates) = &self.predicates {
+let field_id_map = self.build_field_id_map(parquet_schema)?;
+
+// Collect Parquet column indices from field ids
+let column_indices = predicates
+.iter()
+.map(|predicate| {
+let mut collector = CollectFieldIdVisitor { field_ids: 
vec![] };
+collector.visit_predicate(predicate).unwrap();
+collector
+.field_ids
+.iter()
+.map(|field_id| {
+field_id_map.get(field_id).cloned().ok_or_else(|| {
+Error::new(ErrorKind::DataInvalid, "Field id 
not found in schema")
+})
+})
+.collect::>>()
+})
+.collect::>>()?;
+
+// Convert BoundPredicates to ArrowPredicates
+let mut arrow_predicates = vec![];
+for (predicate, columns) in 
predicates.iter().zip(column_indices.iter()) {
+let mut converter = PredicateConverter {
+columns,
+projection_mask: ProjectionMask::leaves(parquet_schema, 
columns.clone()),
+parquet_schema,
+column_map: &field_id_map,
+};
+let arrow_predicate = converter.visit_predicate(predicate)?;
+arrow_predicates.push(arrow_predicate);
+}
+Ok(Some(RowFilter::new(arrow_predicates)))
+} else {
+Ok(None)
+}
+}
+
+/// Build the map of field id to Parquet column index in the schema.
+fn build_field_id_map(&self, parquet_schema: &SchemaDescriptor) -> 
Result> {
+let mut column_map = HashMap::new();
+for (idx, field) in parquet_schema.columns().iter().enumerate() {
+let field_type = field.self_type();
+match field_type {
+ParquetType::PrimitiveType { basic_info, .. } => {
+if !basic_info.has_id() {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column {:?} in schema doesn't have 
field id",
+field_type
+),
+));
+}
+column_map.insert(basic_info.id(), idx);
+}
+ParquetType::GroupType { .. } => {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column in schema should be primitive type 
but got {:?}",
+field_type
+),
+));
+}
+};
+}
+
+Ok(column_map)
+}
+}
+
+/// A visitor to collect field ids from bound predicates.
+struct CollectFieldIdVisitor {
+field_ids: Vec,
+}
+
+impl BoundPredicateVisitor for CollectFieldIdVisitor {
+type T = ();
+type U = ();
+
+fn and(&mut self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn or(&mut self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn not(&mut self, _predicate: Self::T) -> Result {
+Ok(())
+}
+
+fn visit_always_true(&mut self) -> Result {
+Ok(())
+}
+
+fn visit_always_false(&mut self) -> Result {
+Ok(())
+}
+
+fn visit_unary(&mut self, predicate: &UnaryExpression) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_binary(&mut self, predicate: &BinaryExpression) 
-> Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_set(&mut self, predicate: &SetExpression) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn bound_reference(&mut self, reference: &BoundReference) -> 
Result {
+self.field_ids.push(reference.field().id);
+Ok(())
+}
+}
+
+struct PredicateConverter<'a> {
+pub columns: &'a Vec,
+pub projection_mask: ProjectionMask,
+pub parquet_schema: &'a SchemaDescriptor,
+pub column_map: &'a HashMap,
+}
+
+fn get_arrow_datum(datum: &Datum) -> Box {
+match datum.literal() {
+PrimitiveLiteral::Boolean(value) => 
Box::new(BooleanArray::new_scalar(*value)),
+PrimitiveLiteral::Int(value)

Re: [PR] refine: seperate parquet reader and arrow convert [iceberg-rust]

2024-04-02 Thread via GitHub


viirya commented on code in PR #313:
URL: https://github.com/apache/iceberg-rust/pull/313#discussion_r1548890936


##
crates/iceberg/src/reader.rs:
##


Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] refine: seperate parquet reader and arrow convert [iceberg-rust]

2024-04-02 Thread via GitHub


liurenjie1024 commented on PR #313:
URL: https://github.com/apache/iceberg-rust/pull/313#issuecomment-2033470850

   cc @viirya Would you also take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] refine: seperate parquet reader and arrow convert [iceberg-rust]

2024-04-02 Thread via GitHub


liurenjie1024 commented on code in PR #313:
URL: https://github.com/apache/iceberg-rust/pull/313#discussion_r1548868232


##
crates/iceberg/src/arrow/from.rs:
##


Review Comment:
   How about just name it as `schema.rs`, and we can put all schema related 
codes here?



##
crates/iceberg/src/reader.rs:
##


Review Comment:
   Move this to `arrow` module?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] int64() is converted to float for nullable field when None is provided + nan returned when None is expected [iceberg-python]

2024-04-02 Thread via GitHub


Co0olCat commented on issue #572:
URL: https://github.com/apache/iceberg-python/issues/572#issuecomment-2033459923

   It looks like an issue of pyarrow. scan() returns correct data and type. 
   pyarrow.project_table() does the conversion...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] int64() is converted to float for nullable field when None is provided + nan returned when None is expected [iceberg-python]

2024-04-02 Thread via GitHub


Co0olCat closed issue #572: int64() is converted to float for nullable field 
when None is provided + nan returned when None is expected
URL: https://github.com/apache/iceberg-python/issues/572


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] int64() is converted to float for nullable field when None is provided + nan returned when None is expected [iceberg-python]

2024-04-02 Thread via GitHub


Co0olCat opened a new issue, #572:
URL: https://github.com/apache/iceberg-python/issues/572

   ### Apache Iceberg version
   
   0.6.0 (latest release)
   
   ### Please describe the bug 🐞
   
   For reproduction using 
https://github.com/apache/iceberg-python/blob/main/tests/catalog/test_glue.py
   
   Here is failing test
   ```python
   @mock_aws
   def test_create_table_with_pyarrow_schema(
   _bucket_initialize: None,
   moto_endpoint_url: str,
   database_name: str,
   table_name: str,
   ) -> None:
   catalog_name = "glue"
   identifier = (database_name, table_name)
   test_catalog = GlueCatalog(catalog_name, **{"s3.endpoint": 
moto_endpoint_url})
   test_catalog.create_namespace(namespace=database_name)
   
   pa_schema = pa.schema([
   pa.field('year', pa.int64(), nullable=False),
   pa.field('n_legs', pa.int64(), nullable=True),
   pa.field('animals', pa.string(), nullable=True)
   ])
   
   table = test_catalog.create_table(
   identifier=identifier,
   schema=pa_schema,
   location=f"s3://{BUCKET_NAME}/{database_name}.db/{table_name}",
   )
   assert table.identifier == (catalog_name,) + identifier
   assert TABLE_METADATA_LOCATION_REGEX.match(table.metadata_location)
   assert test_catalog._parse_metadata_version(table.metadata_location) == 0
   
   table.append(
   pa.Table.from_pylist(
   [
   {"year": 2001, "n_legs": 2, "animals": None},
   {"year": 2002, "n_legs": None, "animals": "Horse"},
   ], schema=pa_schema
   )
   )
   
   assert len(table.scan().to_arrow()) == 2
   
   table.append(
   pa.Table.from_pylist(
   [
   {"year": 2003, "n_legs": 6, "animals": "Cicada"},
   {"year": 2004, "n_legs": 8, "animals": "Spider"},
   ], schema=pa_schema
   )
   )
   
   assert len(table.scan().to_arrow()) == 4
   
   assert table.scan().to_pandas().to_dict("records") == [
   {"animals": "Cicada", "n_legs": 6, "year": 2003},
   {"animals": "Spider", "n_legs": 8, "year": 2004},
   {"animals": None, "n_legs": 2, "year": 2001},
   {"animals": "Horse", "n_legs": None, "year": 2002},
   ]
   ```
   
   Error part:
   ```
   E Full diff:
   E   [
   E -  {'animals': 'Cicada', 'n_legs': 6, 'year': 2003},
   E +  {'animals': 'Cicada', 'n_legs': 6.0, 'year': 2003},
   E ?   ++
   E -  {'animals': 'Spider', 'n_legs': 8, 'year': 2004},
   E +  {'animals': 'Spider', 'n_legs': 8.0, 'year': 2004},
   E ?   ++
   E -  {'animals': None, 'n_legs': 2, 'year': 2001},
   E +  {'animals': None, 'n_legs': 2.0, 'year': 2001},
   E ?   ++
   E -  {'animals': 'Horse', 'n_legs': None, 'year': 2002},
   E ? -- ^
   E +  {'animals': 'Horse', 'n_legs': nan, 'year': 2002},
   E ?  ^^
   E   ]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] [WIP] Migrate TableTestBase related classes to JUnit5 [iceberg]

2024-04-02 Thread via GitHub


tomtongue opened a new pull request, #10080:
URL: https://github.com/apache/iceberg/pull/10080

   Migrate the following test classes to delete `TableTestBase`  for 
https://github.com/apache/iceberg/issues/9085.
   
   This PR is for the migration of `TableTestBase` related classes in 
https://github.com/apache/iceberg/pull/10063.
   
   ## Current Progress
   
   Core:
   - [x] `TestContentFileParser` (skipped)
   - [x] `TestFileScanTaskParser` (skipped)
   - [x] `util/TestTableScanUtil`
   
   
   - [ ] `WriterTestBase`
 - [x] `TestFileWriterFactory`
   - [x] `TestGenericFileWriterFactory`
   - [ ] `TestSparkFileWriterFactory` for the versions: v3.3, v3.4, v3.5
   - [ ] `TestFlinkFileWriterFactory` for the versions: v1.15, v1.16, 
v.1.17, v1.18
 - [x] `TestPartitioningWriters`
   - [ ] `TestSparkPartitioningWriters` for the versions: v3.3, v3.4, v3.5
   - [ ] `TestFlinkPartitioningWriters` for the versions: v1.15, v1.16, 
v.1.17, v1.18
 - [x] `TestPositionDeltaWriters`
   - [ ] `TestSparkPositionDeltaWriters` for the versions: v3.3, v3.4, v3.5
   - [ ] `TestFlinkPositionDeltaWriters` for the versions: v1.15, v1.16, 
v.1.17, v1.18
 - [x] `TestRollingFileWriters`
   - [ ] `TestSparkRollingFileWriters` for the versions: v3.3, v3.4, v3.5
   - [ ] `TestFlinkRollingFileWriters` for the versions: v1.15, v1.16, 
v.1.17, v1.18
   
   `iceberg-flink` for the versions: v1.15, v1.16, v.1.17, v1.18
   - [ ] `TestStreamingReaderOperator` latest ok
   - [ ] `TestStreamingMonitorFunction` latest ok
   - [ ] `TestIcebergFilesCommitter` latest ok
   - [ ] `TestDeltaTaskWriter` latest ok
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Fix metadata file not found [iceberg]

2024-04-02 Thread via GitHub


lurnagao-dahua commented on PR #10069:
URL: https://github.com/apache/iceberg/pull/10069#issuecomment-2033424821

   Thank you for your response!
   In my case flink streaming write to iceberg
   1.Hive metastore has been full GC  continuously so it will throw 
SocketTimeoutException: Read timed out(`hive.metastore.client.socket.timeout` 
default 600s)
   2.hiveTableOperations commit thread call `Thread.sleep(retryDelaySeconds * 
1000)` to retry
   3.The Flink checkpoint timeout time is less than 600s and Interrupt it, then 
throw InterruptedException and not message
   
   I have been thinking for a while and I have some doubts about this UT.  Can 
you give me some advice?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]

2024-04-02 Thread via GitHub


viirya commented on code in PR #295:
URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1548817290


##
crates/iceberg/src/arrow.rs:
##
@@ -113,6 +143,405 @@ impl ArrowReader {
 // TODO: full implementation
 ProjectionMask::all()
 }
+
+fn get_row_filter(&self, parquet_schema: &SchemaDescriptor) -> 
Result> {
+if let Some(predicates) = &self.predicates {
+let field_id_map = self.build_field_id_map(parquet_schema)?;
+
+// Collect Parquet column indices from field ids
+let column_indices = predicates
+.iter()
+.map(|predicate| {
+let mut collector = CollectFieldIdVisitor { field_ids: 
vec![] };
+collector.visit_predicate(predicate).unwrap();
+collector
+.field_ids
+.iter()
+.map(|field_id| {
+field_id_map.get(field_id).cloned().ok_or_else(|| {
+Error::new(ErrorKind::DataInvalid, "Field id 
not found in schema")
+})
+})
+.collect::>>()
+})
+.collect::>>()?;
+
+// Convert BoundPredicates to ArrowPredicates
+let mut arrow_predicates = vec![];
+for (predicate, columns) in 
predicates.iter().zip(column_indices.iter()) {
+let mut converter = PredicateConverter {
+columns,
+projection_mask: ProjectionMask::leaves(parquet_schema, 
columns.clone()),
+parquet_schema,
+column_map: &field_id_map,
+};
+let arrow_predicate = converter.visit_predicate(predicate)?;
+arrow_predicates.push(arrow_predicate);
+}
+Ok(Some(RowFilter::new(arrow_predicates)))
+} else {
+Ok(None)
+}
+}
+
+/// Build the map of field id to Parquet column index in the schema.
+fn build_field_id_map(&self, parquet_schema: &SchemaDescriptor) -> 
Result> {
+let mut column_map = HashMap::new();
+for (idx, field) in parquet_schema.columns().iter().enumerate() {
+let field_type = field.self_type();
+match field_type {
+ParquetType::PrimitiveType { basic_info, .. } => {
+if !basic_info.has_id() {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column {:?} in schema doesn't have 
field id",
+field_type
+),
+));
+}
+column_map.insert(basic_info.id(), idx);
+}
+ParquetType::GroupType { .. } => {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column in schema should be primitive type 
but got {:?}",
+field_type
+),
+));
+}
+};
+}
+
+Ok(column_map)
+}
+}
+
+/// A visitor to collect field ids from bound predicates.
+struct CollectFieldIdVisitor {
+field_ids: Vec,
+}
+
+impl BoundPredicateVisitor for CollectFieldIdVisitor {
+type T = ();
+type U = ();
+
+fn and(&mut self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn or(&mut self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn not(&mut self, _predicate: Self::T) -> Result {
+Ok(())
+}
+
+fn visit_always_true(&mut self) -> Result {
+Ok(())
+}
+
+fn visit_always_false(&mut self) -> Result {
+Ok(())
+}
+
+fn visit_unary(&mut self, predicate: &UnaryExpression) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_binary(&mut self, predicate: &BinaryExpression) 
-> Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_set(&mut self, predicate: &SetExpression) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn bound_reference(&mut self, reference: &BoundReference) -> 
Result {
+self.field_ids.push(reference.field().id);
+Ok(())
+}
+}
+
+struct PredicateConverter<'a> {
+pub columns: &'a Vec,
+pub projection_mask: ProjectionMask,
+pub parquet_schema: &'a SchemaDescriptor,
+pub column_map: &'a HashMap,
+}
+
+fn get_arrow_datum(datum: &Datum) -> Box {
+match datum.literal() {
+PrimitiveLiteral::Boolean(value) => 
Box::new(BooleanArray::new_scalar(*value)),
+PrimitiveLiteral::Int(value) => 
Bo

Re: [PR] Open-api: update prefix param description [iceberg]

2024-04-02 Thread via GitHub


ajantha-bhat commented on code in PR #9870:
URL: https://github.com/apache/iceberg/pull/9870#discussion_r1548815679


##
open-api/rest-catalog-open-api.yaml:
##
@@ -1444,7 +1444,7 @@ components:
   schema:
 type: string
   required: true
-  description: An optional prefix in the path
+  description: Prefix in the path

Review Comment:
   validator doesn't allow setting `required` as `false` for the Path parameter.
   
   What we need maybe have two paths defined, one with prefix and one without 
prefix?  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Move writes to Transaction [iceberg-python]

2024-04-02 Thread via GitHub


syun64 opened a new pull request, #571:
URL: https://github.com/apache/iceberg-python/pull/571

   As a followup from @HonahX 's suggestion on 
https://github.com/apache/iceberg-python/pull/498


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]

2024-04-02 Thread via GitHub


viirya commented on code in PR #295:
URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1548813765


##
crates/iceberg/src/arrow.rs:
##
@@ -113,6 +143,405 @@ impl ArrowReader {
 // TODO: full implementation
 ProjectionMask::all()
 }
+
+fn get_row_filter(&self, parquet_schema: &SchemaDescriptor) -> 
Result> {
+if let Some(predicates) = &self.predicates {
+let field_id_map = self.build_field_id_map(parquet_schema)?;
+
+// Collect Parquet column indices from field ids
+let column_indices = predicates
+.iter()
+.map(|predicate| {
+let mut collector = CollectFieldIdVisitor { field_ids: 
vec![] };
+collector.visit_predicate(predicate).unwrap();
+collector
+.field_ids
+.iter()
+.map(|field_id| {
+field_id_map.get(field_id).cloned().ok_or_else(|| {
+Error::new(ErrorKind::DataInvalid, "Field id 
not found in schema")
+})
+})
+.collect::>>()
+})
+.collect::>>()?;
+
+// Convert BoundPredicates to ArrowPredicates
+let mut arrow_predicates = vec![];
+for (predicate, columns) in 
predicates.iter().zip(column_indices.iter()) {
+let mut converter = PredicateConverter {
+columns,
+projection_mask: ProjectionMask::leaves(parquet_schema, 
columns.clone()),
+parquet_schema,
+column_map: &field_id_map,
+};
+let arrow_predicate = converter.visit_predicate(predicate)?;
+arrow_predicates.push(arrow_predicate);
+}
+Ok(Some(RowFilter::new(arrow_predicates)))
+} else {
+Ok(None)
+}
+}
+
+/// Build the map of field id to Parquet column index in the schema.
+fn build_field_id_map(&self, parquet_schema: &SchemaDescriptor) -> 
Result> {
+let mut column_map = HashMap::new();
+for (idx, field) in parquet_schema.columns().iter().enumerate() {
+let field_type = field.self_type();
+match field_type {
+ParquetType::PrimitiveType { basic_info, .. } => {
+if !basic_info.has_id() {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column {:?} in schema doesn't have 
field id",
+field_type
+),
+));
+}
+column_map.insert(basic_info.id(), idx);
+}
+ParquetType::GroupType { .. } => {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column in schema should be primitive type 
but got {:?}",
+field_type
+),
+));
+}
+};
+}
+
+Ok(column_map)
+}
+}
+
+/// A visitor to collect field ids from bound predicates.
+struct CollectFieldIdVisitor {
+field_ids: Vec,
+}
+
+impl BoundPredicateVisitor for CollectFieldIdVisitor {
+type T = ();
+type U = ();
+
+fn and(&mut self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn or(&mut self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn not(&mut self, _predicate: Self::T) -> Result {
+Ok(())
+}
+
+fn visit_always_true(&mut self) -> Result {
+Ok(())
+}
+
+fn visit_always_false(&mut self) -> Result {
+Ok(())
+}
+
+fn visit_unary(&mut self, predicate: &UnaryExpression) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_binary(&mut self, predicate: &BinaryExpression) 
-> Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_set(&mut self, predicate: &SetExpression) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn bound_reference(&mut self, reference: &BoundReference) -> 
Result {
+self.field_ids.push(reference.field().id);
+Ok(())
+}
+}
+
+struct PredicateConverter<'a> {
+pub columns: &'a Vec,
+pub projection_mask: ProjectionMask,
+pub parquet_schema: &'a SchemaDescriptor,
+pub column_map: &'a HashMap,
+}
+
+fn get_arrow_datum(datum: &Datum) -> Box {
+match datum.literal() {
+PrimitiveLiteral::Boolean(value) => 
Box::new(BooleanArray::new_scalar(*value)),
+PrimitiveLiteral::Int(value) => 
Bo

Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]

2024-04-02 Thread via GitHub


viirya commented on code in PR #295:
URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1548809949


##
crates/iceberg/src/arrow.rs:
##
@@ -20,24 +20,38 @@
 use async_stream::try_stream;
 use futures::stream::StreamExt;
 use parquet::arrow::{ParquetRecordBatchStreamBuilder, ProjectionMask};
+use std::collections::HashMap;
 
 use crate::io::FileIO;
 use crate::scan::{ArrowRecordBatchStream, FileScanTask, FileScanTaskStream};
-use crate::spec::SchemaRef;
+use crate::spec::{Datum, PrimitiveLiteral, SchemaRef};
 
 use crate::error::Result;
+use crate::expr::{
+BinaryExpression, BoundPredicate, BoundReference, PredicateOperator, 
SetExpression,
+UnaryExpression,
+};
 use crate::spec::{
 ListType, MapType, NestedField, NestedFieldRef, PrimitiveType, Schema, 
StructType, Type,
 };
 use crate::{Error, ErrorKind};
+use arrow_arith::boolean::{and, is_not_null, is_null, not, or};
+use arrow_array::{
+BooleanArray, Datum as ArrowDatum, Float32Array, Float64Array, Int32Array, 
Int64Array,
+};
+use arrow_ord::cmp::{eq, gt, gt_eq, lt, lt_eq, neq};
 use arrow_schema::{DataType, Field, Fields, Schema as ArrowSchema, TimeUnit};
+use bitvec::macros::internal::funty::Fundamental;
+use parquet::arrow::arrow_reader::{ArrowPredicate, ArrowPredicateFn, 
RowFilter};
+use parquet::schema::types::{SchemaDescriptor, Type as ParquetType};
 use std::sync::Arc;
 
 /// Builder to create ArrowReader
 pub struct ArrowReaderBuilder {
 batch_size: Option,
 file_io: FileIO,
 schema: SchemaRef,
+predicates: Option>,

Review Comment:
   This is because the Parquet API `RowFilter` takes `predicates: Vec>`.
   
   This is `RowFilter`'s doc:
   
   ```
   /// A [`RowFilter`] allows pushing down a filter predicate to skip IO and 
decode
   ///
   /// This consists of a list of [`ArrowPredicate`] where only the rows that 
satisfy all
   /// of the predicates will be returned. 
   ```
   
   So I think it is conjunction relationships between these predicates.
   
   > Since we already have And/Or as part of BoundPredicate variant, how about 
just BoundPredicate?
   
   Yea, I can use just one `BoundPredicate`. So users can define conjunctions 
in one single predicate using `And`.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]

2024-04-02 Thread via GitHub


viirya commented on code in PR #295:
URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1548810072


##
crates/iceberg/src/arrow.rs:
##
@@ -113,6 +143,405 @@ impl ArrowReader {
 // TODO: full implementation
 ProjectionMask::all()
 }
+
+fn get_row_filter(&self, parquet_schema: &SchemaDescriptor) -> 
Result> {
+if let Some(predicates) = &self.predicates {
+let field_id_map = self.build_field_id_map(parquet_schema)?;
+
+// Collect Parquet column indices from field ids
+let column_indices = predicates
+.iter()
+.map(|predicate| {
+let mut collector = CollectFieldIdVisitor { field_ids: 
vec![] };
+collector.visit_predicate(predicate).unwrap();
+collector
+.field_ids
+.iter()
+.map(|field_id| {
+field_id_map.get(field_id).cloned().ok_or_else(|| {
+Error::new(ErrorKind::DataInvalid, "Field id 
not found in schema")
+})
+})
+.collect::>>()
+})
+.collect::>>()?;
+
+// Convert BoundPredicates to ArrowPredicates
+let mut arrow_predicates = vec![];
+for (predicate, columns) in 
predicates.iter().zip(column_indices.iter()) {
+let mut converter = PredicateConverter {
+columns,
+projection_mask: ProjectionMask::leaves(parquet_schema, 
columns.clone()),
+parquet_schema,
+column_map: &field_id_map,
+};
+let arrow_predicate = converter.visit_predicate(predicate)?;
+arrow_predicates.push(arrow_predicate);
+}
+Ok(Some(RowFilter::new(arrow_predicates)))
+} else {
+Ok(None)
+}
+}
+
+/// Build the map of field id to Parquet column index in the schema.
+fn build_field_id_map(&self, parquet_schema: &SchemaDescriptor) -> 
Result> {
+let mut column_map = HashMap::new();
+for (idx, field) in parquet_schema.columns().iter().enumerate() {
+let field_type = field.self_type();
+match field_type {
+ParquetType::PrimitiveType { basic_info, .. } => {
+if !basic_info.has_id() {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column {:?} in schema doesn't have 
field id",
+field_type
+),
+));
+}
+column_map.insert(basic_info.id(), idx);
+}
+ParquetType::GroupType { .. } => {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column in schema should be primitive type 
but got {:?}",
+field_type
+),
+));
+}
+};
+}
+
+Ok(column_map)
+}
+}
+
+/// A visitor to collect field ids from bound predicates.
+struct CollectFieldIdVisitor {

Review Comment:
   OKay



##
crates/iceberg/src/arrow.rs:
##
@@ -113,6 +143,405 @@ impl ArrowReader {
 // TODO: full implementation
 ProjectionMask::all()
 }
+
+fn get_row_filter(&self, parquet_schema: &SchemaDescriptor) -> 
Result> {
+if let Some(predicates) = &self.predicates {
+let field_id_map = self.build_field_id_map(parquet_schema)?;
+
+// Collect Parquet column indices from field ids
+let column_indices = predicates
+.iter()
+.map(|predicate| {
+let mut collector = CollectFieldIdVisitor { field_ids: 
vec![] };
+collector.visit_predicate(predicate).unwrap();
+collector
+.field_ids
+.iter()
+.map(|field_id| {
+field_id_map.get(field_id).cloned().ok_or_else(|| {
+Error::new(ErrorKind::DataInvalid, "Field id 
not found in schema")
+})
+})
+.collect::>>()
+})
+.collect::>>()?;
+
+// Convert BoundPredicates to ArrowPredicates
+let mut arrow_predicates = vec![];
+for (predicate, columns) in 
predicates.iter().zip(column_indices.iter()) {
+let mut converter = PredicateConverter {
+columns,
+  

Re: [PR] refine: seperate parquet reader and arrow convert [iceberg-rust]

2024-04-02 Thread via GitHub


ZENOTME commented on PR #313:
URL: https://github.com/apache/iceberg-rust/pull/313#issuecomment-2033373684

   cc @liurenjie1024 @Xuanwo @Fokko 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] refine: seperate parquet reader and arrow convert [iceberg-rust]

2024-04-02 Thread via GitHub


ZENOTME opened a new pull request, #313:
URL: https://github.com/apache/iceberg-rust/pull/313

   This PR separates out the parquet reader from the arrow module. And make the 
arrow module a dir so that we can separate the `from_arow` and `to_arrow`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [WIP] Add `ManifestEvaluator` to allow filtering of files in a table scan (Issue #152) [iceberg-rust]

2024-04-02 Thread via GitHub


liurenjie1024 commented on PR #241:
URL: https://github.com/apache/iceberg-rust/pull/241#issuecomment-2033371130

   > > which is somehow motivated by @viirya 's 
[pr](https://github.com/apache/iceberg-rust/pull/295/files#diff-a59622727cd67153abdf02031475bf8a1b1921738df4ca9903a685ff6970b7aaR472),
 but move the travsering flow out of trait body.
   > 
   > @liurenjie1024 ...so the traversal flow could then be implemented on e.g. 
the `ManifestEvaluator` itself. For example, `eval()` could call the 
corresponding 'visit_xx' on the visitor that implements 
`BoundPredicateVisitor`. Is this what you mean?
   
   I was thinking about following following structure:
   ```rust
   pub trait BoundPredicateVisitor {
 type T;
   
 fn visit_and(&mut self, values: [Self::T; 2]) -> Result;
 fn visit_or(&mut self, values: [Self::T; 2]) -> Result;
   
 ...
   }
   
   pub fn visit_bound_predicate(visitor: &mut V, predicate: &BoundPredicate) -> 
Result {
  match predicate {
   BoundPredicate::And(children) => {
 let ret = [visit_bound_predicate(visitor, 
children[0]),visit_bound_predicate(visitor, children[1])];
 visitor.visit_and(ret)
 },
 ...
}
   }
   
   pub struct ManifestEvaluator {}
   
   impl BoundPredicateVisitor  for ManifestEvaluator {}
   
   impl ManifestEvaluator  {
   
 pub fn eval(&mut self, predicate: &BoundPredicate) -> bool {
   visit_bound_predicate(self, predicate)?
 } 
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support uri redirect in rest client [iceberg-rust]

2024-04-02 Thread via GitHub


liurenjie1024 merged PR #310:
URL: https://github.com/apache/iceberg-rust/pull/310


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Support CreateTableTransaction in Glue and Rest [iceberg-python]

2024-04-02 Thread via GitHub


syun64 commented on PR #498:
URL: https://github.com/apache/iceberg-python/pull/498#issuecomment-2033344103

   > Shall we move "append", "overwrite", and "add_files" to `Transaction` 
class? This change would enable us to seamlessly chain these operations with 
other table updates in a single commit. This adjustment could be particularly 
beneficial in the context of `CreateTableTransaction`, as it would enable users 
to not only create a table but also populate it with initial data in one go.
   
   I think this is a great question.
   
   I think we have two options here:
   1. We move these actions into the Transaction class, and remove them from 
Table class
   2. We move them into the Transaction class, and also keep an implementation 
in the Table class
   
   I'm not sure which of the above two are better, but I keep asking myself 
whether there's a 'good' reason why we have two separate APIs that achieve 
similar results.
   
   For example, we have **update_spec**, **update_schema** that can be created 
from the **Transaction** or the **Table**, and I feel like we might be creating 
work for ourselves by duplicating the feature in both classes. What if we 
consolidated all of our actions into the Transaction class, and removed them 
from the Table class?
   
   I think the upside of that would be that API would convey a very clear 
message to the developer that a _transaction is committed to a table_, and that 
a series of _actions_ can be chained onto the _same transaction_, as a single 
commit.
   
   In addition, we can avoid [issues like 
this](https://github.com/apache/iceberg-python/pull/508) where we roll out a 
feature to one API implementation, but not the other.
   
   ```
   with given_table.update_schema() as tx:
   tx.add_column(path="new_column1", field_type=IntegerType())
   ```
   
   ```
   with given_table.transaction() as tx:
   with tx.update_schema() as update:
   update.add_column(path="new_column1", field_type=IntegerType())
   ```
   
   To me, the bottom pattern feels more explicit than the above option, and I'm 
curious to hear others' opinions on this topic


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Please remove old releases [iceberg]

2024-04-02 Thread via GitHub


github-actions[bot] commented on issue #2414:
URL: https://github.com/apache/iceberg/issues/2414#issuecomment-2033310864

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Flink cdc events with update or delete doesn't work in 0.11.0 branch [iceberg]

2024-04-02 Thread via GitHub


github-actions[bot] commented on issue #2409:
URL: https://github.com/apache/iceberg/issues/2409#issuecomment-2033310842

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] how to fix org.apache.spark.shuffle.FetchFailedException: [iceberg]

2024-04-02 Thread via GitHub


github-actions[bot] closed issue #2211: how to fix 
org.apache.spark.shuffle.FetchFailedException:
URL: https://github.com/apache/iceberg/issues/2211


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] how to fix org.apache.spark.shuffle.FetchFailedException: [iceberg]

2024-04-02 Thread via GitHub


github-actions[bot] commented on issue #2211:
URL: https://github.com/apache/iceberg/issues/2211#issuecomment-2033310616

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Need help inserting data into hadoop table with flink sql in java [iceberg]

2024-04-02 Thread via GitHub


github-actions[bot] closed issue #2209: Need help inserting data into hadoop 
table with flink sql in java
URL: https://github.com/apache/iceberg/issues/2209


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Need help inserting data into hadoop table with flink sql in java [iceberg]

2024-04-02 Thread via GitHub


github-actions[bot] commented on issue #2209:
URL: https://github.com/apache/iceberg/issues/2209#issuecomment-2033310589

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Hive: got error while joining iceberg table and hive table [iceberg]

2024-04-02 Thread via GitHub


github-actions[bot] commented on issue #2198:
URL: https://github.com/apache/iceberg/issues/2198#issuecomment-2033310562

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Hive: got error while joining iceberg table and hive table [iceberg]

2024-04-02 Thread via GitHub


github-actions[bot] closed issue #2198: Hive: got error while joining iceberg 
table and hive table
URL: https://github.com/apache/iceberg/issues/2198


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Cannot write incompatible dataset to table with schema error for list types [iceberg]

2024-04-02 Thread via GitHub


github-actions[bot] commented on issue #2192:
URL: https://github.com/apache/iceberg/issues/2192#issuecomment-2033310543

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Cannot write incompatible dataset to table with schema error for list types [iceberg]

2024-04-02 Thread via GitHub


github-actions[bot] closed issue #2192: Cannot write incompatible dataset to 
table with schema error for list types
URL: https://github.com/apache/iceberg/issues/2192


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Iceberg/Comet integration POC [iceberg]

2024-04-02 Thread via GitHub


huaxingao commented on code in PR #9841:
URL: https://github.com/apache/iceberg/pull/9841#discussion_r1548708375


##
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/comet/CometIcebergColumnReader.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.data.vectorized.comet;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.comet.parquet.AbstractColumnReader;
+import org.apache.comet.parquet.ColumnReader;
+import org.apache.comet.parquet.TypeUtil;
+import org.apache.comet.parquet.Utils;
+import org.apache.comet.vector.CometVector;
+import org.apache.iceberg.parquet.VectorizedReader;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.iceberg.types.Types;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.page.PageReadStore;
+import org.apache.parquet.column.page.PageReader;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnPath;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+
+/**
+ * A Iceberg Parquet column reader backed by a Boson {@link ColumnReader}. 
This class should be used

Review Comment:
   Oops. Will change. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] 5 dremio blog march 2024 [iceberg]

2024-04-02 Thread via GitHub


AlexMercedCoder commented on code in PR #10067:
URL: https://github.com/apache/iceberg/pull/10067#discussion_r1548619662


##
site/docs/blogs.md:
##
@@ -23,6 +23,37 @@ title: "Blogs"
 Here is a list of company blogs that talk about Iceberg. The blogs are ordered 
from most recent to oldest.
 
 
+
+### [End-to-End Basic Data Engineering Tutorial (Apache Spark, Apache Iceberg 
Dremio, Apache Superset, 
Nessie)](https://medium.com/data-engineering-with-dremio/end-to-end-basic-data-engineering-tutorial-apache-spark-apache-iceberg-dremio-apache-superset-a896ecab46f6)

Review Comment:
   also fixed on medium



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] 5 dremio blog march 2024 [iceberg]

2024-04-02 Thread via GitHub


AlexMercedCoder commented on code in PR #10067:
URL: https://github.com/apache/iceberg/pull/10067#discussion_r1548618808


##
site/docs/blogs.md:
##
@@ -23,6 +23,37 @@ title: "Blogs"
 Here is a list of company blogs that talk about Iceberg. The blogs are ordered 
from most recent to oldest.
 
 
+
+### [End-to-End Basic Data Engineering Tutorial (Apache Spark, Apache Iceberg 
Dremio, Apache Superset, 
Nessie)](https://medium.com/data-engineering-with-dremio/end-to-end-basic-data-engineering-tutorial-apache-spark-apache-iceberg-dremio-apache-superset-a896ecab46f6)

Review Comment:
   no, that's a typo, missing comma, just added a new commit adding the comma.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Iceberg/Comet integration POC [iceberg]

2024-04-02 Thread via GitHub


RussellSpitzer commented on code in PR #9841:
URL: https://github.com/apache/iceberg/pull/9841#discussion_r1548617271


##
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/comet/CometIcebergColumnReader.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.data.vectorized.comet;
+
+import java.io.IOException;
+import java.util.Map;
+import org.apache.comet.parquet.AbstractColumnReader;
+import org.apache.comet.parquet.ColumnReader;
+import org.apache.comet.parquet.TypeUtil;
+import org.apache.comet.parquet.Utils;
+import org.apache.comet.vector.CometVector;
+import org.apache.iceberg.parquet.VectorizedReader;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.iceberg.types.Types;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.page.PageReadStore;
+import org.apache.parquet.column.page.PageReader;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnPath;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+
+/**
+ * A Iceberg Parquet column reader backed by a Boson {@link ColumnReader}. 
This class should be used

Review Comment:
   What is boson :P



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] PyArrow S3FileSystem doesn't honor the AWS profile config [iceberg-python]

2024-04-02 Thread via GitHub


geruh opened a new issue, #570:
URL: https://github.com/apache/iceberg-python/issues/570

   ### Apache Iceberg version
   
   main (development)
   
   ### Please describe the bug 🐞
   
   When initializing the GlueCatalog with a specific AWS profile, everything 
works as it should with catalog operations. But, we’ve hit a issue when it 
comes to working with S3 via the PyArrow S3FileSystem. Users can specify a 
profile for initiating a boto connection however, this preference doesn’t carry 
over to the S3FileSystem. Instead of using the specified AWS profile, 
   it will check the catalog configs for the s3 configs like:`s3.access-key-id, 
s3.region... `. If those aren't passed in PyArrow's S3Filesystem has it's own 
strategy of inferring credentials such as:
   1. the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN 
environment variables. 
   2. the default profile credentials in your ~/.aws/credentials and 
~/.aws/config.
   
   This workflow leads to some inconsistencies. For example, while Glue 
operations might be using a ux specified profile, S3 operations could end up 
using a different set of credentials or even a different region from what’s set 
in the environment variables or the AWS config files. This is seen in issue 
#515, where one region (like us-west-2) unexpectedly switches to another (like 
us-east-1), causing a 301 exception.
   
   For example:
   
   1. Set up an AWS profile in ~/.aws/config with an incorrect region:
   
   ```
   [default]
   region = us-east-1
   
   [test]
   region = us-west-2
   ```
   
   2. Initialize the GlueCatalog with the correct region you want to use:
   ```
   catalog = pyiceberg.catalog.load_catalog(
   catalog_name, **{"type": "glue", "profile_name": "test", "region_name": 
"us-west-2"}
   )
   ```
   
   3. load a table
   ```
   catalog.load_table("default.test")
   File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
   OSError: When reading information for key 
'test/metadata/0-c0fc4e45-d79d-41a1-ba92-a4122c09171c.metadata.json' in 
bucket 'test_bucket': AWS Error UNKNOWN (HTTP status 301) during HeadObject 
operation: No response body.
   ```
   
   On one hand, we could argue that this profile configuration should only work 
at the catalog level, and for filesystems, the user must specify the 
aforementioned configs like `s3.region`. But on the other hand it seems 
reasonable that the AWS profile config should work uniformly across both the 
catalog and filesystem levels. This unified approach would certainly simplify 
configuration management for users. I’m leaning towards this perspective. 
However, we're currently utilizing PyArrow's S3FileSystem, which doesn't 
inherently support AWS profiles. This means we'd need to bridge that gap 
manually. 
   
   cc: @HonahX @Fokko @kevinjqliu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add PrePlanTable and PlanTable Endpoints to open api spec [iceberg]

2024-04-02 Thread via GitHub


rahil-c commented on PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#issuecomment-2033006567

   @nastra @rdblue @danielcweeks @jackye1995 @amogh-jahagirdar 
   
   When looking again at the `capabalities` pr: 
https://github.com/apache/iceberg/pull/9940, are we sure we want to add 
scan-planning as part of the `capabilities` list in the `ConfigResponse`? One 
question i want to raise is if server returns back `scan-planning` in the 
`capabalities`, would this mean that all tables under `RestCatalog` would  
support `scan planning`? 
   
   I believe that we need a way tell clients the rest scan planning is 
supported, but I think this is something that should be done at the table 
level, in the `LoadTableResponse` as a new property? Let me know what you all 
think


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Snowflake Iceberg Partitioned data read issue [iceberg]

2024-04-02 Thread via GitHub


findinpath commented on issue #9404:
URL: https://github.com/apache/iceberg/issues/9404#issuecomment-2032964716

   @sfc-gh-rortloff i went through the Snowflake documenation 
https://docs.snowflake.com/en/sql-reference/sql/create-iceberg-table and don't 
see any reference related to partitioning.
   
   Could you pls sketch here how to create via Snowflake SQL syntax an Iceberg 
partitioned table?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Unable to load an iceberg table from aws glue catalog [iceberg-python]

2024-04-02 Thread via GitHub


geruh commented on issue #515:
URL: https://github.com/apache/iceberg-python/issues/515#issuecomment-2032913100

   No Problem!! 
   
   This could potentially be a bug if we assume that the catalog and FileIO 
(S3) share the same aws profile configs. On one side, having a single profile 
configuration is convenient for the user's  boto client, as it allows 
initializing all AWS clients with the correct credentials. However, on the 
other hand, we could argue that this configuration should only work at the 
catalog level, and for filesystems, separate configurations might be required. 
I'm inclined towards the first option. However, we are using pyarrow's 
S3FileSystem implementation, which has no concept of a aws profile. Therefore, 
we will need to initialize these values through boto's 
session.get_credentials() and pass them to the filesystem.
   
   I'll raise an issue for this 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Partitioned Append on Identity Transform [iceberg-python]

2024-04-02 Thread via GitHub


jqin61 commented on code in PR #555:
URL: https://github.com/apache/iceberg-python/pull/555#discussion_r1548140319


##
pyiceberg/table/__init__.py:
##
@@ -2526,25 +2537,44 @@ def _dataframe_to_data_files(
 """
 from pyiceberg.io.pyarrow import bin_pack_arrow_table, write_file
 
-if len([spec for spec in table_metadata.partition_specs if spec.spec_id != 
0]) > 0:
-raise ValueError("Cannot write to partitioned tables")
-
 counter = itertools.count(0)
 write_uuid = write_uuid or uuid.uuid4()
-
 target_file_size = PropertyUtil.property_as_int(
 properties=table_metadata.properties,
 property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
 default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT,
 )
+if target_file_size is None:
+raise ValueError(
+"Fail to get neither TableProperties.WRITE_TARGET_FILE_SIZE_BYTES 
nor WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT for writing target data file."

Review Comment:
   i just find the default value itself could be None: 
   ```PARQUET_COMPRESSION_LEVEL_DEFAULT = None```
   so this None checking is not unnecessary?
   
   the original code for this target_file_size check just `type: ignores` it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Partitioned Append on Identity Transform [iceberg-python]

2024-04-02 Thread via GitHub


jqin61 commented on code in PR #555:
URL: https://github.com/apache/iceberg-python/pull/555#discussion_r1548140319


##
pyiceberg/table/__init__.py:
##
@@ -2526,25 +2537,44 @@ def _dataframe_to_data_files(
 """
 from pyiceberg.io.pyarrow import bin_pack_arrow_table, write_file
 
-if len([spec for spec in table_metadata.partition_specs if spec.spec_id != 
0]) > 0:
-raise ValueError("Cannot write to partitioned tables")
-
 counter = itertools.count(0)
 write_uuid = write_uuid or uuid.uuid4()
-
 target_file_size = PropertyUtil.property_as_int(
 properties=table_metadata.properties,
 property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
 default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT,
 )
+if target_file_size is None:
+raise ValueError(
+"Fail to get neither TableProperties.WRITE_TARGET_FILE_SIZE_BYTES 
nor WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT for writing target data file."

Review Comment:
   i just find the default value itself could be None: 
   ```PARQUET_COMPRESSION_LEVEL_DEFAULT = None```
   so this None checking is not unnecessary?
   
   the original code for this target_file_size just type: ignores it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Partitioned Append on Identity Transform [iceberg-python]

2024-04-02 Thread via GitHub


jqin61 commented on code in PR #555:
URL: https://github.com/apache/iceberg-python/pull/555#discussion_r1548140319


##
pyiceberg/table/__init__.py:
##
@@ -2526,25 +2537,44 @@ def _dataframe_to_data_files(
 """
 from pyiceberg.io.pyarrow import bin_pack_arrow_table, write_file
 
-if len([spec for spec in table_metadata.partition_specs if spec.spec_id != 
0]) > 0:
-raise ValueError("Cannot write to partitioned tables")
-
 counter = itertools.count(0)
 write_uuid = write_uuid or uuid.uuid4()
-
 target_file_size = PropertyUtil.property_as_int(
 properties=table_metadata.properties,
 property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
 default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT,
 )
+if target_file_size is None:
+raise ValueError(
+"Fail to get neither TableProperties.WRITE_TARGET_FILE_SIZE_BYTES 
nor WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT for writing target data file."

Review Comment:
   i just find the default value itself could be None: 
   ```PARQUET_COMPRESSION_LEVEL_DEFAULT = None```
   so this None checking is not unnecessary?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Improve CLI Text by Adding Verbose Text for Commands [iceberg-go]

2024-04-02 Thread via GitHub


zeroshade commented on code in PR #68:
URL: https://github.com/apache/iceberg-go/pull/68#discussion_r1548127844


##
cmd/iceberg/main.go:
##
@@ -34,16 +34,21 @@ import (
 const usage = `iceberg.
 
 Usage:
-  iceberg list [options] [PARENT]
-  iceberg describe [options] [namespace | table] IDENTIFIER
-  iceberg (schema | spec | uuid | location) [options] TABLE_ID
-  iceberg drop [options] (namespace | table) IDENTIFIER
-  iceberg files [options] TABLE_ID [--history]
-  iceberg rename [options]  
-  iceberg properties [options] get (namespace | table) IDENTIFIER [PROPNAME]
-  iceberg properties [options] set (namespace | table) IDENTIFIER PROPNAME 
VALUE
-  iceberg properties [options] remove (namespace | table) IDENTIFIER PROPNAME
-  iceberg -h | --help | --version
+  iceberg [command] [options] [arguments]

Review Comment:
   does this actually work? The library being used `docopt`  actually uses the 
`usage` string to perform the parsing and processing. `[options]` is a special 
handle thing for docopt, but i don't think that `[command]` and `[arguments]` 
is. So i think this might break the CLI here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Spark configuration for amazon access key and secret key with glue catalog for apache Iceberg is not honoring [iceberg]

2024-04-02 Thread via GitHub


AwasthiSomesh opened a new issue, #10078:
URL: https://github.com/apache/iceberg/issues/10078

   Hi Team ,
   
   We are doing below code to access iceberg table from glue catalog and data 
storage as S3
   
   var spark = SparkSession.builder().master("local[*]")
 .config("spark.sql.defaultCatalog", "AwsDataCatalog")
 .config("spark.sql.catalog.AwsDataCatalog", 
"org.apache.iceberg.spark.SparkCatalog")
 .config("spark.sql.catalog.AwsDataCatalog.catalog-impl", 
"org.apache.iceberg.aws.glue.GlueCatalog")
 .config("spark.sql.catalog.AwsDataCatalog.io-imp", 
"org.apache.iceberg.aws.s3.S3FileIO")
 .config("spark.hadoop.fs.s3a.access.key", "XXxxx")
 .config("spark.hadoop.fs.s3a.secret.key", 
"XXXx")
 .config("spark.hadoop.fs.s3a.aws.region", "us-west-2")
 .getOrCreate();
   
   val df1 = spark.sql("select * from default.iceberg_table_exercise1");
   
   Error- 
   Exception in thread "main" 
software.amazon.awssdk.core.exception.SdkClientException: Unable to load 
credentials from any of the providers in the chain 
AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(),
 EnvironmentVariableCredentialsProvider(), 
WebIdentityTokenCredentialsProvider(), 
ProfileCredentialsProvider(profileName=default, 
profileFile=ProfileFile(profilesAndSectionsMap=[])), 
ContainerCredentialsProvider(), InstanceProfileCredentialsProvider()]) : 
[SystemPropertyCredentialsProvider(): Unable to load credentials from system 
settings. Access key must be specified either via environment variable 
(AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId)., 
EnvironmentVariableCredentialsProvider(): Unable to load credentials from 
system settings. Access key must be specified either via environment variable 
(AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId)., 
WebIdentityTokenCredentialsProvider(): Either the environment variable 
AWS_WEB_IDENT
 ITY_TOKEN_FILE or the javaproperty aws.webIdentityTokenFile must be set., 
ProfileCredentialsProvider(profileName=default, 
profileFile=ProfileFile(profilesAndSectionsMap=[])): Profile file contained no 
credentials for profile 'default': ProfileFile(profilesAndSectionsMap=[]), 
ContainerCredentialsProvider(): Cannot fetch credentials from container - 
neither AWS_CONTAINER_CREDENTIALS_FULL_URI or 
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variables are set., 
InstanceProfileCredentialsProvider(): Failed to load credentials from IMDS.]
   
   This code is throwing unable to load access and secret key but when we 
passing these information with System.setproperty then its working.
   
   But our requirement is set to spark level not system level.
   
   Jars we used : - iceberg-spark-runtime-3.5_2.12-1.5.0 , 
iceberg-aws-bundle-1.5.0
   
   Please any one help ASAP
   
   Thanks,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Partitioned Append on Identity Transform [iceberg-python]

2024-04-02 Thread via GitHub


jqin61 commented on code in PR #555:
URL: https://github.com/apache/iceberg-python/pull/555#discussion_r1548068306


##
pyiceberg/typedef.py:
##
@@ -199,3 +199,7 @@ def __repr__(self) -> str:
 def record_fields(self) -> List[str]:
 """Return values of all the fields of the Record class except those 
specified in skip_fields."""
 return [self.__getattribute__(v) if hasattr(self, v) else None for v 
in self._position_to_field_name]
+
+def __hash__(self) -> int:
+"""Return hash value of the Record class."""
+return hash(str(self))

Review Comment:
   I think since `__repr_` is defined, the str() might still work? I tested:
   ```
   r1 = Record(1,2)
   r2 = Record(x=1, y="string value")
   print("")
   print(str(r1), hash(str(r1)))
   print(str(r2), hash(str(r2)))
   ```
   
   prints:
   ```
   Record[field1=1, field2=2] -7504199255027864703
   Record[x=1, y='string value'] -4897332691101137012
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Add option to delete datafiles [iceberg-python]

2024-04-02 Thread via GitHub


Fokko opened a new pull request, #569:
URL: https://github.com/apache/iceberg-python/pull/569

   This is done through the Iceberg metadata, resulting in efficient deletes if 
the data is partitioned correctly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [WIP] Add `ManifestEvaluator` to allow filtering of files in a table scan (Issue #152) [iceberg-rust]

2024-04-02 Thread via GitHub


marvinlanhenke commented on PR #241:
URL: https://github.com/apache/iceberg-rust/pull/241#issuecomment-2032233828

   > which is somehow motivated by @viirya 's 
[pr](https://github.com/apache/iceberg-rust/pull/295/files#diff-a59622727cd67153abdf02031475bf8a1b1921738df4ca9903a685ff6970b7aaR472),
 but move the travsering flow out of trait body.
   
   @liurenjie1024 
   ...so the traversal flow could then be implemented on e.g. the 
`ManifestEvaluator` itself. 
   For example, `eval()` could call the corresponding 'visits` on the visitor 
that implements `BoundPredicateVisitor`. Is this what you mean?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] OpenAPI: Express server capabilities via /config endpoint [iceberg]

2024-04-02 Thread via GitHub


snazy commented on PR #9940:
URL: https://github.com/apache/iceberg/pull/9940#issuecomment-2032202321

   > > I've got strong concerns about using `enum` here - special handling here 
and there, I think, that complicates things for adopters of any OpenAPI spec.
   > 
   > @snazy we use `enum` in the OpenAPI spec to list possible values of a 
string, which is also what's being documented in 
https://swagger.io/docs/specification/data-models/enums/. The underlying type 
of a capability is still a `string`. Do you have an alternative in mind for the 
issue you're seeing? Also I believe when code is generated from the OpenAPI 
spec, there should be an option to generate enums as literals
   
   The issue at hand is that an enum cannot (by default) handle unknown values 
- which is generally fine. But here we have to expect that endpoints to return 
values that are _not_ known by a client. (Not only generated) client's that did 
not handle this rather special case will fail to parse the result, which is a 
problem. I'm still in favor of just using a string.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Fix metadata file not found [iceberg]

2024-04-02 Thread via GitHub


lurnagao-dahua commented on PR #10069:
URL: https://github.com/apache/iceberg/pull/10069#issuecomment-2032193393

   > @lurnagao-dahua please check styles.
   > 
   > > The reason is that in some cases, the e.getMessage() return null and it 
will throw NullPointerException, then skip checkCommitStatus, it may be delete 
metadataLocation, actually, metadata commit succeed.
   > 
   > Is it possible to add a UT for this case?
   
   Hi, I found the original pr similar to this one 
[701](https://github.com/apache/iceberg/pull/701), then I will try to write an 
UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Fix metadata file not found [iceberg]

2024-04-02 Thread via GitHub


lurnagao-dahua commented on PR #10069:
URL: https://github.com/apache/iceberg/pull/10069#issuecomment-2032181736

   > @lurnagao-dahua请检查样式。
   > 
   > > 
原因是在某些情况下,e.getMessage()返回null,会抛出NullPointerException,然后跳过checkCommitStatus,可能是删除metadataLocation,实际上元数据提交成功。
   > 
   > 是否可以为这种情况添加 UT?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Fix metadata file not found [iceberg]

2024-04-02 Thread via GitHub


lurnagao-dahua closed pull request #10069: Hive: Fix metadata file not found
URL: https://github.com/apache/iceberg/pull/10069


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] OpenAPI: Express server capabilities via /config endpoint [iceberg]

2024-04-02 Thread via GitHub


nastra commented on PR #9940:
URL: https://github.com/apache/iceberg/pull/9940#issuecomment-2032092212

   > I've got strong concerns about using `enum` here - special handling here 
and there, I think, that complicates things for adopters of any OpenAPI spec.
   
   
   @snazy we use `enum` in the OpenAPI spec to list possible values of a 
string, which is also what's being documented in 
https://swagger.io/docs/specification/data-models/enums/. The underlying type 
of a capability is still a `string`. Do you have an alternative in mind for the 
issue you're seeing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Open-api: update prefix param description [iceberg]

2024-04-02 Thread via GitHub


jbonofre commented on code in PR #9870:
URL: https://github.com/apache/iceberg/pull/9870#discussion_r1547920214


##
open-api/rest-catalog-open-api.yaml:
##
@@ -1444,7 +1444,7 @@ components:
   schema:
 type: string
   required: true
-  description: An optional prefix in the path
+  description: Prefix in the path

Review Comment:
   I think it makes sense to have `prefix` optional.  Specifically for the 
`prefix` I would set `required: false` and check in the validator.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Open-api: update prefix param description [iceberg]

2024-04-02 Thread via GitHub


ajantha-bhat commented on code in PR #9870:
URL: https://github.com/apache/iceberg/pull/9870#discussion_r1547852220


##
open-api/rest-catalog-open-api.yaml:
##
@@ -1444,7 +1444,7 @@ components:
   schema:
 type: string
   required: true
-  description: An optional prefix in the path
+  description: Prefix in the path

Review Comment:
   I think if we need optional params in path, we need to define a separate 
path without prefix.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Partitioned Append on Identity Transform [iceberg-python]

2024-04-02 Thread via GitHub


Fokko commented on code in PR #555:
URL: https://github.com/apache/iceberg-python/pull/555#discussion_r1547810906


##
pyiceberg/manifest.py:
##
@@ -289,10 +286,7 @@ def 
partition_field_to_data_file_partition_field(partition_field_type: IcebergTy
 
 
 @partition_field_to_data_file_partition_field.register(LongType)
-@partition_field_to_data_file_partition_field.register(DateType)

Review Comment:
   This single-dispatch is there only for the `TimeType` it seems. Probably we 
should we should also convert those into a native type.



##
tests/conftest.py:
##
@@ -2000,7 +2000,11 @@ def spark() -> "SparkSession":
 'float': [0.0, None, 0.9],
 'double': [0.0, None, 0.9],
 'timestamp': [datetime(2023, 1, 1, 19, 25, 00), None, datetime(2023, 3, 1, 
19, 25, 00)],
-'timestamptz': [datetime(2023, 1, 1, 19, 25, 00), None, datetime(2023, 3, 
1, 19, 25, 00)],
+'timestamptz': [

Review Comment:
   Nice one!



##
pyiceberg/table/__init__.py:
##
@@ -3111,3 +3147,112 @@ def snapshots(self) -> "pa.Table":
 snapshots,
 schema=snapshots_schema,
 )
+
+
+@dataclass(frozen=True)
+class TablePartition:
+partition_key: PartitionKey
+arrow_table_partition: pa.Table
+
+
+def _get_partition_sort_order(partition_columns: list[str], reverse: bool = 
False) -> dict[str, Any]:
+order = 'ascending' if not reverse else 'descending'
+null_placement = 'at_start' if reverse else 'at_end'
+return {'sort_keys': [(column_name, order) for column_name in 
partition_columns], 'null_placement': null_placement}
+
+
+def group_by_partition_scheme(arrow_table: pa.Table, partition_columns: 
list[str]) -> pa.Table:
+"""Given a table, sort it by current partition scheme."""
+# only works for identity for now
+sort_options = _get_partition_sort_order(partition_columns, reverse=False)
+sorted_arrow_table = 
arrow_table.sort_by(sorting=sort_options['sort_keys'], 
null_placement=sort_options['null_placement'])
+return sorted_arrow_table
+
+
+def get_partition_columns(
+spec: PartitionSpec,
+schema: Schema,
+) -> list[str]:
+partition_cols = []
+for partition_field in spec.fields:
+column_name = schema.find_column_name(partition_field.source_id)
+if not column_name:
+raise ValueError(f"{partition_field=} could not be found in 
{schema}.")
+partition_cols.append(column_name)
+return partition_cols
+
+
+def _get_table_partitions(
+arrow_table: pa.Table,
+partition_spec: PartitionSpec,
+schema: Schema,
+slice_instructions: list[dict[str, Any]],
+) -> list[TablePartition]:
+sorted_slice_instructions = sorted(slice_instructions, key=lambda x: 
x['offset'])
+
+partition_fields = partition_spec.fields
+
+offsets = [inst["offset"] for inst in sorted_slice_instructions]
+projected_and_filtered = {
+partition_field.source_id: 
arrow_table[schema.find_field(name_or_id=partition_field.source_id).name]
+.take(offsets)
+.to_pylist()
+for partition_field in partition_fields
+}
+
+table_partitions = []
+for idx, inst in enumerate(sorted_slice_instructions):
+partition_slice = arrow_table.slice(**inst)
+fieldvalues = [
+PartitionFieldValue(partition_field, 
projected_and_filtered[partition_field.source_id][idx])
+for partition_field in partition_fields
+]
+partition_key = PartitionKey(raw_partition_field_values=fieldvalues, 
partition_spec=partition_spec, schema=schema)
+table_partitions.append(TablePartition(partition_key=partition_key, 
arrow_table_partition=partition_slice))
+return table_partitions
+
+
+def partition(spec: PartitionSpec, schema: Schema, arrow_table: pa.Table) -> 
Iterable[TablePartition]:

Review Comment:
   It would be good to have a bit more length filenames. I also think we should 
hide this from the outside user.
   ```suggestion
   def _determine_partitions(spec: PartitionSpec, schema: Schema, arrow_table: 
pa.Table) -> List[TablePartition]:
   ```
   I think we can also return a list, so folks know that it is already 
materialized.



##
tests/conftest.py:
##
@@ -2045,3 +2049,19 @@ def arrow_table_with_null(pa_schema: "pa.Schema") -> 
"pa.Table":
 
 """PyArrow table with all kinds of columns"""
 return pa.Table.from_pydict(TEST_DATA_WITH_NULL, schema=pa_schema)
+
+
+@pytest.fixture(scope="session")
+def arrow_table_without_data(pa_schema: "pa.Schema") -> "pa.Table":
+import pyarrow as pa
+
+"""PyArrow table with all kinds of columns."""

Review Comment:
   ```suggestion
   """PyArrow table with all kinds of columns."""
   
   import pyarrow as pa
   ```



##
pyiceberg/table/__init__.py:
##
@@ -1131,8 +1133,11 @@ def append(self, df: pa.Table, snapshot_properties: 
Dict[str, str] = EMPTY_DICT)
 if not isinstance(df, pa.Table):
 raise Valu

Re: [PR] feat: Project transform [iceberg-rust]

2024-04-02 Thread via GitHub


liurenjie1024 commented on code in PR #309:
URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547819858


##
crates/iceberg/src/spec/transform.rs:
##
@@ -261,6 +269,300 @@ impl Transform {
 _ => self == other,
 }
 }
+
+/// Projects a given predicate according to the transformation
+/// specified by the `Transform` instance.
+///
+/// This allows predicates to be effectively applied to data
+/// that has undergone transformation, enabling efficient querying
+/// and filtering based on the original, untransformed data.
+///
+/// # Example
+/// Suppose, we have row filter `a = 10`, and a partition spec
+/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition
+/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to
+/// `bs = bucket(10, 37)`
+pub fn project(&self, name: String, predicate: &BoundPredicate) -> 
Result> {
+let func = create_transform_function(self)?;
+
+let projection = match predicate {
+BoundPredicate::Unary(expr) => match self {

Review Comment:
   Yeah, it looks much better now, thanks! I'll take a careful review later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [WIP] Add `ManifestEvaluator` to allow filtering of files in a table scan (Issue #152) [iceberg-rust]

2024-04-02 Thread via GitHub


liurenjie1024 commented on PR #241:
URL: https://github.com/apache/iceberg-rust/pull/241#issuecomment-2031946776

   > @liurenjie1024 @ZENOTME @sdd @Xuanwo I'd really appreciate your thoughts 
on this:
   > 
   > I took a closer look at the work @sdd has already done - and I think in 
order to proceed it would make sense to split the `ManifestEvaluator`, 
`InclusiveProjection` and `PartitionEvaluator` into separate modules.
   > 
   > I am thinking about putting each visitor into its own file within 
`/iceberg/src/expr` (perhaps even another subfolder /visitors) - for example: 
`/iceberg/src/expr/manifest_evaluator.rs.`
   > 
   > This way, we could create individual issues on each implementation - and 
work better in parallel.
   > 
   > Perhaps, it also makes sense to provide a trait to make sure each visitor 
adheres to the same interface (although I'm not sure this is neccessary right 
now...)
   > 
   > ```rust
   > // Pseudo-Example
   > pub trait BooleanVisitor:
   >   fn eval() -> boolean
   > ```
   
   +1 for splitting these into split modules. Instead of a boolean visitor, I'm 
thinking about a post order predicate visitor:
   ```rust
   pub trait BoundPredicateVisitor {
   
   }
   ```
   which is somehow motivated by @viirya 's 
[pr](https://github.com/apache/iceberg-rust/pull/295/files#diff-a59622727cd67153abdf02031475bf8a1b1921738df4ca9903a685ff6970b7aaR472),
 but move the travsering flow out of trait body.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Project transform [iceberg-rust]

2024-04-02 Thread via GitHub


marvinlanhenke commented on code in PR #309:
URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547785953


##
crates/iceberg/src/spec/transform.rs:
##
@@ -261,6 +269,300 @@ impl Transform {
 _ => self == other,
 }
 }
+
+/// Projects a given predicate according to the transformation
+/// specified by the `Transform` instance.
+///
+/// This allows predicates to be effectively applied to data
+/// that has undergone transformation, enabling efficient querying
+/// and filtering based on the original, untransformed data.
+///
+/// # Example
+/// Suppose, we have row filter `a = 10`, and a partition spec
+/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition
+/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to
+/// `bs = bucket(10, 37)`
+pub fn project(&self, name: String, predicate: &BoundPredicate) -> 
Result> {
+let func = create_transform_function(self)?;
+
+let projection = match predicate {
+BoundPredicate::Unary(expr) => match self {

Review Comment:
   @liurenjie1024
   I did a refactor changing the structure (matching order). I also extracted 
common functionality, renamed those helpers and updated the docs. I hope not 
only the structure but the overall design is more readable and understandable 
with those changes applied?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Project transform [iceberg-rust]

2024-04-02 Thread via GitHub


marvinlanhenke commented on code in PR #309:
URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547681855


##
crates/iceberg/src/spec/transform.rs:
##
@@ -261,6 +269,300 @@ impl Transform {
 _ => self == other,
 }
 }
+
+/// Projects a given predicate according to the transformation
+/// specified by the `Transform` instance.
+///
+/// This allows predicates to be effectively applied to data
+/// that has undergone transformation, enabling efficient querying
+/// and filtering based on the original, untransformed data.
+///
+/// # Example
+/// Suppose, we have row filter `a = 10`, and a partition spec
+/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition
+/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to
+/// `bs = bucket(10, 37)`
+pub fn project(&self, name: String, predicate: &BoundPredicate) -> 
Result> {
+let func = create_transform_function(self)?;
+
+let projection = match predicate {
+BoundPredicate::Unary(expr) => match self {

Review Comment:
   sure, I'm already implementing it - since I wanted to compare for myself.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Project transform [iceberg-rust]

2024-04-02 Thread via GitHub


liurenjie1024 commented on code in PR #309:
URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547676828


##
crates/iceberg/src/spec/transform.rs:
##
@@ -261,6 +269,300 @@ impl Transform {
 _ => self == other,
 }
 }
+
+/// Projects a given predicate according to the transformation
+/// specified by the `Transform` instance.
+///
+/// This allows predicates to be effectively applied to data
+/// that has undergone transformation, enabling efficient querying
+/// and filtering based on the original, untransformed data.
+///
+/// # Example
+/// Suppose, we have row filter `a = 10`, and a partition spec
+/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition
+/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to
+/// `bs = bucket(10, 37)`
+pub fn project(&self, name: String, predicate: &BoundPredicate) -> 
Result> {
+let func = create_transform_function(self)?;
+
+let projection = match predicate {
+BoundPredicate::Unary(expr) => match self {

Review Comment:
   I think some code duplication is worth so that we can have better 
readability?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Project transform [iceberg-rust]

2024-04-02 Thread via GitHub


marvinlanhenke commented on code in PR #309:
URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547667177


##
crates/iceberg/src/spec/transform.rs:
##
@@ -261,6 +269,300 @@ impl Transform {
 _ => self == other,
 }
 }
+
+/// Projects a given predicate according to the transformation
+/// specified by the `Transform` instance.
+///
+/// This allows predicates to be effectively applied to data
+/// that has undergone transformation, enabling efficient querying
+/// and filtering based on the original, untransformed data.
+///
+/// # Example
+/// Suppose, we have row filter `a = 10`, and a partition spec
+/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition
+/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to
+/// `bs = bucket(10, 37)`
+pub fn project(&self, name: String, predicate: &BoundPredicate) -> 
Result> {
+let func = create_transform_function(self)?;
+
+let projection = match predicate {
+BoundPredicate::Unary(expr) => match self {

Review Comment:
   I had the structure you suggested in an earlier version. I changed it the 
other way around since `predicate` has the smaller cardinality, which allows me 
to group more transforms into a single predicate match arm. I can change it 
back, however this would introduce more match arms and some code duplication?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.3: drop_namespace with CASCADE support [iceberg]

2024-04-02 Thread via GitHub


supsupsap commented on PR #7275:
URL: https://github.com/apache/iceberg/pull/7275#issuecomment-2031741342

   @abmo-x do you plan to merge this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Project transform [iceberg-rust]

2024-04-02 Thread via GitHub


liurenjie1024 commented on code in PR #309:
URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547649993


##
crates/iceberg/src/spec/transform.rs:
##
@@ -261,6 +269,300 @@ impl Transform {
 _ => self == other,
 }
 }
+
+/// Projects a given predicate according to the transformation
+/// specified by the `Transform` instance.
+///
+/// This allows predicates to be effectively applied to data
+/// that has undergone transformation, enabling efficient querying
+/// and filtering based on the original, untransformed data.
+///
+/// # Example
+/// Suppose, we have row filter `a = 10`, and a partition spec
+/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition
+/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to
+/// `bs = bucket(10, 37)`
+pub fn project(&self, name: String, predicate: &BoundPredicate) -> 
Result> {
+let func = create_transform_function(self)?;
+
+let projection = match predicate {
+BoundPredicate::Unary(expr) => match self {
+Transform::Identity
+| Transform::Bucket(_)
+| Transform::Truncate(_)
+| Transform::Year
+| Transform::Month
+| Transform::Day
+| Transform::Hour => 
Some(Predicate::Unary(UnaryExpression::new(
+expr.op(),
+Reference::new(name),
+))),
+_ => None,
+},
+BoundPredicate::Binary(expr) => match self {
+Transform::Identity => 
Some(Predicate::Binary(BinaryExpression::new(
+expr.op(),
+Reference::new(name),
+expr.literal().to_owned(),
+))),
+Transform::Bucket(_) => {
+if expr.op() != PredicateOperator::Eq || 
!self.can_transform(expr.literal()) {
+return Ok(None);
+}
+
+Some(Predicate::Binary(BinaryExpression::new(
+expr.op(),
+Reference::new(name),
+func.transform_literal_result(expr.literal())?,
+)))
+}
+Transform::Truncate(width) => {
+if !self.can_transform(expr.literal()) {
+return Ok(None);
+}
+
+self.transform_projected_boundary(
+name,
+expr.literal(),
+&expr.op(),
+&func,
+Some(*width),
+)?
+}
+Transform::Year | Transform::Month | Transform::Day | 
Transform::Hour => {
+if !self.can_transform(expr.literal()) {
+return Ok(None);
+}
+
+self.transform_projected_boundary(
+name,
+expr.literal(),
+&expr.op(),
+&func,
+None,
+)?
+}
+_ => None,
+},
+BoundPredicate::Set(expr) => match self {
+Transform::Identity => Some(Predicate::Set(SetExpression::new(
+expr.op(),
+Reference::new(name),
+expr.literals().to_owned(),
+))),
+Transform::Bucket(_)
+| Transform::Truncate(_)
+| Transform::Year
+| Transform::Month
+| Transform::Day
+| Transform::Hour => {
+if expr.op() != PredicateOperator::In
+|| expr.literals().iter().any(|d| 
!self.can_transform(d))
+{
+return Ok(None);
+}
+
+Some(Predicate::Set(SetExpression::new(
+expr.op(),
+Reference::new(name),
+self.transform_set(expr.literals(), &func)?,
+)))
+}
+_ => None,
+},
+_ => None,
+};
+
+Ok(projection)
+}
+
+/// Check if `Transform` is applicable on datum's `PrimitiveType`
+fn can_transform(&self, datum: &Datum) -> bool {
+let input_type = datum.data_type().clone();
+self.result_type(&Type::Primitive(input_type)).is_ok()
+}
+
+/// Transform each literal value of `FnvHashSet`
+fn transform_set(
+&self,
+literals: &FnvHashSet,
+func: &BoxedTransformFunction,
+) -> Result> {
+let mut new_set = FnvHashSet::default();
+
+for lit in literals {
+let datum = fu

Re: [PR] feat: Project transform [iceberg-rust]

2024-04-02 Thread via GitHub


marvinlanhenke commented on code in PR #309:
URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547653811


##
crates/iceberg/src/spec/transform.rs:
##
@@ -261,6 +269,300 @@ impl Transform {
 _ => self == other,
 }
 }
+
+/// Projects a given predicate according to the transformation
+/// specified by the `Transform` instance.
+///
+/// This allows predicates to be effectively applied to data
+/// that has undergone transformation, enabling efficient querying
+/// and filtering based on the original, untransformed data.
+///
+/// # Example
+/// Suppose, we have row filter `a = 10`, and a partition spec
+/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition
+/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to
+/// `bs = bucket(10, 37)`
+pub fn project(&self, name: String, predicate: &BoundPredicate) -> 
Result> {
+let func = create_transform_function(self)?;
+
+let projection = match predicate {
+BoundPredicate::Unary(expr) => match self {
+Transform::Identity
+| Transform::Bucket(_)
+| Transform::Truncate(_)
+| Transform::Year
+| Transform::Month
+| Transform::Day
+| Transform::Hour => 
Some(Predicate::Unary(UnaryExpression::new(
+expr.op(),
+Reference::new(name),
+))),
+_ => None,
+},
+BoundPredicate::Binary(expr) => match self {
+Transform::Identity => 
Some(Predicate::Binary(BinaryExpression::new(
+expr.op(),
+Reference::new(name),
+expr.literal().to_owned(),
+))),
+Transform::Bucket(_) => {
+if expr.op() != PredicateOperator::Eq || 
!self.can_transform(expr.literal()) {
+return Ok(None);
+}
+
+Some(Predicate::Binary(BinaryExpression::new(
+expr.op(),
+Reference::new(name),
+func.transform_literal_result(expr.literal())?,
+)))
+}
+Transform::Truncate(width) => {
+if !self.can_transform(expr.literal()) {
+return Ok(None);
+}
+
+self.transform_projected_boundary(
+name,
+expr.literal(),
+&expr.op(),
+&func,
+Some(*width),
+)?
+}
+Transform::Year | Transform::Month | Transform::Day | 
Transform::Hour => {
+if !self.can_transform(expr.literal()) {
+return Ok(None);
+}
+
+self.transform_projected_boundary(
+name,
+expr.literal(),
+&expr.op(),
+&func,
+None,
+)?
+}
+_ => None,
+},
+BoundPredicate::Set(expr) => match self {
+Transform::Identity => Some(Predicate::Set(SetExpression::new(
+expr.op(),
+Reference::new(name),
+expr.literals().to_owned(),
+))),
+Transform::Bucket(_)
+| Transform::Truncate(_)
+| Transform::Year
+| Transform::Month
+| Transform::Day
+| Transform::Hour => {
+if expr.op() != PredicateOperator::In
+|| expr.literals().iter().any(|d| 
!self.can_transform(d))
+{
+return Ok(None);
+}
+
+Some(Predicate::Set(SetExpression::new(
+expr.op(),
+Reference::new(name),
+self.transform_set(expr.literals(), &func)?,
+)))
+}
+_ => None,
+},
+_ => None,
+};
+
+Ok(projection)
+}
+
+/// Check if `Transform` is applicable on datum's `PrimitiveType`
+fn can_transform(&self, datum: &Datum) -> bool {
+let input_type = datum.data_type().clone();
+self.result_type(&Type::Primitive(input_type)).is_ok()
+}
+
+/// Transform each literal value of `FnvHashSet`
+fn transform_set(
+&self,
+literals: &FnvHashSet,
+func: &BoxedTransformFunction,
+) -> Result> {
+let mut new_set = FnvHashSet::default();
+
+for lit in literals {
+let datum = f

Re: [PR] Migrate TableTestBase related classes to JUnit5 and delete TableTestBase [iceberg]

2024-04-02 Thread via GitHub


nastra merged PR #10063:
URL: https://github.com/apache/iceberg/pull/10063


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Migrate TableTestBase related classes to JUnit5 and delete TableTestBase [iceberg]

2024-04-02 Thread via GitHub


tomtongue commented on code in PR #10063:
URL: https://github.com/apache/iceberg/pull/10063#discussion_r1547650695


##
data/src/test/java/org/apache/iceberg/io/TestPositionDeltaWriters.java:
##
@@ -20,43 +20,42 @@
 
 import java.io.File;
 import java.io.IOException;
+import java.nio.file.Files;
+import java.util.Arrays;
 import java.util.List;
 import org.apache.iceberg.DataFile;
 import org.apache.iceberg.DeleteFile;
 import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.Parameter;
+import org.apache.iceberg.ParameterizedTestExtension;
+import org.apache.iceberg.Parameters;
 import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.RowDelta;
 import org.apache.iceberg.expressions.Expressions;
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
 import org.apache.iceberg.util.StructLikeSet;
 import org.junit.Assert;

Review Comment:
   Thank you. I'll create a PR to migrate them in a separate PR. At least, 
through this issue, we can delete the `TableTestBase`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Migrate TableTestBase related classes to JUnit5 and delete TableTestBase [iceberg]

2024-04-02 Thread via GitHub


nastra commented on code in PR #10063:
URL: https://github.com/apache/iceberg/pull/10063#discussion_r1547646211


##
data/src/test/java/org/apache/iceberg/io/TestPositionDeltaWriters.java:
##
@@ -20,43 +20,42 @@
 
 import java.io.File;
 import java.io.IOException;
+import java.nio.file.Files;
+import java.util.Arrays;
 import java.util.List;
 import org.apache.iceberg.DataFile;
 import org.apache.iceberg.DeleteFile;
 import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.Parameter;
+import org.apache.iceberg.ParameterizedTestExtension;
+import org.apache.iceberg.Parameters;
 import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.RowDelta;
 import org.apache.iceberg.expressions.Expressions;
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
 import org.apache.iceberg.util.StructLikeSet;
 import org.junit.Assert;

Review Comment:
   yeah indeed, that would lead to many changes. I'm ok doing the conversion 
from JUnit4 asserts to AssertJ on these files in a separate PR then



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Project transform [iceberg-rust]

2024-04-02 Thread via GitHub


marvinlanhenke commented on code in PR #309:
URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547639860


##
crates/iceberg/src/spec/transform.rs:
##
@@ -261,6 +269,300 @@ impl Transform {
 _ => self == other,
 }
 }
+
+/// Projects a given predicate according to the transformation
+/// specified by the `Transform` instance.
+///
+/// This allows predicates to be effectively applied to data
+/// that has undergone transformation, enabling efficient querying
+/// and filtering based on the original, untransformed data.
+///
+/// # Example
+/// Suppose, we have row filter `a = 10`, and a partition spec
+/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition
+/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to
+/// `bs = bucket(10, 37)`
+pub fn project(&self, name: String, predicate: &BoundPredicate) -> 
Result> {
+let func = create_transform_function(self)?;
+
+let projection = match predicate {
+BoundPredicate::Unary(expr) => match self {
+Transform::Identity
+| Transform::Bucket(_)
+| Transform::Truncate(_)
+| Transform::Year
+| Transform::Month
+| Transform::Day
+| Transform::Hour => 
Some(Predicate::Unary(UnaryExpression::new(
+expr.op(),
+Reference::new(name),
+))),
+_ => None,
+},
+BoundPredicate::Binary(expr) => match self {
+Transform::Identity => 
Some(Predicate::Binary(BinaryExpression::new(
+expr.op(),
+Reference::new(name),
+expr.literal().to_owned(),
+))),
+Transform::Bucket(_) => {
+if expr.op() != PredicateOperator::Eq || 
!self.can_transform(expr.literal()) {
+return Ok(None);
+}
+
+Some(Predicate::Binary(BinaryExpression::new(
+expr.op(),
+Reference::new(name),
+func.transform_literal_result(expr.literal())?,
+)))
+}
+Transform::Truncate(width) => {
+if !self.can_transform(expr.literal()) {
+return Ok(None);
+}
+
+self.transform_projected_boundary(
+name,
+expr.literal(),
+&expr.op(),
+&func,
+Some(*width),
+)?
+}
+Transform::Year | Transform::Month | Transform::Day | 
Transform::Hour => {
+if !self.can_transform(expr.literal()) {
+return Ok(None);
+}
+
+self.transform_projected_boundary(
+name,
+expr.literal(),
+&expr.op(),
+&func,
+None,
+)?
+}
+_ => None,
+},
+BoundPredicate::Set(expr) => match self {
+Transform::Identity => Some(Predicate::Set(SetExpression::new(
+expr.op(),
+Reference::new(name),
+expr.literals().to_owned(),
+))),
+Transform::Bucket(_)
+| Transform::Truncate(_)
+| Transform::Year
+| Transform::Month
+| Transform::Day
+| Transform::Hour => {
+if expr.op() != PredicateOperator::In
+|| expr.literals().iter().any(|d| 
!self.can_transform(d))
+{
+return Ok(None);
+}
+
+Some(Predicate::Set(SetExpression::new(
+expr.op(),
+Reference::new(name),
+self.transform_set(expr.literals(), &func)?,
+)))
+}
+_ => None,
+},
+_ => None,
+};
+
+Ok(projection)
+}
+
+/// Check if `Transform` is applicable on datum's `PrimitiveType`
+fn can_transform(&self, datum: &Datum) -> bool {
+let input_type = datum.data_type().clone();
+self.result_type(&Type::Primitive(input_type)).is_ok()
+}
+
+/// Transform each literal value of `FnvHashSet`
+fn transform_set(
+&self,
+literals: &FnvHashSet,
+func: &BoxedTransformFunction,
+) -> Result> {
+let mut new_set = FnvHashSet::default();
+
+for lit in literals {
+let datum = f

Re: [PR] Migrate TableTestBase related classes to JUnit5 and delete TableTestBase [iceberg]

2024-04-02 Thread via GitHub


tomtongue commented on code in PR #10063:
URL: https://github.com/apache/iceberg/pull/10063#discussion_r1547639555


##
data/src/test/java/org/apache/iceberg/io/TestPositionDeltaWriters.java:
##
@@ -20,43 +20,42 @@
 
 import java.io.File;
 import java.io.IOException;
+import java.nio.file.Files;
+import java.util.Arrays;
 import java.util.List;
 import org.apache.iceberg.DataFile;
 import org.apache.iceberg.DeleteFile;
 import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.Parameter;
+import org.apache.iceberg.ParameterizedTestExtension;
+import org.apache.iceberg.Parameters;
 import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.RowDelta;
 import org.apache.iceberg.expressions.Expressions;
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
 import org.apache.iceberg.util.StructLikeSet;
 import org.junit.Assert;

Review Comment:
   @nastra Thanks for the review. I only migrate classes related to 
`TableTestBase` to delete the class, so I keep these classes JUnit4 because 
there should be a lot of changes if each class is updated to JUnit5. And, I'm 
thinking that after deleting `TableTestBase`, I'll migrate those classes to 
Junit 5. Should I update all classes in this PR to JUnit 5? or partially update?
   



##
data/src/test/java/org/apache/iceberg/io/TestPositionDeltaWriters.java:
##
@@ -20,43 +20,42 @@
 
 import java.io.File;
 import java.io.IOException;
+import java.nio.file.Files;
+import java.util.Arrays;
 import java.util.List;
 import org.apache.iceberg.DataFile;
 import org.apache.iceberg.DeleteFile;
 import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.Parameter;
+import org.apache.iceberg.ParameterizedTestExtension;
+import org.apache.iceberg.Parameters;
 import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.RowDelta;
 import org.apache.iceberg.expressions.Expressions;
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
 import org.apache.iceberg.util.StructLikeSet;
 import org.junit.Assert;

Review Comment:
   @nastra Thanks for the review. This time, I only migrate classes related to 
`TableTestBase` to delete the class, so I keep these classes JUnit4 because 
there should be a lot of changes if each class is updated to JUnit5. And, I'm 
thinking that after deleting `TableTestBase`, I'll migrate those classes to 
Junit 5. Should I update all classes in this PR to JUnit 5? or partially update?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Migrate TableTestBase related classes to JUnit5 and delete TableTestBase [iceberg]

2024-04-02 Thread via GitHub


nastra commented on code in PR #10063:
URL: https://github.com/apache/iceberg/pull/10063#discussion_r1547636721


##
flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/sink/TestIcebergFilesCommitter.java:
##
@@ -73,44 +77,39 @@
 import org.apache.iceberg.util.ThreadPools;
 import org.junit.Assert;
 import org.junit.Assume;

Review Comment:
   same as in the other files



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Migrate TableTestBase related classes to JUnit5 and delete TableTestBase [iceberg]

2024-04-02 Thread via GitHub


nastra commented on code in PR #10063:
URL: https://github.com/apache/iceberg/pull/10063#discussion_r1547634234


##
flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/sink/TestDeltaTaskWriter.java:
##
@@ -65,31 +68,28 @@
 import org.apache.iceberg.util.StructLikeSet;
 import org.assertj.core.api.Assertions;
 import org.junit.Assert;

Review Comment:
   still uses JUnit4-Assert



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Migrate TableTestBase related classes to JUnit5 and delete TableTestBase [iceberg]

2024-04-02 Thread via GitHub


nastra commented on code in PR #10063:
URL: https://github.com/apache/iceberg/pull/10063#discussion_r1547631706


##
data/src/test/java/org/apache/iceberg/io/TestPositionDeltaWriters.java:
##
@@ -20,43 +20,42 @@
 
 import java.io.File;
 import java.io.IOException;
+import java.nio.file.Files;
+import java.util.Arrays;
 import java.util.List;
 import org.apache.iceberg.DataFile;
 import org.apache.iceberg.DeleteFile;
 import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.Parameter;
+import org.apache.iceberg.ParameterizedTestExtension;
+import org.apache.iceberg.Parameters;
 import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.RowDelta;
 import org.apache.iceberg.expressions.Expressions;
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
 import org.apache.iceberg.util.StructLikeSet;
 import org.junit.Assert;

Review Comment:
   still a JUnit4-Assert left here



##
data/src/test/java/org/apache/iceberg/io/TestPartitioningWriters.java:
##
@@ -35,33 +40,27 @@
 import org.apache.iceberg.util.StructLikeSet;
 import org.assertj.core.api.Assertions;
 import org.junit.Assert;

Review Comment:
   should be converted to AssertJ



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Migrate TableTestBase related classes to JUnit5 and delete TableTestBase [iceberg]

2024-04-02 Thread via GitHub


nastra commented on code in PR #10063:
URL: https://github.com/apache/iceberg/pull/10063#discussion_r1547632925


##
data/src/test/java/org/apache/iceberg/io/TestRollingFileWriters.java:
##
@@ -20,60 +20,60 @@
 
 import java.io.File;
 import java.io.IOException;
+import java.nio.file.Files;
+import java.util.Arrays;
 import java.util.List;
 import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.Parameter;
+import org.apache.iceberg.ParameterizedTestExtension;
+import org.apache.iceberg.Parameters;
 import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.StructLike;
 import org.apache.iceberg.deletes.PositionDelete;
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.junit.Assert;

Review Comment:
   same as in the other tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Project transform [iceberg-rust]

2024-04-02 Thread via GitHub


marvinlanhenke commented on PR #309:
URL: https://github.com/apache/iceberg-rust/pull/309#issuecomment-2031700958

   > Hi, @marvinlanhenke Thanks for pr, it looks great! I have some small 
suggestion to restructure the code to make it easier for review. Really 
greatful for these tests!
   
   Thanks for the review, I'll get to your suggestions - those should be easy 
to fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]

2024-04-02 Thread via GitHub


liurenjie1024 commented on code in PR #295:
URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1547582654


##
crates/iceberg/src/arrow.rs:
##
@@ -113,6 +143,405 @@ impl ArrowReader {
 // TODO: full implementation
 ProjectionMask::all()
 }
+
+fn get_row_filter(&self, parquet_schema: &SchemaDescriptor) -> 
Result> {
+if let Some(predicates) = &self.predicates {
+let field_id_map = self.build_field_id_map(parquet_schema)?;
+
+// Collect Parquet column indices from field ids
+let column_indices = predicates
+.iter()
+.map(|predicate| {
+let mut collector = CollectFieldIdVisitor { field_ids: 
vec![] };
+collector.visit_predicate(predicate).unwrap();
+collector
+.field_ids
+.iter()
+.map(|field_id| {
+field_id_map.get(field_id).cloned().ok_or_else(|| {
+Error::new(ErrorKind::DataInvalid, "Field id 
not found in schema")
+})
+})
+.collect::>>()
+})
+.collect::>>()?;
+
+// Convert BoundPredicates to ArrowPredicates
+let mut arrow_predicates = vec![];
+for (predicate, columns) in 
predicates.iter().zip(column_indices.iter()) {
+let mut converter = PredicateConverter {
+columns,
+projection_mask: ProjectionMask::leaves(parquet_schema, 
columns.clone()),
+parquet_schema,
+column_map: &field_id_map,
+};
+let arrow_predicate = converter.visit_predicate(predicate)?;
+arrow_predicates.push(arrow_predicate);
+}
+Ok(Some(RowFilter::new(arrow_predicates)))
+} else {
+Ok(None)
+}
+}
+
+/// Build the map of field id to Parquet column index in the schema.
+fn build_field_id_map(&self, parquet_schema: &SchemaDescriptor) -> 
Result> {
+let mut column_map = HashMap::new();
+for (idx, field) in parquet_schema.columns().iter().enumerate() {
+let field_type = field.self_type();
+match field_type {
+ParquetType::PrimitiveType { basic_info, .. } => {
+if !basic_info.has_id() {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column {:?} in schema doesn't have 
field id",
+field_type
+),
+));
+}
+column_map.insert(basic_info.id(), idx);
+}
+ParquetType::GroupType { .. } => {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column in schema should be primitive type 
but got {:?}",
+field_type
+),
+));
+}
+};
+}
+
+Ok(column_map)
+}
+}
+
+/// A visitor to collect field ids from bound predicates.
+struct CollectFieldIdVisitor {
+field_ids: Vec,
+}
+
+impl BoundPredicateVisitor for CollectFieldIdVisitor {
+type T = ();
+type U = ();
+
+fn and(&mut self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn or(&mut self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn not(&mut self, _predicate: Self::T) -> Result {
+Ok(())
+}
+
+fn visit_always_true(&mut self) -> Result {
+Ok(())
+}
+
+fn visit_always_false(&mut self) -> Result {
+Ok(())
+}
+
+fn visit_unary(&mut self, predicate: &UnaryExpression) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_binary(&mut self, predicate: &BinaryExpression) 
-> Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_set(&mut self, predicate: &SetExpression) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn bound_reference(&mut self, reference: &BoundReference) -> 
Result {
+self.field_ids.push(reference.field().id);
+Ok(())
+}
+}
+
+struct PredicateConverter<'a> {
+pub columns: &'a Vec,
+pub projection_mask: ProjectionMask,
+pub parquet_schema: &'a SchemaDescriptor,
+pub column_map: &'a HashMap,
+}
+
+fn get_arrow_datum(datum: &Datum) -> Box {
+match datum.literal() {
+PrimitiveLiteral::Boolean(value) => 
Box::new(BooleanArray::new_scalar(*value)),
+PrimitiveLiteral::Int(value)

Re: [PR] Spark 3.5: Parallelize reading files in snapshot and migrate procedures [iceberg]

2024-04-02 Thread via GitHub


manuzhang commented on code in PR #10037:
URL: https://github.com/apache/iceberg/pull/10037#discussion_r1547578684


##
data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:
##
@@ -171,6 +176,10 @@ public static List listPartition(
 }
   }
 
+  public static boolean isFileReadingParallelized() {

Review Comment:
   @nastra do you have an idea to test whether parallelism setting is working?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Project transform [iceberg-rust]

2024-04-02 Thread via GitHub


liurenjie1024 commented on code in PR #309:
URL: https://github.com/apache/iceberg-rust/pull/309#discussion_r1547488931


##
crates/iceberg/src/spec/transform.rs:
##
@@ -261,6 +269,300 @@ impl Transform {
 _ => self == other,
 }
 }
+
+/// Projects a given predicate according to the transformation
+/// specified by the `Transform` instance.
+///
+/// This allows predicates to be effectively applied to data
+/// that has undergone transformation, enabling efficient querying
+/// and filtering based on the original, untransformed data.
+///
+/// # Example
+/// Suppose, we have row filter `a = 10`, and a partition spec
+/// `bucket(a, 37) as bs`, if one row matches `a = 10`, then its partition
+/// value should match `bucket(10, 37) as bs`, and we project `a = 10` to
+/// `bs = bucket(10, 37)`
+pub fn project(&self, name: String, predicate: &BoundPredicate) -> 
Result> {
+let func = create_transform_function(self)?;
+
+let projection = match predicate {
+BoundPredicate::Unary(expr) => match self {
+Transform::Identity
+| Transform::Bucket(_)
+| Transform::Truncate(_)
+| Transform::Year
+| Transform::Month
+| Transform::Day
+| Transform::Hour => 
Some(Predicate::Unary(UnaryExpression::new(
+expr.op(),
+Reference::new(name),
+))),
+_ => None,
+},
+BoundPredicate::Binary(expr) => match self {
+Transform::Identity => 
Some(Predicate::Binary(BinaryExpression::new(
+expr.op(),
+Reference::new(name),
+expr.literal().to_owned(),
+))),
+Transform::Bucket(_) => {
+if expr.op() != PredicateOperator::Eq || 
!self.can_transform(expr.literal()) {
+return Ok(None);
+}
+
+Some(Predicate::Binary(BinaryExpression::new(
+expr.op(),
+Reference::new(name),
+func.transform_literal_result(expr.literal())?,
+)))
+}
+Transform::Truncate(width) => {
+if !self.can_transform(expr.literal()) {
+return Ok(None);
+}
+
+self.transform_projected_boundary(
+name,
+expr.literal(),
+&expr.op(),
+&func,
+Some(*width),
+)?
+}
+Transform::Year | Transform::Month | Transform::Day | 
Transform::Hour => {
+if !self.can_transform(expr.literal()) {
+return Ok(None);
+}
+
+self.transform_projected_boundary(
+name,
+expr.literal(),
+&expr.op(),
+&func,
+None,
+)?
+}
+_ => None,
+},
+BoundPredicate::Set(expr) => match self {
+Transform::Identity => Some(Predicate::Set(SetExpression::new(
+expr.op(),
+Reference::new(name),
+expr.literals().to_owned(),
+))),
+Transform::Bucket(_)
+| Transform::Truncate(_)
+| Transform::Year
+| Transform::Month
+| Transform::Day
+| Transform::Hour => {
+if expr.op() != PredicateOperator::In
+|| expr.literals().iter().any(|d| 
!self.can_transform(d))
+{
+return Ok(None);
+}
+
+Some(Predicate::Set(SetExpression::new(
+expr.op(),
+Reference::new(name),
+self.transform_set(expr.literals(), &func)?,
+)))
+}
+_ => None,
+},
+_ => None,
+};
+
+Ok(projection)
+}
+
+/// Check if `Transform` is applicable on datum's `PrimitiveType`
+fn can_transform(&self, datum: &Datum) -> bool {
+let input_type = datum.data_type().clone();
+self.result_type(&Type::Primitive(input_type)).is_ok()
+}
+
+/// Transform each literal value of `FnvHashSet`
+fn transform_set(
+&self,
+literals: &FnvHashSet,
+func: &BoxedTransformFunction,
+) -> Result> {
+let mut new_set = FnvHashSet::default();
+
+for lit in literals {
+let datum = fu

Re: [PR] #9073 Junit 4 tests switched to JUnit 5 [iceberg]

2024-04-02 Thread via GitHub


nastra commented on code in PR #9793:
URL: https://github.com/apache/iceberg/pull/9793#discussion_r1547543252


##
data/src/test/java/org/apache/iceberg/data/DataTestHelpers.java:
##
@@ -84,16 +86,16 @@ private static void assertEquals(Type type, Object 
expected, Object actual) {
   case UUID:
   case BINARY:
   case DECIMAL:
-Assert.assertEquals(
-"Primitive value should be equal to expected for type " + type, 
expected, actual);
+org.junit.jupiter.api.Assertions.assertEquals(
+expected, actual, "Primitive value should be equal to expected for 
type " + type);
 break;
   case FIXED:
 Assertions.assertThat(expected)
 .as("Expected should be a byte[]")
 .isInstanceOf(byte[].class);
 Assertions.assertThat(expected).as("Actual should be a 
byte[]").isInstanceOf(byte[].class);
-Assert.assertArrayEquals(
-"Array contents should be equal", (byte[]) expected, (byte[]) 
actual);
+org.junit.jupiter.api.Assertions.assertArrayEquals(

Review Comment:
   this should use AssertJ's `assertThat()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Parallelize reading files in snapshot and migrate procedures [iceberg]

2024-04-02 Thread via GitHub


nastra commented on code in PR #10037:
URL: https://github.com/apache/iceberg/pull/10037#discussion_r1547529529


##
data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:
##
@@ -171,6 +176,10 @@ public static List listPartition(
 }
   }
 
+  public static boolean isFileReadingParallelized() {

Review Comment:
   I don't think it's a good idea to introduce this, since this reflects 
whatever was set after calling `listPartition(...)` and depending on the order 
of calls, it could return something unexpected



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Support identifier warehouses [iceberg-rust]

2024-04-02 Thread via GitHub


liurenjie1024 commented on code in PR #308:
URL: https://github.com/apache/iceberg-rust/pull/308#discussion_r1547526634


##
crates/catalog/rest/src/catalog.rs:
##
@@ -617,7 +617,13 @@ impl RestCatalog {
 props.extend(config);
 }
 
-let file_io = match 
self.config.warehouse.as_deref().or(metadata_location) {
+let warehouse_path = match self.config.warehouse.as_deref() {
+Some(url) if url.contains("://") => Some(url),

Review Comment:
   Yes, we don't need to raise an exception here, just match `if 
Url::parse().is_ok()` would be enough.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Support identifier warehouses [iceberg-rust]

2024-04-02 Thread via GitHub


Fokko commented on code in PR #308:
URL: https://github.com/apache/iceberg-rust/pull/308#discussion_r1547503192


##
crates/catalog/rest/src/catalog.rs:
##
@@ -617,7 +617,13 @@ impl RestCatalog {
 props.extend(config);
 }
 
-let file_io = match 
self.config.warehouse.as_deref().or(metadata_location) {
+let warehouse_path = match self.config.warehouse.as_deref() {
+Some(url) if url.contains("://") => Some(url),

Review Comment:
   This is also happening [below in the 
`FileIO::from_path`](https://github.com/apache/iceberg-rust/blob/d57d91b9a72c516c6665d5faef349f52ebe59585/crates/iceberg/src/io.rs#L155),
 but it raises an exception we want to avoid.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Support identifier warehouses [iceberg-rust]

2024-04-02 Thread via GitHub


liurenjie1024 commented on PR #308:
URL: https://github.com/apache/iceberg-rust/pull/308#issuecomment-2031513654

   > Hi, @Fokko Thanks for this fix. It also reminds me that should we append 
the warehouse parameter to `getConfig` call?
   
   Seems we already have that. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Parallelize reading files in snapshot and migrate procedures [iceberg]

2024-04-02 Thread via GitHub


manuzhang commented on code in PR #10037:
URL: https://github.com/apache/iceberg/pull/10037#discussion_r1547447584


##
spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java:
##
@@ -232,4 +232,25 @@ public void testMigrateEmptyTable() throws Exception {
 Object result = scalarSql("CALL %s.system.migrate('%s')", catalogName, 
tableName);
 assertThat(result).isEqualTo(0L);
   }
+
+  @TestTemplate
+  public void testMigrateWithParallelism() throws IOException {
+assumeThat(catalogName).isEqualToIgnoringCase("spark_catalog");
+String location = Files.createTempDirectory(temp, 
"junit").toFile().toString();
+for (int p = -1; p <= 2; p++) {

Review Comment:
   Thanks @nastra , I've updated as @RussellSpitzer suggests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Disable checking links for Blogs section [iceberg]

2024-04-02 Thread via GitHub


manuzhang commented on issue #10060:
URL: https://github.com/apache/iceberg/issues/10060#issuecomment-2031439818

   I also see errors when checking maven repo.
   
   ```
   ERROR: 21 dead links found!
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime/0.13.2/iceberg-spark-runtime-0.13.2.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.14/0.13.2/iceberg-flink-runtime-1.14-0.13.2.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.13/0.13.2/iceberg-flink-runtime-1.13-0.13.2.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.12/0.13.2/iceberg-flink-runtime-1.12-0.13.2.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/0.13.2/iceberg-hive-runtime-0.13.2.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark3-runtime/0.13.1/iceberg-spark3-runtime-0.13.1.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime/0.13.1/iceberg-spark-runtime-0.13.1.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.13/0.13.1/iceberg-flink-runtime-1.13-0.13.1.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.12/0.13.1/iceberg-flink-runtime-1.12-0.13.1.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime/0.13.0/iceberg-spark-runtime-0.13.0.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.12/0.13.0/iceberg-flink-runtime-1.12-0.13.0.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark3-runtime/0.12.1/iceberg-spark3-runtime-0.12.1.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime/0.12.1/iceberg-spark-runtime-0.12.1.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark3-runtime/0.12.0/iceberg-spark3-runtime-0.12.0.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/0.12.0/iceberg-hive-runtime-0.12.0.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/0.11.1/iceberg-hive-runtime-0.11.1.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime/0.11.0/iceberg-flink-runtime-0.11.0.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark3-runtime/0.10.0/iceberg-spark3-runtime-0.10.0.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark3-runtime/0.9.0/iceberg-spark3-runtime-0.9.0.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime/0.9.0/iceberg-spark-runtime-0.9.0.jar
 → Status: 0
   [✖] 
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime/0.8.0-incubating/iceberg-spark-runtime-0.8.0-incubating.jar
 → Status: 0
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Open-api: update prefix param description [iceberg]

2024-04-02 Thread via GitHub


ajantha-bhat commented on code in PR #9870:
URL: https://github.com/apache/iceberg/pull/9870#discussion_r1547408417


##
open-api/rest-catalog-open-api.yaml:
##
@@ -1444,7 +1444,7 @@ components:
   schema:
 type: string
   required: true
-  description: An optional prefix in the path
+  description: Prefix in the path

Review Comment:
   If I change to `required: false` for `prefix`
   
   ```
   Spec is invalid.
   Issues:
   
components.parameters.For path parameter prefix the required value 
should be true
   
   
   Execution failed for task ':iceberg-open-api:validateRESTCatalogSpec'.
   > Validation failed.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Parallelize reading files in snapshot and migrate procedures [iceberg]

2024-04-02 Thread via GitHub


nastra commented on code in PR #10037:
URL: https://github.com/apache/iceberg/pull/10037#discussion_r1547402202


##
spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java:
##
@@ -232,4 +232,25 @@ public void testMigrateEmptyTable() throws Exception {
 Object result = scalarSql("CALL %s.system.migrate('%s')", catalogName, 
tableName);
 assertThat(result).isEqualTo(0L);
   }
+
+  @TestTemplate
+  public void testMigrateWithParallelism() throws IOException {
+assumeThat(catalogName).isEqualToIgnoringCase("spark_catalog");
+String location = Files.createTempDirectory(temp, 
"junit").toFile().toString();
+for (int p = -1; p <= 2; p++) {

Review Comment:
   > I was trying that initially, but there's no way to add parameters for just 
one test.
   
   This is because the test is already parameterized at the class level.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Parallelize reading files in snapshot and migrate procedures [iceberg]

2024-04-02 Thread via GitHub


nastra commented on code in PR #10037:
URL: https://github.com/apache/iceberg/pull/10037#discussion_r1547402202


##
spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java:
##
@@ -232,4 +232,25 @@ public void testMigrateEmptyTable() throws Exception {
 Object result = scalarSql("CALL %s.system.migrate('%s')", catalogName, 
tableName);
 assertThat(result).isEqualTo(0L);
   }
+
+  @TestTemplate
+  public void testMigrateWithParallelism() throws IOException {
+assumeThat(catalogName).isEqualToIgnoringCase("spark_catalog");
+String location = Files.createTempDirectory(temp, 
"junit").toFile().toString();
+for (int p = -1; p <= 2; p++) {

Review Comment:
   > I was trying that initially, but there's no way to add parameters for just 
one test.
   
   This is because the test is already parameterized at the class level



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Open-api: update prefix param description [iceberg]

2024-04-02 Thread via GitHub


Fokko commented on code in PR #9870:
URL: https://github.com/apache/iceberg/pull/9870#discussion_r1547395200


##
open-api/rest-catalog-open-api.yaml:
##
@@ -1444,7 +1444,7 @@ components:
   schema:
 type: string
   required: true
-  description: An optional prefix in the path
+  description: Prefix in the path

Review Comment:
   What's the error you're seeing?
   
   The prefix is optional, but it depends on the definition. We can also update 
the default to be an empty string.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Open-api: update prefix param description [iceberg]

2024-04-02 Thread via GitHub


ajantha-bhat commented on PR #9870:
URL: https://github.com/apache/iceberg/pull/9870#issuecomment-2031310056

   cc: @jbonofre 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]

2024-04-02 Thread via GitHub


nastra commented on code in PR #10053:
URL: https://github.com/apache/iceberg/pull/10053#discussion_r1547300740


##
core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java:
##
@@ -133,6 +136,59 @@ public void testDynamicHttpRequestInterceptorLoading() {
 assertThat(((TestHttpRequestInterceptor) 
interceptor).properties).isEqualTo(properties);
   }
 
+  @Test
+  public void testHttpClientGetConnectionConfig() {
+long connectionTimeoutMs = 10L;
+int socketTimeoutMs = 10;
+Map properties =
+ImmutableMap.of(
+HTTPClient.REST_CONNECTION_TIMEOUT_MS, 
String.valueOf(connectionTimeoutMs),
+HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+
+ConnectionConfig connectionConfig = 
HTTPClient.configureConnectionConfig(properties);
+assertThat(connectionConfig).isNotNull();
+
assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs);
+
assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs);
+  }
+
+  @Test
+  public void testHttpClientWithSocketTimeout() throws IOException {
+long socketTimeoutMs = 2000L;
+Map properties =
+ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+String path = "socket/timeout/path";
+
+try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) {
+  HttpRequest mockRequest =
+  request()
+  .withPath("/" + path)
+  .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT));
+  // Setting a response delay of 5 seconds to simulate hitting the 
configured socket timeout of
+  // 2 seconds
+  HttpResponse mockResponse =
+  response()
+  .withStatusCode(200)
+  .withBody("Delayed response")
+  .withDelay(TimeUnit.MILLISECONDS, 5000);
+  mockServer.when(mockRequest).respond(mockResponse);
+
+  Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), 
(unused) -> {}))
+  .cause()
+  .isInstanceOf(SocketTimeoutException.class);
+}
+  }
+
+  @Test
+  public void testHttpClientInvalidConnectionTimeout() {

Review Comment:
   please also add a separate test method where the socket timeout is invalid



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]

2024-04-02 Thread via GitHub


nastra commented on code in PR #10053:
URL: https://github.com/apache/iceberg/pull/10053#discussion_r1547305188


##
core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java:
##
@@ -133,6 +136,59 @@ public void testDynamicHttpRequestInterceptorLoading() {
 assertThat(((TestHttpRequestInterceptor) 
interceptor).properties).isEqualTo(properties);
   }
 
+  @Test
+  public void testHttpClientGetConnectionConfig() {
+long connectionTimeoutMs = 10L;
+int socketTimeoutMs = 10;
+Map properties =
+ImmutableMap.of(
+HTTPClient.REST_CONNECTION_TIMEOUT_MS, 
String.valueOf(connectionTimeoutMs),
+HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+
+ConnectionConfig connectionConfig = 
HTTPClient.configureConnectionConfig(properties);
+assertThat(connectionConfig).isNotNull();
+
assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs);
+
assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs);
+  }
+
+  @Test
+  public void testHttpClientWithSocketTimeout() throws IOException {

Review Comment:
   should there also be a test verifying the connection timeout?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]

2024-04-02 Thread via GitHub


nastra commented on code in PR #10053:
URL: https://github.com/apache/iceberg/pull/10053#discussion_r1547304325


##
core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java:
##
@@ -133,6 +136,59 @@ public void testDynamicHttpRequestInterceptorLoading() {
 assertThat(((TestHttpRequestInterceptor) 
interceptor).properties).isEqualTo(properties);
   }
 
+  @Test
+  public void testHttpClientGetConnectionConfig() {
+long connectionTimeoutMs = 10L;
+int socketTimeoutMs = 10;
+Map properties =
+ImmutableMap.of(
+HTTPClient.REST_CONNECTION_TIMEOUT_MS, 
String.valueOf(connectionTimeoutMs),
+HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+
+ConnectionConfig connectionConfig = 
HTTPClient.configureConnectionConfig(properties);
+assertThat(connectionConfig).isNotNull();
+
assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs);
+
assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs);
+  }
+
+  @Test
+  public void testHttpClientWithSocketTimeout() throws IOException {
+long socketTimeoutMs = 2000L;
+Map properties =
+ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+String path = "socket/timeout/path";
+
+try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) {
+  HttpRequest mockRequest =
+  request()
+  .withPath("/" + path)
+  .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT));
+  // Setting a response delay of 5 seconds to simulate hitting the 
configured socket timeout of
+  // 2 seconds
+  HttpResponse mockResponse =
+  response()
+  .withStatusCode(200)
+  .withBody("Delayed response")
+  .withDelay(TimeUnit.MILLISECONDS, 5000);
+  mockServer.when(mockRequest).respond(mockResponse);
+
+  Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), 
(unused) -> {}))
+  .cause()
+  .isInstanceOf(SocketTimeoutException.class);
+}
+  }
+
+  @Test
+  public void testHttpClientInvalidConnectionTimeout() {

Review Comment:
   also no need for the `HttpClient` prefix in the test names as this is 
already obvious, because the test class itself is called `TestHTTPClient` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]

2024-04-02 Thread via GitHub


nastra commented on code in PR #10053:
URL: https://github.com/apache/iceberg/pull/10053#discussion_r1547301639


##
core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java:
##
@@ -133,6 +136,59 @@ public void testDynamicHttpRequestInterceptorLoading() {
 assertThat(((TestHttpRequestInterceptor) 
interceptor).properties).isEqualTo(properties);
   }
 
+  @Test
+  public void testHttpClientGetConnectionConfig() {
+long connectionTimeoutMs = 10L;
+int socketTimeoutMs = 10;
+Map properties =
+ImmutableMap.of(
+HTTPClient.REST_CONNECTION_TIMEOUT_MS, 
String.valueOf(connectionTimeoutMs),
+HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+
+ConnectionConfig connectionConfig = 
HTTPClient.configureConnectionConfig(properties);
+assertThat(connectionConfig).isNotNull();
+
assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs);
+
assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs);
+  }
+
+  @Test
+  public void testHttpClientWithSocketTimeout() throws IOException {
+long socketTimeoutMs = 2000L;
+Map properties =
+ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+String path = "socket/timeout/path";
+
+try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) {
+  HttpRequest mockRequest =
+  request()
+  .withPath("/" + path)
+  .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT));
+  // Setting a response delay of 5 seconds to simulate hitting the 
configured socket timeout of
+  // 2 seconds
+  HttpResponse mockResponse =
+  response()
+  .withStatusCode(200)
+  .withBody("Delayed response")
+  .withDelay(TimeUnit.MILLISECONDS, 5000);
+  mockServer.when(mockRequest).respond(mockResponse);
+
+  Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), 
(unused) -> {}))
+  .cause()
+  .isInstanceOf(SocketTimeoutException.class);
+}
+  }
+
+  @Test
+  public void testHttpClientInvalidConnectionTimeout() {
+// We expect a Runtime exception

Review Comment:
   this is obvious from the check further above, so no need for the comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]

2024-04-02 Thread via GitHub


nastra commented on code in PR #10053:
URL: https://github.com/apache/iceberg/pull/10053#discussion_r1547301269


##
core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java:
##
@@ -133,6 +136,59 @@ public void testDynamicHttpRequestInterceptorLoading() {
 assertThat(((TestHttpRequestInterceptor) 
interceptor).properties).isEqualTo(properties);
   }
 
+  @Test
+  public void testHttpClientGetConnectionConfig() {
+long connectionTimeoutMs = 10L;
+int socketTimeoutMs = 10;
+Map properties =
+ImmutableMap.of(
+HTTPClient.REST_CONNECTION_TIMEOUT_MS, 
String.valueOf(connectionTimeoutMs),
+HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+
+ConnectionConfig connectionConfig = 
HTTPClient.configureConnectionConfig(properties);
+assertThat(connectionConfig).isNotNull();
+
assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs);
+
assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs);
+  }
+
+  @Test
+  public void testHttpClientWithSocketTimeout() throws IOException {
+long socketTimeoutMs = 2000L;
+Map properties =
+ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+String path = "socket/timeout/path";
+
+try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) {
+  HttpRequest mockRequest =
+  request()
+  .withPath("/" + path)
+  .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT));
+  // Setting a response delay of 5 seconds to simulate hitting the 
configured socket timeout of
+  // 2 seconds
+  HttpResponse mockResponse =
+  response()
+  .withStatusCode(200)
+  .withBody("Delayed response")
+  .withDelay(TimeUnit.MILLISECONDS, 5000);
+  mockServer.when(mockRequest).respond(mockResponse);
+
+  Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), 
(unused) -> {}))
+  .cause()
+  .isInstanceOf(SocketTimeoutException.class);
+}
+  }
+
+  @Test
+  public void testHttpClientInvalidConnectionTimeout() {
+// We expect a Runtime exception
+String connectionTimeoutMsStr = "invalidMs";
+Map properties =
+ImmutableMap.of(HTTPClient.REST_CONNECTION_TIMEOUT_MS, 
connectionTimeoutMsStr);
+
+Assertions.assertThatThrownBy(() -> 
HTTPClient.builder(properties).uri(URI).build())
+.isInstanceOf(RuntimeException.class);

Review Comment:
   this should have a `.hasMessage(...)` check



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]

2024-04-02 Thread via GitHub


nastra commented on code in PR #10053:
URL: https://github.com/apache/iceberg/pull/10053#discussion_r1547300740


##
core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java:
##
@@ -133,6 +136,59 @@ public void testDynamicHttpRequestInterceptorLoading() {
 assertThat(((TestHttpRequestInterceptor) 
interceptor).properties).isEqualTo(properties);
   }
 
+  @Test
+  public void testHttpClientGetConnectionConfig() {
+long connectionTimeoutMs = 10L;
+int socketTimeoutMs = 10;
+Map properties =
+ImmutableMap.of(
+HTTPClient.REST_CONNECTION_TIMEOUT_MS, 
String.valueOf(connectionTimeoutMs),
+HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+
+ConnectionConfig connectionConfig = 
HTTPClient.configureConnectionConfig(properties);
+assertThat(connectionConfig).isNotNull();
+
assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs);
+
assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs);
+  }
+
+  @Test
+  public void testHttpClientWithSocketTimeout() throws IOException {
+long socketTimeoutMs = 2000L;
+Map properties =
+ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+String path = "socket/timeout/path";
+
+try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) {
+  HttpRequest mockRequest =
+  request()
+  .withPath("/" + path)
+  .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT));
+  // Setting a response delay of 5 seconds to simulate hitting the 
configured socket timeout of
+  // 2 seconds
+  HttpResponse mockResponse =
+  response()
+  .withStatusCode(200)
+  .withBody("Delayed response")
+  .withDelay(TimeUnit.MILLISECONDS, 5000);
+  mockServer.when(mockRequest).respond(mockResponse);
+
+  Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), 
(unused) -> {}))
+  .cause()
+  .isInstanceOf(SocketTimeoutException.class);
+}
+  }
+
+  @Test
+  public void testHttpClientInvalidConnectionTimeout() {

Review Comment:
   what about an invalid socket timeout?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



  1   2   >