Re: [I] [JDBC Catalog] Table commit fails if iceberg_type field is NULL [iceberg]
jbonofre commented on issue #10046: URL: https://github.com/apache/iceberg/issues/10046#issuecomment-2044198936 @jurossiar Thanks for the report. It's normal that the `iceberg_type` is added and the records updated with `NULL` if you define `jdbc.schema-version=V1` (`V0` "preserves" the "old" schema). However, if you don't explicitly define `V1`, the schema should not be updated. That's the issue I saw: the catalog tries to update the schema by default. I'm testing a fix today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: init iceberg writer [iceberg-rust]
ZENOTME commented on code in PR #275: URL: https://github.com/apache/iceberg-rust/pull/275#discussion_r1556935437 ## crates/iceberg/src/writer/base_writer/data_file_writer.rs: ## @@ -0,0 +1,323 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! This module provide `DataFileWriter`. + +use crate::spec::{DataContentType, DataFile, Struct}; +use crate::writer::file_writer::FileWriter; +use crate::writer::CurrentFileStatus; +use crate::writer::{file_writer::FileWriterBuilder, IcebergWriter, IcebergWriterBuilder}; +use crate::Result; +use arrow_array::RecordBatch; +use itertools::Itertools; + +/// Builder for `DataFileWriter`. +#[derive(Clone)] +pub struct DataFileWriterBuilder { +inner: B, +} + +impl DataFileWriterBuilder { +/// Create a new `DataFileWriterBuilder` using a `FileWriterBuilder`. +pub fn new(inner: B) -> Self { +Self { inner } +} +} + +/// Config for `DataFileWriter`. +pub struct DataFileWriterConfig { +partition_value: Struct, +} + +impl DataFileWriterConfig { +/// Create a new `DataFileWriterConfig` with partition value. +pub fn new(partition_value: Option) -> Self { +Self { +partition_value: partition_value.unwrap_or(Struct::empty()), +} +} +} + +#[async_trait::async_trait] +impl IcebergWriterBuilder for DataFileWriterBuilder { +type R = DataFileWriter; +type C = DataFileWriterConfig; + +async fn build(self, config: Self::C) -> Result { +Ok(DataFileWriter { +inner_writer: self.inner.clone().build().await?, +builder: self.inner, +partition_value: config.partition_value, +}) +} +} + +/// A writer write data is within one spec/partition. +pub struct DataFileWriter { +builder: B, +inner_writer: B::R, +partition_value: Struct, +} + +#[async_trait::async_trait] +impl IcebergWriter for DataFileWriter { +async fn write( self, batch: RecordBatch) -> Result<()> { +self.inner_writer.write().await +} + +async fn flush( self) -> Result> { +let writer = std::mem::replace( self.inner_writer, self.builder.clone().build().await?); Review Comment: Because the FileWriter only provides the interface `close(self)` which will make it simple. To provide the interface `flush( self)`, each time flush we will create a new FileWriter. A FileWriter will write the data into one or multiple files. There is no restriction for that. > What will happen if user calling flush on the same file twice? It's safe. The semantic of flush is to flush data written before into the storage and generate the files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] iceberg-flink: Switch tests to JUnit5 + AssertJ-style assertions [iceberg]
tomtongue commented on issue #9087: URL: https://github.com/apache/iceberg/issues/9087#issuecomment-2044183666 @nastra If no one is assigned with this, could you assign this with me? I believe there are three remaining tasks like flink, spark other versions, data for the migration completion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: init iceberg writer [iceberg-rust]
Xuanwo commented on code in PR #275: URL: https://github.com/apache/iceberg-rust/pull/275#discussion_r1556917470 ## crates/iceberg/src/writer/base_writer/data_file_writer.rs: ## @@ -0,0 +1,323 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! This module provide `DataFileWriter`. + +use crate::spec::{DataContentType, DataFile, Struct}; +use crate::writer::file_writer::FileWriter; +use crate::writer::CurrentFileStatus; +use crate::writer::{file_writer::FileWriterBuilder, IcebergWriter, IcebergWriterBuilder}; +use crate::Result; +use arrow_array::RecordBatch; +use itertools::Itertools; + +/// Builder for `DataFileWriter`. +#[derive(Clone)] +pub struct DataFileWriterBuilder { +inner: B, +} + +impl DataFileWriterBuilder { +/// Create a new `DataFileWriterBuilder` using a `FileWriterBuilder`. +pub fn new(inner: B) -> Self { +Self { inner } +} +} + +/// Config for `DataFileWriter`. +pub struct DataFileWriterConfig { +partition_value: Struct, +} + +impl DataFileWriterConfig { +/// Create a new `DataFileWriterConfig` with partition value. +pub fn new(partition_value: Option) -> Self { +Self { +partition_value: partition_value.unwrap_or(Struct::empty()), +} +} +} + +#[async_trait::async_trait] +impl IcebergWriterBuilder for DataFileWriterBuilder { +type R = DataFileWriter; +type C = DataFileWriterConfig; + +async fn build(self, config: Self::C) -> Result { +Ok(DataFileWriter { +inner_writer: self.inner.clone().build().await?, +builder: self.inner, +partition_value: config.partition_value, +}) +} +} + +/// A writer write data is within one spec/partition. +pub struct DataFileWriter { +builder: B, +inner_writer: B::R, +partition_value: Struct, +} + +#[async_trait::async_trait] +impl IcebergWriter for DataFileWriter { +async fn write( self, batch: RecordBatch) -> Result<()> { +self.inner_writer.write().await +} + +async fn flush( self) -> Result> { +let writer = std::mem::replace( self.inner_writer, self.builder.clone().build().await?); Review Comment: Hi, I'm bit confused about this. Why we need this? What will happen if user calling `flush` on the same file twice? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Implement `Closable` interface for class `HiveCatalog` and `HiveClientPool` [iceberg]
yuqi1129 commented on issue #10100: URL: https://github.com/apache/iceberg/issues/10100#issuecomment-2044152196 > @yuqi1129: Do I understand correctly, that your main issue is that the cleaner threads are remaining, and those are the only ones causing the issue? Yeah, I want a mechanism to explicitly close them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] [BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name [iceberg-python]
kevinjqliu commented on issue #584: URL: https://github.com/apache/iceberg-python/issues/584#issuecomment-2044152708 > I would argue that the Python one is correct Yeah me too. But I think Java Iceberg doesn't support this since parquet files with `ABC-GG-1-A` column will be read as Iceberg column `ABC_x2DGG_x2D1_x2DA`. I think it's worth opening an issue to track this in the main Iceberg repo. PyIceberg can already read a parquet file with a "sanitized" column name, this is handled by #83. [file_project_schema](https://github.com/apache/iceberg-python/pull/83/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR834), which is the schema with "sanitized" column name, is used to read the parquet files. So we just need to ensure that PyIceberg writes parquet files with the same behavior as Java Iceberg. #590 verifies that fixing the write behavior will fix the entire roundtrip described in the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Implement `Closable` interface for class `HiveCatalog` and `HiveClientPool` [iceberg]
yuqi1129 commented on issue #10100: URL: https://github.com/apache/iceberg/issues/10100#issuecomment-2044151328 > Can you use the daemon thread from the jdk (via systemScheduler)? Then how can I use the `systemScheduler` here? I can't change the following code from an external project as it's part of the Iceberg inner class. ``` // Code block from CachedClientPool#init clientPoolCache = Caffeine.newBuilder() .expireAfterAccess(evictionInterval, TimeUnit.MILLISECONDS) .removalListener((ignored, value, cause) -> ((HiveClientPool) value).close()) .scheduler( // This daemon thread can't be explicitly closed. Scheduler.forScheduledExecutorService( ThreadPools.newScheduledPool("hive-metastore-cleaner", 1))) .build(); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Implement `Closable` interface for class `HiveCatalog` and `HiveClientPool` [iceberg]
pvary commented on issue #10100: URL: https://github.com/apache/iceberg/issues/10100#issuecomment-2044147391 @yuqi1129: Do I understand correctly, that your main issue is that the cleaner threads are remaining, and those are the only ones causing the issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Implement `Closable` interface for class `HiveCatalog` and `HiveClientPool` [iceberg]
ben-manes commented on issue #10100: URL: https://github.com/apache/iceberg/issues/10100#issuecomment-2044074963 Can you use the daemon thread from the jdk (via systemScheduler)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
rahil-c commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1556673446 ## core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java: ## @@ -490,12 +522,29 @@ public void createNamespace( @Override public List listNamespaces(SessionContext context, Namespace namespace) { Review Comment: added test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Residuals cannot be applied to dates/times/timestamps read by BaseParquetReaders [iceberg]
github-actions[bot] closed issue #2253: Residuals cannot be applied to dates/times/timestamps read by BaseParquetReaders URL: https://github.com/apache/iceberg/issues/2253 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Unset no longer present table props while propagating props to HMS [iceberg]
github-actions[bot] closed issue #2249: Unset no longer present table props while propagating props to HMS URL: https://github.com/apache/iceberg/issues/2249 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Parquet: Vectorized reads should support reading INT96 timestamps [iceberg]
github-actions[bot] closed issue #2236: Parquet: Vectorized reads should support reading INT96 timestamps URL: https://github.com/apache/iceberg/issues/2236 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] NPE while closing OrcFileAppender [iceberg]
github-actions[bot] commented on issue #2470: URL: https://github.com/apache/iceberg/issues/2470#issuecomment-2043917692 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Residuals cannot be applied to dates/times/timestamps read by BaseParquetReaders [iceberg]
github-actions[bot] commented on issue #2253: URL: https://github.com/apache/iceberg/issues/2253#issuecomment-2043917536 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Unset no longer present table props while propagating props to HMS [iceberg]
github-actions[bot] commented on issue #2249: URL: https://github.com/apache/iceberg/issues/2249#issuecomment-2043917514 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Parquet: Vectorized reads should support reading INT96 timestamps [iceberg]
github-actions[bot] commented on issue #2236: URL: https://github.com/apache/iceberg/issues/2236#issuecomment-2043917494 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support creating tags [iceberg-python]
enkidulan commented on issue #573: URL: https://github.com/apache/iceberg-python/issues/573#issuecomment-2043897402 > Are you interested in creating the API for this? :) I am. I'll try to create a PR this week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] [JDBC Catalog] Table commit fails if iceberg_type field is NULL [iceberg]
jurossiar commented on issue #10046: URL: https://github.com/apache/iceberg/issues/10046#issuecomment-2043653883 Notice: We had the same issue. The table iceberg_tables was updated adding the `iceberg_type` column (with NULL values) even without set the environment variable `jdbc.schema-version=V1` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Document support for binary in truncate transform [iceberg]
amogh-jahagirdar commented on PR #10079: URL: https://github.com/apache/iceberg/pull/10079#issuecomment-2043652460 Merged. Thanks for addressing this @TheNeuralBit! thanks for the reviews @Fokko @rdblue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Discrepancy between partition truncation transform in spec vs code (org.apache.iceberg.transforms.Transform) [iceberg]
amogh-jahagirdar closed issue #5251: Discrepancy between partition truncation transform in spec vs code (org.apache.iceberg.transforms.Transform) URL: https://github.com/apache/iceberg/issues/5251 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Document support for binary in truncate transform [iceberg]
amogh-jahagirdar merged PR #10079: URL: https://github.com/apache/iceberg/pull/10079 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
rahil-c commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1556425249 ## core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java: ## @@ -274,14 +277,34 @@ public void setConf(Object newConf) { @Override public List listTables(SessionContext context, Namespace ns) { checkNamespaceIsValid(ns); +Map queryParams = Maps.newHashMap(); Review Comment: Sure will fix this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
rahil-c commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1556399763 ## core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java: ## @@ -224,6 +226,7 @@ public void initialize(String name, Map unresolved) { client, tokenRefreshExecutor(), token, expiresAtMillis(mergedProps), catalogAuth); } +this.restPageSize = mergedProps.get(REST_PAGE_SIZE); Review Comment: will fix this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] StructType field `required` optional by default [iceberg-python]
MehulBatra opened a new pull request, #592: URL: https://github.com/apache/iceberg-python/pull/592 * Type change default = false * Updated test cases to align with recent changes in the codebase (Unit + Integration) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Fix On-screen display issues and minor expressions on Branching and Tagging DDL [iceberg]
lawofcycles commented on code in PR #10091: URL: https://github.com/apache/iceberg/pull/10091#discussion_r1556378194 ## docs/docs/spark-ddl.md: ## @@ -499,17 +500,18 @@ AS OF VERSION 1234 -- CREATE audit-branch at snapshot 1234, retain audit-branch for 31 days, and retain the latest 31 days. The latest 3 snapshot snapshots, and 2 days worth of snapshots. ALTER TABLE prod.db.sample CREATE BRANCH `audit-branch` -AS OF VERSION 1234 RETAIN 30 DAYS +AS OF VERSION 1234 RETAIN 31 DAYS Review Comment: I've updated the description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
danielcweeks commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1556370624 ## core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java: ## @@ -224,6 +226,7 @@ public void initialize(String name, Map unresolved) { client, tokenRefreshExecutor(), token, expiresAtMillis(mergedProps), catalogAuth); } +this.restPageSize = mergedProps.get(REST_PAGE_SIZE); Review Comment: We should probably add a validation that this is a positive integer if set. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
danielcweeks commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1556366558 ## core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java: ## @@ -274,14 +277,34 @@ public void setConf(Object newConf) { @Override public List listTables(SessionContext context, Namespace ns) { checkNamespaceIsValid(ns); +Map queryParams = Maps.newHashMap(); Review Comment: I feel like this could be consolidated using a `do ... while (...)` since we always need to make a single call. That might simplify and prevent having two separate call paths. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Added Upsolver to vendor list [iceberg]
jasonf20 commented on code in PR #10096: URL: https://github.com/apache/iceberg/pull/10096#discussion_r1556353763 ## site/docs/vendors.md: ## @@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query engine](https://trino.io ### [Tabular](https://tabular.io) [Tabular](https://tabular.io/product/) is a managed warehouse and automation platform. Tabular offers a central store for analytic data that can be used with any query engine or processing framework that supports Iceberg. Tabular warehouses add role-based access control and automatic optimization, clustering, and compaction to Iceberg tables. + +### [Upsolver](https://upsolver.com) + +[Upsolver](https://upsolver.com) is a streaming data ingestion and table management solution for Apache Iceberg. With Upsolver, users can easily ingest batch and streaming data from files, streams and databases (CDC) into [Iceberg tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables). In addition, Upsolver connects to your existing REST and Hive catalogs, and [analyzes the health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables) of your tables. Use Upsolver to continuously optimize tables by compacting small files, sorting and compressing, repartitioning, and cleaning up dangling files and expired manifests. Upsolver is available from the [Upsolver Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can be deployed in your AWS VPC. Review Comment: changed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Added Upsolver to vendor list [iceberg]
bitsondatadev commented on code in PR #10096: URL: https://github.com/apache/iceberg/pull/10096#discussion_r1556290485 ## site/docs/vendors.md: ## @@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query engine](https://trino.io ### [Tabular](https://tabular.io) [Tabular](https://tabular.io/product/) is a managed warehouse and automation platform. Tabular offers a central store for analytic data that can be used with any query engine or processing framework that supports Iceberg. Tabular warehouses add role-based access control and automatic optimization, clustering, and compaction to Iceberg tables. + +### [Upsolver](https://upsolver.com) + +[Upsolver](https://upsolver.com) is a streaming data ingestion and table management solution for Apache Iceberg. With Upsolver, users can easily ingest batch and streaming data from files, streams and databases (CDC) into [Iceberg tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables). In addition, Upsolver connects to your existing REST and Hive catalogs, and [analyzes the health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables) of your tables. Use Upsolver to continuously optimize tables by compacting small files, sorting and compressing, repartitioning, and cleaning up dangling files and expired manifests. Upsolver is available from the [Upsolver Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can be deployed in your AWS VPC. Review Comment: Based on the recommendations of the PMC, we also would like to remove any utm codes please. ```suggestion [Upsolver](https://upsolver.com) is a streaming data ingestion and table management solution for Apache Iceberg. With Upsolver, users can easily ingest batch and streaming data from files, streams and databases (CDC) into [Iceberg tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables). In addition, Upsolver connects to your existing REST and Hive catalogs, and [analyzes the health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables) of your tables. Use Upsolver to continuously optimize tables by compacting small files, sorting and compressing, repartitioning, and cleaning up dangling files and expired manifests. Upsolver is available from the [Upsolver Cloud](https://www.upsolver.com/sqlake-signup-wp) or can be deployed in your AWS VPC. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Added Upsolver to vendor list [iceberg]
bitsondatadev commented on code in PR #10096: URL: https://github.com/apache/iceberg/pull/10096#discussion_r1556290485 ## site/docs/vendors.md: ## @@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query engine](https://trino.io ### [Tabular](https://tabular.io) [Tabular](https://tabular.io/product/) is a managed warehouse and automation platform. Tabular offers a central store for analytic data that can be used with any query engine or processing framework that supports Iceberg. Tabular warehouses add role-based access control and automatic optimization, clustering, and compaction to Iceberg tables. + +### [Upsolver](https://upsolver.com) + +[Upsolver](https://upsolver.com) is a streaming data ingestion and table management solution for Apache Iceberg. With Upsolver, users can easily ingest batch and streaming data from files, streams and databases (CDC) into [Iceberg tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables). In addition, Upsolver connects to your existing REST and Hive catalogs, and [analyzes the health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables) of your tables. Use Upsolver to continuously optimize tables by compacting small files, sorting and compressing, repartitioning, and cleaning up dangling files and expired manifests. Upsolver is available from the [Upsolver Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can be deployed in your AWS VPC. Review Comment: Based on the recommendations of the PMC, we also would like to remove any utm codes please! ```suggestion [Upsolver](https://upsolver.com) is a streaming data ingestion and table management solution for Apache Iceberg. With Upsolver, users can easily ingest batch and streaming data from files, streams and databases (CDC) into [Iceberg tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables). In addition, Upsolver connects to your existing REST and Hive catalogs, and [analyzes the health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables) of your tables. Use Upsolver to continuously optimize tables by compacting small files, sorting and compressing, repartitioning, and cleaning up dangling files and expired manifests. Upsolver is available from the [Upsolver Cloud](https://www.upsolver.com/sqlake-signup-wp) or can be deployed in your AWS VPC. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Validate overwrite filter [iceberg-python]
Fokko commented on PR #582: URL: https://github.com/apache/iceberg-python/pull/582#issuecomment-2043480905 > If we wanted to handle the validation only in the delete function by checking if we would end up rewriting files, above pattern would succeed by deleting level = 'INFO' and dt = '2024-02-01' because this deletion is a pure metadata operations. This is what I tried to explain in the comment earlier above: sometimes you have to do rewrites because there was a different partitioning strategy before where still some rows match. I'm adding that currently in https://github.com/apache/iceberg-python/pull/569. A table can have older manifests that are still written using an older partition spec. > Static overwrite on the other hand, would eagerly validate the predicate expression against the table schema, and the values in the arrow table and throw instead. I missed this part. We can add it, but I would say that it is up to the user. To simplify it, this means doing this additional check (pseudocode): ```python def overwrite(df: pa.Table, overwrite_filter: Union[str, BooleanExpression]) -> None: row_filter = _parse_row_filter(row_filter) # Turns the str into a boolean expression pa_row_filter = expression_to_pyarrow(row_filter) num_invalid_rows = len(df) - df.filter(pa_row_filter) if len(num_invalid_rows) > 0: raise ValueError(f"Found {num_invalid_rows} rows that don't match the overwrite predicate") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Validate overwrite filter [iceberg-python]
syun64 commented on PR #582: URL: https://github.com/apache/iceberg-python/pull/582#issuecomment-2043441654 Hi @Fokko @adrianqin I think the goal of this PR is to create a distinction to the semantic of a 'static overwrite' onto a partitioned table, from that of a 'delete' + 'append'. As we discussed in the last [PyIceberg Sync](https://docs.google.com/document/d/1FnWwTJ5VBKdsvklPxt4r1Lb06DB2fFMJG3HITeafrU8/edit#heading=h.g2i9mi9crf0m), if we believe that a partitioned table overwrite should maintain the expectation of comparing the values provided in the overwrite_filter, to the values provided in the arrow table, I think we'd need these validations. We would first need to run these validations on the predicate expression provided in the overwrite_filter so that we can then compare the values in the arrow table. In the community sync, we discussed whether the community was in favor of drawing a distinction between `delete + append`, versus `overwrite`, and I think the we all gravitated _somewhat_ towards the latter. For example, where dt and level are both partition columns: ``` overwrite_filter = "level = 'INFO' AND dt = '2024-03-25'" # expected behavior -> deletes partition level = 'INFO' and dt = '2024-03-26' df = pa.Table.from_pylist( [ {"level": "INFO", "msg": "hi", "dt": date(2024, 3, 26)}, {"level": "ERROR", "msg": "bye", "dt": date(2024, 3, 26)}, ], ) tbl.overwrite(df, overwrite_filter) ``` If we wanted to handle the validation only in the `delete` function by checking if we would end up rewriting files, above pattern would succeed by deleting level = 'INFO' and dt = '2024-02-01' because these are pure metadata operations. And then we would add new data files for level = 'INFO' and dt = '2024-03-26' & level = 'ERROR' and dt = '2024-03-26'. Static overwrite on the other hand, would eagerly validate the predicate expression against the table schema, and the values in the arrow table and throw instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Added Upsolver to vendor list [iceberg]
jasonf20 commented on code in PR #10096: URL: https://github.com/apache/iceberg/pull/10096#discussion_r1556236365 ## site/docs/vendors.md: ## @@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query engine](https://trino.io ### [Tabular](https://tabular.io) [Tabular](https://tabular.io/product/) is a managed warehouse and automation platform. Tabular offers a central store for analytic data that can be used with any query engine or processing framework that supports Iceberg. Tabular warehouses add role-based access control and automatic optimization, clustering, and compaction to Iceberg tables. + +### [Upsolver](https://upsolver.com) + +[Upsolver](https://upsolver.com) is a simple to use, high volume streaming data ingestion and table management solution for Apache Iceberg. With Upsolver, users can easily ingest batch and streaming data from files, streams and databases (CDC) into [Iceberg tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables). In addition, Upsolver connects to your existing lakes cataloged by Tabular or AWS Glue Data Catalog, and [analyzes the health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables) of your tables. Use Upsolver to continuously optimize tables by compacting small files, sorting and compressing, repartitioning, and cleaning up dangling files and expired manifests. Upsolver is available from the [Upsolver Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can be deployed in your AWS VPC. Review Comment: Currently Upsolver doesn't support JDBC / Nessie Iceberg catalogs, I agree that the generic phrasing doesn't necessarily mean we support all catalogs though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Added Upsolver to vendor list [iceberg]
jasonf20 commented on code in PR #10096: URL: https://github.com/apache/iceberg/pull/10096#discussion_r1556232872 ## site/docs/vendors.md: ## @@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query engine](https://trino.io ### [Tabular](https://tabular.io) [Tabular](https://tabular.io/product/) is a managed warehouse and automation platform. Tabular offers a central store for analytic data that can be used with any query engine or processing framework that supports Iceberg. Tabular warehouses add role-based access control and automatic optimization, clustering, and compaction to Iceberg tables. + +### [Upsolver](https://upsolver.com) + +[Upsolver](https://upsolver.com) is a simple to use, high volume streaming data ingestion and table management solution for Apache Iceberg. With Upsolver, users can easily ingest batch and streaming data from files, streams and databases (CDC) into [Iceberg tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables). In addition, Upsolver connects to your existing lakes cataloged by Tabular or AWS Glue Data Catalog, and [analyzes the health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables) of your tables. Use Upsolver to continuously optimize tables by compacting small files, sorting and compressing, repartitioning, and cleaning up dangling files and expired manifests. Upsolver is available from the [Upsolver Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can be deployed in your AWS VPC. Review Comment: Sorry, I pushed to the wrong repository, the fix should be there now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] [draft] Use Parquet's getRowIndexOffset support instead of calculating it [iceberg]
wypoon opened a new pull request, #10107: URL: https://github.com/apache/iceberg/pull/10107 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Manifest list encryption [iceberg]
RussellSpitzer commented on code in PR #7770: URL: https://github.com/apache/iceberg/pull/7770#discussion_r1556171156 ## core/src/main/java/org/apache/iceberg/BaseSnapshot.java: ## @@ -90,7 +127,9 @@ class BaseSnapshot implements Snapshot { this.summary = summary; this.schemaId = schemaId; this.manifestListLocation = null; +this.manifestListSize = 0L; this.v1ManifestLocations = v1ManifestLocations; +this.manifestListKeyMetadata = null; Review Comment: Same as comment above (group manifest vars) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Added Upsolver to vendor list [iceberg]
ajantha-bhat commented on code in PR #10096: URL: https://github.com/apache/iceberg/pull/10096#discussion_r1556087620 ## site/docs/vendors.md: ## @@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query engine](https://trino.io ### [Tabular](https://tabular.io) [Tabular](https://tabular.io/product/) is a managed warehouse and automation platform. Tabular offers a central store for analytic data that can be used with any query engine or processing framework that supports Iceberg. Tabular warehouses add role-based access control and automatic optimization, clustering, and compaction to Iceberg tables. + +### [Upsolver](https://upsolver.com) + +[Upsolver](https://upsolver.com) is a simple to use, high volume streaming data ingestion and table management solution for Apache Iceberg. With Upsolver, users can easily ingest batch and streaming data from files, streams and databases (CDC) into [Iceberg tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables). In addition, Upsolver connects to your existing lakes cataloged by Tabular or AWS Glue Data Catalog, and [analyzes the health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables) of your tables. Use Upsolver to continuously optimize tables by compacting small files, sorting and compressing, repartitioning, and cleaning up dangling files and expired manifests. Upsolver is available from the [Upsolver Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can be deployed in your AWS VPC. Review Comment: @bitsondatadev is right. @jasonf20 didn't update the PR (forgot to push may be). I do have a question about this statement, "Upsolver connects to your existing lakes cataloged by Tabular or AWS Glue Data Catalog" or the suggested statement with "REST and Hive catalogs", Does this mean, Upsolver doesn't work with the other native Iceberg catalogs like JDBC, Nessie, HMS? If we are using the generic catalog API, then it should work with all catalogs and we can simplify the statement as "Upsolver connects to your existing Iceberg catalog" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Spark procedure to compute partition stats. [iceberg]
ajantha-bhat opened a new issue, #10106: URL: https://github.com/apache/iceberg/issues/10106 ### Feature Request / Improvement Based on the experiments from https://github.com/apache/iceberg/pull/9437, spark action is not effective as the serialization cost of each partition stats entry is expensive. Need a table API in the core module to compute stats in a distributed way. But we still need a SQL way to compute the partition stats. Hence, we will be calling the core API via SQL call procedure. ### Query engine None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Add a table API to compute partition stats. [iceberg]
ajantha-bhat opened a new issue, #10105: URL: https://github.com/apache/iceberg/issues/10105 ### Feature Request / Improvement Based on the experiments from https://github.com/apache/iceberg/pull/9437, spark action is not effective as the serialization cost of each partition stats entry is expensive. Need a table API in the core module to compute stats in a distributed way. Local algorithm will be similar to the POC from https://github.com/apache/iceberg/pull/9437 ### Query engine None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Docs: Update releases.md for Spark scala versions [iceberg]
liko9 opened a new pull request, #10104: URL: https://github.com/apache/iceberg/pull/10104 I found the Releases page for downloads to be quite confusing for the versions involved, so I updated the language and links a bit to be more clear -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Added Upsolver to vendor list [iceberg]
bitsondatadev commented on code in PR #10096: URL: https://github.com/apache/iceberg/pull/10096#discussion_r1555998294 ## site/docs/vendors.md: ## @@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query engine](https://trino.io ### [Tabular](https://tabular.io) [Tabular](https://tabular.io/product/) is a managed warehouse and automation platform. Tabular offers a central store for analytic data that can be used with any query engine or processing framework that supports Iceberg. Tabular warehouses add role-based access control and automatic optimization, clustering, and compaction to Iceberg tables. + +### [Upsolver](https://upsolver.com) + +[Upsolver](https://upsolver.com) is a simple to use, high volume streaming data ingestion and table management solution for Apache Iceberg. With Upsolver, users can easily ingest batch and streaming data from files, streams and databases (CDC) into [Iceberg tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables). In addition, Upsolver connects to your existing lakes cataloged by Tabular or AWS Glue Data Catalog, and [analyzes the health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables) of your tables. Use Upsolver to continuously optimize tables by compacting small files, sorting and compressing, repartitioning, and cleaning up dangling files and expired manifests. Upsolver is available from the [Upsolver Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can be deployed in your AWS VPC. Review Comment: Hey @jasonf20 im on my phone so maybe it's not refreshing but I don't see the update. I'll be back at my computer shortly to check on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Implement `Closable` interface for class `HiveCatalog` and `HiveClientPool` [iceberg]
yuqi1129 commented on issue #10100: URL: https://github.com/apache/iceberg/issues/10100#issuecomment-2042946236 > @yuqi1129: Could setting the `client.pool.cache.eviction-interval-ms`, and decreasing the `clients` size help your case? I'm afraid it won't work for me, decreasing `client.pool.cache.eviction-interval-ms` will not affect the fact that thread pool `hive-metastore-cleaner` is a daemon thread, only when the JVM exit can it be closed automatically. We can't close it in an isolated class loader as we won't be able to get its reference. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Extend HTTPClient Builder to allow setting a proxy server [iceberg]
nastra commented on code in PR #10052: URL: https://github.com/apache/iceberg/pull/10052#discussion_r1555942482 ## core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java: ## @@ -121,6 +128,103 @@ public void testHeadFailure() throws JsonProcessingException { testHttpMethodOnFailure(HttpMethod.HEAD); } + @Test + public void testProxyServer() throws IOException { +int proxyPort = 1070; +String proxyHostName = "localhost"; +try (ClientAndServer proxyServer = startClientAndServer(proxyPort); +RESTClient clientWithProxy = +HTTPClient.builder(ImmutableMap.of()) +.uri(URI) +.withProxy(proxyHostName, proxyPort) +.build()) { + String path = "v1/config"; + + HttpRequest mockRequest = + request("/" + path).withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT)); + + HttpResponse mockResponse = response().withStatusCode(200); + + mockServer.when(mockRequest).respond(mockResponse); + proxyServer.when(mockRequest).respond(mockResponse); + + clientWithProxy.head(path, ImmutableMap.of(), (onError) -> {}); + proxyServer.verify(mockRequest, VerificationTimes.exactly(1)); +} + } + + @Test + public void testProxyCredentialProviderWithoutProxyServerFailsBuild() throws IOException { Review Comment: ```suggestion public void testProxyCredentialProviderWithoutProxyServer() throws IOException { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Extend HTTPClient Builder to allow setting a proxy server [iceberg]
nastra commented on code in PR #10052: URL: https://github.com/apache/iceberg/pull/10052#discussion_r1555940535 ## core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java: ## @@ -121,6 +128,103 @@ public void testHeadFailure() throws JsonProcessingException { testHttpMethodOnFailure(HttpMethod.HEAD); } + @Test + public void testProxyServer() throws IOException { +int proxyPort = 1070; +String proxyHostName = "localhost"; +try (ClientAndServer proxyServer = startClientAndServer(proxyPort); +RESTClient clientWithProxy = +HTTPClient.builder(ImmutableMap.of()) +.uri(URI) +.withProxy(proxyHostName, proxyPort) +.build()) { + String path = "v1/config"; + + HttpRequest mockRequest = + request("/" + path).withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT)); + + HttpResponse mockResponse = response().withStatusCode(200); + + mockServer.when(mockRequest).respond(mockResponse); + proxyServer.when(mockRequest).respond(mockResponse); + + clientWithProxy.head(path, ImmutableMap.of(), (onError) -> {}); + proxyServer.verify(mockRequest, VerificationTimes.exactly(1)); +} + } + + @Test + public void testProxyCredentialProviderWithoutProxyServerFailsBuild() throws IOException { +int proxyPort = 1070; +String proxyHostName = "localhost"; +String authorizedUsername = "test-username"; Review Comment: these can be inlined -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Extend HTTPClient Builder to allow setting a proxy server [iceberg]
nastra commented on code in PR #10052: URL: https://github.com/apache/iceberg/pull/10052#discussion_r1555939815 ## core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java: ## @@ -121,6 +128,103 @@ public void testHeadFailure() throws JsonProcessingException { testHttpMethodOnFailure(HttpMethod.HEAD); } + @Test + public void testProxyServer() throws IOException { +int proxyPort = 1070; +String proxyHostName = "localhost"; +try (ClientAndServer proxyServer = startClientAndServer(proxyPort); +RESTClient clientWithProxy = +HTTPClient.builder(ImmutableMap.of()) +.uri(URI) +.withProxy(proxyHostName, proxyPort) +.build()) { + String path = "v1/config"; + + HttpRequest mockRequest = + request("/" + path).withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT)); + + HttpResponse mockResponse = response().withStatusCode(200); + + mockServer.when(mockRequest).respond(mockResponse); + proxyServer.when(mockRequest).respond(mockResponse); + + clientWithProxy.head(path, ImmutableMap.of(), (onError) -> {}); + proxyServer.verify(mockRequest, VerificationTimes.exactly(1)); +} + } + + @Test + public void testProxyCredentialProviderWithoutProxyServerFailsBuild() throws IOException { +int proxyPort = 1070; +String proxyHostName = "localhost"; +String authorizedUsername = "test-username"; +String authorizedPassword = "test-password"; +HttpHost proxy = new HttpHost(proxyHostName, proxyPort); +BasicCredentialsProvider credentialsProvider = new BasicCredentialsProvider(); +credentialsProvider.setCredentials( +new AuthScope(proxy), +new UsernamePasswordCredentials(authorizedUsername, authorizedPassword.toCharArray())); +Assertions.assertThatThrownBy( +() -> { + HTTPClient.builder(ImmutableMap.of()) + .uri(URI) + .withProxyCredentialsProvider(credentialsProvider) Review Comment: there should also be a separate test method where the proxy hostname is set to null -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Change DataScan to accept Metadata and io [iceberg-python]
Fokko merged PR #581: URL: https://github.com/apache/iceberg-python/pull/581 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Implement `Closable` interface for class `HiveCatalog` and `HiveClientPool` [iceberg]
pvary commented on issue #10100: URL: https://github.com/apache/iceberg/issues/10100#issuecomment-2042811946 @yuqi1129: Could setting the `client.pool.cache.eviction-interval-ms`, and decreasing the `clients` size help your case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Pyarrow type error [iceberg-python]
bigluck commented on issue #541: URL: https://github.com/apache/iceberg-python/issues/541#issuecomment-2042792612 @Fokko it sounds good to me! :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Pyarrow type error [iceberg-python]
Fokko commented on issue #541: URL: https://github.com/apache/iceberg-python/issues/541#issuecomment-2042789157 Ciao @bigluck. Thanks for jumping in here. Until V3 is finalized, we can add a flag to cast a nanosecond to a microsecond precision. Would that work for you? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Don't fail to serialize IcebergSourceSplit when there is too many delete files [iceberg]
elkhand commented on PR #9464: URL: https://github.com/apache/iceberg/pull/9464#issuecomment-2042732429 @javrasya wanted to check if you will have time to update the PR this week. If not I can take update this PR accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Pyarrow type error [iceberg-python]
bigluck commented on issue #541: URL: https://github.com/apache/iceberg-python/issues/541#issuecomment-2042703204 So, `timestamp_ns` & `timestamptz_ns` has been added on the `v3` of the iceberg specs, pyiceberg right now supports `v1` &`v2`. In my case, the column has been generated by this user-generated snipped: ``` df['ds'] = pd.to_datetime(df["pickup_datetime"]).dt.date ``` Unfortunately, I can not control what a user can write and how she produces the table. What's the recommended solution for downcasting unsupported column types into something less precise, without raising an error? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Pyarrow type error [iceberg-python]
bigluck commented on issue #541: URL: https://github.com/apache/iceberg-python/issues/541#issuecomment-2042634726 I'm facing a similar issue in my code. Tested using main@7fcdb8d25dfa2498ba98a2b8e8d2b327d85fa7c9 (the commit after `Minor fixes, #523 followup (#563)` and `Cast data to Iceberg Table's pyarrow schema (#523)`) In my case I'm creating a new table from this arrow schema: ``` ds: timestamp[ns] yhat: double yhat_lower: double yhat_upper: double -- schema metadata -- ``` This is the full stacktrace: ``` Traceback (most recent call last): File "/bpln/cba723bb/82eddf43/pip/runtime/s3write/invoke.py", line 62, in invoke self._pyiceberg_write_model(model) File "/bpln/cba723bb/82eddf43/pip/runtime/s3write/invoke.py", line 137, in _pyiceberg_write_model pyiceberg_table = catalog.create_table( ^ File "/tmp/runtime/shared/pyiceberg_patch.py", line 105, in create_table iceberg_schema = self._convert_schema_if_needed(schema) ^^ File "/bpln/cba723bb/82eddf43/pip/pyiceberg/pyiceberg/catalog/__init__.py", line 559, in _convert_schema_if_needed schema: Schema = visit_pyarrow(schema, _ConvertToIcebergWithoutIDs()) # type: ignore File "/usr/local/lib/python3.11/functools.py", line 909, in wrapper return dispatch(args[0].__class__)(*args, **kw) File "/bpln/cba723bb/82eddf43/pip/pyiceberg/pyiceberg/io/pyarrow.py", line 682, in _ return visitor.schema(obj, visit_pyarrow(pa.struct(obj), visitor)) ^^ File "/usr/local/lib/python3.11/functools.py", line 909, in wrapper return dispatch(args[0].__class__)(*args, **kw) File "/bpln/cba723bb/82eddf43/pip/pyiceberg/pyiceberg/io/pyarrow.py", line 691, in _ result = visit_pyarrow(field.type, visitor) ^^ File "/usr/local/lib/python3.11/functools.py", line 909, in wrapper return dispatch(args[0].__class__)(*args, **kw) File "/bpln/cba723bb/82eddf43/pip/pyiceberg/pyiceberg/io/pyarrow.py", line 726, in _ return visitor.primitive(obj) ^^ File "/bpln/cba723bb/82eddf43/pip/pyiceberg/pyiceberg/io/pyarrow.py", line 899, in primitive raise TypeError(f"Unsupported type: {primitive}") TypeError: Unsupported type: timestamp[ns] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Iceberg use PARTITIONED BY (days(ts)) cause wrong partition name [iceberg]
tcguanshuhuai closed issue #10102: Iceberg use PARTITIONED BY (days(ts)) cause wrong partition name URL: https://github.com/apache/iceberg/issues/10102 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Iceberg use PARTITIONED BY (days(ts)) cause wrong partition name [iceberg]
tcguanshuhuai commented on issue #10102: URL: https://github.com/apache/iceberg/issues/10102#issuecomment-2042571758 Thanks a lot. SentfrommyiPhone -- Original -- From: Mathew Kapkiai ***@***.*** Date: Mon,Apr 8,2024 8:03 PM To: apache/iceberg ***@***.*** Cc: tcguanshuhuai ***@***.***, Author ***@***.*** Subject: Re: [apache/iceberg] Iceberg use PARTITIONED BY (days(ts)) causewrong partition name (Issue #10102) I don’t think it’s a bug. Converting the epoch gives ‘2024-06-20 16:00:00’ at UTC. I believe spark inteprets timestamps at UTC based on it’s tz/system settings and you are in different timezone. I have faced similar challenges before. When I really don’t need timezone info, I mostly prefer using spark’s timestamp_ntz. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.*** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Iceberg use PARTITIONED BY (days(ts)) cause wrong partition name [iceberg]
Kapkiai commented on issue #10102: URL: https://github.com/apache/iceberg/issues/10102#issuecomment-2042569175 I don’t think it’s a bug. Converting the epoch gives ‘2024-06-20 16:00:00’ at UTC. I believe spark inteprets timestamps at UTC based on it’s tz/system settings and you are in different timezone. I have faced similar challenges before. When I really don’t need timezone info, I mostly prefer using spark’s timestamp_ntz. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Added Upsolver to vendor list [iceberg]
jasonf20 commented on code in PR #10096: URL: https://github.com/apache/iceberg/pull/10096#discussion_r1555685176 ## site/docs/vendors.md: ## @@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query engine](https://trino.io ### [Tabular](https://tabular.io) [Tabular](https://tabular.io/product/) is a managed warehouse and automation platform. Tabular offers a central store for analytic data that can be used with any query engine or processing framework that supports Iceberg. Tabular warehouses add role-based access control and automatic optimization, clustering, and compaction to Iceberg tables. + +### [Upsolver](https://upsolver.com) + +[Upsolver](https://upsolver.com) is a simple to use, high volume streaming data ingestion and table management solution for Apache Iceberg. With Upsolver, users can easily ingest batch and streaming data from files, streams and databases (CDC) into [Iceberg tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables). In addition, Upsolver connects to your existing lakes cataloged by Tabular or AWS Glue Data Catalog, and [analyzes the health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables) of your tables. Use Upsolver to continuously optimize tables by compacting small files, sorting and compressing, repartitioning, and cleaning up dangling files and expired manifests. Upsolver is available from the [Upsolver Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can be deployed in your AWS VPC. Review Comment: Sure, submitted and update with this suggestion. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Added Upsolver to vendor list [iceberg]
bitsondatadev commented on code in PR #10096: URL: https://github.com/apache/iceberg/pull/10096#discussion_r1554976560 ## site/docs/vendors.md: ## @@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query engine](https://trino.io ### [Tabular](https://tabular.io) [Tabular](https://tabular.io/product/) is a managed warehouse and automation platform. Tabular offers a central store for analytic data that can be used with any query engine or processing framework that supports Iceberg. Tabular warehouses add role-based access control and automatic optimization, clustering, and compaction to Iceberg tables. + +### [Upsolver](https://upsolver.com) + +[Upsolver](https://upsolver.com) is a simple to use, high volume streaming data ingestion and table management solution for Apache Iceberg. With Upsolver, users can easily ingest batch and streaming data from files, streams and databases (CDC) into [Iceberg tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables). In addition, Upsolver connects to your existing lakes cataloged by Tabular or AWS Glue Data Catalog, and [analyzes the health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables) of your tables. Use Upsolver to continuously optimize tables by compacting small files, sorting and compressing, repartitioning, and cleaning up dangling files and expired manifests. Upsolver is available from the [Upsolver Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can be deployed in your AWS VPC. Review Comment: ```suggestion [Upsolver](https://upsolver.com) is a streaming data ingestion and table management solution for Apache Iceberg. With Upsolver, users can easily ingest batch and streaming data from files, streams and databases (CDC) into [Iceberg tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables). In addition, Upsolver connects to your existing REST and Hive catalogs, and [analyzes the health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables) of your tables. Use Upsolver to continuously optimize tables by compacting small files, sorting and compressing, repartitioning, and cleaning up dangling files and expired manifests. Upsolver is available from the [Upsolver Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can be deployed in your AWS VPC. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] iceberg-aws: Switch tests to JUnit5 + AssertJ-style assertions [iceberg]
nastra closed issue #9080: iceberg-aws: Switch tests to JUnit5 + AssertJ-style assertions URL: https://github.com/apache/iceberg/issues/9080 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Move iceberg-orc files to iceberg-core module [iceberg]
ajantha-bhat closed issue #8454: Move iceberg-orc files to iceberg-core module URL: https://github.com/apache/iceberg/issues/8454 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Move iceberg-parquet files to iceberg-core module [iceberg]
ajantha-bhat closed issue #8453: Move iceberg-parquet files to iceberg-core module URL: https://github.com/apache/iceberg/issues/8453 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Migrate AWS tests to JUnit5 [iceberg]
nastra merged PR #10086: URL: https://github.com/apache/iceberg/pull/10086 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Implement spark action to compute partition stats [iceberg]
ajantha-bhat closed issue #8459: Implement spark action to compute partition stats URL: https://github.com/apache/iceberg/issues/8459 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Implement spark action to compute partition stats [iceberg]
ajantha-bhat commented on issue #8459: URL: https://github.com/apache/iceberg/issues/8459#issuecomment-2042446172 Based on the experiments from https://github.com/apache/iceberg/pull/9437, spark action is not effective as the serialization cost of each partition stats entry is expensive. Will implement an API in core module to compute stats in a distributed way and a spark procedure to call this API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Implement incremental update using commit stats (SnapshotSummary) [iceberg]
ajantha-bhat commented on issue #8461: URL: https://github.com/apache/iceberg/issues/8461#issuecomment-2042443146 Lower priority and will be working on this once all other planned activities from #8450 is done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Migrate AWS tests to JUnit5 [iceberg]
nastra commented on code in PR #10086: URL: https://github.com/apache/iceberg/pull/10086#discussion_r1555620195 ## aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java: ## @@ -229,10 +218,10 @@ public void testRenameTable() { glueCatalog.renameTable( TableIdentifier.of(namespace, tableName), TableIdentifier.of(namespace, newTableName)); Table renamedTable = glueCatalog.loadTable(TableIdentifier.of(namespace, newTableName)); -Assert.assertEquals(table.location(), renamedTable.location()); -Assert.assertEquals(table.schema().toString(), renamedTable.schema().toString()); -Assert.assertEquals(table.spec(), renamedTable.spec()); -Assert.assertEquals(table.currentSnapshot(), renamedTable.currentSnapshot()); +assertThat(renamedTable.location()).isEqualTo(table.location()); + assertThat(renamedTable.schema()).asString().isEqualTo(table.schema().toString()); Review Comment: it's because `Schema` doesn't implement equals() -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Implement incremental update using commit stats (SnapshotSummary) [iceberg]
ajantha-bhat opened a new issue, #8461: URL: https://github.com/apache/iceberg/issues/8461 ### Feature Request / Improvement This could be an experimental direction and can be controlled by a flag. This might bloat up the table metadata file size when millions of partitions are present. ### Query engine None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Migrate AWS tests to JUnit5 [iceberg]
nk1506 commented on code in PR #10086: URL: https://github.com/apache/iceberg/pull/10086#discussion_r186399 ## aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java: ## @@ -229,10 +218,10 @@ public void testRenameTable() { glueCatalog.renameTable( TableIdentifier.of(namespace, tableName), TableIdentifier.of(namespace, newTableName)); Table renamedTable = glueCatalog.loadTable(TableIdentifier.of(namespace, newTableName)); -Assert.assertEquals(table.location(), renamedTable.location()); -Assert.assertEquals(table.schema().toString(), renamedTable.schema().toString()); -Assert.assertEquals(table.spec(), renamedTable.spec()); -Assert.assertEquals(table.currentSnapshot(), renamedTable.currentSnapshot()); +assertThat(renamedTable.location()).isEqualTo(table.location()); + assertThat(renamedTable.schema()).asString().isEqualTo(table.schema().toString()); Review Comment: Yeah Just wondering , Why PartiotionSpec and Snapshot comparison are working. As per my understanding `isEqualTo` compares object values. So ideally it should not fail unless I am missing something. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Change DataScan to accept Metadata and io [iceberg-python]
Fokko commented on code in PR #581: URL: https://github.com/apache/iceberg-python/pull/581#discussion_r142312 ## pyiceberg/io/pyarrow.py: ## @@ -1089,7 +1091,7 @@ def project_table( deletes_per_file.get(task.file.file_path), case_sensitive, limit, -table.name_mapping(), +None, Review Comment: Ah, yes, I just added `None` temporarily to silence the IDEA, but then forgot about it. Thanks for raising the PR :D Good to have this tested as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]
nastra commented on code in PR #10053: URL: https://github.com/apache/iceberg/pull/10053#discussion_r128393 ## core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java: ## @@ -133,6 +138,59 @@ public void testDynamicHttpRequestInterceptorLoading() { assertThat(((TestHttpRequestInterceptor) interceptor).properties).isEqualTo(properties); } + @Test + public void testSocketAndConnectionTimeoutSet() { +long connectionTimeoutMs = 10L; +int socketTimeoutMs = 10; +Map properties = +ImmutableMap.of( +HTTPClient.REST_CONNECTION_TIMEOUT_MS, String.valueOf(connectionTimeoutMs), +HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); + +ConnectionConfig connectionConfig = HTTPClient.configureConnectionConfig(properties); +assertThat(connectionConfig).isNotNull(); + assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs); + assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs); + } + + @Test + public void testSocketTimeout() throws IOException { +long socketTimeoutMs = 2000L; +Map properties = +ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); +String path = "socket/timeout/path"; + +try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) { + HttpRequest mockRequest = + request() + .withPath("/" + path) + .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT)); + // Setting a response delay of 5 seconds to simulate hitting the configured socket timeout of + // 2 seconds + HttpResponse mockResponse = + response() + .withStatusCode(200) + .withBody("Delayed response") + .withDelay(TimeUnit.MILLISECONDS, 5000); + mockServer.when(mockRequest).respond(mockResponse); + + Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), (unused) -> {})) + .cause() + .isInstanceOf(SocketTimeoutException.class); +} + } + + @ParameterizedTest + @ValueSource(strings = {HTTPClient.REST_CONNECTION_TIMEOUT_MS, HTTPClient.REST_SOCKET_TIMEOUT_MS}) + public void testInvalidTimeout(String timeoutMsType) { Review Comment: can you also please add a test where the timout is -1? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Validate overwrite filter [iceberg-python]
Fokko commented on code in PR #582: URL: https://github.com/apache/iceberg-python/pull/582#discussion_r1555374364 ## pyiceberg/io/pyarrow.py: ## @@ -1776,7 +1776,10 @@ def write_parquet(task: WriteTask) -> DataFile: fo = io.new_output(file_path) with fo.create(overwrite=True) as fos: with pq.ParquetWriter(fos, schema=arrow_file_schema, **parquet_writer_kwargs) as writer: -writer.write(pa.Table.from_batches(task.record_batches), row_group_size=row_group_size) +arrow_table = pa.Table.from_batches(task.record_batches) +# align the columns accordingly in case input arrow table has columns in order different from iceberg table Review Comment: Can you provide an example of when this would happen? This only handles top-level columns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]
nastra commented on code in PR #10053: URL: https://github.com/apache/iceberg/pull/10053#discussion_r125438 ## core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java: ## @@ -133,6 +138,59 @@ public void testDynamicHttpRequestInterceptorLoading() { assertThat(((TestHttpRequestInterceptor) interceptor).properties).isEqualTo(properties); } + @Test + public void testSocketAndConnectionTimeoutSet() { +long connectionTimeoutMs = 10L; +int socketTimeoutMs = 10; +Map properties = +ImmutableMap.of( +HTTPClient.REST_CONNECTION_TIMEOUT_MS, String.valueOf(connectionTimeoutMs), +HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); + +ConnectionConfig connectionConfig = HTTPClient.configureConnectionConfig(properties); +assertThat(connectionConfig).isNotNull(); + assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs); + assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs); + } + + @Test + public void testSocketTimeout() throws IOException { +long socketTimeoutMs = 2000L; +Map properties = +ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); +String path = "socket/timeout/path"; + +try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) { + HttpRequest mockRequest = + request() + .withPath("/" + path) + .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT)); + // Setting a response delay of 5 seconds to simulate hitting the configured socket timeout of + // 2 seconds + HttpResponse mockResponse = + response() + .withStatusCode(200) + .withBody("Delayed response") + .withDelay(TimeUnit.MILLISECONDS, 5000); + mockServer.when(mockRequest).respond(mockResponse); + + Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), (unused) -> {})) + .cause() + .isInstanceOf(SocketTimeoutException.class); Review Comment: this should also have `.hasMessage("Read timed out")` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Migrate AWS tests to JUnit5 [iceberg]
tomtongue commented on code in PR #10086: URL: https://github.com/apache/iceberg/pull/10086#discussion_r117972 ## aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java: ## @@ -258,10 +247,10 @@ public void testRenameTableFailsToCreateNewTable() { .hasMessageContaining("Table already exists"); // old table can still be read with same metadata Table oldTable = glueCatalog.loadTable(id); -Assert.assertEquals(table.location(), oldTable.location()); -Assert.assertEquals(table.schema().toString(), oldTable.schema().toString()); -Assert.assertEquals(table.spec(), oldTable.spec()); -Assert.assertEquals(table.currentSnapshot(), oldTable.currentSnapshot()); +assertThat(oldTable.location()).isEqualTo(table.location()); + assertThat(oldTable.schema()).asString().isEqualTo(table.schema().toString()); Review Comment: As commented above, it's necessary because the testing the comparisons between objects fail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Migrate AWS tests to JUnit5 [iceberg]
tomtongue commented on code in PR #10086: URL: https://github.com/apache/iceberg/pull/10086#discussion_r119947 ## aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java: ## @@ -84,43 +84,35 @@ public void testCreateTable() { // verify table exists in Glue GetTableResponse response = glue.getTable(GetTableRequest.builder().databaseName(namespace).name(tableName).build()); -Assert.assertEquals(namespace, response.table().databaseName()); -Assert.assertEquals(tableName, response.table().name()); -Assert.assertEquals( - BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toUpperCase(Locale.ENGLISH), - response.table().parameters().get(BaseMetastoreTableOperations.TABLE_TYPE_PROP)); -Assert.assertTrue( -response -.table() -.parameters() -.containsKey(BaseMetastoreTableOperations.METADATA_LOCATION_PROP)); -Assert.assertEquals( -schema.columns().size(), response.table().storageDescriptor().columns().size()); -Assert.assertEquals(partitionSpec.fields().size(), response.table().partitionKeys().size()); -Assert.assertEquals( -"additionalLocations should match", - tableLocationProperties.values().stream().sorted().collect(Collectors.toList()), -response.table().storageDescriptor().additionalLocations().stream() -.sorted() -.collect(Collectors.toList())); +assertThat(response.table().databaseName()).isEqualTo(namespace); +assertThat(response.table().name()).isEqualTo(tableName); +assertThat(response.table().parameters()) +.containsEntry( +BaseMetastoreTableOperations.TABLE_TYPE_PROP, + BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toUpperCase(Locale.ENGLISH)) +.containsKey(BaseMetastoreTableOperations.METADATA_LOCATION_PROP); + assertThat(response.table().storageDescriptor().columns()).hasSameSizeAs(schema.columns()); + assertThat(response.table().partitionKeys()).hasSameSizeAs(partitionSpec.fields()); +assertThat(response.table().storageDescriptor().additionalLocations()) Review Comment: thanks for the suggestion. I think the suggestion drops the sorting check. So this uses `isEqualTo` and it's not flaky test. ## aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java: ## @@ -84,43 +84,35 @@ public void testCreateTable() { // verify table exists in Glue GetTableResponse response = glue.getTable(GetTableRequest.builder().databaseName(namespace).name(tableName).build()); -Assert.assertEquals(namespace, response.table().databaseName()); -Assert.assertEquals(tableName, response.table().name()); -Assert.assertEquals( - BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toUpperCase(Locale.ENGLISH), - response.table().parameters().get(BaseMetastoreTableOperations.TABLE_TYPE_PROP)); -Assert.assertTrue( -response -.table() -.parameters() -.containsKey(BaseMetastoreTableOperations.METADATA_LOCATION_PROP)); -Assert.assertEquals( -schema.columns().size(), response.table().storageDescriptor().columns().size()); -Assert.assertEquals(partitionSpec.fields().size(), response.table().partitionKeys().size()); -Assert.assertEquals( -"additionalLocations should match", - tableLocationProperties.values().stream().sorted().collect(Collectors.toList()), -response.table().storageDescriptor().additionalLocations().stream() -.sorted() -.collect(Collectors.toList())); +assertThat(response.table().databaseName()).isEqualTo(namespace); +assertThat(response.table().name()).isEqualTo(tableName); +assertThat(response.table().parameters()) +.containsEntry( +BaseMetastoreTableOperations.TABLE_TYPE_PROP, + BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toUpperCase(Locale.ENGLISH)) +.containsKey(BaseMetastoreTableOperations.METADATA_LOCATION_PROP); + assertThat(response.table().storageDescriptor().columns()).hasSameSizeAs(schema.columns()); + assertThat(response.table().partitionKeys()).hasSameSizeAs(partitionSpec.fields()); +assertThat(response.table().storageDescriptor().additionalLocations()) Review Comment: thanks for the suggestion. I think the suggestion drops the sorting check. So this uses `isEqualTo` and it's not flaky. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For
Re: [PR] Migrate AWS tests to JUnit5 [iceberg]
tomtongue commented on code in PR #10086: URL: https://github.com/apache/iceberg/pull/10086#discussion_r117631 ## aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java: ## @@ -229,10 +218,10 @@ public void testRenameTable() { glueCatalog.renameTable( TableIdentifier.of(namespace, tableName), TableIdentifier.of(namespace, newTableName)); Table renamedTable = glueCatalog.loadTable(TableIdentifier.of(namespace, newTableName)); -Assert.assertEquals(table.location(), renamedTable.location()); -Assert.assertEquals(table.schema().toString(), renamedTable.schema().toString()); -Assert.assertEquals(table.spec(), renamedTable.spec()); -Assert.assertEquals(table.currentSnapshot(), renamedTable.currentSnapshot()); +assertThat(renamedTable.location()).isEqualTo(table.location()); + assertThat(renamedTable.schema()).asString().isEqualTo(table.schema().toString()); Review Comment: It's necessary because the testing the comparisons between objects fail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Migrate AWS tests to JUnit5 [iceberg]
nk1506 commented on code in PR #10086: URL: https://github.com/apache/iceberg/pull/10086#discussion_r113397 ## aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java: ## @@ -229,10 +218,10 @@ public void testRenameTable() { glueCatalog.renameTable( TableIdentifier.of(namespace, tableName), TableIdentifier.of(namespace, newTableName)); Table renamedTable = glueCatalog.loadTable(TableIdentifier.of(namespace, newTableName)); -Assert.assertEquals(table.location(), renamedTable.location()); -Assert.assertEquals(table.schema().toString(), renamedTable.schema().toString()); -Assert.assertEquals(table.spec(), renamedTable.spec()); -Assert.assertEquals(table.currentSnapshot(), renamedTable.currentSnapshot()); +assertThat(renamedTable.location()).isEqualTo(table.location()); + assertThat(renamedTable.schema()).asString().isEqualTo(table.schema().toString()); Review Comment: nit: Just wondering why cast to `asString()` here ? Other Object comparison we are doing using `isEqualTo` ## aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java: ## @@ -258,10 +247,10 @@ public void testRenameTableFailsToCreateNewTable() { .hasMessageContaining("Table already exists"); // old table can still be read with same metadata Table oldTable = glueCatalog.loadTable(id); -Assert.assertEquals(table.location(), oldTable.location()); -Assert.assertEquals(table.schema().toString(), oldTable.schema().toString()); -Assert.assertEquals(table.spec(), oldTable.spec()); -Assert.assertEquals(table.currentSnapshot(), oldTable.currentSnapshot()); +assertThat(oldTable.location()).isEqualTo(table.location()); + assertThat(oldTable.schema()).asString().isEqualTo(table.schema().toString()); Review Comment: same -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] read from Iceberg table throw java.lang.ArrayIndexOutOfBoundsException: 3 [iceberg]
nastra commented on issue #10103: URL: https://github.com/apache/iceberg/issues/10103#issuecomment-2042278015 @jiantao-vungle do you have a small reproducible example? Without that it's quite difficult to reproduce this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Migrate AWS tests to JUnit5 [iceberg]
nk1506 commented on code in PR #10086: URL: https://github.com/apache/iceberg/pull/10086#discussion_r103258 ## aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java: ## @@ -84,43 +84,35 @@ public void testCreateTable() { // verify table exists in Glue GetTableResponse response = glue.getTable(GetTableRequest.builder().databaseName(namespace).name(tableName).build()); -Assert.assertEquals(namespace, response.table().databaseName()); -Assert.assertEquals(tableName, response.table().name()); -Assert.assertEquals( - BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toUpperCase(Locale.ENGLISH), - response.table().parameters().get(BaseMetastoreTableOperations.TABLE_TYPE_PROP)); -Assert.assertTrue( -response -.table() -.parameters() -.containsKey(BaseMetastoreTableOperations.METADATA_LOCATION_PROP)); -Assert.assertEquals( -schema.columns().size(), response.table().storageDescriptor().columns().size()); -Assert.assertEquals(partitionSpec.fields().size(), response.table().partitionKeys().size()); -Assert.assertEquals( -"additionalLocations should match", - tableLocationProperties.values().stream().sorted().collect(Collectors.toList()), -response.table().storageDescriptor().additionalLocations().stream() -.sorted() -.collect(Collectors.toList())); +assertThat(response.table().databaseName()).isEqualTo(namespace); +assertThat(response.table().name()).isEqualTo(tableName); +assertThat(response.table().parameters()) +.containsEntry( +BaseMetastoreTableOperations.TABLE_TYPE_PROP, + BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toUpperCase(Locale.ENGLISH)) +.containsKey(BaseMetastoreTableOperations.METADATA_LOCATION_PROP); + assertThat(response.table().storageDescriptor().columns()).hasSameSizeAs(schema.columns()); + assertThat(response.table().partitionKeys()).hasSameSizeAs(partitionSpec.fields()); +assertThat(response.table().storageDescriptor().additionalLocations()) Review Comment: previously the assertion was happening after sorting the values. Do you think it can become flaky someday? Why not somethng like `assertThat(response.table().storageDescriptor().additionalLocations()).containsAll(tableLocationProperties.values());` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Migrate AWS tests to JUnit5 [iceberg]
nk1506 commented on code in PR #10086: URL: https://github.com/apache/iceberg/pull/10086#discussion_r1555485467 ## aws/src/integration/java/org/apache/iceberg/aws/dynamodb/TestDynamoDbCatalog.java: ## @@ -99,22 +100,23 @@ public void testCreateNamespace() { .tableName(catalogTableName) .key(DynamoDbCatalog.namespacePrimaryKey(namespace)) .build()); -Assert.assertTrue("namespace must exist", response.hasItem()); -Assert.assertEquals( -"namespace must be stored in DynamoDB", -namespace.toString(), -response.item().get("namespace").s()); -Assertions.assertThatThrownBy(() -> catalog.createNamespace(namespace)) +assertThat(response.hasItem()).as("namespace must exist").isTrue(); +assertThat(response.item()) +.as("namespace must be stored in DynamoDB") +.hasEntrySatisfying( +"namespace", +attributeValue -> assertThat(attributeValue.s()).isEqualTo(namespace.toString())); Review Comment: Makes sense, I think `hasEntrySatisfying` sounds good as per the above reasons. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]
harishch1998 commented on code in PR #10053: URL: https://github.com/apache/iceberg/pull/10053#discussion_r1555456733 ## core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java: ## @@ -133,6 +136,67 @@ public void testDynamicHttpRequestInterceptorLoading() { assertThat(((TestHttpRequestInterceptor) interceptor).properties).isEqualTo(properties); } + @Test + public void testSocketAndConnectionTimeoutSet() { +long connectionTimeoutMs = 10L; +int socketTimeoutMs = 10; +Map properties = +ImmutableMap.of( +HTTPClient.REST_CONNECTION_TIMEOUT_MS, String.valueOf(connectionTimeoutMs), +HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); + +ConnectionConfig connectionConfig = HTTPClient.configureConnectionConfig(properties); +assertThat(connectionConfig).isNotNull(); + assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs); + assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs); + } + + @Test + public void testSocketTimeout() throws IOException { +long socketTimeoutMs = 2000L; +Map properties = +ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); +String path = "socket/timeout/path"; + +try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) { + HttpRequest mockRequest = + request() + .withPath("/" + path) + .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT)); + // Setting a response delay of 5 seconds to simulate hitting the configured socket timeout of + // 2 seconds + HttpResponse mockResponse = + response() + .withStatusCode(200) + .withBody("Delayed response") + .withDelay(TimeUnit.MILLISECONDS, 5000); + mockServer.when(mockRequest).respond(mockResponse); + + Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), (unused) -> {})) + .cause() + .isInstanceOf(SocketTimeoutException.class); +} + } + + @Test + public void testInvalidConnectionTimeout() { +testInvalidTimeouts(HTTPClient.REST_CONNECTION_TIMEOUT_MS); + } + + @Test + public void testInvalidSocketTimeout() { +testInvalidTimeouts(HTTPClient.REST_SOCKET_TIMEOUT_MS); + } + + private void testInvalidTimeouts(String timeoutMsType) { Review Comment: I'm so sorry for the repeated reviews. Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Validate overwrite filter [iceberg-python]
Fokko commented on PR #582: URL: https://github.com/apache/iceberg-python/pull/582#issuecomment-2042152141 Hi Adrian, thanks for working on this and the very comprehensive write-up. My first questions is, what is the main goal of this PR. Let me elaborate with more context. Looking at the Spark syntax, this originated from the Hive era, and assumes that there is a single partition spec on the table. With Iceberg, the partitioning can evolve over time and therefore also older partitioning strategies can be present. ```sql CREATE TABLE prod.my_app.logs ( envSTRING, loglineSTRING, created_at TIMESTAMP_NTZ -- Nobody likes timezones ) USING iceberg PARTITION BY (months(created_at)) -- Later on when there is more data REPLACE PARTITION FIELD dt_month WITH days(days(created_at)) -- Or, when we want to split the logs per environment REPLACE PARTITION FIELD dt_month WITH days(env, months(created_at)) ``` There is a fair chance that we have to rewrite actual data. When updating the spec, a full rewrite of the table is not done directly. Otherwise it would be very costly to update the partition spec on big tables. If the data is still partitioned monthly, update data using the new partitioning (on a daily basis), it will read in the Parquet files that match the filter. Let's say that you do: `INSERT OVERWRITE PARTITION (created_at='2024-04-08')` it will read in data that has been written into a monthly partitioning. It will filter out the data for `2024-04-08` and write back the remaining data. The new data will be appended to the table using the new partitioning scheme. > Example spark sql counterpart in iceberg spark static overwrite for full table overwrite is I think the top one is a [dynamic overwrite](https://iceberg.apache.org/docs/latest/spark-writes/#dynamic-overwrite) since it does not explicitly calls out the partition that it will overwrite. In that case we should compute all the affected partition values by applying the current partition spec on the dataframe. Based on that we can first delete the affected partitions, and then append the data. ```python def overwrite_dynamic(df: pa.Table) -> None: ``` The second one looks like a static overwrite: ```python def overwrite(df: pa.Table, expr: Union[BooleanExpression, str]) -> None: ... overwrite(df, "level = 'INFO'") ``` I agree that this is a bit odd: ``` mismatched input '>' expecting {')', ','}(line 3, pos 22) == SQL == INSERT OVERWRITE lacus.test.spark_staticoverwrite_partition_clause_and_data_reltship_multicols PARTITION (n_legs > 2) --^^^ SELECT color,animals FROM tmp_view ``` Open questions: - But in Iceberg there is no technical reason to now allow this. Do we want to refrain the user from doing this? - Should we add an option that will blow up when we have to rewrite files. If you do your predicates correctly, and you don't evolve the partition spec, then no parquet files should be opened and your overwrites will be crazy efficient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] java.lang.ArrayIndexOutOfBoundsException: 3 [iceberg]
jiantao-vungle opened a new issue, #10103: URL: https://github.com/apache/iceberg/issues/10103 ### Apache Iceberg version 1.3.1 ### Query engine Spark ### Please describe the bug Environment Spark: 3.4.1 Iceberg: 1.3.1 Description Throw following exception when read from Iceberg table. I did some researches, and found that there existed a column named `publisher_payout_type_at_delivery` in the table, it had only three options, like `CPM`, `REVENUE_SHARE` and `FLAT_CPM`, but the exception seemed that it was to use the fourth option. Exception ``` Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:63) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in stage 4.0 failed 4 times, most recent failure: Lost task 38.3 in stage 4.0 (TID 314) (172.26.2.4 executor 4): java.lang.ArrayIndexOutOfBoundsException: 3 at org.apache.iceberg.arrow.vectorized.GenericArrowVectorAccessorFactory$DictionaryStringAccessor.getUTF8String(GenericArrowVectorAccessorFactory.java:423) at org.apache.iceberg.spark.data.vectorized.IcebergArrowColumnVector.getUTF8String(IcebergArrowColumnVector.java:138) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithKeys_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1206) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1206) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1206) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2984) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) Caused by: java.lang.ArrayIndexOutOfBoundsException: 3 at
Re: [I] [BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name [iceberg-python]
Fokko commented on issue #584: URL: https://github.com/apache/iceberg-python/issues/584#issuecomment-2042077008 Generated a Parquet file using both Spark and Python: ![image](https://github.com/apache/iceberg-python/assets/1134248/b382632a-e5ef-4c3d-82bd-6efbe2ced53f) ![image](https://github.com/apache/iceberg-python/assets/1134248/a79c1cc3-f4ff-41d4-99d8-837475cf4be8) Looking at the Python file: ``` parq 0-0-624f6b17-7d8d-435e-9fbe-217caeb25ca4.parquet --schema # Schema required group field_id=-1 schema { optional double field_id=1 ABC-GG-1-A; } ``` And the Spark file: ``` parq 0-0-875786b8-dbcd-4f8c-80a7-d261205a0333-0-1.parquet --schema # Schema required group field_id=-1 table { required double field_id=1 ABC_x2DGG_x2D1_x2DA; } ``` I would argue that the Python one is correct, but probably missing some context. The names shouldn't matter in the end, so probably we should look them up by ID. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]
nastra commented on code in PR #10053: URL: https://github.com/apache/iceberg/pull/10053#discussion_r1555349405 ## core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java: ## @@ -133,6 +136,67 @@ public void testDynamicHttpRequestInterceptorLoading() { assertThat(((TestHttpRequestInterceptor) interceptor).properties).isEqualTo(properties); } + @Test + public void testSocketAndConnectionTimeoutSet() { +long connectionTimeoutMs = 10L; +int socketTimeoutMs = 10; +Map properties = +ImmutableMap.of( +HTTPClient.REST_CONNECTION_TIMEOUT_MS, String.valueOf(connectionTimeoutMs), +HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); + +ConnectionConfig connectionConfig = HTTPClient.configureConnectionConfig(properties); +assertThat(connectionConfig).isNotNull(); + assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs); + assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs); + } + + @Test + public void testSocketTimeout() throws IOException { +long socketTimeoutMs = 2000L; +Map properties = +ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); +String path = "socket/timeout/path"; + +try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) { + HttpRequest mockRequest = + request() + .withPath("/" + path) + .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT)); + // Setting a response delay of 5 seconds to simulate hitting the configured socket timeout of + // 2 seconds + HttpResponse mockResponse = + response() + .withStatusCode(200) + .withBody("Delayed response") + .withDelay(TimeUnit.MILLISECONDS, 5000); + mockServer.when(mockRequest).respond(mockResponse); + + Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), (unused) -> {})) + .cause() + .isInstanceOf(SocketTimeoutException.class); +} + } + + @Test + public void testInvalidConnectionTimeout() { +testInvalidTimeouts(HTTPClient.REST_CONNECTION_TIMEOUT_MS); + } + + @Test + public void testInvalidSocketTimeout() { +testInvalidTimeouts(HTTPClient.REST_SOCKET_TIMEOUT_MS); + } + + private void testInvalidTimeouts(String timeoutMsType) { Review Comment: ```suggestion @ParameterizedTest @ValueSource(strings = {HTTPClient.REST_CONNECTION_TIMEOUT_MS), HTTPClient.REST_SOCKET_TIMEOUT_MS}) public void testInvalidTimeout(String timeoutType) { ``` ## core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java: ## @@ -133,6 +136,67 @@ public void testDynamicHttpRequestInterceptorLoading() { assertThat(((TestHttpRequestInterceptor) interceptor).properties).isEqualTo(properties); } + @Test + public void testSocketAndConnectionTimeoutSet() { +long connectionTimeoutMs = 10L; +int socketTimeoutMs = 10; +Map properties = +ImmutableMap.of( +HTTPClient.REST_CONNECTION_TIMEOUT_MS, String.valueOf(connectionTimeoutMs), +HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); + +ConnectionConfig connectionConfig = HTTPClient.configureConnectionConfig(properties); +assertThat(connectionConfig).isNotNull(); + assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs); + assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs); + } + + @Test + public void testSocketTimeout() throws IOException { +long socketTimeoutMs = 2000L; +Map properties = +ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); +String path = "socket/timeout/path"; + +try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) { + HttpRequest mockRequest = + request() + .withPath("/" + path) + .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT)); + // Setting a response delay of 5 seconds to simulate hitting the configured socket timeout of + // 2 seconds + HttpResponse mockResponse = + response() + .withStatusCode(200) + .withBody("Delayed response") + .withDelay(TimeUnit.MILLISECONDS, 5000); + mockServer.when(mockRequest).respond(mockResponse); + + Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), (unused) -> {})) + .cause() + .isInstanceOf(SocketTimeoutException.class); +} + } + + @Test + public void testInvalidConnectionTimeout() { +testInvalidTimeouts(HTTPClient.REST_CONNECTION_TIMEOUT_MS); + } + + @Test + public void testInvalidSocketTimeout() { +testInvalidTimeouts(HTTPClient.REST_SOCKET_TIMEOUT_MS); + } + + private void
Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]
nastra commented on code in PR #10053: URL: https://github.com/apache/iceberg/pull/10053#discussion_r1555349405 ## core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java: ## @@ -133,6 +136,67 @@ public void testDynamicHttpRequestInterceptorLoading() { assertThat(((TestHttpRequestInterceptor) interceptor).properties).isEqualTo(properties); } + @Test + public void testSocketAndConnectionTimeoutSet() { +long connectionTimeoutMs = 10L; +int socketTimeoutMs = 10; +Map properties = +ImmutableMap.of( +HTTPClient.REST_CONNECTION_TIMEOUT_MS, String.valueOf(connectionTimeoutMs), +HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); + +ConnectionConfig connectionConfig = HTTPClient.configureConnectionConfig(properties); +assertThat(connectionConfig).isNotNull(); + assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs); + assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs); + } + + @Test + public void testSocketTimeout() throws IOException { +long socketTimeoutMs = 2000L; +Map properties = +ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, String.valueOf(socketTimeoutMs)); +String path = "socket/timeout/path"; + +try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) { + HttpRequest mockRequest = + request() + .withPath("/" + path) + .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT)); + // Setting a response delay of 5 seconds to simulate hitting the configured socket timeout of + // 2 seconds + HttpResponse mockResponse = + response() + .withStatusCode(200) + .withBody("Delayed response") + .withDelay(TimeUnit.MILLISECONDS, 5000); + mockServer.when(mockRequest).respond(mockResponse); + + Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), (unused) -> {})) + .cause() + .isInstanceOf(SocketTimeoutException.class); +} + } + + @Test + public void testInvalidConnectionTimeout() { +testInvalidTimeouts(HTTPClient.REST_CONNECTION_TIMEOUT_MS); + } + + @Test + public void testInvalidSocketTimeout() { +testInvalidTimeouts(HTTPClient.REST_SOCKET_TIMEOUT_MS); + } + + private void testInvalidTimeouts(String timeoutMsType) { Review Comment: ```suggestion private void checkInvalidTimeout(String timeoutMsType) { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Fix On-screen display issues and minor expressions on Branching and Tagging DDL [iceberg]
nastra commented on code in PR #10091: URL: https://github.com/apache/iceberg/pull/10091#discussion_r1555345466 ## docs/docs/spark-ddl.md: ## @@ -499,17 +500,18 @@ AS OF VERSION 1234 -- CREATE audit-branch at snapshot 1234, retain audit-branch for 31 days, and retain the latest 31 days. The latest 3 snapshot snapshots, and 2 days worth of snapshots. ALTER TABLE prod.db.sample CREATE BRANCH `audit-branch` -AS OF VERSION 1234 RETAIN 30 DAYS +AS OF VERSION 1234 RETAIN 31 DAYS Review Comment: rather than changing it here, can you please change it in the description above? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] #9073 Junit 4 tests switched to JUnit 5 [iceberg]
nastra commented on code in PR #9793: URL: https://github.com/apache/iceberg/pull/9793#discussion_r1555322359 ## data/src/test/java/org/apache/iceberg/data/TestDataFileIndexStatsFilters.java: ## @@ -137,9 +135,11 @@ public void testPositionDeletePlanningPath() throws IOException { tasks = Lists.newArrayList(tasksIterable); } -Assert.assertEquals("Should produce one task", 1, tasks.size()); +Assertions.assertThat(1).isEqualTo(tasks.size()).as("Should produce one task"); Review Comment: please statically import `assertThat()` ## data/src/test/java/org/apache/iceberg/data/DeleteReadTests.java: ## @@ -314,7 +315,7 @@ public void testEqualityDeletesWithRequiredEqColumn() throws IOException { StructLikeSet actual = rowSet(tableName, table, "id"); if (expectPruned()) { - assertThat(actual).as("Table should contain expected rows").isEqualTo(expected); + Assertions.assertThat(actual).as("Table should contain expected rows").isEqualTo(expected); Review Comment: please revert changes to this class ## data/src/test/java/org/apache/iceberg/data/TestDataFileIndexStatsFilters.java: ## @@ -163,10 +163,11 @@ public void testPositionDeletePlanningPathFilter() throws IOException { tasks = Lists.newArrayList(tasksIterable); } -Assert.assertEquals("Should produce one task", 1, tasks.size()); +Assertions.assertThat(tasks.size()).isEqualTo(1).as("Should produce one task"); Review Comment: same as above ## data/src/test/java/org/apache/iceberg/data/TestLocalScan.java: ## @@ -175,67 +173,110 @@ private void overwriteExistingData() throws IOException { .commit(); } - private void appendData() throws IOException { + private void appendData(Object fileExt) throws IOException { +FileFormat fileFormat = FileFormat.fromString(fileExt.toString()); DataFile file12 = -writeFile(sharedTableLocation, format.addExtension("file-12"), file1SecondSnapshotRecords); +writeFile( +sharedTableLocation, fileFormat.addExtension("file-12"), file1SecondSnapshotRecords); DataFile file22 = -writeFile(sharedTableLocation, format.addExtension("file-22"), file2SecondSnapshotRecords); +writeFile( +sharedTableLocation, fileFormat.addExtension("file-22"), file2SecondSnapshotRecords); DataFile file32 = -writeFile(sharedTableLocation, format.addExtension("file-32"), file3SecondSnapshotRecords); +writeFile( +sharedTableLocation, fileFormat.addExtension("file-32"), file3SecondSnapshotRecords); sharedTable.newFastAppend().appendFile(file12).appendFile(file22).appendFile(file32).commit(); DataFile file13 = -writeFile(sharedTableLocation, format.addExtension("file-13"), file1ThirdSnapshotRecords); +writeFile( +sharedTableLocation, fileFormat.addExtension("file-13"), file1ThirdSnapshotRecords); DataFile file23 = -writeFile(sharedTableLocation, format.addExtension("file-23"), file2ThirdSnapshotRecords); +writeFile( +sharedTableLocation, fileFormat.addExtension("file-23"), file2ThirdSnapshotRecords); DataFile file33 = -writeFile(sharedTableLocation, format.addExtension("file-33"), file3ThirdSnapshotRecords); +writeFile( +sharedTableLocation, fileFormat.addExtension("file-33"), file3ThirdSnapshotRecords); sharedTable.newFastAppend().appendFile(file13).appendFile(file23).appendFile(file33).commit(); } - @Before + @BeforeEach public void createTables() throws IOException { -File location = temp.newFolder("shared"); -Assert.assertTrue(location.delete()); +File location = temp; +Assertions.assertThat(location.delete()).isTrue(); this.sharedTableLocation = location.toString(); this.sharedTable = TABLES.create( SCHEMA, PartitionSpec.unpartitioned(), -ImmutableMap.of(TableProperties.DEFAULT_FILE_FORMAT, format.name()), +ImmutableMap.of(TableProperties.DEFAULT_FILE_FORMAT, temp.getName()), sharedTableLocation); Record record = GenericRecord.create(SCHEMA); -DataFile file1 = -writeFile(sharedTableLocation, format.addExtension("file-1"), file1FirstSnapshotRecords); - -Record nullData = record.copy(); -nullData.setField("id", 11L); -nullData.setField("data", null); - -DataFile file2 = -writeFile(sharedTableLocation, format.addExtension("file-2"), file2FirstSnapshotRecords); -DataFile file3 = -writeFile(sharedTableLocation, format.addExtension("file-3"), file3FirstSnapshotRecords); - -// commit the test data - sharedTable.newAppend().appendFile(file1).appendFile(file2).appendFile(file3).commit(); +String[] format = {"parquet", "orc", "avro"}; +Arrays.stream(format) +.forEach( +
Re: [PR] Migrate AWS tests to JUnit5 [iceberg]
tomtongue commented on code in PR #10086: URL: https://github.com/apache/iceberg/pull/10086#discussion_r1555328887 ## aws/src/integration/java/org/apache/iceberg/aws/dynamodb/TestDynamoDbCatalog.java: ## @@ -99,22 +100,23 @@ public void testCreateNamespace() { .tableName(catalogTableName) .key(DynamoDbCatalog.namespacePrimaryKey(namespace)) .build()); -Assert.assertTrue("namespace must exist", response.hasItem()); -Assert.assertEquals( -"namespace must be stored in DynamoDB", -namespace.toString(), -response.item().get("namespace").s()); -Assertions.assertThatThrownBy(() -> catalog.createNamespace(namespace)) +assertThat(response.hasItem()).as("namespace must exist").isTrue(); +assertThat(response.item()) +.as("namespace must be stored in DynamoDB") +.hasEntrySatisfying( +"namespace", +attributeValue -> assertThat(attributeValue.s()).isEqualTo(namespace.toString())); Review Comment: Thanks for the suggestion. I think your suggestion seems to be fine, but my change also checks if `response.item` has `namespace` or not. But if the value extraction by`get("namespace")`, it doesn't check in the test. So if this part is like your suggestion, it's better to write like: ``` assertThat(response.item()).containsKey("namespace") assertThat(response.item().get("namespace").s())... ``` That's why I'm using `hasEntrySatisfying` here. If I'm wrong here, please correct me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Hive Catalog cannot create table with TimestamptzType field [iceberg-python]
HonahX closed issue #583: Hive Catalog cannot create table with TimestamptzType field URL: https://github.com/apache/iceberg-python/issues/583 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [Bug Fix] Allow HiveCatalog to create table with TimestamptzType [iceberg-python]
HonahX merged PR #585: URL: https://github.com/apache/iceberg-python/pull/585 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [Bug Fix] Allow HiveCatalog to create table with TimestamptzType [iceberg-python]
HonahX commented on PR #585: URL: https://github.com/apache/iceberg-python/pull/585#issuecomment-2042016912 @Fokko Thanks for reviewing! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Migrate AWS tests to JUnit5 [iceberg]
tomtongue commented on code in PR #10086: URL: https://github.com/apache/iceberg/pull/10086#discussion_r1555324898 ## aws/src/integration/java/org/apache/iceberg/aws/lakeformation/TestLakeFormationAwsClientFactory.java: ## @@ -165,7 +165,8 @@ public void testLakeFormationEnabledGlueCatalog() throws Exception { glueCatalog.createNamespace(allowedNamespace); } catch (GlueException e) { LOG.error("fail to create Glue database", e); - Assert.fail("create namespace should succeed"); + + fail("create namespace should succeed"); Review Comment: Thanks. Will update this line. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Iceberg use PARTITIONED BY (days(ts)) cause wrong partition name [iceberg]
tcguanshuhuai opened a new issue, #10102: URL: https://github.com/apache/iceberg/issues/10102 ### Apache Iceberg version 1.5.0 (latest release) ### Query engine Spark ### Please describe the bug Step1: start iceberg spark-sql --jars /root/myjars/iceberg-spark-runtime-3.3_2.12-1.5.0.jar \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \ --conf spark.sql.catalog.spark_catalog.type=hive \ --conf spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.hive_prod.type=hive \ --conf spark.sql.catalog.hadoop_prod=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.hadoop_prod.type=hadoop \ --conf spark.sql.catalog.hadoop_prod.warehouse=hdfs://ns:8020/user/iceberg/warehouse/hadoop_prod Step2: create a iceberg table CREATE TABLE hadoop_prod.iceberg_db.sample_hidden_partition ( id bigint, data string, ts timestamp) USING iceberg PARTITIONED BY (days(ts)); Step3: insert into a record //Timestamp: 1718899200 is 2024-06-21 00:00:00 insert into table hadoop_prod.iceberg_db.sample_hidden_partition values(2, '2021-01-02', cast(1718899200 as timestamp) ); We can find use select sql I can get the correct value: spark-sql (default)> select * from hadoop_prod.iceberg_db.sample_hidden_partition; id datats 22021-01-02 2024-06-21 00:00:00 However, the hadoop url is wrong: spark-sql (default)> dfs -ls /user/iceberg/warehouse/hadoop_prod/iceberg_db/sample_hidden_partition/data; **Expect result:** drwxr-xr-x - root supergroup 0 2024-04-08 14:36 /user/iceberg/warehouse/hadoop_prod/iceberg_db/sample_hidden_partition/data/ts_day=2024-06-21 **Actual result:** drwxr-xr-x - root supergroup 0 2024-04-08 14:36 /user/iceberg/warehouse/hadoop_prod/iceberg_db/sample_hidden_partition/data/ts_day=2024-06-20 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Migrate AWS tests to JUnit5 [iceberg]
nastra commented on code in PR #10086: URL: https://github.com/apache/iceberg/pull/10086#discussion_r1555315428 ## aws/src/integration/java/org/apache/iceberg/aws/lakeformation/TestLakeFormationAwsClientFactory.java: ## @@ -165,7 +165,8 @@ public void testLakeFormationEnabledGlueCatalog() throws Exception { glueCatalog.createNamespace(allowedNamespace); } catch (GlueException e) { LOG.error("fail to create Glue database", e); - Assert.fail("create namespace should succeed"); + + fail("create namespace should succeed"); Review Comment: rather than using `fail` with a try-catch, it's better to have something like `assertThatNoException().isThrownBy(() -> glueCatalog.createNamespace(allowedNamespace))` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [Bug Fix] Allow HiveCatalog to create table with TimestamptzType [iceberg-python]
Fokko commented on code in PR #585: URL: https://github.com/apache/iceberg-python/pull/585#discussion_r1555301526 ## mkdocs/docs/configuration.md: ## @@ -232,6 +232,18 @@ catalog: s3.secret-access-key: password ``` +In case of Hive 2.x: + +```yaml +catalog: + default: +uri: thrift://localhost:9083 +hive.hive2-compatible: true +s3.endpoint: http://localhost:9000 +s3.access-key-id: admin +s3.secret-access-key: password +``` Review Comment: ```suggestion When using Hive 2.x, make sure to set the compatibility flag: ```yaml catalog: default: hive.hive2-compatible: true ``` ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [Bug Fix] Allow HiveCatalog to create table with TimestamptzType [iceberg-python]
Fokko commented on code in PR #585: URL: https://github.com/apache/iceberg-python/pull/585#discussion_r1555301526 ## mkdocs/docs/configuration.md: ## @@ -232,6 +232,18 @@ catalog: s3.secret-access-key: password ``` +In case of Hive 2.x: + +```yaml +catalog: + default: +uri: thrift://localhost:9083 +hive.hive2-compatible: true +s3.endpoint: http://localhost:9000 +s3.access-key-id: admin +s3.secret-access-key: password +``` Review Comment: ```suggestion When using Hive 2.x, make sure to set the compatibility flag: ```yaml catalog: default: ... hive.hive2-compatible: true ``` ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Implement incremental update using commit stats (SnapshotSummary) [iceberg]
ajantha-bhat commented on issue #8461: URL: https://github.com/apache/iceberg/issues/8461#issuecomment-2041964709 Based on the experiments from https://github.com/apache/iceberg/pull/9437, spark action is not effective as the serialization cost of each partition stats entry is expensive. Will implement an API in core module to compute stats in a distributed way and a spark procedure to call this API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Implement incremental update using commit stats (SnapshotSummary) [iceberg]
ajantha-bhat closed issue #8461: Implement incremental update using commit stats (SnapshotSummary) URL: https://github.com/apache/iceberg/issues/8461 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Implement `Closable` interface for class `HiveCatalog` and `HiveClientPool` [iceberg]
yuqi1129 commented on issue #10100: URL: https://github.com/apache/iceberg/issues/10100#issuecomment-2041959506 @pvary Thank you for your kind reply. The reason we need to close the `HiveCatalog` and `HiveClientPool` is that we are working on a metadata management system https://github.com/datastrato/gravitino and Gravitino will manage several catalogs. Each Iceberg catalog will have the following operations: 1. Create an Iceberg catalog. In this step, Gravitino will use a `IsolatedClassloader` to load classes that belongs to Icebergs catalog 2. `LoadTable`, `LoadSchema`, and other operations under the classloader 3. Finally, we are going to drop the Iceberg catalog if it's no longer needed, so we need to close the catalog, close the classloader, and release associated resources. However, due to a daemon thread introduced by `CachedClientPool` (Implementation class of CachedClientPool), the classloader can't be GC and thus will leads to OOM eventually. ```java // Code block from CachedClientPool#init clientPoolCache = Caffeine.newBuilder() .expireAfterAccess(evictionInterval, TimeUnit.MILLISECONDS) .removalListener((ignored, value, cause) -> ((HiveClientPool) value).close()) .scheduler( // This daemon thread can't be explicitly closed. Scheduler.forScheduledExecutorService( ThreadPools.newScheduledPool("hive-metastore-cleaner", 1))) .build(); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org