Re: [I] [JDBC Catalog] Table commit fails if iceberg_type field is NULL [iceberg]

2024-04-08 Thread via GitHub


jbonofre commented on issue #10046:
URL: https://github.com/apache/iceberg/issues/10046#issuecomment-2044198936

   @jurossiar Thanks for the report. It's normal that the `iceberg_type` is 
added and the records updated with `NULL` if you define 
`jdbc.schema-version=V1` (`V0` "preserves" the "old" schema). However, if you 
don't explicitly define `V1`, the schema should not be updated. That's the 
issue I saw: the catalog tries to update the schema by default.
   
   I'm testing a fix today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: init iceberg writer [iceberg-rust]

2024-04-08 Thread via GitHub


ZENOTME commented on code in PR #275:
URL: https://github.com/apache/iceberg-rust/pull/275#discussion_r1556935437


##
crates/iceberg/src/writer/base_writer/data_file_writer.rs:
##
@@ -0,0 +1,323 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This module provide `DataFileWriter`.
+
+use crate::spec::{DataContentType, DataFile, Struct};
+use crate::writer::file_writer::FileWriter;
+use crate::writer::CurrentFileStatus;
+use crate::writer::{file_writer::FileWriterBuilder, IcebergWriter, 
IcebergWriterBuilder};
+use crate::Result;
+use arrow_array::RecordBatch;
+use itertools::Itertools;
+
+/// Builder for `DataFileWriter`.
+#[derive(Clone)]
+pub struct DataFileWriterBuilder {
+inner: B,
+}
+
+impl DataFileWriterBuilder {
+/// Create a new `DataFileWriterBuilder` using a `FileWriterBuilder`.
+pub fn new(inner: B) -> Self {
+Self { inner }
+}
+}
+
+/// Config for `DataFileWriter`.
+pub struct DataFileWriterConfig {
+partition_value: Struct,
+}
+
+impl DataFileWriterConfig {
+/// Create a new `DataFileWriterConfig` with partition value.
+pub fn new(partition_value: Option) -> Self {
+Self {
+partition_value: partition_value.unwrap_or(Struct::empty()),
+}
+}
+}
+
+#[async_trait::async_trait]
+impl IcebergWriterBuilder for DataFileWriterBuilder {
+type R = DataFileWriter;
+type C = DataFileWriterConfig;
+
+async fn build(self, config: Self::C) -> Result {
+Ok(DataFileWriter {
+inner_writer: self.inner.clone().build().await?,
+builder: self.inner,
+partition_value: config.partition_value,
+})
+}
+}
+
+/// A writer write data is within one spec/partition.
+pub struct DataFileWriter {
+builder: B,
+inner_writer: B::R,
+partition_value: Struct,
+}
+
+#[async_trait::async_trait]
+impl IcebergWriter for DataFileWriter {
+async fn write( self, batch: RecordBatch) -> Result<()> {
+self.inner_writer.write().await
+}
+
+async fn flush( self) -> Result> {
+let writer = std::mem::replace( self.inner_writer, 
self.builder.clone().build().await?);

Review Comment:
   Because the FileWriter only provides the interface `close(self)` which will 
make it simple. To provide the interface `flush( self)`, each time flush we 
will create a new FileWriter. 
   
   A FileWriter will write the data into one or multiple files. There is no 
restriction for that. 
   
   >  What will happen if user calling flush on the same file twice?
   
   It's safe. The semantic of flush is to flush data written before into the 
storage and generate the files.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] iceberg-flink: Switch tests to JUnit5 + AssertJ-style assertions [iceberg]

2024-04-08 Thread via GitHub


tomtongue commented on issue #9087:
URL: https://github.com/apache/iceberg/issues/9087#issuecomment-2044183666

   @nastra If no one is assigned with this, could you assign this with me? I 
believe there are three remaining tasks like flink, spark other versions, data 
for the migration completion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: init iceberg writer [iceberg-rust]

2024-04-08 Thread via GitHub


Xuanwo commented on code in PR #275:
URL: https://github.com/apache/iceberg-rust/pull/275#discussion_r1556917470


##
crates/iceberg/src/writer/base_writer/data_file_writer.rs:
##
@@ -0,0 +1,323 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This module provide `DataFileWriter`.
+
+use crate::spec::{DataContentType, DataFile, Struct};
+use crate::writer::file_writer::FileWriter;
+use crate::writer::CurrentFileStatus;
+use crate::writer::{file_writer::FileWriterBuilder, IcebergWriter, 
IcebergWriterBuilder};
+use crate::Result;
+use arrow_array::RecordBatch;
+use itertools::Itertools;
+
+/// Builder for `DataFileWriter`.
+#[derive(Clone)]
+pub struct DataFileWriterBuilder {
+inner: B,
+}
+
+impl DataFileWriterBuilder {
+/// Create a new `DataFileWriterBuilder` using a `FileWriterBuilder`.
+pub fn new(inner: B) -> Self {
+Self { inner }
+}
+}
+
+/// Config for `DataFileWriter`.
+pub struct DataFileWriterConfig {
+partition_value: Struct,
+}
+
+impl DataFileWriterConfig {
+/// Create a new `DataFileWriterConfig` with partition value.
+pub fn new(partition_value: Option) -> Self {
+Self {
+partition_value: partition_value.unwrap_or(Struct::empty()),
+}
+}
+}
+
+#[async_trait::async_trait]
+impl IcebergWriterBuilder for DataFileWriterBuilder {
+type R = DataFileWriter;
+type C = DataFileWriterConfig;
+
+async fn build(self, config: Self::C) -> Result {
+Ok(DataFileWriter {
+inner_writer: self.inner.clone().build().await?,
+builder: self.inner,
+partition_value: config.partition_value,
+})
+}
+}
+
+/// A writer write data is within one spec/partition.
+pub struct DataFileWriter {
+builder: B,
+inner_writer: B::R,
+partition_value: Struct,
+}
+
+#[async_trait::async_trait]
+impl IcebergWriter for DataFileWriter {
+async fn write( self, batch: RecordBatch) -> Result<()> {
+self.inner_writer.write().await
+}
+
+async fn flush( self) -> Result> {
+let writer = std::mem::replace( self.inner_writer, 
self.builder.clone().build().await?);

Review Comment:
   
   Hi, I'm bit confused about this. Why we need this? What will happen if user 
calling `flush` on the same file twice?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Implement `Closable` interface for class `HiveCatalog` and `HiveClientPool` [iceberg]

2024-04-08 Thread via GitHub


yuqi1129 commented on issue #10100:
URL: https://github.com/apache/iceberg/issues/10100#issuecomment-2044152196

   > @yuqi1129: Do I understand correctly, that your main issue is that the 
cleaner threads are remaining, and those are the only ones causing the issue?
   
   Yeah, I want a mechanism to explicitly close them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] [BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name [iceberg-python]

2024-04-08 Thread via GitHub


kevinjqliu commented on issue #584:
URL: https://github.com/apache/iceberg-python/issues/584#issuecomment-2044152708

   > I would argue that the Python one is correct
   
   Yeah me too. But I think Java Iceberg doesn't support this since parquet 
files with `ABC-GG-1-A` column will be read as Iceberg column 
`ABC_x2DGG_x2D1_x2DA`. I think it's worth opening an issue to track this in the 
main Iceberg repo.
   
   PyIceberg can already read a parquet file with a "sanitized" column name, 
this is handled by #83. 
[file_project_schema](https://github.com/apache/iceberg-python/pull/83/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR834),
 which is the schema with "sanitized" column name, is used to read the parquet 
files.
   
   So we just need to ensure that PyIceberg writes parquet files with the same 
behavior as Java Iceberg. 
   
   #590 verifies that fixing the write behavior will fix the entire roundtrip 
described in the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Implement `Closable` interface for class `HiveCatalog` and `HiveClientPool` [iceberg]

2024-04-08 Thread via GitHub


yuqi1129 commented on issue #10100:
URL: https://github.com/apache/iceberg/issues/10100#issuecomment-2044151328

   > Can you use the daemon thread from the jdk (via systemScheduler)?
   
   Then how can I use the `systemScheduler` here? I can't change the following 
code from an external project as it's part of the Iceberg inner class.  
   
   ```
// Code block from CachedClientPool#init
   clientPoolCache =
 Caffeine.newBuilder()
 .expireAfterAccess(evictionInterval, TimeUnit.MILLISECONDS)
 .removalListener((ignored, value, cause) -> ((HiveClientPool) 
value).close())
 .scheduler(
// This daemon thread can't be explicitly closed.
 Scheduler.forScheduledExecutorService(
 ThreadPools.newScheduledPool("hive-metastore-cleaner", 
1)))
 .build();
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Implement `Closable` interface for class `HiveCatalog` and `HiveClientPool` [iceberg]

2024-04-08 Thread via GitHub


pvary commented on issue #10100:
URL: https://github.com/apache/iceberg/issues/10100#issuecomment-2044147391

   @yuqi1129: Do I understand correctly, that your main issue is that the 
cleaner threads are remaining, and those are the only ones causing the issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Implement `Closable` interface for class `HiveCatalog` and `HiveClientPool` [iceberg]

2024-04-08 Thread via GitHub


ben-manes commented on issue #10100:
URL: https://github.com/apache/iceberg/issues/10100#issuecomment-2044074963

   Can you use the daemon thread from the jdk (via systemScheduler)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Pagination To List Apis [iceberg]

2024-04-08 Thread via GitHub


rahil-c commented on code in PR #9782:
URL: https://github.com/apache/iceberg/pull/9782#discussion_r1556673446


##
core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java:
##
@@ -490,12 +522,29 @@ public void createNamespace(
 
   @Override
   public List listNamespaces(SessionContext context, Namespace 
namespace) {

Review Comment:
   added test



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Residuals cannot be applied to dates/times/timestamps read by BaseParquetReaders [iceberg]

2024-04-08 Thread via GitHub


github-actions[bot] closed issue #2253: Residuals cannot be applied to 
dates/times/timestamps read by BaseParquetReaders
URL: https://github.com/apache/iceberg/issues/2253


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Unset no longer present table props while propagating props to HMS [iceberg]

2024-04-08 Thread via GitHub


github-actions[bot] closed issue #2249: Unset no longer present table props 
while propagating props to HMS
URL: https://github.com/apache/iceberg/issues/2249


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Parquet: Vectorized reads should support reading INT96 timestamps [iceberg]

2024-04-08 Thread via GitHub


github-actions[bot] closed issue #2236: Parquet: Vectorized reads should 
support reading INT96 timestamps
URL: https://github.com/apache/iceberg/issues/2236


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] NPE while closing OrcFileAppender [iceberg]

2024-04-08 Thread via GitHub


github-actions[bot] commented on issue #2470:
URL: https://github.com/apache/iceberg/issues/2470#issuecomment-2043917692

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Residuals cannot be applied to dates/times/timestamps read by BaseParquetReaders [iceberg]

2024-04-08 Thread via GitHub


github-actions[bot] commented on issue #2253:
URL: https://github.com/apache/iceberg/issues/2253#issuecomment-2043917536

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Unset no longer present table props while propagating props to HMS [iceberg]

2024-04-08 Thread via GitHub


github-actions[bot] commented on issue #2249:
URL: https://github.com/apache/iceberg/issues/2249#issuecomment-2043917514

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Parquet: Vectorized reads should support reading INT96 timestamps [iceberg]

2024-04-08 Thread via GitHub


github-actions[bot] commented on issue #2236:
URL: https://github.com/apache/iceberg/issues/2236#issuecomment-2043917494

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support creating tags [iceberg-python]

2024-04-08 Thread via GitHub


enkidulan commented on issue #573:
URL: https://github.com/apache/iceberg-python/issues/573#issuecomment-2043897402

   >  Are you interested in creating the API for this? :)
   
   I am. I'll try to create a PR this week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] [JDBC Catalog] Table commit fails if iceberg_type field is NULL [iceberg]

2024-04-08 Thread via GitHub


jurossiar commented on issue #10046:
URL: https://github.com/apache/iceberg/issues/10046#issuecomment-2043653883

   Notice: We had the same issue. The table iceberg_tables was updated adding 
the `iceberg_type` column (with NULL values) even without set the environment 
variable `jdbc.schema-version=V1`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Document support for binary in truncate transform [iceberg]

2024-04-08 Thread via GitHub


amogh-jahagirdar commented on PR #10079:
URL: https://github.com/apache/iceberg/pull/10079#issuecomment-2043652460

   Merged. Thanks for addressing this @TheNeuralBit! thanks for the reviews 
@Fokko @rdblue 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Discrepancy between partition truncation transform in spec vs code (org.apache.iceberg.transforms.Transform) [iceberg]

2024-04-08 Thread via GitHub


amogh-jahagirdar closed issue #5251: Discrepancy between partition truncation 
transform in spec vs code (org.apache.iceberg.transforms.Transform) 
URL: https://github.com/apache/iceberg/issues/5251


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Document support for binary in truncate transform [iceberg]

2024-04-08 Thread via GitHub


amogh-jahagirdar merged PR #10079:
URL: https://github.com/apache/iceberg/pull/10079


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Pagination To List Apis [iceberg]

2024-04-08 Thread via GitHub


rahil-c commented on code in PR #9782:
URL: https://github.com/apache/iceberg/pull/9782#discussion_r1556425249


##
core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java:
##
@@ -274,14 +277,34 @@ public void setConf(Object newConf) {
   @Override
   public List listTables(SessionContext context, Namespace 
ns) {
 checkNamespaceIsValid(ns);
+Map queryParams = Maps.newHashMap();

Review Comment:
   Sure will fix this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Pagination To List Apis [iceberg]

2024-04-08 Thread via GitHub


rahil-c commented on code in PR #9782:
URL: https://github.com/apache/iceberg/pull/9782#discussion_r1556399763


##
core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java:
##
@@ -224,6 +226,7 @@ public void initialize(String name, Map 
unresolved) {
   client, tokenRefreshExecutor(), token, 
expiresAtMillis(mergedProps), catalogAuth);
 }
 
+this.restPageSize = mergedProps.get(REST_PAGE_SIZE);

Review Comment:
   will fix this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] StructType field `required` optional by default [iceberg-python]

2024-04-08 Thread via GitHub


MehulBatra opened a new pull request, #592:
URL: https://github.com/apache/iceberg-python/pull/592

   * Type change default = false
   * Updated test cases to align with recent changes in the codebase (Unit + 
Integration)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Fix On-screen display issues and minor expressions on Branching and Tagging DDL [iceberg]

2024-04-08 Thread via GitHub


lawofcycles commented on code in PR #10091:
URL: https://github.com/apache/iceberg/pull/10091#discussion_r1556378194


##
docs/docs/spark-ddl.md:
##
@@ -499,17 +500,18 @@ AS OF VERSION 1234
 
 -- CREATE audit-branch at snapshot 1234, retain audit-branch for 31 days, and 
retain the latest 31 days. The latest 3 snapshot snapshots, and 2 days worth of 
snapshots. 
 ALTER TABLE prod.db.sample CREATE BRANCH `audit-branch`
-AS OF VERSION 1234 RETAIN 30 DAYS 
+AS OF VERSION 1234 RETAIN 31 DAYS 

Review Comment:
   I've updated the description



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Pagination To List Apis [iceberg]

2024-04-08 Thread via GitHub


danielcweeks commented on code in PR #9782:
URL: https://github.com/apache/iceberg/pull/9782#discussion_r1556370624


##
core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java:
##
@@ -224,6 +226,7 @@ public void initialize(String name, Map 
unresolved) {
   client, tokenRefreshExecutor(), token, 
expiresAtMillis(mergedProps), catalogAuth);
 }
 
+this.restPageSize = mergedProps.get(REST_PAGE_SIZE);

Review Comment:
   We should probably add a validation that this is a positive integer if set.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Pagination To List Apis [iceberg]

2024-04-08 Thread via GitHub


danielcweeks commented on code in PR #9782:
URL: https://github.com/apache/iceberg/pull/9782#discussion_r1556366558


##
core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java:
##
@@ -274,14 +277,34 @@ public void setConf(Object newConf) {
   @Override
   public List listTables(SessionContext context, Namespace 
ns) {
 checkNamespaceIsValid(ns);
+Map queryParams = Maps.newHashMap();

Review Comment:
   I feel like this could be consolidated using a `do ... while (...)` since we 
always need to make a single call.  That might simplify and prevent having two 
separate call paths.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Added Upsolver to vendor list [iceberg]

2024-04-08 Thread via GitHub


jasonf20 commented on code in PR #10096:
URL: https://github.com/apache/iceberg/pull/10096#discussion_r1556353763


##
site/docs/vendors.md:
##
@@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query 
engine](https://trino.io
 ### [Tabular](https://tabular.io)
 
 [Tabular](https://tabular.io/product/) is a managed warehouse and automation 
platform. Tabular offers a central store for analytic data that can be used 
with any query engine or processing framework that supports Iceberg. Tabular 
warehouses add role-based access control and automatic optimization, 
clustering, and compaction to Iceberg tables.
+
+### [Upsolver](https://upsolver.com)
+
+[Upsolver](https://upsolver.com) is a streaming data ingestion and table 
management solution for Apache Iceberg. With Upsolver, users can easily ingest 
batch and streaming data from files, streams and databases (CDC) into [Iceberg 
tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables).
 In addition, Upsolver connects to your existing REST and Hive catalogs, and 
[analyzes the 
health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables)
 of your tables. Use Upsolver to continuously optimize tables by compacting 
small files, sorting and compressing, repartitioning, and cleaning up dangling 
files and expired manifests. Upsolver is available from the [Upsolver 
Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can 
be deployed in your AWS VPC.

Review Comment:
   changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Added Upsolver to vendor list [iceberg]

2024-04-08 Thread via GitHub


bitsondatadev commented on code in PR #10096:
URL: https://github.com/apache/iceberg/pull/10096#discussion_r1556290485


##
site/docs/vendors.md:
##
@@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query 
engine](https://trino.io
 ### [Tabular](https://tabular.io)
 
 [Tabular](https://tabular.io/product/) is a managed warehouse and automation 
platform. Tabular offers a central store for analytic data that can be used 
with any query engine or processing framework that supports Iceberg. Tabular 
warehouses add role-based access control and automatic optimization, 
clustering, and compaction to Iceberg tables.
+
+### [Upsolver](https://upsolver.com)
+
+[Upsolver](https://upsolver.com) is a streaming data ingestion and table 
management solution for Apache Iceberg. With Upsolver, users can easily ingest 
batch and streaming data from files, streams and databases (CDC) into [Iceberg 
tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables).
 In addition, Upsolver connects to your existing REST and Hive catalogs, and 
[analyzes the 
health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables)
 of your tables. Use Upsolver to continuously optimize tables by compacting 
small files, sorting and compressing, repartitioning, and cleaning up dangling 
files and expired manifests. Upsolver is available from the [Upsolver 
Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can 
be deployed in your AWS VPC.

Review Comment:
   Based on the recommendations of the PMC, we also would like to remove any 
utm codes please.
   
   ```suggestion
   [Upsolver](https://upsolver.com) is a streaming data ingestion and table 
management solution for Apache Iceberg. With Upsolver, users can easily ingest 
batch and streaming data from files, streams and databases (CDC) into [Iceberg 
tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables).
 In addition, Upsolver connects to your existing REST and Hive catalogs, and 
[analyzes the 
health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables)
 of your tables. Use Upsolver to continuously optimize tables by compacting 
small files, sorting and compressing, repartitioning, and cleaning up dangling 
files and expired manifests. Upsolver is available from the [Upsolver 
Cloud](https://www.upsolver.com/sqlake-signup-wp) or can be deployed in your 
AWS VPC.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Added Upsolver to vendor list [iceberg]

2024-04-08 Thread via GitHub


bitsondatadev commented on code in PR #10096:
URL: https://github.com/apache/iceberg/pull/10096#discussion_r1556290485


##
site/docs/vendors.md:
##
@@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query 
engine](https://trino.io
 ### [Tabular](https://tabular.io)
 
 [Tabular](https://tabular.io/product/) is a managed warehouse and automation 
platform. Tabular offers a central store for analytic data that can be used 
with any query engine or processing framework that supports Iceberg. Tabular 
warehouses add role-based access control and automatic optimization, 
clustering, and compaction to Iceberg tables.
+
+### [Upsolver](https://upsolver.com)
+
+[Upsolver](https://upsolver.com) is a streaming data ingestion and table 
management solution for Apache Iceberg. With Upsolver, users can easily ingest 
batch and streaming data from files, streams and databases (CDC) into [Iceberg 
tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables).
 In addition, Upsolver connects to your existing REST and Hive catalogs, and 
[analyzes the 
health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables)
 of your tables. Use Upsolver to continuously optimize tables by compacting 
small files, sorting and compressing, repartitioning, and cleaning up dangling 
files and expired manifests. Upsolver is available from the [Upsolver 
Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can 
be deployed in your AWS VPC.

Review Comment:
   Based on the recommendations of the PMC, we also would like to remove any 
utm codes please!
   
   ```suggestion
   [Upsolver](https://upsolver.com) is a streaming data ingestion and table 
management solution for Apache Iceberg. With Upsolver, users can easily ingest 
batch and streaming data from files, streams and databases (CDC) into [Iceberg 
tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables).
 In addition, Upsolver connects to your existing REST and Hive catalogs, and 
[analyzes the 
health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables)
 of your tables. Use Upsolver to continuously optimize tables by compacting 
small files, sorting and compressing, repartitioning, and cleaning up dangling 
files and expired manifests. Upsolver is available from the [Upsolver 
Cloud](https://www.upsolver.com/sqlake-signup-wp) or can be deployed in your 
AWS VPC.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Validate overwrite filter [iceberg-python]

2024-04-08 Thread via GitHub


Fokko commented on PR #582:
URL: https://github.com/apache/iceberg-python/pull/582#issuecomment-2043480905

   > If we wanted to handle the validation only in the delete function by 
checking if we would end up rewriting files, above pattern would succeed by 
deleting level = 'INFO' and dt = '2024-02-01' because this deletion is a pure 
metadata operations.
   
   This is what I tried to explain in the comment earlier above: sometimes you 
have to do rewrites because there was a different partitioning strategy before 
where still some rows match. I'm adding that currently in 
https://github.com/apache/iceberg-python/pull/569. A table can have older 
manifests that are still written using an older partition spec.
   
   > Static overwrite on the other hand, would eagerly validate the predicate 
expression against the table schema, and the values in the arrow table and 
throw instead.
   
   I missed this part. We can add it, but I would say that it is up to the 
user. To simplify it, this means doing this additional check (pseudocode):
   
   ```python
   def overwrite(df: pa.Table, overwrite_filter: Union[str, BooleanExpression]) 
-> None:
   row_filter = _parse_row_filter(row_filter) # Turns the str into a 
boolean expression
   pa_row_filter = expression_to_pyarrow(row_filter)
   num_invalid_rows = len(df) - df.filter(pa_row_filter)
   if len(num_invalid_rows) > 0:
   raise ValueError(f"Found {num_invalid_rows} rows that don't match 
the overwrite predicate")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Validate overwrite filter [iceberg-python]

2024-04-08 Thread via GitHub


syun64 commented on PR #582:
URL: https://github.com/apache/iceberg-python/pull/582#issuecomment-2043441654

   Hi @Fokko @adrianqin I think the goal of this PR is to create a distinction 
to the semantic of a 'static overwrite' onto a partitioned table, from that of 
a 'delete' + 'append'.
   
   As we discussed in the last [PyIceberg 
Sync](https://docs.google.com/document/d/1FnWwTJ5VBKdsvklPxt4r1Lb06DB2fFMJG3HITeafrU8/edit#heading=h.g2i9mi9crf0m),
 if we believe that a partitioned table overwrite should maintain the 
expectation of comparing the values provided in the overwrite_filter, to the 
values provided in the arrow table, I think we'd need these validations. We 
would first need to run these validations on the predicate expression provided 
in the overwrite_filter so that we can then compare the values in the arrow 
table. 
   
   In the community sync, we discussed whether the community was in favor of 
drawing a distinction between `delete + append`, versus `overwrite`, and I 
think the we all gravitated _somewhat_ towards the latter.
   
   For example, where dt and level are both partition columns:
   ```
   overwrite_filter = "level = 'INFO' AND dt = '2024-03-25'"
   
   # expected behavior -> deletes partition level = 'INFO' and dt = '2024-03-26'
   
   df = pa.Table.from_pylist(
  [
  {"level": "INFO", "msg": "hi", "dt": date(2024, 3, 26)},
  {"level": "ERROR", "msg": "bye", "dt": date(2024, 3, 26)},
  ],
   )
   
   tbl.overwrite(df, overwrite_filter)
   ```
   
   If we wanted to handle the validation only in the `delete` function by 
checking if we would end up rewriting files, above pattern would succeed by 
deleting level = 'INFO' and dt = '2024-02-01' because these are pure metadata 
operations. And then we would add new data files for level = 'INFO' and dt = 
'2024-03-26' & level = 'ERROR' and dt = '2024-03-26'.
   
   Static overwrite on the other hand, would eagerly validate the predicate 
expression against the table schema, and the values in the arrow table and 
throw instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Added Upsolver to vendor list [iceberg]

2024-04-08 Thread via GitHub


jasonf20 commented on code in PR #10096:
URL: https://github.com/apache/iceberg/pull/10096#discussion_r1556236365


##
site/docs/vendors.md:
##
@@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query 
engine](https://trino.io
 ### [Tabular](https://tabular.io)
 
 [Tabular](https://tabular.io/product/) is a managed warehouse and automation 
platform. Tabular offers a central store for analytic data that can be used 
with any query engine or processing framework that supports Iceberg. Tabular 
warehouses add role-based access control and automatic optimization, 
clustering, and compaction to Iceberg tables.
+
+### [Upsolver](https://upsolver.com)
+
+[Upsolver](https://upsolver.com) is a simple to use, high volume streaming 
data ingestion and table management solution for Apache Iceberg. With Upsolver, 
users can easily ingest batch and streaming data from files, streams and 
databases (CDC) into [Iceberg 
tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables).
 In addition, Upsolver connects to your existing lakes cataloged by Tabular or 
AWS Glue Data Catalog, and [analyzes the 
health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables)
 of your tables. Use Upsolver to continuously optimize tables by compacting 
small files, sorting and compressing, repartitioning, and cleaning up dangling 
files and expired manifests. Upsolver is available from the [Upsolver 
Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can 
be deployed in your AWS VPC.

Review Comment:
   Currently Upsolver doesn't support JDBC / Nessie Iceberg catalogs, I agree 
that the generic phrasing doesn't necessarily mean we support all catalogs 
though. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Added Upsolver to vendor list [iceberg]

2024-04-08 Thread via GitHub


jasonf20 commented on code in PR #10096:
URL: https://github.com/apache/iceberg/pull/10096#discussion_r1556232872


##
site/docs/vendors.md:
##
@@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query 
engine](https://trino.io
 ### [Tabular](https://tabular.io)
 
 [Tabular](https://tabular.io/product/) is a managed warehouse and automation 
platform. Tabular offers a central store for analytic data that can be used 
with any query engine or processing framework that supports Iceberg. Tabular 
warehouses add role-based access control and automatic optimization, 
clustering, and compaction to Iceberg tables.
+
+### [Upsolver](https://upsolver.com)
+
+[Upsolver](https://upsolver.com) is a simple to use, high volume streaming 
data ingestion and table management solution for Apache Iceberg. With Upsolver, 
users can easily ingest batch and streaming data from files, streams and 
databases (CDC) into [Iceberg 
tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables).
 In addition, Upsolver connects to your existing lakes cataloged by Tabular or 
AWS Glue Data Catalog, and [analyzes the 
health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables)
 of your tables. Use Upsolver to continuously optimize tables by compacting 
small files, sorting and compressing, repartitioning, and cleaning up dangling 
files and expired manifests. Upsolver is available from the [Upsolver 
Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can 
be deployed in your AWS VPC.

Review Comment:
Sorry, I pushed to the wrong repository, the fix should be there now. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] [draft] Use Parquet's getRowIndexOffset support instead of calculating it [iceberg]

2024-04-08 Thread via GitHub


wypoon opened a new pull request, #10107:
URL: https://github.com/apache/iceberg/pull/10107

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Manifest list encryption [iceberg]

2024-04-08 Thread via GitHub


RussellSpitzer commented on code in PR #7770:
URL: https://github.com/apache/iceberg/pull/7770#discussion_r1556171156


##
core/src/main/java/org/apache/iceberg/BaseSnapshot.java:
##
@@ -90,7 +127,9 @@ class BaseSnapshot implements Snapshot {
 this.summary = summary;
 this.schemaId = schemaId;
 this.manifestListLocation = null;
+this.manifestListSize = 0L;
 this.v1ManifestLocations = v1ManifestLocations;
+this.manifestListKeyMetadata = null;

Review Comment:
   Same as comment above (group manifest vars)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Added Upsolver to vendor list [iceberg]

2024-04-08 Thread via GitHub


ajantha-bhat commented on code in PR #10096:
URL: https://github.com/apache/iceberg/pull/10096#discussion_r1556087620


##
site/docs/vendors.md:
##
@@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query 
engine](https://trino.io
 ### [Tabular](https://tabular.io)
 
 [Tabular](https://tabular.io/product/) is a managed warehouse and automation 
platform. Tabular offers a central store for analytic data that can be used 
with any query engine or processing framework that supports Iceberg. Tabular 
warehouses add role-based access control and automatic optimization, 
clustering, and compaction to Iceberg tables.
+
+### [Upsolver](https://upsolver.com)
+
+[Upsolver](https://upsolver.com) is a simple to use, high volume streaming 
data ingestion and table management solution for Apache Iceberg. With Upsolver, 
users can easily ingest batch and streaming data from files, streams and 
databases (CDC) into [Iceberg 
tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables).
 In addition, Upsolver connects to your existing lakes cataloged by Tabular or 
AWS Glue Data Catalog, and [analyzes the 
health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables)
 of your tables. Use Upsolver to continuously optimize tables by compacting 
small files, sorting and compressing, repartitioning, and cleaning up dangling 
files and expired manifests. Upsolver is available from the [Upsolver 
Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can 
be deployed in your AWS VPC.

Review Comment:
   @bitsondatadev is right. @jasonf20 didn't update the PR (forgot to push may 
be). 
   
   I do have a question about this statement, "Upsolver connects to your 
existing lakes cataloged by Tabular or AWS Glue Data Catalog" or the suggested 
statement with "REST and Hive catalogs", 
   
   Does this mean, Upsolver doesn't work with the other native Iceberg catalogs 
like JDBC, Nessie, HMS?
   If we are using the generic catalog API, then it should work with all 
catalogs and we can simplify the statement as "Upsolver connects to your 
existing Iceberg catalog" 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Spark procedure to compute partition stats. [iceberg]

2024-04-08 Thread via GitHub


ajantha-bhat opened a new issue, #10106:
URL: https://github.com/apache/iceberg/issues/10106

   ### Feature Request / Improvement
   
   Based on the experiments from https://github.com/apache/iceberg/pull/9437, 
spark action is not effective as the serialization cost of each partition stats 
entry is expensive. Need a table API in the core module to compute stats in a 
distributed way.
   
   But we still need a SQL way to compute the partition stats. Hence, we will 
be calling the core API via SQL call procedure. 
   
   ### Query engine
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Add a table API to compute partition stats. [iceberg]

2024-04-08 Thread via GitHub


ajantha-bhat opened a new issue, #10105:
URL: https://github.com/apache/iceberg/issues/10105

   ### Feature Request / Improvement
   
   Based on the experiments from https://github.com/apache/iceberg/pull/9437, 
spark action is not effective as the serialization cost of each partition stats 
entry is expensive.
   
   Need a table API in the core module to compute stats in a distributed way.
   Local algorithm will be similar to the POC from 
https://github.com/apache/iceberg/pull/9437
   
   ### Query engine
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Docs: Update releases.md for Spark scala versions [iceberg]

2024-04-08 Thread via GitHub


liko9 opened a new pull request, #10104:
URL: https://github.com/apache/iceberg/pull/10104

   I found the Releases page for downloads to be quite confusing for the 
versions involved, so I updated the language and links a bit to be more clear


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Added Upsolver to vendor list [iceberg]

2024-04-08 Thread via GitHub


bitsondatadev commented on code in PR #10096:
URL: https://github.com/apache/iceberg/pull/10096#discussion_r1555998294


##
site/docs/vendors.md:
##
@@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query 
engine](https://trino.io
 ### [Tabular](https://tabular.io)
 
 [Tabular](https://tabular.io/product/) is a managed warehouse and automation 
platform. Tabular offers a central store for analytic data that can be used 
with any query engine or processing framework that supports Iceberg. Tabular 
warehouses add role-based access control and automatic optimization, 
clustering, and compaction to Iceberg tables.
+
+### [Upsolver](https://upsolver.com)
+
+[Upsolver](https://upsolver.com) is a simple to use, high volume streaming 
data ingestion and table management solution for Apache Iceberg. With Upsolver, 
users can easily ingest batch and streaming data from files, streams and 
databases (CDC) into [Iceberg 
tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables).
 In addition, Upsolver connects to your existing lakes cataloged by Tabular or 
AWS Glue Data Catalog, and [analyzes the 
health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables)
 of your tables. Use Upsolver to continuously optimize tables by compacting 
small files, sorting and compressing, repartitioning, and cleaning up dangling 
files and expired manifests. Upsolver is available from the [Upsolver 
Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can 
be deployed in your AWS VPC.

Review Comment:
   Hey @jasonf20 im on my phone so maybe it's not refreshing but I don't see 
the update. I'll be back at my computer shortly to check on this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Implement `Closable` interface for class `HiveCatalog` and `HiveClientPool` [iceberg]

2024-04-08 Thread via GitHub


yuqi1129 commented on issue #10100:
URL: https://github.com/apache/iceberg/issues/10100#issuecomment-2042946236

   > @yuqi1129: Could setting the `client.pool.cache.eviction-interval-ms`, and 
decreasing the `clients` size help your case?
   
   I'm afraid it won't work for me, decreasing 
`client.pool.cache.eviction-interval-ms` will not affect the fact that thread 
pool `hive-metastore-cleaner` is a daemon thread, only when the JVM exit can it 
be closed automatically. We can't close it in an isolated class loader as we 
won't be able to get its reference. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Extend HTTPClient Builder to allow setting a proxy server [iceberg]

2024-04-08 Thread via GitHub


nastra commented on code in PR #10052:
URL: https://github.com/apache/iceberg/pull/10052#discussion_r1555942482


##
core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java:
##
@@ -121,6 +128,103 @@ public void testHeadFailure() throws 
JsonProcessingException {
 testHttpMethodOnFailure(HttpMethod.HEAD);
   }
 
+  @Test
+  public void testProxyServer() throws IOException {
+int proxyPort = 1070;
+String proxyHostName = "localhost";
+try (ClientAndServer proxyServer = startClientAndServer(proxyPort);
+RESTClient clientWithProxy =
+HTTPClient.builder(ImmutableMap.of())
+.uri(URI)
+.withProxy(proxyHostName, proxyPort)
+.build()) {
+  String path = "v1/config";
+
+  HttpRequest mockRequest =
+  request("/" + 
path).withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT));
+
+  HttpResponse mockResponse = response().withStatusCode(200);
+
+  mockServer.when(mockRequest).respond(mockResponse);
+  proxyServer.when(mockRequest).respond(mockResponse);
+
+  clientWithProxy.head(path, ImmutableMap.of(), (onError) -> {});
+  proxyServer.verify(mockRequest, VerificationTimes.exactly(1));
+}
+  }
+
+  @Test
+  public void testProxyCredentialProviderWithoutProxyServerFailsBuild() throws 
IOException {

Review Comment:
   ```suggestion
 public void testProxyCredentialProviderWithoutProxyServer() throws 
IOException {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Extend HTTPClient Builder to allow setting a proxy server [iceberg]

2024-04-08 Thread via GitHub


nastra commented on code in PR #10052:
URL: https://github.com/apache/iceberg/pull/10052#discussion_r1555940535


##
core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java:
##
@@ -121,6 +128,103 @@ public void testHeadFailure() throws 
JsonProcessingException {
 testHttpMethodOnFailure(HttpMethod.HEAD);
   }
 
+  @Test
+  public void testProxyServer() throws IOException {
+int proxyPort = 1070;
+String proxyHostName = "localhost";
+try (ClientAndServer proxyServer = startClientAndServer(proxyPort);
+RESTClient clientWithProxy =
+HTTPClient.builder(ImmutableMap.of())
+.uri(URI)
+.withProxy(proxyHostName, proxyPort)
+.build()) {
+  String path = "v1/config";
+
+  HttpRequest mockRequest =
+  request("/" + 
path).withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT));
+
+  HttpResponse mockResponse = response().withStatusCode(200);
+
+  mockServer.when(mockRequest).respond(mockResponse);
+  proxyServer.when(mockRequest).respond(mockResponse);
+
+  clientWithProxy.head(path, ImmutableMap.of(), (onError) -> {});
+  proxyServer.verify(mockRequest, VerificationTimes.exactly(1));
+}
+  }
+
+  @Test
+  public void testProxyCredentialProviderWithoutProxyServerFailsBuild() throws 
IOException {
+int proxyPort = 1070;
+String proxyHostName = "localhost";
+String authorizedUsername = "test-username";

Review Comment:
   these can be inlined



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Extend HTTPClient Builder to allow setting a proxy server [iceberg]

2024-04-08 Thread via GitHub


nastra commented on code in PR #10052:
URL: https://github.com/apache/iceberg/pull/10052#discussion_r1555939815


##
core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java:
##
@@ -121,6 +128,103 @@ public void testHeadFailure() throws 
JsonProcessingException {
 testHttpMethodOnFailure(HttpMethod.HEAD);
   }
 
+  @Test
+  public void testProxyServer() throws IOException {
+int proxyPort = 1070;
+String proxyHostName = "localhost";
+try (ClientAndServer proxyServer = startClientAndServer(proxyPort);
+RESTClient clientWithProxy =
+HTTPClient.builder(ImmutableMap.of())
+.uri(URI)
+.withProxy(proxyHostName, proxyPort)
+.build()) {
+  String path = "v1/config";
+
+  HttpRequest mockRequest =
+  request("/" + 
path).withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT));
+
+  HttpResponse mockResponse = response().withStatusCode(200);
+
+  mockServer.when(mockRequest).respond(mockResponse);
+  proxyServer.when(mockRequest).respond(mockResponse);
+
+  clientWithProxy.head(path, ImmutableMap.of(), (onError) -> {});
+  proxyServer.verify(mockRequest, VerificationTimes.exactly(1));
+}
+  }
+
+  @Test
+  public void testProxyCredentialProviderWithoutProxyServerFailsBuild() throws 
IOException {
+int proxyPort = 1070;
+String proxyHostName = "localhost";
+String authorizedUsername = "test-username";
+String authorizedPassword = "test-password";
+HttpHost proxy = new HttpHost(proxyHostName, proxyPort);
+BasicCredentialsProvider credentialsProvider = new 
BasicCredentialsProvider();
+credentialsProvider.setCredentials(
+new AuthScope(proxy),
+new UsernamePasswordCredentials(authorizedUsername, 
authorizedPassword.toCharArray()));
+Assertions.assertThatThrownBy(
+() -> {
+  HTTPClient.builder(ImmutableMap.of())
+  .uri(URI)
+  .withProxyCredentialsProvider(credentialsProvider)

Review Comment:
   there should also be a separate test method where the proxy hostname is set 
to null



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Change DataScan to accept Metadata and io [iceberg-python]

2024-04-08 Thread via GitHub


Fokko merged PR #581:
URL: https://github.com/apache/iceberg-python/pull/581


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Implement `Closable` interface for class `HiveCatalog` and `HiveClientPool` [iceberg]

2024-04-08 Thread via GitHub


pvary commented on issue #10100:
URL: https://github.com/apache/iceberg/issues/10100#issuecomment-2042811946

   @yuqi1129: Could setting the `client.pool.cache.eviction-interval-ms`, and 
decreasing the `clients` size help your case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Pyarrow type error [iceberg-python]

2024-04-08 Thread via GitHub


bigluck commented on issue #541:
URL: https://github.com/apache/iceberg-python/issues/541#issuecomment-2042792612

   @Fokko it sounds good to me! :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Pyarrow type error [iceberg-python]

2024-04-08 Thread via GitHub


Fokko commented on issue #541:
URL: https://github.com/apache/iceberg-python/issues/541#issuecomment-2042789157

   Ciao @bigluck. Thanks for jumping in here. Until V3 is finalized, we can add 
a flag to cast a nanosecond to a microsecond precision. Would that work for you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Don't fail to serialize IcebergSourceSplit when there is too many delete files [iceberg]

2024-04-08 Thread via GitHub


elkhand commented on PR #9464:
URL: https://github.com/apache/iceberg/pull/9464#issuecomment-2042732429

   @javrasya  wanted to check if you will have time to update the PR this week. 
If not I can take update this PR accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Pyarrow type error [iceberg-python]

2024-04-08 Thread via GitHub


bigluck commented on issue #541:
URL: https://github.com/apache/iceberg-python/issues/541#issuecomment-2042703204

   So, `timestamp_ns` & `timestamptz_ns` has been added on the `v3` of the 
iceberg specs, pyiceberg right now supports `v1` &`v2`.
   
   In my case, the column has been generated by this user-generated snipped:
   
   ```
   df['ds'] = pd.to_datetime(df["pickup_datetime"]).dt.date
   ```
   
   Unfortunately, I can not control what a user can write and how she produces 
the table.
   
   What's the recommended solution for downcasting unsupported column types 
into something less precise, without raising an error?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Pyarrow type error [iceberg-python]

2024-04-08 Thread via GitHub


bigluck commented on issue #541:
URL: https://github.com/apache/iceberg-python/issues/541#issuecomment-2042634726

   I'm facing a similar issue in my code.
   
   Tested using main@7fcdb8d25dfa2498ba98a2b8e8d2b327d85fa7c9 (the commit after 
`Minor fixes, #523 followup (#563)` and `Cast data to Iceberg Table's pyarrow 
schema (#523)`)
   
   In my case I'm creating a new table from this arrow schema:
   ```
   ds: timestamp[ns]
   yhat: double
   yhat_lower: double
   yhat_upper: double
   -- schema metadata --
   ```
   
   This is the full stacktrace:
   ```
   Traceback (most recent call last):
 File "/bpln/cba723bb/82eddf43/pip/runtime/s3write/invoke.py", line 62, in 
invoke
   self._pyiceberg_write_model(model)
 File "/bpln/cba723bb/82eddf43/pip/runtime/s3write/invoke.py", line 137, in 
_pyiceberg_write_model
   pyiceberg_table = catalog.create_table(
 ^
 File "/tmp/runtime/shared/pyiceberg_patch.py", line 105, in create_table
   iceberg_schema = self._convert_schema_if_needed(schema)
^^
 File 
"/bpln/cba723bb/82eddf43/pip/pyiceberg/pyiceberg/catalog/__init__.py", line 
559, in _convert_schema_if_needed
   schema: Schema = visit_pyarrow(schema, _ConvertToIcebergWithoutIDs())  # 
type: ignore

 File "/usr/local/lib/python3.11/functools.py", line 909, in wrapper
   return dispatch(args[0].__class__)(*args, **kw)
  
 File "/bpln/cba723bb/82eddf43/pip/pyiceberg/pyiceberg/io/pyarrow.py", line 
682, in _
   return visitor.schema(obj, visit_pyarrow(pa.struct(obj), visitor))
  ^^
 File "/usr/local/lib/python3.11/functools.py", line 909, in wrapper
   return dispatch(args[0].__class__)(*args, **kw)
  
 File "/bpln/cba723bb/82eddf43/pip/pyiceberg/pyiceberg/io/pyarrow.py", line 
691, in _
   result = visit_pyarrow(field.type, visitor)
^^
 File "/usr/local/lib/python3.11/functools.py", line 909, in wrapper
   return dispatch(args[0].__class__)(*args, **kw)
  
 File "/bpln/cba723bb/82eddf43/pip/pyiceberg/pyiceberg/io/pyarrow.py", line 
726, in _
   return visitor.primitive(obj)
  ^^
 File "/bpln/cba723bb/82eddf43/pip/pyiceberg/pyiceberg/io/pyarrow.py", line 
899, in primitive
   raise TypeError(f"Unsupported type: {primitive}")
   TypeError: Unsupported type: timestamp[ns]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Iceberg use PARTITIONED BY (days(ts)) cause wrong partition name [iceberg]

2024-04-08 Thread via GitHub


tcguanshuhuai closed issue #10102: Iceberg use PARTITIONED BY (days(ts)) cause 
wrong partition name
URL: https://github.com/apache/iceberg/issues/10102


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Iceberg use PARTITIONED BY (days(ts)) cause wrong partition name [iceberg]

2024-04-08 Thread via GitHub


tcguanshuhuai commented on issue #10102:
URL: https://github.com/apache/iceberg/issues/10102#issuecomment-2042571758

   Thanks a lot.
   
   
   
   SentfrommyiPhone
   
   
   -- Original --
   From: Mathew Kapkiai ***@***.***
   Date: Mon,Apr 8,2024 8:03 PM
   To: apache/iceberg ***@***.***
   Cc: tcguanshuhuai ***@***.***, Author ***@***.***
   Subject: Re: [apache/iceberg] Iceberg use PARTITIONED BY (days(ts)) 
causewrong partition name (Issue #10102)
   
   
   
   
   

   I don’t think it’s a bug. Converting the epoch gives ‘2024-06-20 16:00:00’ 
at UTC. I believe spark inteprets timestamps at UTC based on it’s tz/system 
settings and you are in different timezone. I have faced similar challenges 
before. When I really don’t need timezone info, I mostly prefer using spark’s 
timestamp_ntz.

   —
   Reply to this email directly, view it on GitHub, or unsubscribe.
   You are receiving this because you authored the thread.Message ID: 
***@***.***


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Iceberg use PARTITIONED BY (days(ts)) cause wrong partition name [iceberg]

2024-04-08 Thread via GitHub


Kapkiai commented on issue #10102:
URL: https://github.com/apache/iceberg/issues/10102#issuecomment-2042569175

   I don’t think it’s a bug. Converting the epoch gives ‘2024-06-20 16:00:00’ 
at UTC. I believe spark inteprets timestamps at UTC based on it’s tz/system 
settings and you are in different timezone. I have faced similar challenges 
before. When I really don’t need timezone info, I mostly prefer using spark’s 
timestamp_ntz.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Added Upsolver to vendor list [iceberg]

2024-04-08 Thread via GitHub


jasonf20 commented on code in PR #10096:
URL: https://github.com/apache/iceberg/pull/10096#discussion_r1555685176


##
site/docs/vendors.md:
##
@@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query 
engine](https://trino.io
 ### [Tabular](https://tabular.io)
 
 [Tabular](https://tabular.io/product/) is a managed warehouse and automation 
platform. Tabular offers a central store for analytic data that can be used 
with any query engine or processing framework that supports Iceberg. Tabular 
warehouses add role-based access control and automatic optimization, 
clustering, and compaction to Iceberg tables.
+
+### [Upsolver](https://upsolver.com)
+
+[Upsolver](https://upsolver.com) is a simple to use, high volume streaming 
data ingestion and table management solution for Apache Iceberg. With Upsolver, 
users can easily ingest batch and streaming data from files, streams and 
databases (CDC) into [Iceberg 
tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables).
 In addition, Upsolver connects to your existing lakes cataloged by Tabular or 
AWS Glue Data Catalog, and [analyzes the 
health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables)
 of your tables. Use Upsolver to continuously optimize tables by compacting 
small files, sorting and compressing, repartitioning, and cleaning up dangling 
files and expired manifests. Upsolver is available from the [Upsolver 
Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can 
be deployed in your AWS VPC.

Review Comment:
   Sure, submitted and update with this suggestion. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Added Upsolver to vendor list [iceberg]

2024-04-08 Thread via GitHub


bitsondatadev commented on code in PR #10096:
URL: https://github.com/apache/iceberg/pull/10096#discussion_r1554976560


##
site/docs/vendors.md:
##
@@ -71,3 +71,7 @@ Starburst is a commercial offering for the [Trino query 
engine](https://trino.io
 ### [Tabular](https://tabular.io)
 
 [Tabular](https://tabular.io/product/) is a managed warehouse and automation 
platform. Tabular offers a central store for analytic data that can be used 
with any query engine or processing framework that supports Iceberg. Tabular 
warehouses add role-based access control and automatic optimization, 
clustering, and compaction to Iceberg tables.
+
+### [Upsolver](https://upsolver.com)
+
+[Upsolver](https://upsolver.com) is a simple to use, high volume streaming 
data ingestion and table management solution for Apache Iceberg. With Upsolver, 
users can easily ingest batch and streaming data from files, streams and 
databases (CDC) into [Iceberg 
tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables).
 In addition, Upsolver connects to your existing lakes cataloged by Tabular or 
AWS Glue Data Catalog, and [analyzes the 
health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables)
 of your tables. Use Upsolver to continuously optimize tables by compacting 
small files, sorting and compressing, repartitioning, and cleaning up dangling 
files and expired manifests. Upsolver is available from the [Upsolver 
Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can 
be deployed in your AWS VPC.

Review Comment:
   ```suggestion
   [Upsolver](https://upsolver.com) is a streaming data ingestion and table 
management solution for Apache Iceberg. With Upsolver, users can easily ingest 
batch and streaming data from files, streams and databases (CDC) into [Iceberg 
tables](https://docs.upsolver.com/reference/sql-commands/iceberg-tables/upsolver-managed-tables).
 In addition, Upsolver connects to your existing REST and Hive catalogs, and 
[analyzes the 
health](https://docs.upsolver.com/how-to-guides/apache-iceberg/optimize-your-iceberg-tables)
 of your tables. Use Upsolver to continuously optimize tables by compacting 
small files, sorting and compressing, repartitioning, and cleaning up dangling 
files and expired manifests. Upsolver is available from the [Upsolver 
Cloud](https://www.upsolver.com/sqlake-signup-wp?utm_page=iceberg.org) or can 
be deployed in your AWS VPC.
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] iceberg-aws: Switch tests to JUnit5 + AssertJ-style assertions [iceberg]

2024-04-08 Thread via GitHub


nastra closed issue #9080: iceberg-aws: Switch tests to JUnit5 + AssertJ-style 
assertions
URL: https://github.com/apache/iceberg/issues/9080


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Move iceberg-orc files to iceberg-core module [iceberg]

2024-04-08 Thread via GitHub


ajantha-bhat closed issue #8454: Move iceberg-orc files to iceberg-core module
URL: https://github.com/apache/iceberg/issues/8454


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Move iceberg-parquet files to iceberg-core module [iceberg]

2024-04-08 Thread via GitHub


ajantha-bhat closed issue #8453: Move iceberg-parquet files to iceberg-core 
module
URL: https://github.com/apache/iceberg/issues/8453


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-08 Thread via GitHub


nastra merged PR #10086:
URL: https://github.com/apache/iceberg/pull/10086


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Implement spark action to compute partition stats [iceberg]

2024-04-08 Thread via GitHub


ajantha-bhat closed issue #8459: Implement spark action to compute partition 
stats
URL: https://github.com/apache/iceberg/issues/8459


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Implement spark action to compute partition stats [iceberg]

2024-04-08 Thread via GitHub


ajantha-bhat commented on issue #8459:
URL: https://github.com/apache/iceberg/issues/8459#issuecomment-2042446172

   Based on the experiments from https://github.com/apache/iceberg/pull/9437, 
spark action is not effective as the serialization cost of each partition stats 
entry is expensive.
   
   Will implement an API in core module to compute stats in a distributed way 
and a spark procedure to call this API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Implement incremental update using commit stats (SnapshotSummary) [iceberg]

2024-04-08 Thread via GitHub


ajantha-bhat commented on issue #8461:
URL: https://github.com/apache/iceberg/issues/8461#issuecomment-2042443146

   Lower priority and will be working on this once all other planned activities 
from #8450 is done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-08 Thread via GitHub


nastra commented on code in PR #10086:
URL: https://github.com/apache/iceberg/pull/10086#discussion_r1555620195


##
aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java:
##
@@ -229,10 +218,10 @@ public void testRenameTable() {
 glueCatalog.renameTable(
 TableIdentifier.of(namespace, tableName), 
TableIdentifier.of(namespace, newTableName));
 Table renamedTable = glueCatalog.loadTable(TableIdentifier.of(namespace, 
newTableName));
-Assert.assertEquals(table.location(), renamedTable.location());
-Assert.assertEquals(table.schema().toString(), 
renamedTable.schema().toString());
-Assert.assertEquals(table.spec(), renamedTable.spec());
-Assert.assertEquals(table.currentSnapshot(), 
renamedTable.currentSnapshot());
+assertThat(renamedTable.location()).isEqualTo(table.location());
+
assertThat(renamedTable.schema()).asString().isEqualTo(table.schema().toString());

Review Comment:
   it's because `Schema` doesn't implement equals()



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Implement incremental update using commit stats (SnapshotSummary) [iceberg]

2024-04-08 Thread via GitHub


ajantha-bhat opened a new issue, #8461:
URL: https://github.com/apache/iceberg/issues/8461

   ### Feature Request / Improvement
   
   This could be an experimental direction and can be controlled by a flag.
   
   This might bloat up the table metadata file size when millions of partitions 
are present. 
   
   
   ### Query engine
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-08 Thread via GitHub


nk1506 commented on code in PR #10086:
URL: https://github.com/apache/iceberg/pull/10086#discussion_r186399


##
aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java:
##
@@ -229,10 +218,10 @@ public void testRenameTable() {
 glueCatalog.renameTable(
 TableIdentifier.of(namespace, tableName), 
TableIdentifier.of(namespace, newTableName));
 Table renamedTable = glueCatalog.loadTable(TableIdentifier.of(namespace, 
newTableName));
-Assert.assertEquals(table.location(), renamedTable.location());
-Assert.assertEquals(table.schema().toString(), 
renamedTable.schema().toString());
-Assert.assertEquals(table.spec(), renamedTable.spec());
-Assert.assertEquals(table.currentSnapshot(), 
renamedTable.currentSnapshot());
+assertThat(renamedTable.location()).isEqualTo(table.location());
+
assertThat(renamedTable.schema()).asString().isEqualTo(table.schema().toString());

Review Comment:
   Yeah Just wondering , Why PartiotionSpec and Snapshot comparison are 
working. 
   As per my understanding `isEqualTo` compares object values. So ideally it 
should not fail  unless I am missing something. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Change DataScan to accept Metadata and io [iceberg-python]

2024-04-08 Thread via GitHub


Fokko commented on code in PR #581:
URL: https://github.com/apache/iceberg-python/pull/581#discussion_r142312


##
pyiceberg/io/pyarrow.py:
##
@@ -1089,7 +1091,7 @@ def project_table(
 deletes_per_file.get(task.file.file_path),
 case_sensitive,
 limit,
-table.name_mapping(),
+None,

Review Comment:
   Ah, yes, I just added `None` temporarily to silence the IDEA, but then 
forgot about it. Thanks for raising the PR :D Good to have this tested as well 
 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]

2024-04-08 Thread via GitHub


nastra commented on code in PR #10053:
URL: https://github.com/apache/iceberg/pull/10053#discussion_r128393


##
core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java:
##
@@ -133,6 +138,59 @@ public void testDynamicHttpRequestInterceptorLoading() {
 assertThat(((TestHttpRequestInterceptor) 
interceptor).properties).isEqualTo(properties);
   }
 
+  @Test
+  public void testSocketAndConnectionTimeoutSet() {
+long connectionTimeoutMs = 10L;
+int socketTimeoutMs = 10;
+Map properties =
+ImmutableMap.of(
+HTTPClient.REST_CONNECTION_TIMEOUT_MS, 
String.valueOf(connectionTimeoutMs),
+HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+
+ConnectionConfig connectionConfig = 
HTTPClient.configureConnectionConfig(properties);
+assertThat(connectionConfig).isNotNull();
+
assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs);
+
assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs);
+  }
+
+  @Test
+  public void testSocketTimeout() throws IOException {
+long socketTimeoutMs = 2000L;
+Map properties =
+ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+String path = "socket/timeout/path";
+
+try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) {
+  HttpRequest mockRequest =
+  request()
+  .withPath("/" + path)
+  .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT));
+  // Setting a response delay of 5 seconds to simulate hitting the 
configured socket timeout of
+  // 2 seconds
+  HttpResponse mockResponse =
+  response()
+  .withStatusCode(200)
+  .withBody("Delayed response")
+  .withDelay(TimeUnit.MILLISECONDS, 5000);
+  mockServer.when(mockRequest).respond(mockResponse);
+
+  Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), 
(unused) -> {}))
+  .cause()
+  .isInstanceOf(SocketTimeoutException.class);
+}
+  }
+
+  @ParameterizedTest
+  @ValueSource(strings = {HTTPClient.REST_CONNECTION_TIMEOUT_MS, 
HTTPClient.REST_SOCKET_TIMEOUT_MS})
+  public void testInvalidTimeout(String timeoutMsType) {

Review Comment:
   can you also please add a test where the timout is -1?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Validate overwrite filter [iceberg-python]

2024-04-08 Thread via GitHub


Fokko commented on code in PR #582:
URL: https://github.com/apache/iceberg-python/pull/582#discussion_r1555374364


##
pyiceberg/io/pyarrow.py:
##
@@ -1776,7 +1776,10 @@ def write_parquet(task: WriteTask) -> DataFile:
 fo = io.new_output(file_path)
 with fo.create(overwrite=True) as fos:
 with pq.ParquetWriter(fos, schema=arrow_file_schema, 
**parquet_writer_kwargs) as writer:
-writer.write(pa.Table.from_batches(task.record_batches), 
row_group_size=row_group_size)
+arrow_table = pa.Table.from_batches(task.record_batches)
+# align the columns accordingly in case input arrow table has 
columns in order different from iceberg table

Review Comment:
   Can you provide an example of when this would happen? This only handles 
top-level columns.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]

2024-04-08 Thread via GitHub


nastra commented on code in PR #10053:
URL: https://github.com/apache/iceberg/pull/10053#discussion_r125438


##
core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java:
##
@@ -133,6 +138,59 @@ public void testDynamicHttpRequestInterceptorLoading() {
 assertThat(((TestHttpRequestInterceptor) 
interceptor).properties).isEqualTo(properties);
   }
 
+  @Test
+  public void testSocketAndConnectionTimeoutSet() {
+long connectionTimeoutMs = 10L;
+int socketTimeoutMs = 10;
+Map properties =
+ImmutableMap.of(
+HTTPClient.REST_CONNECTION_TIMEOUT_MS, 
String.valueOf(connectionTimeoutMs),
+HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+
+ConnectionConfig connectionConfig = 
HTTPClient.configureConnectionConfig(properties);
+assertThat(connectionConfig).isNotNull();
+
assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs);
+
assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs);
+  }
+
+  @Test
+  public void testSocketTimeout() throws IOException {
+long socketTimeoutMs = 2000L;
+Map properties =
+ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+String path = "socket/timeout/path";
+
+try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) {
+  HttpRequest mockRequest =
+  request()
+  .withPath("/" + path)
+  .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT));
+  // Setting a response delay of 5 seconds to simulate hitting the 
configured socket timeout of
+  // 2 seconds
+  HttpResponse mockResponse =
+  response()
+  .withStatusCode(200)
+  .withBody("Delayed response")
+  .withDelay(TimeUnit.MILLISECONDS, 5000);
+  mockServer.when(mockRequest).respond(mockResponse);
+
+  Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), 
(unused) -> {}))
+  .cause()
+  .isInstanceOf(SocketTimeoutException.class);

Review Comment:
   this should also have `.hasMessage("Read timed out")`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-08 Thread via GitHub


tomtongue commented on code in PR #10086:
URL: https://github.com/apache/iceberg/pull/10086#discussion_r117972


##
aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java:
##
@@ -258,10 +247,10 @@ public void testRenameTableFailsToCreateNewTable() {
 .hasMessageContaining("Table already exists");
 // old table can still be read with same metadata
 Table oldTable = glueCatalog.loadTable(id);
-Assert.assertEquals(table.location(), oldTable.location());
-Assert.assertEquals(table.schema().toString(), 
oldTable.schema().toString());
-Assert.assertEquals(table.spec(), oldTable.spec());
-Assert.assertEquals(table.currentSnapshot(), oldTable.currentSnapshot());
+assertThat(oldTable.location()).isEqualTo(table.location());
+
assertThat(oldTable.schema()).asString().isEqualTo(table.schema().toString());

Review Comment:
   As commented above, it's necessary because the testing the comparisons 
between objects fail.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-08 Thread via GitHub


tomtongue commented on code in PR #10086:
URL: https://github.com/apache/iceberg/pull/10086#discussion_r119947


##
aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java:
##
@@ -84,43 +84,35 @@ public void testCreateTable() {
 // verify table exists in Glue
 GetTableResponse response =
 
glue.getTable(GetTableRequest.builder().databaseName(namespace).name(tableName).build());
-Assert.assertEquals(namespace, response.table().databaseName());
-Assert.assertEquals(tableName, response.table().name());
-Assert.assertEquals(
-
BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toUpperCase(Locale.ENGLISH),
-
response.table().parameters().get(BaseMetastoreTableOperations.TABLE_TYPE_PROP));
-Assert.assertTrue(
-response
-.table()
-.parameters()
-.containsKey(BaseMetastoreTableOperations.METADATA_LOCATION_PROP));
-Assert.assertEquals(
-schema.columns().size(), 
response.table().storageDescriptor().columns().size());
-Assert.assertEquals(partitionSpec.fields().size(), 
response.table().partitionKeys().size());
-Assert.assertEquals(
-"additionalLocations should match",
-
tableLocationProperties.values().stream().sorted().collect(Collectors.toList()),
-response.table().storageDescriptor().additionalLocations().stream()
-.sorted()
-.collect(Collectors.toList()));
+assertThat(response.table().databaseName()).isEqualTo(namespace);
+assertThat(response.table().name()).isEqualTo(tableName);
+assertThat(response.table().parameters())
+.containsEntry(
+BaseMetastoreTableOperations.TABLE_TYPE_PROP,
+
BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toUpperCase(Locale.ENGLISH))
+.containsKey(BaseMetastoreTableOperations.METADATA_LOCATION_PROP);
+
assertThat(response.table().storageDescriptor().columns()).hasSameSizeAs(schema.columns());
+
assertThat(response.table().partitionKeys()).hasSameSizeAs(partitionSpec.fields());
+assertThat(response.table().storageDescriptor().additionalLocations())

Review Comment:
   thanks for the suggestion. I think the suggestion drops the sorting check. 
So this uses `isEqualTo` and it's not flaky test.



##
aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java:
##
@@ -84,43 +84,35 @@ public void testCreateTable() {
 // verify table exists in Glue
 GetTableResponse response =
 
glue.getTable(GetTableRequest.builder().databaseName(namespace).name(tableName).build());
-Assert.assertEquals(namespace, response.table().databaseName());
-Assert.assertEquals(tableName, response.table().name());
-Assert.assertEquals(
-
BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toUpperCase(Locale.ENGLISH),
-
response.table().parameters().get(BaseMetastoreTableOperations.TABLE_TYPE_PROP));
-Assert.assertTrue(
-response
-.table()
-.parameters()
-.containsKey(BaseMetastoreTableOperations.METADATA_LOCATION_PROP));
-Assert.assertEquals(
-schema.columns().size(), 
response.table().storageDescriptor().columns().size());
-Assert.assertEquals(partitionSpec.fields().size(), 
response.table().partitionKeys().size());
-Assert.assertEquals(
-"additionalLocations should match",
-
tableLocationProperties.values().stream().sorted().collect(Collectors.toList()),
-response.table().storageDescriptor().additionalLocations().stream()
-.sorted()
-.collect(Collectors.toList()));
+assertThat(response.table().databaseName()).isEqualTo(namespace);
+assertThat(response.table().name()).isEqualTo(tableName);
+assertThat(response.table().parameters())
+.containsEntry(
+BaseMetastoreTableOperations.TABLE_TYPE_PROP,
+
BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toUpperCase(Locale.ENGLISH))
+.containsKey(BaseMetastoreTableOperations.METADATA_LOCATION_PROP);
+
assertThat(response.table().storageDescriptor().columns()).hasSameSizeAs(schema.columns());
+
assertThat(response.table().partitionKeys()).hasSameSizeAs(partitionSpec.fields());
+assertThat(response.table().storageDescriptor().additionalLocations())

Review Comment:
   thanks for the suggestion. I think the suggestion drops the sorting check. 
So this uses `isEqualTo` and it's not flaky.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For 

Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-08 Thread via GitHub


tomtongue commented on code in PR #10086:
URL: https://github.com/apache/iceberg/pull/10086#discussion_r117631


##
aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java:
##
@@ -229,10 +218,10 @@ public void testRenameTable() {
 glueCatalog.renameTable(
 TableIdentifier.of(namespace, tableName), 
TableIdentifier.of(namespace, newTableName));
 Table renamedTable = glueCatalog.loadTable(TableIdentifier.of(namespace, 
newTableName));
-Assert.assertEquals(table.location(), renamedTable.location());
-Assert.assertEquals(table.schema().toString(), 
renamedTable.schema().toString());
-Assert.assertEquals(table.spec(), renamedTable.spec());
-Assert.assertEquals(table.currentSnapshot(), 
renamedTable.currentSnapshot());
+assertThat(renamedTable.location()).isEqualTo(table.location());
+
assertThat(renamedTable.schema()).asString().isEqualTo(table.schema().toString());

Review Comment:
   It's necessary because the testing the comparisons between objects fail.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-08 Thread via GitHub


nk1506 commented on code in PR #10086:
URL: https://github.com/apache/iceberg/pull/10086#discussion_r113397


##
aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java:
##
@@ -229,10 +218,10 @@ public void testRenameTable() {
 glueCatalog.renameTable(
 TableIdentifier.of(namespace, tableName), 
TableIdentifier.of(namespace, newTableName));
 Table renamedTable = glueCatalog.loadTable(TableIdentifier.of(namespace, 
newTableName));
-Assert.assertEquals(table.location(), renamedTable.location());
-Assert.assertEquals(table.schema().toString(), 
renamedTable.schema().toString());
-Assert.assertEquals(table.spec(), renamedTable.spec());
-Assert.assertEquals(table.currentSnapshot(), 
renamedTable.currentSnapshot());
+assertThat(renamedTable.location()).isEqualTo(table.location());
+
assertThat(renamedTable.schema()).asString().isEqualTo(table.schema().toString());

Review Comment:
   nit: Just wondering why cast to `asString()` here ? Other Object comparison 
we are doing using `isEqualTo` 



##
aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java:
##
@@ -258,10 +247,10 @@ public void testRenameTableFailsToCreateNewTable() {
 .hasMessageContaining("Table already exists");
 // old table can still be read with same metadata
 Table oldTable = glueCatalog.loadTable(id);
-Assert.assertEquals(table.location(), oldTable.location());
-Assert.assertEquals(table.schema().toString(), 
oldTable.schema().toString());
-Assert.assertEquals(table.spec(), oldTable.spec());
-Assert.assertEquals(table.currentSnapshot(), oldTable.currentSnapshot());
+assertThat(oldTable.location()).isEqualTo(table.location());
+
assertThat(oldTable.schema()).asString().isEqualTo(table.schema().toString());

Review Comment:
   same



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] read from Iceberg table throw java.lang.ArrayIndexOutOfBoundsException: 3 [iceberg]

2024-04-08 Thread via GitHub


nastra commented on issue #10103:
URL: https://github.com/apache/iceberg/issues/10103#issuecomment-2042278015

   @jiantao-vungle do you have a small reproducible example? Without that it's 
quite difficult to reproduce this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-08 Thread via GitHub


nk1506 commented on code in PR #10086:
URL: https://github.com/apache/iceberg/pull/10086#discussion_r103258


##
aws/src/integration/java/org/apache/iceberg/aws/glue/TestGlueCatalogTable.java:
##
@@ -84,43 +84,35 @@ public void testCreateTable() {
 // verify table exists in Glue
 GetTableResponse response =
 
glue.getTable(GetTableRequest.builder().databaseName(namespace).name(tableName).build());
-Assert.assertEquals(namespace, response.table().databaseName());
-Assert.assertEquals(tableName, response.table().name());
-Assert.assertEquals(
-
BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toUpperCase(Locale.ENGLISH),
-
response.table().parameters().get(BaseMetastoreTableOperations.TABLE_TYPE_PROP));
-Assert.assertTrue(
-response
-.table()
-.parameters()
-.containsKey(BaseMetastoreTableOperations.METADATA_LOCATION_PROP));
-Assert.assertEquals(
-schema.columns().size(), 
response.table().storageDescriptor().columns().size());
-Assert.assertEquals(partitionSpec.fields().size(), 
response.table().partitionKeys().size());
-Assert.assertEquals(
-"additionalLocations should match",
-
tableLocationProperties.values().stream().sorted().collect(Collectors.toList()),
-response.table().storageDescriptor().additionalLocations().stream()
-.sorted()
-.collect(Collectors.toList()));
+assertThat(response.table().databaseName()).isEqualTo(namespace);
+assertThat(response.table().name()).isEqualTo(tableName);
+assertThat(response.table().parameters())
+.containsEntry(
+BaseMetastoreTableOperations.TABLE_TYPE_PROP,
+
BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toUpperCase(Locale.ENGLISH))
+.containsKey(BaseMetastoreTableOperations.METADATA_LOCATION_PROP);
+
assertThat(response.table().storageDescriptor().columns()).hasSameSizeAs(schema.columns());
+
assertThat(response.table().partitionKeys()).hasSameSizeAs(partitionSpec.fields());
+assertThat(response.table().storageDescriptor().additionalLocations())

Review Comment:
   previously the assertion was happening after sorting the values. Do you 
think it can become flaky someday? 
   Why not somethng like
   
`assertThat(response.table().storageDescriptor().additionalLocations()).containsAll(tableLocationProperties.values());`
 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-08 Thread via GitHub


nk1506 commented on code in PR #10086:
URL: https://github.com/apache/iceberg/pull/10086#discussion_r1555485467


##
aws/src/integration/java/org/apache/iceberg/aws/dynamodb/TestDynamoDbCatalog.java:
##
@@ -99,22 +100,23 @@ public void testCreateNamespace() {
 .tableName(catalogTableName)
 .key(DynamoDbCatalog.namespacePrimaryKey(namespace))
 .build());
-Assert.assertTrue("namespace must exist", response.hasItem());
-Assert.assertEquals(
-"namespace must be stored in DynamoDB",
-namespace.toString(),
-response.item().get("namespace").s());
-Assertions.assertThatThrownBy(() -> catalog.createNamespace(namespace))
+assertThat(response.hasItem()).as("namespace must exist").isTrue();
+assertThat(response.item())
+.as("namespace must be stored in DynamoDB")
+.hasEntrySatisfying(
+"namespace",
+attributeValue -> 
assertThat(attributeValue.s()).isEqualTo(namespace.toString()));

Review Comment:
   Makes sense, I think `hasEntrySatisfying` sounds good as per the above 
reasons. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]

2024-04-08 Thread via GitHub


harishch1998 commented on code in PR #10053:
URL: https://github.com/apache/iceberg/pull/10053#discussion_r1555456733


##
core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java:
##
@@ -133,6 +136,67 @@ public void testDynamicHttpRequestInterceptorLoading() {
 assertThat(((TestHttpRequestInterceptor) 
interceptor).properties).isEqualTo(properties);
   }
 
+  @Test
+  public void testSocketAndConnectionTimeoutSet() {
+long connectionTimeoutMs = 10L;
+int socketTimeoutMs = 10;
+Map properties =
+ImmutableMap.of(
+HTTPClient.REST_CONNECTION_TIMEOUT_MS, 
String.valueOf(connectionTimeoutMs),
+HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+
+ConnectionConfig connectionConfig = 
HTTPClient.configureConnectionConfig(properties);
+assertThat(connectionConfig).isNotNull();
+
assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs);
+
assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs);
+  }
+
+  @Test
+  public void testSocketTimeout() throws IOException {
+long socketTimeoutMs = 2000L;
+Map properties =
+ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+String path = "socket/timeout/path";
+
+try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) {
+  HttpRequest mockRequest =
+  request()
+  .withPath("/" + path)
+  .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT));
+  // Setting a response delay of 5 seconds to simulate hitting the 
configured socket timeout of
+  // 2 seconds
+  HttpResponse mockResponse =
+  response()
+  .withStatusCode(200)
+  .withBody("Delayed response")
+  .withDelay(TimeUnit.MILLISECONDS, 5000);
+  mockServer.when(mockRequest).respond(mockResponse);
+
+  Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), 
(unused) -> {}))
+  .cause()
+  .isInstanceOf(SocketTimeoutException.class);
+}
+  }
+
+  @Test
+  public void testInvalidConnectionTimeout() {
+testInvalidTimeouts(HTTPClient.REST_CONNECTION_TIMEOUT_MS);
+  }
+
+  @Test
+  public void testInvalidSocketTimeout() {
+testInvalidTimeouts(HTTPClient.REST_SOCKET_TIMEOUT_MS);
+  }
+
+  private void testInvalidTimeouts(String timeoutMsType) {

Review Comment:
   I'm so sorry for the repeated reviews. Fixed. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Validate overwrite filter [iceberg-python]

2024-04-08 Thread via GitHub


Fokko commented on PR #582:
URL: https://github.com/apache/iceberg-python/pull/582#issuecomment-2042152141

   Hi Adrian, thanks for working on this and the very comprehensive write-up. 
My first questions is, what is the main goal of this PR.
   
   Let me elaborate with more context. Looking at the Spark syntax, this 
originated from the Hive era, and assumes that there is a single partition spec 
on the table. With Iceberg, the partitioning can evolve over time and therefore 
also older partitioning strategies can be present.
   
   ```sql
   CREATE TABLE prod.my_app.logs (
envSTRING,
loglineSTRING,
created_at TIMESTAMP_NTZ -- Nobody likes timezones
   ) 
   USING iceberg
   PARTITION BY (months(created_at))
   
   -- Later on when there is more data
   REPLACE PARTITION FIELD dt_month WITH days(days(created_at))
   
   -- Or, when we want to split the logs per environment
   REPLACE PARTITION FIELD dt_month WITH days(env, months(created_at))
   ```
   
   There is a fair chance that we have to rewrite actual data. When updating 
the spec, a full rewrite of the table is not done directly. Otherwise it would 
be very costly to update the partition spec on big tables. If the data is still 
partitioned monthly, update data using the new partitioning (on a daily basis), 
it will read in the Parquet files that match the filter. Let's say that you do: 
`INSERT OVERWRITE PARTITION (created_at='2024-04-08')` it will read in data 
that has been written into a monthly partitioning. It will filter out the data 
for `2024-04-08` and write back the remaining data. The new data will be 
appended to the table using the new partitioning scheme.
   
   > Example spark sql counterpart in iceberg spark static overwrite for full 
table overwrite is
   
   I think the top one is a [dynamic 
overwrite](https://iceberg.apache.org/docs/latest/spark-writes/#dynamic-overwrite)
 since it does not explicitly calls out the partition that it will overwrite.
   
   In that case we should compute all the affected partition values by applying 
the current partition spec on the dataframe. Based on that we can first delete 
the affected partitions, and then append the data.
   
   ```python
   def overwrite_dynamic(df: pa.Table) -> None:
   ```
   
   The second one looks like a static overwrite:
   
   ```python
   def overwrite(df: pa.Table, expr: Union[BooleanExpression, str]) -> None:
   ...
   
   
   overwrite(df, "level = 'INFO'")
   ```
   
   I agree that this is a bit odd:
   
   ```
   mismatched input '>' expecting {')', ','}(line 3, pos 22)
   
   == SQL ==
   INSERT OVERWRITE
   
lacus.test.spark_staticoverwrite_partition_clause_and_data_reltship_multicols
   PARTITION (n_legs > 2)
   --^^^
   SELECT color,animals FROM  tmp_view
   ```
   
   Open questions:
   
   - But in Iceberg there is no technical reason to now allow this. Do we want 
to refrain the user from doing this?
   - Should we add an option that will blow up when we have to rewrite files. 
If you do your predicates correctly, and you don't evolve the partition spec, 
then no parquet files should be opened and your overwrites will be crazy 
efficient.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] java.lang.ArrayIndexOutOfBoundsException: 3 [iceberg]

2024-04-08 Thread via GitHub


jiantao-vungle opened a new issue, #10103:
URL: https://github.com/apache/iceberg/issues/10103

   ### Apache Iceberg version
   
   1.3.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 
   
   Environment
   
   Spark: 3.4.1
   Iceberg: 1.3.1
   
   Description
   
   Throw following exception when read from Iceberg table. I did some 
researches, and found that there existed a column named 
`publisher_payout_type_at_delivery` in the table, it had only three options, 
like `CPM`, `REVENUE_SHARE` and `FLAT_CPM`, but the exception seemed that it 
was to use the fourth option.
   
   Exception
   
   ```
   Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:63)
at 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
   Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 38 in stage 4.0 failed 4 times, most recent failure: Lost task 
38.3 in stage 4.0 (TID 314) (172.26.2.4 executor 4): 
java.lang.ArrayIndexOutOfBoundsException: 3
at 
org.apache.iceberg.arrow.vectorized.GenericArrowVectorAccessorFactory$DictionaryStringAccessor.getUTF8String(GenericArrowVectorAccessorFactory.java:423)
at 
org.apache.iceberg.spark.data.vectorized.IcebergArrowColumnVector.getUTF8String(IcebergArrowColumnVector.java:138)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithKeys_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
   
   Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1206)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1206)
at scala.Option.foreach(Option.scala:407)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1206)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2984)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
   Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
at 

Re: [I] [BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name [iceberg-python]

2024-04-08 Thread via GitHub


Fokko commented on issue #584:
URL: https://github.com/apache/iceberg-python/issues/584#issuecomment-2042077008

   Generated a Parquet file using both Spark and Python:
   
   
![image](https://github.com/apache/iceberg-python/assets/1134248/b382632a-e5ef-4c3d-82bd-6efbe2ced53f)
   
![image](https://github.com/apache/iceberg-python/assets/1134248/a79c1cc3-f4ff-41d4-99d8-837475cf4be8)
   
   Looking at the Python file:
   ```
   parq 0-0-624f6b17-7d8d-435e-9fbe-217caeb25ca4.parquet --schema 
   
# Schema 

   required group field_id=-1 schema {
 optional double field_id=1 ABC-GG-1-A;
   }
   ```
   
   And the Spark file:
   ```
   parq 0-0-875786b8-dbcd-4f8c-80a7-d261205a0333-0-1.parquet --schema 
   
# Schema 

   required group field_id=-1 table {
 required double field_id=1 ABC_x2DGG_x2D1_x2DA;
   }
   ```
   
   I would argue that the Python one is correct, but probably missing some 
context. The names shouldn't matter in the end, so probably we should look them 
up by ID.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]

2024-04-08 Thread via GitHub


nastra commented on code in PR #10053:
URL: https://github.com/apache/iceberg/pull/10053#discussion_r1555349405


##
core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java:
##
@@ -133,6 +136,67 @@ public void testDynamicHttpRequestInterceptorLoading() {
 assertThat(((TestHttpRequestInterceptor) 
interceptor).properties).isEqualTo(properties);
   }
 
+  @Test
+  public void testSocketAndConnectionTimeoutSet() {
+long connectionTimeoutMs = 10L;
+int socketTimeoutMs = 10;
+Map properties =
+ImmutableMap.of(
+HTTPClient.REST_CONNECTION_TIMEOUT_MS, 
String.valueOf(connectionTimeoutMs),
+HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+
+ConnectionConfig connectionConfig = 
HTTPClient.configureConnectionConfig(properties);
+assertThat(connectionConfig).isNotNull();
+
assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs);
+
assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs);
+  }
+
+  @Test
+  public void testSocketTimeout() throws IOException {
+long socketTimeoutMs = 2000L;
+Map properties =
+ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+String path = "socket/timeout/path";
+
+try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) {
+  HttpRequest mockRequest =
+  request()
+  .withPath("/" + path)
+  .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT));
+  // Setting a response delay of 5 seconds to simulate hitting the 
configured socket timeout of
+  // 2 seconds
+  HttpResponse mockResponse =
+  response()
+  .withStatusCode(200)
+  .withBody("Delayed response")
+  .withDelay(TimeUnit.MILLISECONDS, 5000);
+  mockServer.when(mockRequest).respond(mockResponse);
+
+  Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), 
(unused) -> {}))
+  .cause()
+  .isInstanceOf(SocketTimeoutException.class);
+}
+  }
+
+  @Test
+  public void testInvalidConnectionTimeout() {
+testInvalidTimeouts(HTTPClient.REST_CONNECTION_TIMEOUT_MS);
+  }
+
+  @Test
+  public void testInvalidSocketTimeout() {
+testInvalidTimeouts(HTTPClient.REST_SOCKET_TIMEOUT_MS);
+  }
+
+  private void testInvalidTimeouts(String timeoutMsType) {

Review Comment:
   ```suggestion
 @ParameterizedTest
 @ValueSource(strings = {HTTPClient.REST_CONNECTION_TIMEOUT_MS), 
HTTPClient.REST_SOCKET_TIMEOUT_MS})
 public void testInvalidTimeout(String timeoutType) {
   ```



##
core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java:
##
@@ -133,6 +136,67 @@ public void testDynamicHttpRequestInterceptorLoading() {
 assertThat(((TestHttpRequestInterceptor) 
interceptor).properties).isEqualTo(properties);
   }
 
+  @Test
+  public void testSocketAndConnectionTimeoutSet() {
+long connectionTimeoutMs = 10L;
+int socketTimeoutMs = 10;
+Map properties =
+ImmutableMap.of(
+HTTPClient.REST_CONNECTION_TIMEOUT_MS, 
String.valueOf(connectionTimeoutMs),
+HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+
+ConnectionConfig connectionConfig = 
HTTPClient.configureConnectionConfig(properties);
+assertThat(connectionConfig).isNotNull();
+
assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs);
+
assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs);
+  }
+
+  @Test
+  public void testSocketTimeout() throws IOException {
+long socketTimeoutMs = 2000L;
+Map properties =
+ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+String path = "socket/timeout/path";
+
+try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) {
+  HttpRequest mockRequest =
+  request()
+  .withPath("/" + path)
+  .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT));
+  // Setting a response delay of 5 seconds to simulate hitting the 
configured socket timeout of
+  // 2 seconds
+  HttpResponse mockResponse =
+  response()
+  .withStatusCode(200)
+  .withBody("Delayed response")
+  .withDelay(TimeUnit.MILLISECONDS, 5000);
+  mockServer.when(mockRequest).respond(mockResponse);
+
+  Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), 
(unused) -> {}))
+  .cause()
+  .isInstanceOf(SocketTimeoutException.class);
+}
+  }
+
+  @Test
+  public void testInvalidConnectionTimeout() {
+testInvalidTimeouts(HTTPClient.REST_CONNECTION_TIMEOUT_MS);
+  }
+
+  @Test
+  public void testInvalidSocketTimeout() {
+testInvalidTimeouts(HTTPClient.REST_SOCKET_TIMEOUT_MS);
+  }
+
+  private void 

Re: [PR] Introduce two properties for reading the connection timeout and socke… [iceberg]

2024-04-08 Thread via GitHub


nastra commented on code in PR #10053:
URL: https://github.com/apache/iceberg/pull/10053#discussion_r1555349405


##
core/src/test/java/org/apache/iceberg/rest/TestHTTPClient.java:
##
@@ -133,6 +136,67 @@ public void testDynamicHttpRequestInterceptorLoading() {
 assertThat(((TestHttpRequestInterceptor) 
interceptor).properties).isEqualTo(properties);
   }
 
+  @Test
+  public void testSocketAndConnectionTimeoutSet() {
+long connectionTimeoutMs = 10L;
+int socketTimeoutMs = 10;
+Map properties =
+ImmutableMap.of(
+HTTPClient.REST_CONNECTION_TIMEOUT_MS, 
String.valueOf(connectionTimeoutMs),
+HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+
+ConnectionConfig connectionConfig = 
HTTPClient.configureConnectionConfig(properties);
+assertThat(connectionConfig).isNotNull();
+
assertThat(connectionConfig.getConnectTimeout().getDuration()).isEqualTo(connectionTimeoutMs);
+
assertThat(connectionConfig.getSocketTimeout().getDuration()).isEqualTo(socketTimeoutMs);
+  }
+
+  @Test
+  public void testSocketTimeout() throws IOException {
+long socketTimeoutMs = 2000L;
+Map properties =
+ImmutableMap.of(HTTPClient.REST_SOCKET_TIMEOUT_MS, 
String.valueOf(socketTimeoutMs));
+String path = "socket/timeout/path";
+
+try (HTTPClient client = HTTPClient.builder(properties).uri(URI).build()) {
+  HttpRequest mockRequest =
+  request()
+  .withPath("/" + path)
+  .withMethod(HttpMethod.HEAD.name().toUpperCase(Locale.ROOT));
+  // Setting a response delay of 5 seconds to simulate hitting the 
configured socket timeout of
+  // 2 seconds
+  HttpResponse mockResponse =
+  response()
+  .withStatusCode(200)
+  .withBody("Delayed response")
+  .withDelay(TimeUnit.MILLISECONDS, 5000);
+  mockServer.when(mockRequest).respond(mockResponse);
+
+  Assertions.assertThatThrownBy(() -> client.head(path, ImmutableMap.of(), 
(unused) -> {}))
+  .cause()
+  .isInstanceOf(SocketTimeoutException.class);
+}
+  }
+
+  @Test
+  public void testInvalidConnectionTimeout() {
+testInvalidTimeouts(HTTPClient.REST_CONNECTION_TIMEOUT_MS);
+  }
+
+  @Test
+  public void testInvalidSocketTimeout() {
+testInvalidTimeouts(HTTPClient.REST_SOCKET_TIMEOUT_MS);
+  }
+
+  private void testInvalidTimeouts(String timeoutMsType) {

Review Comment:
   ```suggestion
 private void checkInvalidTimeout(String timeoutMsType) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Fix On-screen display issues and minor expressions on Branching and Tagging DDL [iceberg]

2024-04-08 Thread via GitHub


nastra commented on code in PR #10091:
URL: https://github.com/apache/iceberg/pull/10091#discussion_r1555345466


##
docs/docs/spark-ddl.md:
##
@@ -499,17 +500,18 @@ AS OF VERSION 1234
 
 -- CREATE audit-branch at snapshot 1234, retain audit-branch for 31 days, and 
retain the latest 31 days. The latest 3 snapshot snapshots, and 2 days worth of 
snapshots. 
 ALTER TABLE prod.db.sample CREATE BRANCH `audit-branch`
-AS OF VERSION 1234 RETAIN 30 DAYS 
+AS OF VERSION 1234 RETAIN 31 DAYS 

Review Comment:
   rather than changing it here, can you please change it in the description 
above?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] #9073 Junit 4 tests switched to JUnit 5 [iceberg]

2024-04-08 Thread via GitHub


nastra commented on code in PR #9793:
URL: https://github.com/apache/iceberg/pull/9793#discussion_r1555322359


##
data/src/test/java/org/apache/iceberg/data/TestDataFileIndexStatsFilters.java:
##
@@ -137,9 +135,11 @@ public void testPositionDeletePlanningPath() throws 
IOException {
   tasks = Lists.newArrayList(tasksIterable);
 }
 
-Assert.assertEquals("Should produce one task", 1, tasks.size());
+Assertions.assertThat(1).isEqualTo(tasks.size()).as("Should produce one 
task");

Review Comment:
   please statically import `assertThat()`



##
data/src/test/java/org/apache/iceberg/data/DeleteReadTests.java:
##
@@ -314,7 +315,7 @@ public void testEqualityDeletesWithRequiredEqColumn() 
throws IOException {
 StructLikeSet actual = rowSet(tableName, table, "id");
 
 if (expectPruned()) {
-  assertThat(actual).as("Table should contain expected 
rows").isEqualTo(expected);
+  Assertions.assertThat(actual).as("Table should contain expected 
rows").isEqualTo(expected);

Review Comment:
   please revert changes to this class



##
data/src/test/java/org/apache/iceberg/data/TestDataFileIndexStatsFilters.java:
##
@@ -163,10 +163,11 @@ public void testPositionDeletePlanningPathFilter() throws 
IOException {
   tasks = Lists.newArrayList(tasksIterable);
 }
 
-Assert.assertEquals("Should produce one task", 1, tasks.size());
+Assertions.assertThat(tasks.size()).isEqualTo(1).as("Should produce one 
task");

Review Comment:
   same as above



##
data/src/test/java/org/apache/iceberg/data/TestLocalScan.java:
##
@@ -175,67 +173,110 @@ private void overwriteExistingData() throws IOException {
 .commit();
   }
 
-  private void appendData() throws IOException {
+  private void appendData(Object fileExt) throws IOException {
+FileFormat fileFormat = FileFormat.fromString(fileExt.toString());
 DataFile file12 =
-writeFile(sharedTableLocation, format.addExtension("file-12"), 
file1SecondSnapshotRecords);
+writeFile(
+sharedTableLocation, fileFormat.addExtension("file-12"), 
file1SecondSnapshotRecords);
 DataFile file22 =
-writeFile(sharedTableLocation, format.addExtension("file-22"), 
file2SecondSnapshotRecords);
+writeFile(
+sharedTableLocation, fileFormat.addExtension("file-22"), 
file2SecondSnapshotRecords);
 DataFile file32 =
-writeFile(sharedTableLocation, format.addExtension("file-32"), 
file3SecondSnapshotRecords);
+writeFile(
+sharedTableLocation, fileFormat.addExtension("file-32"), 
file3SecondSnapshotRecords);
 
 
sharedTable.newFastAppend().appendFile(file12).appendFile(file22).appendFile(file32).commit();
 
 DataFile file13 =
-writeFile(sharedTableLocation, format.addExtension("file-13"), 
file1ThirdSnapshotRecords);
+writeFile(
+sharedTableLocation, fileFormat.addExtension("file-13"), 
file1ThirdSnapshotRecords);
 DataFile file23 =
-writeFile(sharedTableLocation, format.addExtension("file-23"), 
file2ThirdSnapshotRecords);
+writeFile(
+sharedTableLocation, fileFormat.addExtension("file-23"), 
file2ThirdSnapshotRecords);
 DataFile file33 =
-writeFile(sharedTableLocation, format.addExtension("file-33"), 
file3ThirdSnapshotRecords);
+writeFile(
+sharedTableLocation, fileFormat.addExtension("file-33"), 
file3ThirdSnapshotRecords);
 
 
sharedTable.newFastAppend().appendFile(file13).appendFile(file23).appendFile(file33).commit();
   }
 
-  @Before
+  @BeforeEach
   public void createTables() throws IOException {
-File location = temp.newFolder("shared");
-Assert.assertTrue(location.delete());
+File location = temp;
+Assertions.assertThat(location.delete()).isTrue();
 this.sharedTableLocation = location.toString();
 this.sharedTable =
 TABLES.create(
 SCHEMA,
 PartitionSpec.unpartitioned(),
-ImmutableMap.of(TableProperties.DEFAULT_FILE_FORMAT, 
format.name()),
+ImmutableMap.of(TableProperties.DEFAULT_FILE_FORMAT, 
temp.getName()),
 sharedTableLocation);
 
 Record record = GenericRecord.create(SCHEMA);
 
-DataFile file1 =
-writeFile(sharedTableLocation, format.addExtension("file-1"), 
file1FirstSnapshotRecords);
-
-Record nullData = record.copy();
-nullData.setField("id", 11L);
-nullData.setField("data", null);
-
-DataFile file2 =
-writeFile(sharedTableLocation, format.addExtension("file-2"), 
file2FirstSnapshotRecords);
-DataFile file3 =
-writeFile(sharedTableLocation, format.addExtension("file-3"), 
file3FirstSnapshotRecords);
-
-// commit the test data
-
sharedTable.newAppend().appendFile(file1).appendFile(file2).appendFile(file3).commit();
+String[] format = {"parquet", "orc", "avro"};
+Arrays.stream(format)
+.forEach(
+

Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-08 Thread via GitHub


tomtongue commented on code in PR #10086:
URL: https://github.com/apache/iceberg/pull/10086#discussion_r1555328887


##
aws/src/integration/java/org/apache/iceberg/aws/dynamodb/TestDynamoDbCatalog.java:
##
@@ -99,22 +100,23 @@ public void testCreateNamespace() {
 .tableName(catalogTableName)
 .key(DynamoDbCatalog.namespacePrimaryKey(namespace))
 .build());
-Assert.assertTrue("namespace must exist", response.hasItem());
-Assert.assertEquals(
-"namespace must be stored in DynamoDB",
-namespace.toString(),
-response.item().get("namespace").s());
-Assertions.assertThatThrownBy(() -> catalog.createNamespace(namespace))
+assertThat(response.hasItem()).as("namespace must exist").isTrue();
+assertThat(response.item())
+.as("namespace must be stored in DynamoDB")
+.hasEntrySatisfying(
+"namespace",
+attributeValue -> 
assertThat(attributeValue.s()).isEqualTo(namespace.toString()));

Review Comment:
   Thanks for the suggestion. I think your suggestion seems to be fine, but my 
change also checks if `response.item` has `namespace` or not.  But if the value 
extraction by`get("namespace")`, it doesn't check in the test. So if this part 
is like your suggestion, it's better to write like:
   
   ```
   assertThat(response.item()).containsKey("namespace")
   assertThat(response.item().get("namespace").s())...
   ```
   
   That's why I'm using `hasEntrySatisfying` here. If I'm wrong here, please 
correct me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Hive Catalog cannot create table with TimestamptzType field [iceberg-python]

2024-04-08 Thread via GitHub


HonahX closed issue #583: Hive Catalog cannot create table with TimestamptzType 
field
URL: https://github.com/apache/iceberg-python/issues/583


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [Bug Fix] Allow HiveCatalog to create table with TimestamptzType [iceberg-python]

2024-04-08 Thread via GitHub


HonahX merged PR #585:
URL: https://github.com/apache/iceberg-python/pull/585


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [Bug Fix] Allow HiveCatalog to create table with TimestamptzType [iceberg-python]

2024-04-08 Thread via GitHub


HonahX commented on PR #585:
URL: https://github.com/apache/iceberg-python/pull/585#issuecomment-2042016912

   @Fokko Thanks for reviewing!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-08 Thread via GitHub


tomtongue commented on code in PR #10086:
URL: https://github.com/apache/iceberg/pull/10086#discussion_r1555324898


##
aws/src/integration/java/org/apache/iceberg/aws/lakeformation/TestLakeFormationAwsClientFactory.java:
##
@@ -165,7 +165,8 @@ public void testLakeFormationEnabledGlueCatalog() throws 
Exception {
   glueCatalog.createNamespace(allowedNamespace);
 } catch (GlueException e) {
   LOG.error("fail to create Glue database", e);
-  Assert.fail("create namespace should succeed");
+
+  fail("create namespace should succeed");

Review Comment:
   Thanks. Will update this line.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Iceberg use PARTITIONED BY (days(ts)) cause wrong partition name [iceberg]

2024-04-08 Thread via GitHub


tcguanshuhuai opened a new issue, #10102:
URL: https://github.com/apache/iceberg/issues/10102

   ### Apache Iceberg version
   
   1.5.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 
   
   Step1: start iceberg
   spark-sql --jars /root/myjars/iceberg-spark-runtime-3.3_2.12-1.5.0.jar \
   --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
 \
   --conf 
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
   --conf spark.sql.catalog.spark_catalog.type=hive \
   --conf spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog 
\
   --conf spark.sql.catalog.hive_prod.type=hive \
   --conf 
spark.sql.catalog.hadoop_prod=org.apache.iceberg.spark.SparkCatalog \
   --conf spark.sql.catalog.hadoop_prod.type=hadoop \
   --conf 
spark.sql.catalog.hadoop_prod.warehouse=hdfs://ns:8020/user/iceberg/warehouse/hadoop_prod
   
   Step2: create a iceberg table 
   CREATE TABLE hadoop_prod.iceberg_db.sample_hidden_partition (
   id bigint,
   data string,
   ts timestamp)
   USING iceberg
   PARTITIONED BY (days(ts));
   
   Step3: insert into a record
   //Timestamp: 1718899200  is  2024-06-21 00:00:00
   insert into table hadoop_prod.iceberg_db.sample_hidden_partition values(2, 
'2021-01-02', cast(1718899200 as timestamp) );
   
   We can find use select sql I can get the correct value:
   spark-sql (default)> select * from 
hadoop_prod.iceberg_db.sample_hidden_partition;
   id   datats
   22021-01-02  2024-06-21 00:00:00
   
   However, the hadoop url is wrong:
   spark-sql (default)> dfs -ls 
/user/iceberg/warehouse/hadoop_prod/iceberg_db/sample_hidden_partition/data;
   
   **Expect result:** 
   drwxr-xr-x   - root supergroup  0 2024-04-08 14:36 
/user/iceberg/warehouse/hadoop_prod/iceberg_db/sample_hidden_partition/data/ts_day=2024-06-21
   
   **Actual result:**
   drwxr-xr-x   - root supergroup  0 2024-04-08 14:36 
/user/iceberg/warehouse/hadoop_prod/iceberg_db/sample_hidden_partition/data/ts_day=2024-06-20
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-08 Thread via GitHub


nastra commented on code in PR #10086:
URL: https://github.com/apache/iceberg/pull/10086#discussion_r1555315428


##
aws/src/integration/java/org/apache/iceberg/aws/lakeformation/TestLakeFormationAwsClientFactory.java:
##
@@ -165,7 +165,8 @@ public void testLakeFormationEnabledGlueCatalog() throws 
Exception {
   glueCatalog.createNamespace(allowedNamespace);
 } catch (GlueException e) {
   LOG.error("fail to create Glue database", e);
-  Assert.fail("create namespace should succeed");
+
+  fail("create namespace should succeed");

Review Comment:
   rather than using `fail` with a try-catch, it's better to have something 
like `assertThatNoException().isThrownBy(() -> 
glueCatalog.createNamespace(allowedNamespace))`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [Bug Fix] Allow HiveCatalog to create table with TimestamptzType [iceberg-python]

2024-04-08 Thread via GitHub


Fokko commented on code in PR #585:
URL: https://github.com/apache/iceberg-python/pull/585#discussion_r1555301526


##
mkdocs/docs/configuration.md:
##
@@ -232,6 +232,18 @@ catalog:
 s3.secret-access-key: password
 ```
 
+In case of Hive 2.x:
+
+```yaml
+catalog:
+  default:
+uri: thrift://localhost:9083
+hive.hive2-compatible: true
+s3.endpoint: http://localhost:9000
+s3.access-key-id: admin
+s3.secret-access-key: password
+```

Review Comment:
   ```suggestion
   When using Hive 2.x, make sure to set the compatibility flag:
   
   ```yaml
   catalog:
 default:
   hive.hive2-compatible: true
   ```
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [Bug Fix] Allow HiveCatalog to create table with TimestamptzType [iceberg-python]

2024-04-08 Thread via GitHub


Fokko commented on code in PR #585:
URL: https://github.com/apache/iceberg-python/pull/585#discussion_r1555301526


##
mkdocs/docs/configuration.md:
##
@@ -232,6 +232,18 @@ catalog:
 s3.secret-access-key: password
 ```
 
+In case of Hive 2.x:
+
+```yaml
+catalog:
+  default:
+uri: thrift://localhost:9083
+hive.hive2-compatible: true
+s3.endpoint: http://localhost:9000
+s3.access-key-id: admin
+s3.secret-access-key: password
+```

Review Comment:
   ```suggestion
   When using Hive 2.x, make sure to set the compatibility flag:
   
   ```yaml
   catalog:
 default:
   ...
   hive.hive2-compatible: true
   ```
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Implement incremental update using commit stats (SnapshotSummary) [iceberg]

2024-04-08 Thread via GitHub


ajantha-bhat commented on issue #8461:
URL: https://github.com/apache/iceberg/issues/8461#issuecomment-2041964709

   Based on the experiments from https://github.com/apache/iceberg/pull/9437, 
spark action is not effective as the serialization cost of each partition stats 
entry is expensive. 
   
   Will implement an API in core module to compute stats in a distributed way 
and a spark procedure to call this API. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Implement incremental update using commit stats (SnapshotSummary) [iceberg]

2024-04-08 Thread via GitHub


ajantha-bhat closed issue #8461: Implement incremental update using commit 
stats (SnapshotSummary)
URL: https://github.com/apache/iceberg/issues/8461


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Implement `Closable` interface for class `HiveCatalog` and `HiveClientPool` [iceberg]

2024-04-08 Thread via GitHub


yuqi1129 commented on issue #10100:
URL: https://github.com/apache/iceberg/issues/10100#issuecomment-2041959506

   @pvary 
   Thank you for your kind reply. 
   
   The reason we need to close the `HiveCatalog` and `HiveClientPool` is that 
we are working on a metadata management system 
https://github.com/datastrato/gravitino and Gravitino will manage several 
catalogs. Each Iceberg catalog will have the following operations:
   
   1. Create an Iceberg catalog. In this step, Gravitino will use a 
`IsolatedClassloader` to load classes that belongs to Icebergs catalog
   2. `LoadTable`, `LoadSchema`, and other operations under the classloader
   3. Finally, we are going to drop the Iceberg catalog if it's no longer 
needed, so we need to close the catalog, close the classloader, and release 
associated resources. However, due to a daemon thread introduced by 
`CachedClientPool` (Implementation class of CachedClientPool), the classloader 
can't be GC and thus will leads to OOM eventually.
   
   ```java
   
   // Code block from CachedClientPool#init
   clientPoolCache =
 Caffeine.newBuilder()
 .expireAfterAccess(evictionInterval, TimeUnit.MILLISECONDS)
 .removalListener((ignored, value, cause) -> ((HiveClientPool) 
value).close())
 .scheduler(
// This daemon thread can't be explicitly closed.
 Scheduler.forScheduledExecutorService(
 ThreadPools.newScheduledPool("hive-metastore-cleaner", 
1)))
 .build();
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org