Re: [PR] Core: Change Delete granularity to file for new tables [iceberg]

2024-12-12 Thread via GitHub


amogh-jahagirdar commented on code in PR #11478:
URL: https://github.com/apache/iceberg/pull/11478#discussion_r1882456464


##
spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMerge.java:
##
@@ -231,7 +233,6 @@ public void testMergeWithVectorizedReads() {
 
   @TestTemplate
   public void testCoalesceMerge() {
-assumeThat(formatVersion).isLessThan(3);

Review Comment:
   Addressing a miss from my previous PR for writing DVs in V3, instead of 
skipping this coalesce case when the format version is less than 3, we should 
be asserting the expected DVs (the else if case below)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Change Delete granularity to file for new tables [iceberg]

2024-11-19 Thread via GitHub


amogh-jahagirdar commented on code in PR #11478:
URL: https://github.com/apache/iceberg/pull/11478#discussion_r1848422899


##
spark/v3.4/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java:
##
@@ -154,7 +155,7 @@ public void testDeleteWithVectorizedReads() throws 
NoSuchTableException {
   }
 
   @Test
-  public void testCoalesceDelete() throws Exception {
+  public void testCoalesceDeleteWithPartitionGranularity() throws Exception {

Review Comment:
   I missed that this is shared between CoW and MoR, so renaming the test name 
to include granularity doesn't make sense as it is here. Let me see how to fix 
this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Change Delete granularity to file for new tables [iceberg]

2024-11-19 Thread via GitHub


amogh-jahagirdar commented on code in PR #11478:
URL: https://github.com/apache/iceberg/pull/11478#discussion_r1848422899


##
spark/v3.4/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java:
##
@@ -154,7 +155,7 @@ public void testDeleteWithVectorizedReads() throws 
NoSuchTableException {
   }
 
   @Test
-  public void testCoalesceDelete() throws Exception {
+  public void testCoalesceDeleteWithPartitionGranularity() throws Exception {

Review Comment:
   I missed that this is shared between CoW and MoR, so renaming the test name 
doesn't make sense as it is here. Let me see how to fix this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Change Delete granularity to file for new tables [iceberg]

2024-11-13 Thread via GitHub


nastra commented on PR #11478:
URL: https://github.com/apache/iceberg/pull/11478#issuecomment-2474032570

   I'm in favor of doing a normal backport so that all Spark versions get the 
same improvement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Change Delete granularity to file for new tables [iceberg]

2024-11-13 Thread via GitHub


amogh-jahagirdar commented on PR #11478:
URL: https://github.com/apache/iceberg/pull/11478#issuecomment-2474025071

   >@amogh-jahagirdar I don't see https://github.com/apache/iceberg/pull/11273 
being back-ported to Spark 3.3, 3.4 yet. Shall we skip changes to Spark 3.3, 
3.4 until that is done?
   
   @manuzhang Definitely agree that we shouldn't merge this change to 3.3/3.4 
until the maintenance piece is backported, I'm working on the backport. Though 
I think we can leave this PR open for discussion, and I also wanted to raise 
this to see which tests would need to be updated in CI (and I wanted to get 
some clarity if 3.3/3.4 tests were different somehow or not). 
   
   What I'd propose is we backport the sync maintenance to 3.3/3.4 since I 
think we'll need that anyways, and then if there's consensus here/mailing list 
to change the default table property then we could just update all the spark 
tests in one go.
   Alternatively like @aokolnychyi mentioned above we could just change the 
specific spark conf in the case we don't want to backport to older spark 
versions. I am leaning towards just doing the backport because I think we 
should just need to take a stance if we think the new default is just going to 
be generally better, rather than making it specific to a spark version.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Change Delete granularity to file for new tables [iceberg]

2024-11-13 Thread via GitHub


aokolnychyi commented on code in PR #11478:
URL: https://github.com/apache/iceberg/pull/11478#discussion_r1840638062


##
spark/v3.4/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java:
##
@@ -154,7 +155,7 @@ public void testDeleteWithVectorizedReads() throws 
NoSuchTableException {
   }
 
   @Test
-  public void testCoalesceDelete() throws Exception {
+  public void testCoalesceDeleteWithPartitionGranularity() throws Exception {

Review Comment:
   This makes sense to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Change Delete granularity to file for new tables [iceberg]

2024-11-13 Thread via GitHub


aokolnychyi commented on code in PR #11478:
URL: https://github.com/apache/iceberg/pull/11478#discussion_r1840651564


##
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##
@@ -90,6 +90,8 @@ private static Map 
persistedProperties(Map rawPr
 persistedProperties.put(
 TableProperties.PARQUET_COMPRESSION,
 TableProperties.PARQUET_COMPRESSION_DEFAULT_SINCE_1_4_0);
+persistedProperties.put(
+TableProperties.DELETE_GRANULARITY, 
TableProperties.DELETE_GRANULARITY_DEFAULT_SINCE_1_8_0);

Review Comment:
   This matches what we did for zstd in Parquet. I think it should be fine as 
is.
   
   An alternative is to change the default in our Spark conf classes, but it 
will be hard to differentiate between new and existing tables. Switching to 
file-scoped deletes for existing workloads may be a bigger disruption.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Change Delete granularity to file for new tables [iceberg]

2024-11-13 Thread via GitHub


aokolnychyi commented on code in PR #11478:
URL: https://github.com/apache/iceberg/pull/11478#discussion_r1840651564


##
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##
@@ -90,6 +90,8 @@ private static Map 
persistedProperties(Map rawPr
 persistedProperties.put(
 TableProperties.PARQUET_COMPRESSION,
 TableProperties.PARQUET_COMPRESSION_DEFAULT_SINCE_1_4_0);
+persistedProperties.put(
+TableProperties.DELETE_GRANULARITY, 
TableProperties.DELETE_GRANULARITY_DEFAULT_SINCE_1_8_0);

Review Comment:
   This matches what we did for zstd in Parquet. I think it should be fine. An 
alternative is to change the default in our Spark conf classes, if we want to 
make this specific to Spark. In theory, it would allow us to change this value 
only for particular versions of Spark (if we don't want to cherry-pick this to 
older Spark versions).
   
   ```
   public DeleteGranularity deleteGranularity() {
 String valueAsString =
 confParser
 .stringConf()
 .option(SparkWriteOptions.DELETE_GRANULARITY)
 .tableProperty(TableProperties.DELETE_GRANULARITY)
 .defaultValue(TableProperties.DELETE_GRANULARITY_DEFAULT) // this 
will have to change
 .parse();
 return DeleteGranularity.fromString(valueAsString);
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Change Delete granularity to file for new tables [iceberg]

2024-11-12 Thread via GitHub


manuzhang commented on PR #11478:
URL: https://github.com/apache/iceberg/pull/11478#issuecomment-2472286633

   I don't see #11273 being back-ported to Spark 3.3, 3.4 yet. Shall we skip 
changes to Spark 3.3, 3.4 until that is done?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Change Delete granularity to file for new tables [iceberg]

2024-11-12 Thread via GitHub


nastra commented on code in PR #11478:
URL: https://github.com/apache/iceberg/pull/11478#discussion_r1837939172


##
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##
@@ -90,6 +90,8 @@ private static Map 
persistedProperties(Map rawPr
 persistedProperties.put(
 TableProperties.PARQUET_COMPRESSION,
 TableProperties.PARQUET_COMPRESSION_DEFAULT_SINCE_1_4_0);
+persistedProperties.put(

Review Comment:
   can we add a simple test to `TestTableMetadata` to make sure this is 
properly set on new tables and not set on existing ones?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Change Delete granularity to file for new tables [iceberg]

2024-11-11 Thread via GitHub


amogh-jahagirdar commented on code in PR #11478:
URL: https://github.com/apache/iceberg/pull/11478#discussion_r1837005354


##
spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewritePositionDeleteFilesProcedure.java:
##
@@ -49,7 +49,7 @@ private void createTable(boolean partitioned) throws 
Exception {
 String partitionStmt = partitioned ? "PARTITIONED BY (id)" : "";
 sql(
 "CREATE TABLE %s (id bigint, data string) USING iceberg %s 
TBLPROPERTIES"
-+ "('format-version'='2', 'write.delete.mode'='merge-on-read')",
++ "('format-version'='2', 'write.delete.mode'='merge-on-read', 
'write.delete.granularity'='partitioned')",

Review Comment:
   See below, the tests for the underlying action already test both partition + 
file granularity so we have coverage of both cases. I don't know how much value 
it would add to rewrite all the tests here which have expectations based on 
partition granularity to have expectations based on file granularity or have 
procedure level tests for both.



##
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewritePositionDeleteFilesAction.java:
##
@@ -741,7 +741,9 @@ private Map tableProperties() {
 TableProperties.FORMAT_VERSION,
 "2",
 TableProperties.DEFAULT_FILE_FORMAT,
-format.toString());
+format.toString(),
+TableProperties.DELETE_GRANULARITY,
+DeleteGranularity.PARTITION.toString());

Review Comment:
   This is just the default setup for a table in this test, we do test file 
scoped deletes as well for this action in `testFileGranularity` in this test 
class.



##
spark/v3.4/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java:
##
@@ -154,7 +155,7 @@ public void testDeleteWithVectorizedReads() throws 
NoSuchTableException {
   }
 
   @Test
-  public void testCoalesceDelete() throws Exception {
+  public void testCoalesceDeleteWithPartitionGranularity() throws Exception {

Review Comment:
   I updated this test to use partition scoped deletes since I think the 
original intent of the test was to verify the output files when the shuffle 
blocks are small and coalesced into a single task (with an unpartitioned table) 
as part of AQE. 
   
   Without making this test specific to partition scoped deletes, we wouldn't 
really be testing how AQE coalescing to a single task is affecting the number 
of output files, which would be undifferentiated from the other tests where we 
already cover file scoped deletes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Change Delete granularity to file for new tables [iceberg]

2024-11-08 Thread via GitHub


amogh-jahagirdar commented on code in PR #11478:
URL: https://github.com/apache/iceberg/pull/11478#discussion_r1834587711


##
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##
@@ -90,6 +90,8 @@ private static Map 
persistedProperties(Map rawPr
 persistedProperties.put(
 TableProperties.PARQUET_COMPRESSION,
 TableProperties.PARQUET_COMPRESSION_DEFAULT_SINCE_1_4_0);
+persistedProperties.put(

Review Comment:
   Yes, we'll definitely need to update the failing Hive tests in their current 
form since it looks like they currently have expectations on the persisted 
properties. 
   
   >Maybe worth to add a test to check how to read "previous" files which don't 
contain the property ?
   
   In terms of general metadata file reading though, the core library will just 
read the metadata file with or without the property (since the properties is 
just a string-string map). In terms of how engine integrations handle the 
missing property, at the moment only Spark really leverages this property with 
a default option in case it's not defined, and we actually already have a 
[test](https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/TestSparkWriteConf.java#L140C15-L140C21)
 for verifying this default handling in case it's missing. 
   
   I'll double check the other integrations though



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Change Delete granularity to file for new tables [iceberg]

2024-11-05 Thread via GitHub


jbonofre commented on code in PR #11478:
URL: https://github.com/apache/iceberg/pull/11478#discussion_r1830477007


##
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##
@@ -90,6 +90,8 @@ private static Map 
persistedProperties(Map rawPr
 persistedProperties.put(
 TableProperties.PARQUET_COMPRESSION,
 TableProperties.PARQUET_COMPRESSION_DEFAULT_SINCE_1_4_0);
+persistedProperties.put(

Review Comment:
   By adding this, we need to update a few tests (like Hive), as we see now 
`"write.delete.granularity"="file"` in the metadata persistence.
   
   I propose to update the tests.
   
   Maybe worth to add a test to check how to read "previous" files which don't 
contain the property ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org