Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-05 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2073859092


##
pom.xml:
##
@@ -155,21 +155,25 @@
 org.apache.parquet
 parquet-avro
 ${parquet.version}
+provided

Review Comment:
   ok done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-05 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2073849932


##
pom.xml:
##
@@ -155,21 +155,25 @@
 org.apache.parquet
 parquet-avro
 ${parquet.version}
+provided

Review Comment:
   yes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-05 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2073845150


##
pom.xml:
##
@@ -155,21 +155,25 @@
 org.apache.parquet
 parquet-avro
 ${parquet.version}
+provided

Review Comment:
   you mean the scope should be removed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-05 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2073843732


##
xtable-core/pom.xml:
##
@@ -57,6 +57,12 @@
 
 
 
+
+

Review Comment:
   ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-05 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2073841260


##
pom.xml:
##
@@ -155,21 +155,25 @@
 org.apache.parquet
 parquet-avro
 ${parquet.version}
+provided

Review Comment:
   Can you revert these changes to the scope in this file? They should no 
longer be needed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-05 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2073841939


##
xtable-core/pom.xml:
##
@@ -57,6 +57,12 @@
 
 
 
+
+

Review Comment:
   Please, no more commented out lines in final drafts of PRs this will only 
lead to confusion around the intention of the code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-04 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2072781543


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,451 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.column.statistics.BooleanStatistics;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+public class TestParquetStatsExtractor {
+  @Builder.Default
+  private static final ParquetSchemaExtractor schemaExtractor =
+  ParquetSchemaExtractor.getInstance();
+
+  @TempDir static java.nio.file.Path tempDir = Paths.get("./");
+
+  public static List initBooleanFileTest(File file) throws 
IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+MessageTypeParser.parseMessageType("message m { required group a 
{required boolean b;}}");
+String[] columnPath = {"a", "b"};
+ColumnDescriptor c1 = schema.getColumnDescription(columnPath);
+CompressionCodecName codec = CompressionCodecName.UNCOMPRESSED;
+BooleanStatistics stats = new BooleanStatistics();
+stats.updateStats(true);
+stats.updateStats(false);
+
+// write the string columned file
+
+ParquetFileWriter w = new ParquetFileWriter(configuration, schema, path);
+w.start();
+w.startBlock(3);
+w.startColumn(c1, 5, codec);
+w.writeDataPage(2, 4, BytesInput.fromInt(1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.writeDataPage(3, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.startBlock(4);
+w.startColumn(c1, 8, codec);
+w.writeDataPage(7, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.end(new HashMap());
+
+// reconstruct the stats for the InternalDataFile testing object
+boolean minStat = stats.genericGetMin();
+boolean maxStat = stats.genericGetMax();
+PrimitiveType primitiveType =
+new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.BOOLEAN, "b");
+List col1NumValTotSize = new ArrayList<>(Arrays.asList(5, 8)); // 
start column indexes
+List col2NumValTotSize = new ArrayList<>(Arrays.asList(54, 27));
+List testColumnStats = new ArrayList<>();
+String[] columnDotPath = {"a.b", "a.b"};
+for (int i = 0; i < columnDotPath.length; i++) {
+  testColumnStats.add(
+  ColumnStat.builder()
+  .field(
+  InternalField.builder()
+  .name(primitiveType.getName())
+  

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-04 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2072781268


##
xtable-utilities/src/main/java/org/apache/xtable/utilities/RunSync.java:
##
@@ -115,6 +117,126 @@ public class RunSync {
   "The interval in seconds to schedule the loop. Requires 
--continuousMode to be set. Defaults to 5 seconds.")
   .addOption(HELP_OPTION, "help", false, "Displays help information to 
run this utility");
 
+  static SourceTable sourceTableBuilder(

Review Comment:
   I synced this branch with the remote main 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-04 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2072672051


##
xtable-utilities/src/main/java/org/apache/xtable/utilities/RunSync.java:
##
@@ -115,6 +117,126 @@ public class RunSync {
   "The interval in seconds to schedule the loop. Requires 
--continuousMode to be set. Defaults to 5 seconds.")
   .addOption(HELP_OPTION, "help", false, "Displays help information to 
run this utility");
 
+  static SourceTable sourceTableBuilder(

Review Comment:
   @unical1988 if you can update your fork with the latest master it will 
hopefully reduce the diffs shown in the pull request and help speed up the 
review. Otherwise I need to review all of these changes as well before signing 
off to merge.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-04 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2072669763


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,451 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.column.statistics.BooleanStatistics;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+public class TestParquetStatsExtractor {
+  @Builder.Default
+  private static final ParquetSchemaExtractor schemaExtractor =
+  ParquetSchemaExtractor.getInstance();
+
+  @TempDir static java.nio.file.Path tempDir = Paths.get("./");
+
+  public static List initBooleanFileTest(File file) throws 
IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+MessageTypeParser.parseMessageType("message m { required group a 
{required boolean b;}}");
+String[] columnPath = {"a", "b"};
+ColumnDescriptor c1 = schema.getColumnDescription(columnPath);
+CompressionCodecName codec = CompressionCodecName.UNCOMPRESSED;
+BooleanStatistics stats = new BooleanStatistics();
+stats.updateStats(true);
+stats.updateStats(false);
+
+// write the string columned file
+
+ParquetFileWriter w = new ParquetFileWriter(configuration, schema, path);
+w.start();
+w.startBlock(3);
+w.startColumn(c1, 5, codec);
+w.writeDataPage(2, 4, BytesInput.fromInt(1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.writeDataPage(3, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.startBlock(4);
+w.startColumn(c1, 8, codec);
+w.writeDataPage(7, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.end(new HashMap());
+
+// reconstruct the stats for the InternalDataFile testing object
+boolean minStat = stats.genericGetMin();
+boolean maxStat = stats.genericGetMax();
+PrimitiveType primitiveType =
+new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.BOOLEAN, "b");
+List col1NumValTotSize = new ArrayList<>(Arrays.asList(5, 8)); // 
start column indexes
+List col2NumValTotSize = new ArrayList<>(Arrays.asList(54, 27));
+List testColumnStats = new ArrayList<>();
+String[] columnDotPath = {"a.b", "a.b"};
+for (int i = 0; i < columnDotPath.length; i++) {
+  testColumnStats.add(
+  ColumnStat.builder()
+  .field(
+  InternalField.builder()
+  .name(primitiveType.getNa

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2071095411


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,451 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.column.statistics.BooleanStatistics;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+public class TestParquetStatsExtractor {
+  @Builder.Default
+  private static final ParquetSchemaExtractor schemaExtractor =
+  ParquetSchemaExtractor.getInstance();
+
+  @TempDir static java.nio.file.Path tempDir = Paths.get("./");
+
+  public static List initBooleanFileTest(File file) throws 
IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+MessageTypeParser.parseMessageType("message m { required group a 
{required boolean b;}}");
+String[] columnPath = {"a", "b"};
+ColumnDescriptor c1 = schema.getColumnDescription(columnPath);
+CompressionCodecName codec = CompressionCodecName.UNCOMPRESSED;
+BooleanStatistics stats = new BooleanStatistics();
+stats.updateStats(true);
+stats.updateStats(false);
+
+// write the string columned file
+
+ParquetFileWriter w = new ParquetFileWriter(configuration, schema, path);
+w.start();
+w.startBlock(3);
+w.startColumn(c1, 5, codec);
+w.writeDataPage(2, 4, BytesInput.fromInt(1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.writeDataPage(3, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.startBlock(4);
+w.startColumn(c1, 8, codec);
+w.writeDataPage(7, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.end(new HashMap());
+
+// reconstruct the stats for the InternalDataFile testing object
+boolean minStat = stats.genericGetMin();
+boolean maxStat = stats.genericGetMax();
+PrimitiveType primitiveType =
+new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.BOOLEAN, "b");
+List col1NumValTotSize = new ArrayList<>(Arrays.asList(5, 8)); // 
start column indexes
+List col2NumValTotSize = new ArrayList<>(Arrays.asList(54, 27));
+List testColumnStats = new ArrayList<>();
+String[] columnDotPath = {"a.b", "a.b"};
+for (int i = 0; i < columnDotPath.length; i++) {
+  testColumnStats.add(
+  ColumnStat.builder()
+  .field(
+  InternalField.builder()
+  .name(primitiveType.getName())
+  

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2071078576


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,451 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.column.statistics.BooleanStatistics;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+public class TestParquetStatsExtractor {
+  @Builder.Default
+  private static final ParquetSchemaExtractor schemaExtractor =
+  ParquetSchemaExtractor.getInstance();
+
+  @TempDir static java.nio.file.Path tempDir = Paths.get("./");
+
+  public static List initBooleanFileTest(File file) throws 
IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+MessageTypeParser.parseMessageType("message m { required group a 
{required boolean b;}}");
+String[] columnPath = {"a", "b"};
+ColumnDescriptor c1 = schema.getColumnDescription(columnPath);
+CompressionCodecName codec = CompressionCodecName.UNCOMPRESSED;
+BooleanStatistics stats = new BooleanStatistics();
+stats.updateStats(true);
+stats.updateStats(false);
+
+// write the string columned file
+
+ParquetFileWriter w = new ParquetFileWriter(configuration, schema, path);
+w.start();
+w.startBlock(3);
+w.startColumn(c1, 5, codec);
+w.writeDataPage(2, 4, BytesInput.fromInt(1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.writeDataPage(3, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.startBlock(4);
+w.startColumn(c1, 8, codec);
+w.writeDataPage(7, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.end(new HashMap());
+
+// reconstruct the stats for the InternalDataFile testing object
+boolean minStat = stats.genericGetMin();
+boolean maxStat = stats.genericGetMax();
+PrimitiveType primitiveType =
+new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.BOOLEAN, "b");
+List col1NumValTotSize = new ArrayList<>(Arrays.asList(5, 8)); // 
start column indexes
+List col2NumValTotSize = new ArrayList<>(Arrays.asList(54, 27));
+List testColumnStats = new ArrayList<>();
+String[] columnDotPath = {"a.b", "a.b"};
+for (int i = 0; i < columnDotPath.length; i++) {
+  testColumnStats.add(
+  ColumnStat.builder()
+  .field(
+  InternalField.builder()
+  .name(primitiveType.getName())
+  

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2071064145


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,451 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.column.statistics.BooleanStatistics;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+public class TestParquetStatsExtractor {
+  @Builder.Default
+  private static final ParquetSchemaExtractor schemaExtractor =
+  ParquetSchemaExtractor.getInstance();
+
+  @TempDir static java.nio.file.Path tempDir = Paths.get("./");
+
+  public static List initBooleanFileTest(File file) throws 
IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+MessageTypeParser.parseMessageType("message m { required group a 
{required boolean b;}}");
+String[] columnPath = {"a", "b"};
+ColumnDescriptor c1 = schema.getColumnDescription(columnPath);
+CompressionCodecName codec = CompressionCodecName.UNCOMPRESSED;
+BooleanStatistics stats = new BooleanStatistics();
+stats.updateStats(true);
+stats.updateStats(false);
+
+// write the string columned file
+
+ParquetFileWriter w = new ParquetFileWriter(configuration, schema, path);
+w.start();
+w.startBlock(3);
+w.startColumn(c1, 5, codec);
+w.writeDataPage(2, 4, BytesInput.fromInt(1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.writeDataPage(3, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.startBlock(4);
+w.startColumn(c1, 8, codec);
+w.writeDataPage(7, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.end(new HashMap());
+
+// reconstruct the stats for the InternalDataFile testing object
+boolean minStat = stats.genericGetMin();
+boolean maxStat = stats.genericGetMax();
+PrimitiveType primitiveType =
+new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.BOOLEAN, "b");
+List col1NumValTotSize = new ArrayList<>(Arrays.asList(5, 8)); // 
start column indexes
+List col2NumValTotSize = new ArrayList<>(Arrays.asList(54, 27));
+List testColumnStats = new ArrayList<>();
+String[] columnDotPath = {"a.b", "a.b"};
+for (int i = 0; i < columnDotPath.length; i++) {
+  testColumnStats.add(
+  ColumnStat.builder()
+  .field(
+  InternalField.builder()
+  .name(primitiveType.getName())
+  

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2071061180


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,451 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.column.statistics.BooleanStatistics;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+public class TestParquetStatsExtractor {
+  @Builder.Default
+  private static final ParquetSchemaExtractor schemaExtractor =
+  ParquetSchemaExtractor.getInstance();
+
+  @TempDir static java.nio.file.Path tempDir = Paths.get("./");
+
+  public static List initBooleanFileTest(File file) throws 
IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+MessageTypeParser.parseMessageType("message m { required group a 
{required boolean b;}}");
+String[] columnPath = {"a", "b"};
+ColumnDescriptor c1 = schema.getColumnDescription(columnPath);
+CompressionCodecName codec = CompressionCodecName.UNCOMPRESSED;
+BooleanStatistics stats = new BooleanStatistics();
+stats.updateStats(true);
+stats.updateStats(false);
+
+// write the string columned file
+
+ParquetFileWriter w = new ParquetFileWriter(configuration, schema, path);
+w.start();
+w.startBlock(3);
+w.startColumn(c1, 5, codec);
+w.writeDataPage(2, 4, BytesInput.fromInt(1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.writeDataPage(3, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.startBlock(4);
+w.startColumn(c1, 8, codec);
+w.writeDataPage(7, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.end(new HashMap());
+
+// reconstruct the stats for the InternalDataFile testing object
+boolean minStat = stats.genericGetMin();
+boolean maxStat = stats.genericGetMax();
+PrimitiveType primitiveType =
+new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.BOOLEAN, "b");
+List col1NumValTotSize = new ArrayList<>(Arrays.asList(5, 8)); // 
start column indexes
+List col2NumValTotSize = new ArrayList<>(Arrays.asList(54, 27));
+List testColumnStats = new ArrayList<>();
+String[] columnDotPath = {"a.b", "a.b"};
+for (int i = 0; i < columnDotPath.length; i++) {
+  testColumnStats.add(
+  ColumnStat.builder()
+  .field(
+  InternalField.builder()
+  .name(primitiveType.getNa

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2071058398


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,451 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.column.statistics.BooleanStatistics;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+public class TestParquetStatsExtractor {
+  @Builder.Default
+  private static final ParquetSchemaExtractor schemaExtractor =
+  ParquetSchemaExtractor.getInstance();
+
+  @TempDir static java.nio.file.Path tempDir = Paths.get("./");
+
+  public static List initBooleanFileTest(File file) throws 
IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+MessageTypeParser.parseMessageType("message m { required group a 
{required boolean b;}}");
+String[] columnPath = {"a", "b"};
+ColumnDescriptor c1 = schema.getColumnDescription(columnPath);
+CompressionCodecName codec = CompressionCodecName.UNCOMPRESSED;
+BooleanStatistics stats = new BooleanStatistics();
+stats.updateStats(true);
+stats.updateStats(false);
+
+// write the string columned file
+
+ParquetFileWriter w = new ParquetFileWriter(configuration, schema, path);
+w.start();
+w.startBlock(3);
+w.startColumn(c1, 5, codec);
+w.writeDataPage(2, 4, BytesInput.fromInt(1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.writeDataPage(3, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.startBlock(4);
+w.startColumn(c1, 8, codec);
+w.writeDataPage(7, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.end(new HashMap());
+
+// reconstruct the stats for the InternalDataFile testing object
+boolean minStat = stats.genericGetMin();
+boolean maxStat = stats.genericGetMax();
+PrimitiveType primitiveType =
+new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.BOOLEAN, "b");
+List col1NumValTotSize = new ArrayList<>(Arrays.asList(5, 8)); // 
start column indexes
+List col2NumValTotSize = new ArrayList<>(Arrays.asList(54, 27));
+List testColumnStats = new ArrayList<>();
+String[] columnDotPath = {"a.b", "a.b"};
+for (int i = 0; i < columnDotPath.length; i++) {
+  testColumnStats.add(
+  ColumnStat.builder()
+  .field(
+  InternalField.builder()
+  .name(primitiveType.getName())
+  

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2071018989


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,451 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.column.statistics.BooleanStatistics;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+public class TestParquetStatsExtractor {
+  @Builder.Default
+  private static final ParquetSchemaExtractor schemaExtractor =
+  ParquetSchemaExtractor.getInstance();
+
+  @TempDir static java.nio.file.Path tempDir = Paths.get("./");
+
+  public static List initBooleanFileTest(File file) throws 
IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+MessageTypeParser.parseMessageType("message m { required group a 
{required boolean b;}}");
+String[] columnPath = {"a", "b"};
+ColumnDescriptor c1 = schema.getColumnDescription(columnPath);
+CompressionCodecName codec = CompressionCodecName.UNCOMPRESSED;
+BooleanStatistics stats = new BooleanStatistics();
+stats.updateStats(true);
+stats.updateStats(false);
+
+// write the string columned file
+
+ParquetFileWriter w = new ParquetFileWriter(configuration, schema, path);
+w.start();
+w.startBlock(3);
+w.startColumn(c1, 5, codec);
+w.writeDataPage(2, 4, BytesInput.fromInt(1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.writeDataPage(3, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.startBlock(4);
+w.startColumn(c1, 8, codec);
+w.writeDataPage(7, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.end(new HashMap());
+
+// reconstruct the stats for the InternalDataFile testing object
+boolean minStat = stats.genericGetMin();
+boolean maxStat = stats.genericGetMax();
+PrimitiveType primitiveType =
+new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.BOOLEAN, "b");
+List col1NumValTotSize = new ArrayList<>(Arrays.asList(5, 8)); // 
start column indexes
+List col2NumValTotSize = new ArrayList<>(Arrays.asList(54, 27));
+List testColumnStats = new ArrayList<>();
+String[] columnDotPath = {"a.b", "a.b"};
+for (int i = 0; i < columnDotPath.length; i++) {
+  testColumnStats.add(
+  ColumnStat.builder()
+  .field(
+  InternalField.builder()
+  .name(primitiveType.getName())
+  

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2071007602


##
xtable-utilities/src/main/java/org/apache/xtable/utilities/RunCatalogSync.java:
##
@@ -131,11 +132,11 @@ public static void main(String[] args) throws Exception {
 
Paths.get(cmd.getOptionValue(CATALOG_SOURCE_AND_TARGET_CONFIG_PATH {
   datasetConfig = YAML_MAPPER.readValue(inputStream, DatasetConfig.class);
 }
-
-byte[] customConfig = getCustomConfigurations(cmd, HADOOP_CONFIG_PATH);
+String hadoopConfigpath = getValueFromConfig(cmd, HADOOP_CONFIG_PATH);
+byte[] customConfig = getCustomConfigurations(hadoopConfigpath);
 Configuration hadoopConf = loadHadoopConf(customConfig);
-
-customConfig = getCustomConfigurations(cmd, CONVERTERS_CONFIG_PATH);
+String conversionProviderConfigpath = getValueFromConfig(cmd, 
CONVERTERS_CONFIG_PATH);
+customConfig = getCustomConfigurations(conversionProviderConfigpath);

Review Comment:
   I saw the build error on the file names which was fixed, all tests shall 
pass now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2070998177


##
xtable-utilities/src/main/java/org/apache/xtable/utilities/RunCatalogSync.java:
##
@@ -131,11 +132,11 @@ public static void main(String[] args) throws Exception {
 
Paths.get(cmd.getOptionValue(CATALOG_SOURCE_AND_TARGET_CONFIG_PATH {
   datasetConfig = YAML_MAPPER.readValue(inputStream, DatasetConfig.class);
 }
-
-byte[] customConfig = getCustomConfigurations(cmd, HADOOP_CONFIG_PATH);
+String hadoopConfigpath = getValueFromConfig(cmd, HADOOP_CONFIG_PATH);
+byte[] customConfig = getCustomConfigurations(hadoopConfigpath);
 Configuration hadoopConf = loadHadoopConf(customConfig);
-
-customConfig = getCustomConfigurations(cmd, CONVERTERS_CONFIG_PATH);
+String conversionProviderConfigpath = getValueFromConfig(cmd, 
CONVERTERS_CONFIG_PATH);
+customConfig = getCustomConfigurations(conversionProviderConfigpath);

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2070994895


##
xtable-utilities/src/main/java/org/apache/xtable/utilities/RunCatalogSync.java:
##
@@ -131,11 +132,11 @@ public static void main(String[] args) throws Exception {
 
Paths.get(cmd.getOptionValue(CATALOG_SOURCE_AND_TARGET_CONFIG_PATH {
   datasetConfig = YAML_MAPPER.readValue(inputStream, DatasetConfig.class);
 }
-
-byte[] customConfig = getCustomConfigurations(cmd, HADOOP_CONFIG_PATH);
+String hadoopConfigpath = getValueFromConfig(cmd, HADOOP_CONFIG_PATH);
+byte[] customConfig = getCustomConfigurations(hadoopConfigpath);
 Configuration hadoopConf = loadHadoopConf(customConfig);
-
-customConfig = getCustomConfigurations(cmd, CONVERTERS_CONFIG_PATH);
+String conversionProviderConfigpath = getValueFromConfig(cmd, 
CONVERTERS_CONFIG_PATH);
+customConfig = getCustomConfigurations(conversionProviderConfigpath);

Review Comment:
   Can you try syncing your fork's master branch to see if that cleans up the 
diff?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2070994289


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,451 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.column.statistics.BooleanStatistics;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+public class TestParquetStatsExtractor {
+  @Builder.Default
+  private static final ParquetSchemaExtractor schemaExtractor =
+  ParquetSchemaExtractor.getInstance();
+
+  @TempDir static java.nio.file.Path tempDir = Paths.get("./");
+
+  public static List initBooleanFileTest(File file) throws 
IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+MessageTypeParser.parseMessageType("message m { required group a 
{required boolean b;}}");
+String[] columnPath = {"a", "b"};
+ColumnDescriptor c1 = schema.getColumnDescription(columnPath);
+CompressionCodecName codec = CompressionCodecName.UNCOMPRESSED;
+BooleanStatistics stats = new BooleanStatistics();
+stats.updateStats(true);
+stats.updateStats(false);
+
+// write the string columned file
+
+ParquetFileWriter w = new ParquetFileWriter(configuration, schema, path);
+w.start();
+w.startBlock(3);
+w.startColumn(c1, 5, codec);
+w.writeDataPage(2, 4, BytesInput.fromInt(1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.writeDataPage(3, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.startBlock(4);
+w.startColumn(c1, 8, codec);
+w.writeDataPage(7, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.end(new HashMap());
+
+// reconstruct the stats for the InternalDataFile testing object
+boolean minStat = stats.genericGetMin();
+boolean maxStat = stats.genericGetMax();
+PrimitiveType primitiveType =
+new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.BOOLEAN, "b");
+List col1NumValTotSize = new ArrayList<>(Arrays.asList(5, 8)); // 
start column indexes
+List col2NumValTotSize = new ArrayList<>(Arrays.asList(54, 27));
+List testColumnStats = new ArrayList<>();
+String[] columnDotPath = {"a.b", "a.b"};
+for (int i = 0; i < columnDotPath.length; i++) {
+  testColumnStats.add(
+  ColumnStat.builder()
+  .field(
+  InternalField.builder()
+  .name(primitiveType.getNa

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2070992394


##
xtable-utilities/src/main/java/org/apache/xtable/utilities/RunCatalogSync.java:
##
@@ -131,11 +132,11 @@ public static void main(String[] args) throws Exception {
 
Paths.get(cmd.getOptionValue(CATALOG_SOURCE_AND_TARGET_CONFIG_PATH {
   datasetConfig = YAML_MAPPER.readValue(inputStream, DatasetConfig.class);
 }
-
-byte[] customConfig = getCustomConfigurations(cmd, HADOOP_CONFIG_PATH);
+String hadoopConfigpath = getValueFromConfig(cmd, HADOOP_CONFIG_PATH);
+byte[] customConfig = getCustomConfigurations(hadoopConfigpath);
 Configuration hadoopConf = loadHadoopConf(customConfig);
-
-customConfig = getCustomConfigurations(cmd, CONVERTERS_CONFIG_PATH);
+String conversionProviderConfigpath = getValueFromConfig(cmd, 
CONVERTERS_CONFIG_PATH);
+customConfig = getCustomConfigurations(conversionProviderConfigpath);

Review Comment:
   ok, alI fixes were executed, you can run build again and see if all is fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2070991423


##
xtable-utilities/src/main/java/org/apache/xtable/utilities/RunCatalogSync.java:
##
@@ -131,11 +132,11 @@ public static void main(String[] args) throws Exception {
 
Paths.get(cmd.getOptionValue(CATALOG_SOURCE_AND_TARGET_CONFIG_PATH {
   datasetConfig = YAML_MAPPER.readValue(inputStream, DatasetConfig.class);
 }
-
-byte[] customConfig = getCustomConfigurations(cmd, HADOOP_CONFIG_PATH);
+String hadoopConfigpath = getValueFromConfig(cmd, HADOOP_CONFIG_PATH);
+byte[] customConfig = getCustomConfigurations(hadoopConfigpath);
 Configuration hadoopConf = loadHadoopConf(customConfig);
-
-customConfig = getCustomConfigurations(cmd, CONVERTERS_CONFIG_PATH);
+String conversionProviderConfigpath = getValueFromConfig(cmd, 
CONVERTERS_CONFIG_PATH);
+customConfig = getCustomConfigurations(conversionProviderConfigpath);

Review Comment:
   No worries then, it is showing up as a diff in the UI 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2070987126


##
xtable-utilities/src/main/java/org/apache/xtable/utilities/RunCatalogSync.java:
##
@@ -131,11 +132,11 @@ public static void main(String[] args) throws Exception {
 
Paths.get(cmd.getOptionValue(CATALOG_SOURCE_AND_TARGET_CONFIG_PATH {
   datasetConfig = YAML_MAPPER.readValue(inputStream, DatasetConfig.class);
 }
-
-byte[] customConfig = getCustomConfigurations(cmd, HADOOP_CONFIG_PATH);
+String hadoopConfigpath = getValueFromConfig(cmd, HADOOP_CONFIG_PATH);
+byte[] customConfig = getCustomConfigurations(hadoopConfigpath);
 Configuration hadoopConf = loadHadoopConf(customConfig);
-
-customConfig = getCustomConfigurations(cmd, CONVERTERS_CONFIG_PATH);
+String conversionProviderConfigpath = getValueFromConfig(cmd, 
CONVERTERS_CONFIG_PATH);
+customConfig = getCustomConfigurations(conversionProviderConfigpath);

Review Comment:
   these changes are within the last version of the project, why should they be 
removed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2070980016


##
xtable-utilities/src/main/java/org/apache/xtable/utilities/RunCatalogSync.java:
##
@@ -131,11 +132,11 @@ public static void main(String[] args) throws Exception {
 
Paths.get(cmd.getOptionValue(CATALOG_SOURCE_AND_TARGET_CONFIG_PATH {
   datasetConfig = YAML_MAPPER.readValue(inputStream, DatasetConfig.class);
 }
-
-byte[] customConfig = getCustomConfigurations(cmd, HADOOP_CONFIG_PATH);
+String hadoopConfigpath = getValueFromConfig(cmd, HADOOP_CONFIG_PATH);
+byte[] customConfig = getCustomConfigurations(hadoopConfigpath);
 Configuration hadoopConf = loadHadoopConf(customConfig);
-
-customConfig = getCustomConfigurations(cmd, CONVERTERS_CONFIG_PATH);
+String conversionProviderConfigpath = getValueFromConfig(cmd, 
CONVERTERS_CONFIG_PATH);
+customConfig = getCustomConfigurations(conversionProviderConfigpath);

Review Comment:
   Where these changes come from, can you point out the original correct 
version of this class?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2070975946


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetSchemaExtractor.java:
##
@@ -0,0 +1,349 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Map;
+
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type;
+import org.apache.parquet.schema.Type.Repetition;
+import org.apache.parquet.schema.Types;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+
+public class TestParquetSchemaExtractor {
+  private static final ParquetSchemaExtractor schemaExtractor =
+  ParquetSchemaExtractor.getInstance();
+
+  @Test
+  public void testPrimitiveTypes() {
+
+InternalSchema primitive1 =
+
InternalSchema.builder().name("integer").dataType(InternalType.INT).build();
+InternalSchema primitive2 =
+
InternalSchema.builder().name("string").dataType(InternalType.STRING).build();
+
+Map fixedDecimalMetadata = new 
HashMap<>();
+fixedDecimalMetadata.put(InternalSchema.MetadataKey.DECIMAL_PRECISION, 6);
+fixedDecimalMetadata.put(InternalSchema.MetadataKey.DECIMAL_SCALE, 5);
+InternalSchema decimalType =
+InternalSchema.builder()
+.name("decimal")
+.dataType(InternalType.DECIMAL)
+.isNullable(false)
+.metadata(fixedDecimalMetadata)
+.build();
+
+Type stringPrimitiveType =
+Types.required(PrimitiveTypeName.BINARY)
+.as(LogicalTypeAnnotation.stringType())
+.named("string");
+
+Type intPrimitiveType =
+Types.required(PrimitiveTypeName.INT32)
+.as(LogicalTypeAnnotation.intType(32, false))
+.named("integer");
+
+Type decimalPrimitive =
+Types.required(PrimitiveTypeName.INT32)
+.as(LogicalTypeAnnotation.decimalType(5, 6))
+.named("decimal");
+
+Assertions.assertEquals(primitive1, 
schemaExtractor.toInternalSchema(intPrimitiveType, null));
+
+Assertions.assertEquals(
+primitive2, schemaExtractor.toInternalSchema(stringPrimitiveType, 
null));
+
+Assertions.assertEquals(decimalType, 
schemaExtractor.toInternalSchema(decimalPrimitive, null));
+
+// tests for timestamp and date
+InternalSchema testDate =
+
InternalSchema.builder().name("date").dataType(InternalType.DATE).isNullable(false).build();
+
+Map millisMetadata =
+Collections.singletonMap(
+InternalSchema.MetadataKey.TIMESTAMP_PRECISION, 
InternalSchema.MetadataValue.MILLIS);
+Map microsMetadata =
+Collections.singletonMap(
+InternalSchema.MetadataKey.TIMESTAMP_PRECISION, 
InternalSchema.MetadataValue.MICROS);
+Map nanosMetadata =
+Collections.singletonMap(
+InternalSchema.MetadataKey.TIMESTAMP_PRECISION, 
InternalSchema.MetadataValue.NANOS);
+
+InternalSchema testTimestampMillis =
+InternalSchema.builder()
+.name("timestamp_millis")
+.dataType(InternalType.TIMESTAMP_NTZ)
+.isNullable(false)
+.metadata(millisMetadata)
+.build();
+
+InternalSchema testTimestampMicros =
+InternalSchema.builder()
+.name("timestamp_micros")
+.dataType(InternalType.TIMESTAMP)
+.isNullable(false)
+.metadata(microsMetadata)
+.build();
+
+InternalSchema testTimestampNanos =
+InternalSchema.builder()
+.name("timestamp_nanos")
+.dataType(InternalType.TIMESTAMP_NTZ)
+.isNullable(false)
+.metadata(nanos

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2070975619


##
xtable-utilities/src/main/java/org/apache/xtable/utilities/RunCatalogSync.java:
##
@@ -131,11 +132,11 @@ public static void main(String[] args) throws Exception {
 
Paths.get(cmd.getOptionValue(CATALOG_SOURCE_AND_TARGET_CONFIG_PATH {
   datasetConfig = YAML_MAPPER.readValue(inputStream, DatasetConfig.class);
 }
-
-byte[] customConfig = getCustomConfigurations(cmd, HADOOP_CONFIG_PATH);
+String hadoopConfigpath = getValueFromConfig(cmd, HADOOP_CONFIG_PATH);
+byte[] customConfig = getCustomConfigurations(hadoopConfigpath);
 Configuration hadoopConf = loadHadoopConf(customConfig);
-
-customConfig = getCustomConfigurations(cmd, CONVERTERS_CONFIG_PATH);
+String conversionProviderConfigpath = getValueFromConfig(cmd, 
CONVERTERS_CONFIG_PATH);
+customConfig = getCustomConfigurations(conversionProviderConfigpath);

Review Comment:
   Can you remove these changes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-05-01 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2070975349


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetSchemaExtractor.java:
##
@@ -0,0 +1,349 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Map;
+
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type;
+import org.apache.parquet.schema.Type.Repetition;
+import org.apache.parquet.schema.Types;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+
+public class TestParquetSchemaExtractor {
+  private static final ParquetSchemaExtractor schemaExtractor =
+  ParquetSchemaExtractor.getInstance();
+
+  @Test
+  public void testPrimitiveTypes() {
+
+InternalSchema primitive1 =
+
InternalSchema.builder().name("integer").dataType(InternalType.INT).build();
+InternalSchema primitive2 =
+
InternalSchema.builder().name("string").dataType(InternalType.STRING).build();
+
+Map fixedDecimalMetadata = new 
HashMap<>();
+fixedDecimalMetadata.put(InternalSchema.MetadataKey.DECIMAL_PRECISION, 6);
+fixedDecimalMetadata.put(InternalSchema.MetadataKey.DECIMAL_SCALE, 5);
+InternalSchema decimalType =
+InternalSchema.builder()
+.name("decimal")
+.dataType(InternalType.DECIMAL)
+.isNullable(false)
+.metadata(fixedDecimalMetadata)
+.build();
+
+Type stringPrimitiveType =
+Types.required(PrimitiveTypeName.BINARY)
+.as(LogicalTypeAnnotation.stringType())
+.named("string");
+
+Type intPrimitiveType =
+Types.required(PrimitiveTypeName.INT32)
+.as(LogicalTypeAnnotation.intType(32, false))
+.named("integer");
+
+Type decimalPrimitive =
+Types.required(PrimitiveTypeName.INT32)
+.as(LogicalTypeAnnotation.decimalType(5, 6))
+.named("decimal");
+
+Assertions.assertEquals(primitive1, 
schemaExtractor.toInternalSchema(intPrimitiveType, null));
+
+Assertions.assertEquals(
+primitive2, schemaExtractor.toInternalSchema(stringPrimitiveType, 
null));
+
+Assertions.assertEquals(decimalType, 
schemaExtractor.toInternalSchema(decimalPrimitive, null));
+
+// tests for timestamp and date
+InternalSchema testDate =
+
InternalSchema.builder().name("date").dataType(InternalType.DATE).isNullable(false).build();
+
+Map millisMetadata =
+Collections.singletonMap(
+InternalSchema.MetadataKey.TIMESTAMP_PRECISION, 
InternalSchema.MetadataValue.MILLIS);
+Map microsMetadata =
+Collections.singletonMap(
+InternalSchema.MetadataKey.TIMESTAMP_PRECISION, 
InternalSchema.MetadataValue.MICROS);
+Map nanosMetadata =
+Collections.singletonMap(
+InternalSchema.MetadataKey.TIMESTAMP_PRECISION, 
InternalSchema.MetadataValue.NANOS);
+
+InternalSchema testTimestampMillis =
+InternalSchema.builder()
+.name("timestamp_millis")
+.dataType(InternalType.TIMESTAMP_NTZ)
+.isNullable(false)
+.metadata(millisMetadata)
+.build();
+
+InternalSchema testTimestampMicros =
+InternalSchema.builder()
+.name("timestamp_micros")
+.dataType(InternalType.TIMESTAMP)
+.isNullable(false)
+.metadata(microsMetadata)
+.build();
+
+InternalSchema testTimestampNanos =
+InternalSchema.builder()
+.name("timestamp_nanos")
+.dataType(InternalType.TIMESTAMP_NTZ)
+.isNullable(false)
+.metadata(nanos

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-28 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2065243897


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetSchemaExtractor.java:
##
@@ -0,0 +1,349 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Map;
+
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type;
+import org.apache.parquet.schema.Type.Repetition;
+import org.apache.parquet.schema.Types;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+
+public class TestParquetSchemaExtractor {
+  private static final ParquetSchemaExtractor schemaExtractor =
+  ParquetSchemaExtractor.getInstance();
+
+  @Test
+  public void testPrimitiveTypes() {
+
+InternalSchema primitive1 =
+
InternalSchema.builder().name("integer").dataType(InternalType.INT).build();
+InternalSchema primitive2 =
+
InternalSchema.builder().name("string").dataType(InternalType.STRING).build();
+
+Map fixedDecimalMetadata = new 
HashMap<>();
+fixedDecimalMetadata.put(InternalSchema.MetadataKey.DECIMAL_PRECISION, 6);
+fixedDecimalMetadata.put(InternalSchema.MetadataKey.DECIMAL_SCALE, 5);
+InternalSchema decimalType =
+InternalSchema.builder()
+.name("decimal")
+.dataType(InternalType.DECIMAL)
+.isNullable(false)
+.metadata(fixedDecimalMetadata)
+.build();
+
+Type stringPrimitiveType =
+Types.required(PrimitiveTypeName.BINARY)
+.as(LogicalTypeAnnotation.stringType())
+.named("string");
+
+Type intPrimitiveType =
+Types.required(PrimitiveTypeName.INT32)
+.as(LogicalTypeAnnotation.intType(32, false))
+.named("integer");
+
+Type decimalPrimitive =
+Types.required(PrimitiveTypeName.INT32)
+.as(LogicalTypeAnnotation.decimalType(5, 6))
+.named("decimal");
+
+Assertions.assertEquals(primitive1, 
schemaExtractor.toInternalSchema(intPrimitiveType, null));
+
+Assertions.assertEquals(
+primitive2, schemaExtractor.toInternalSchema(stringPrimitiveType, 
null));
+
+Assertions.assertEquals(decimalType, 
schemaExtractor.toInternalSchema(decimalPrimitive, null));
+
+// tests for timestamp and date
+InternalSchema testDate =
+
InternalSchema.builder().name("date").dataType(InternalType.DATE).isNullable(false).build();
+
+Map millisMetadata =
+Collections.singletonMap(
+InternalSchema.MetadataKey.TIMESTAMP_PRECISION, 
InternalSchema.MetadataValue.MILLIS);
+Map microsMetadata =
+Collections.singletonMap(
+InternalSchema.MetadataKey.TIMESTAMP_PRECISION, 
InternalSchema.MetadataValue.MICROS);
+Map nanosMetadata =
+Collections.singletonMap(
+InternalSchema.MetadataKey.TIMESTAMP_PRECISION, 
InternalSchema.MetadataValue.NANOS);
+
+InternalSchema testTimestampMillis =
+InternalSchema.builder()
+.name("timestamp_millis")
+.dataType(InternalType.TIMESTAMP_NTZ)
+.isNullable(false)
+.metadata(millisMetadata)
+.build();
+
+InternalSchema testTimestampMicros =
+InternalSchema.builder()
+.name("timestamp_micros")
+.dataType(InternalType.TIMESTAMP)
+.isNullable(false)
+.metadata(microsMetadata)
+.build();
+
+InternalSchema testTimestampNanos =
+InternalSchema.builder()
+.name("timestamp_nanos")
+.dataType(InternalType.TIMESTAMP_NTZ)
+.isNullable(false)
+.metadata(nanosMetadata)

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-28 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2065243782


##
xtable-utilities/src/main/java/org/apache/xtable/utilities/RunCatalogSync.java:
##
@@ -154,9 +153,7 @@ public static void main(String[] args) throws Exception {
 TargetTable targetTable =
 TargetTable.builder()
 .name(sourceTable.getName())
-.basePath(

Review Comment:
   done.



##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetSchemaExtractor.java:
##
@@ -0,0 +1,349 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Map;
+
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type;
+import org.apache.parquet.schema.Type.Repetition;
+import org.apache.parquet.schema.Types;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+
+public class TestParquetSchemaExtractor {
+  private static final ParquetSchemaExtractor schemaExtractor =
+  ParquetSchemaExtractor.getInstance();
+
+  @Test
+  public void testPrimitiveTypes() {
+
+InternalSchema primitive1 =
+
InternalSchema.builder().name("integer").dataType(InternalType.INT).build();
+InternalSchema primitive2 =
+
InternalSchema.builder().name("string").dataType(InternalType.STRING).build();
+
+Map fixedDecimalMetadata = new 
HashMap<>();
+fixedDecimalMetadata.put(InternalSchema.MetadataKey.DECIMAL_PRECISION, 6);
+fixedDecimalMetadata.put(InternalSchema.MetadataKey.DECIMAL_SCALE, 5);
+InternalSchema decimalType =
+InternalSchema.builder()
+.name("decimal")
+.dataType(InternalType.DECIMAL)
+.isNullable(false)
+.metadata(fixedDecimalMetadata)
+.build();
+
+Type stringPrimitiveType =
+Types.required(PrimitiveTypeName.BINARY)
+.as(LogicalTypeAnnotation.stringType())
+.named("string");
+
+Type intPrimitiveType =
+Types.required(PrimitiveTypeName.INT32)
+.as(LogicalTypeAnnotation.intType(32, false))
+.named("integer");
+
+Type decimalPrimitive =
+Types.required(PrimitiveTypeName.INT32)
+.as(LogicalTypeAnnotation.decimalType(5, 6))
+.named("decimal");
+
+Assertions.assertEquals(primitive1, 
schemaExtractor.toInternalSchema(intPrimitiveType, null));
+
+Assertions.assertEquals(
+primitive2, schemaExtractor.toInternalSchema(stringPrimitiveType, 
null));
+
+Assertions.assertEquals(decimalType, 
schemaExtractor.toInternalSchema(decimalPrimitive, null));
+
+// tests for timestamp and date
+InternalSchema testDate =
+
InternalSchema.builder().name("date").dataType(InternalType.DATE).isNullable(false).build();
+
+Map millisMetadata =
+Collections.singletonMap(
+InternalSchema.MetadataKey.TIMESTAMP_PRECISION, 
InternalSchema.MetadataValue.MILLIS);
+Map microsMetadata =
+Collections.singletonMap(
+InternalSchema.MetadataKey.TIMESTAMP_PRECISION, 
InternalSchema.MetadataValue.MICROS);
+Map nanosMetadata =
+Collections.singletonMap(
+InternalSchema.MetadataKey.TIMESTAMP_PRECISION, 
InternalSchema.MetadataValue.NANOS);
+
+InternalSchema testTimestampMillis =
+InternalSchema.builder()
+.name("timestamp_millis")
+.dataType(InternalType.TIMESTAMP_NTZ)
+.isNullable(false)
+.metadata(millisMetadata)
+.build();
+
+InternalSchema testTimestampMicros =
+InternalSchema.builder()
+.name("timestamp_micros")
+.dataType(

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-28 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2065243993


##
xtable-core/pom.xml:
##
@@ -57,6 +57,18 @@
 
 
 
+
+
+org.apache.parquet
+parquet-avro
+
+
+
+

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-28 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2065223441


##
xtable-core/pom.xml:
##
@@ -57,6 +57,18 @@
 
 
 
+
+
+org.apache.parquet
+parquet-avro
+
+
+
+

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-28 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2065220190


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetPartitionValueExtractor.java:
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import java.time.Instant;
+import java.time.OffsetDateTime;
+import java.time.ZoneOffset;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import lombok.AccessLevel;
+import lombok.NoArgsConstructor;
+
+import org.apache.xtable.model.config.InputPartitionField;
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalPartitionField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+
+/** Partition value extractor for Parquet. */
+@NoArgsConstructor(access = AccessLevel.PRIVATE)
+public class ParquetPartitionValueExtractor {

Review Comment:
   I think now it is fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-28 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2065210056


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetPartitionValueExtractor.java:
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import java.time.Instant;
+import java.time.OffsetDateTime;
+import java.time.ZoneOffset;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import lombok.AccessLevel;
+import lombok.NoArgsConstructor;
+
+import org.apache.xtable.model.config.InputPartitionField;
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalPartitionField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+
+/** Partition value extractor for Parquet. */
+@NoArgsConstructor(access = AccessLevel.PRIVATE)
+public class ParquetPartitionValueExtractor {

Review Comment:
   It looks like some other files were removed as well unintentionally



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-28 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2065207675


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetPartitionValueExtractor.java:
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import java.time.Instant;
+import java.time.OffsetDateTime;
+import java.time.ZoneOffset;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import lombok.AccessLevel;
+import lombok.NoArgsConstructor;
+
+import org.apache.xtable.model.config.InputPartitionField;
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalPartitionField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+
+/** Partition value extractor for Parquet. */
+@NoArgsConstructor(access = AccessLevel.PRIVATE)
+public class ParquetPartitionValueExtractor {

Review Comment:
   How about now, were they removed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-28 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2065189159


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetPartitionValueExtractor.java:
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import java.time.Instant;
+import java.time.OffsetDateTime;
+import java.time.ZoneOffset;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import lombok.AccessLevel;
+import lombok.NoArgsConstructor;
+
+import org.apache.xtable.model.config.InputPartitionField;
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalPartitionField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+
+/** Partition value extractor for Parquet. */
+@NoArgsConstructor(access = AccessLevel.PRIVATE)
+public class ParquetPartitionValueExtractor {

Review Comment:
   Yes that is fine



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-24 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2059393035


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetPartitionValueExtractor.java:
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import java.time.Instant;
+import java.time.OffsetDateTime;
+import java.time.ZoneOffset;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import lombok.AccessLevel;
+import lombok.NoArgsConstructor;
+
+import org.apache.xtable.model.config.InputPartitionField;
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalPartitionField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+
+/** Partition value extractor for Parquet. */
+@NoArgsConstructor(access = AccessLevel.PRIVATE)
+public class ParquetPartitionValueExtractor {

Review Comment:
   can I just remove the class and push the changes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-24 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2059380914


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetPartitionValueExtractor.java:
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+
+import java.time.Instant;
+import java.time.OffsetDateTime;
+import java.time.ZoneOffset;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import lombok.AccessLevel;
+import lombok.NoArgsConstructor;
+
+import org.apache.xtable.model.config.InputPartitionField;
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalPartitionField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+
+/** Partition value extractor for Parquet. */
+@NoArgsConstructor(access = AccessLevel.PRIVATE)
+public class ParquetPartitionValueExtractor {

Review Comment:
   Let's pull this into a separate branch as discussed in the developer sync. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-24 Thread via GitHub


unical1988 commented on PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#issuecomment-2829055243

   @the-other-tim-brown all of your reviews has been addressed and the updates 
have been pushed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-24 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2059359561


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,21 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public boolean equals(InternalDataFile obj2) {

Review Comment:
   test passes even with the lombok's equals()



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-24 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2059359308


##
xtable-api/src/main/java/org/apache/xtable/model/stat/ColumnStat.java:
##
@@ -31,9 +33,20 @@
 @Value
 @Builder(toBuilder = true)
 public class ColumnStat {
-  InternalField field;
-  Range range;
-  long numNulls;
-  long numValues;
-  long totalSize;
+InternalField field;
+Range range;
+long numNulls;
+long numValues;
+long totalSize;
+
+public boolean equals(ColumnStat colStat) {

Review Comment:
   It seems you are right about the equals() the test still passed even after 
removing the custom equals()



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-24 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2059311381


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,327 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+import org.apache.parquet.column.statistics.BooleanStatistics;
+import org.apache.hadoop.conf.Configuration;
+
+import java.nio.file.Paths;
+
+import java.nio.file.Files;
+//import java.nio.file.Path;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+import org.junit.jupiter.api.io.TempDir;
+
+public class TestParquetStatsExtractor {
+  @Builder.Default
+  private static final ParquetSchemaExtractor schemaExtractor =
+  ParquetSchemaExtractor.getInstance();
+  @TempDir
+  static java.nio.file.Path tempDir = Paths.get("./");
+
+
+
+  public static List initBooleanFileTest(File file) throws 
IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+MessageTypeParser.parseMessageType("message m { required group a 
{required boolean b;}}");
+String[] columnPath = {"a", "b"};
+ColumnDescriptor c1 = schema.getColumnDescription(columnPath);
+CompressionCodecName codec = CompressionCodecName.UNCOMPRESSED;
+BooleanStatistics stats = new BooleanStatistics();
+stats.updateStats(true);
+stats.updateStats(false);
+
+
+
+/*
+byte byteTrue = (byte)(true?1:0);
+byte byteFalse = (byte)(false?1:0);
+*/
+
+
+// write the string columned file
+
+ParquetFileWriter w = new ParquetFileWriter(configuration, schema, path);
+w.start();
+w.startBlock(3);
+w.startColumn(c1, 5, codec);
+//w.startColumn(c1, 2, codec);
+w.writeDataPage(2, 4, BytesInput.fromInt(1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.writeDataPage(3, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.startBlock(4);
+w.startColumn(c1, 8, codec);
+//w.startColumn(c1, 1, codec);
+w.writeDataPage(7, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.end(new HashMap());
+
+// reconstruct the stats for the InternalDataFile testing object
+//byte[] minStat = stats.getMinBytes();
+boolean minStat = stats.genericGetMin();
+//byte[] maxStat = stats.getMaxBytes();
+boolean maxStat = stats.genericGetMax();
+PrimitiveType primitiveType =
+//   new PrimitiveType(Repetition.REQUIRED, 
PrimitiveTypeName.BINARY, "b");
+new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.BOOLEAN, 
"b");
+List col1NumValTotSize = new ArrayList<>(Arrays.asLis

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-23 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2057368359


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,21 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public boolean equals(InternalDataFile obj2) {

Review Comment:
   I'm not sure what the problem is with the lombok `equals()` but I'm sure 
there is one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-23 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2057368359


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,21 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public boolean equals(InternalDataFile obj2) {

Review Comment:
   I'm not sure what the problem is with the lombok `equals()` but I'm sure it 
has one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-23 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2057284571


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,327 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+package org.apache.xtable.parquet;
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+import org.apache.parquet.column.statistics.BooleanStatistics;
+import org.apache.hadoop.conf.Configuration;
+
+import java.nio.file.Paths;
+
+import java.nio.file.Files;
+//import java.nio.file.Path;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+import org.junit.jupiter.api.io.TempDir;
+
+public class TestParquetStatsExtractor {
+  @Builder.Default
+  private static final ParquetSchemaExtractor schemaExtractor =
+  ParquetSchemaExtractor.getInstance();
+  @TempDir
+  static java.nio.file.Path tempDir = Paths.get("./");
+
+
+
+  public static List initBooleanFileTest(File file) throws 
IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+MessageTypeParser.parseMessageType("message m { required group a 
{required boolean b;}}");
+String[] columnPath = {"a", "b"};
+ColumnDescriptor c1 = schema.getColumnDescription(columnPath);
+CompressionCodecName codec = CompressionCodecName.UNCOMPRESSED;
+BooleanStatistics stats = new BooleanStatistics();
+stats.updateStats(true);
+stats.updateStats(false);
+
+
+
+/*
+byte byteTrue = (byte)(true?1:0);
+byte byteFalse = (byte)(false?1:0);
+*/
+
+
+// write the string columned file
+
+ParquetFileWriter w = new ParquetFileWriter(configuration, schema, path);
+w.start();
+w.startBlock(3);
+w.startColumn(c1, 5, codec);
+//w.startColumn(c1, 2, codec);
+w.writeDataPage(2, 4, BytesInput.fromInt(1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.writeDataPage(3, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.startBlock(4);
+w.startColumn(c1, 8, codec);
+//w.startColumn(c1, 1, codec);
+w.writeDataPage(7, 4, BytesInput.fromInt(0), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.end(new HashMap());
+
+// reconstruct the stats for the InternalDataFile testing object
+//byte[] minStat = stats.getMinBytes();
+boolean minStat = stats.genericGetMin();
+//byte[] maxStat = stats.getMaxBytes();
+boolean maxStat = stats.genericGetMax();
+PrimitiveType primitiveType =
+//   new PrimitiveType(Repetition.REQUIRED, 
PrimitiveTypeName.BINARY, "b");
+new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.BOOLEAN, 
"b");
+List col1NumValTotSize = new ArrayList<>(Arr

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-22 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2055081712


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import java.nio.file.Paths;
+
+import java.nio.file.Files;
+//import java.nio.file.Path;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+import org.junit.jupiter.api.io.TempDir;
+
+public class TestParquetStatsExtractor {
+@TempDir
+static java.nio.file.Path tempDir = Paths.get("./");
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static List initFileTest(File file) throws IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+//MessageTypeParser.parseMessageType("message m { required 
group a {required binary b;}}");

Review Comment:
   boolean schema file type test has been added 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-22 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2055081712


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import java.nio.file.Paths;
+
+import java.nio.file.Files;
+//import java.nio.file.Path;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+import org.junit.jupiter.api.io.TempDir;
+
+public class TestParquetStatsExtractor {
+@TempDir
+static java.nio.file.Path tempDir = Paths.get("./");
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static List initFileTest(File file) throws IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+//MessageTypeParser.parseMessageType("message m { required 
group a {required binary b;}}");

Review Comment:
   I will add a boolean schema file type to close with tests for stat.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-22 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2055082721


##
xtable-api/src/main/java/org/apache/xtable/model/stat/ColumnStat.java:
##
@@ -31,9 +33,20 @@
 @Value
 @Builder(toBuilder = true)
 public class ColumnStat {
-  InternalField field;
-  Range range;
-  long numNulls;
-  long numValues;
-  long totalSize;
+InternalField field;
+Range range;
+long numNulls;
+long numValues;
+long totalSize;
+
+public boolean equals(ColumnStat colStat) {

Review Comment:
   what's the command to re-generate the equals method?  I will try to use 
instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-22 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2055080740


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import lombok.Builder;
+import lombok.Value;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+
+
+private static final ParquetPartitionValueExtractor partitionExtractor =
+ParquetPartitionValueExtractor.getInstance();
+
+
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+
+private static final ParquetMetadataExtractor parquetMetadataExtractor =
+ParquetMetadataExtractor.getInstance();
+
+private static final InputPartitionFields partitions = null;
+
+public static ParquetStatsExtractor getInstance() {
+return INSTANCE;
+}
+
+public static List getColumnStatsForaFile(ParquetMetadata 
footer) {
+return getStatsForFile(footer).values().stream()
+.flatMap(List::stream)
+.collect(Collectors.toList());
+}
+
+private static Optional getMaxFromColumnStats(List 
columnStats) {
+return columnStats.stream()
+.filter(entry -> entry.getField().getParentPath() == null)
+.map(ColumnStat::getNumValues)
+.filter(numValues -> numValues > 0)
+.max(Long::compareTo);
+}
+
+public static Map> 
getStatsForFile(ParquetMetadata footer) {
+Map> columnDescStats = new 
HashMap<>();
+MessageType schema = parquetMetadataExtractor.getSchema(footer);
+List columns = new ArrayList<>();
+columns =
+footer.getBlocks().stream()
+.flatMap(blockMetaData -> 
blockMetaData.getColumns().stream())
+.collect(Collectors.toList());
+columnDescStats =
+columns.stream()
+.collect(
+Collectors.groupingBy(
+columnMetaData ->
+
schema.getColumnDescription(columnMetaData.getPath().toArray()),
+Collectors.mapping(
+columnMetaData ->
+ColumnStat.builder()
+.field(
+
InternalField.builder()
+   
 .name(columnMetaData.getPrimitiveType().getName())
+   
 .fieldId(
+   
 columnMetaData.getPrimitiveType().getId() == null
+   
 ? null
+   
 

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2053087027


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import java.nio.file.Paths;
+
+import java.nio.file.Files;
+//import java.nio.file.Path;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+import org.junit.jupiter.api.io.TempDir;
+
+public class TestParquetStatsExtractor {
+@TempDir
+static java.nio.file.Path tempDir = Paths.get("./");
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static List initFileTest(File file) throws IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+//MessageTypeParser.parseMessageType("message m { required 
group a {required binary b;}}");

Review Comment:
   We could add such a complex file struct in a subsequent test after the 
merge. For now I am adding separate tests (different methods) for every 
primitiveType + I don't think it matters if I do complex file struct since we 
are only getting footer stats to compare.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2053087027


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import java.nio.file.Paths;
+
+import java.nio.file.Files;
+//import java.nio.file.Path;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+import org.junit.jupiter.api.io.TempDir;
+
+public class TestParquetStatsExtractor {
+@TempDir
+static java.nio.file.Path tempDir = Paths.get("./");
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static List initFileTest(File file) throws IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+//MessageTypeParser.parseMessageType("message m { required 
group a {required binary b;}}");

Review Comment:
   We could add such a complex file struct in a subsequent test after the 
merge. For now I am adding separate tests (different methods) for every 
primitiveType.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2053052590


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import lombok.Builder;
+import lombok.Value;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+
+
+private static final ParquetPartitionValueExtractor partitionExtractor =
+ParquetPartitionValueExtractor.getInstance();
+
+
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+
+private static final ParquetMetadataExtractor parquetMetadataExtractor =
+ParquetMetadataExtractor.getInstance();
+
+private static final InputPartitionFields partitions = null;
+
+public static ParquetStatsExtractor getInstance() {
+return INSTANCE;
+}

Review Comment:
   THat is needed in the conversionSource obj.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2053048740


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import java.nio.file.Paths;
+
+import java.nio.file.Files;
+//import java.nio.file.Path;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+import org.junit.jupiter.api.io.TempDir;
+
+public class TestParquetStatsExtractor {
+@TempDir
+static java.nio.file.Path tempDir = Paths.get("./");
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static List initFileTest(File file) throws IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+//MessageTypeParser.parseMessageType("message m { required 
group a {required binary b;}}");

Review Comment:
   we can do this in a single test or separate ones. The main thing is that we 
test all the cases. If it's in a single test, then you can make a file with 
various fields of different types.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#issuecomment-2819519782

   @the-other-tim-brown I marked as resolved all of your addressed reviews and 
comments here. To me except few other minor comments here, the main work still 
to be done is adding the test for the file schema through seperate methods for 
the statsExtractor.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052968641


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import lombok.Builder;
+import lombok.Value;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+
+
+private static final ParquetPartitionValueExtractor partitionExtractor =
+ParquetPartitionValueExtractor.getInstance();
+
+
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+
+private static final ParquetMetadataExtractor parquetMetadataExtractor =
+ParquetMetadataExtractor.getInstance();
+
+private static final InputPartitionFields partitions = null;
+
+public static ParquetStatsExtractor getInstance() {
+return INSTANCE;
+}
+
+public static List getColumnStatsForaFile(ParquetMetadata 
footer) {
+return getStatsForFile(footer).values().stream()
+.flatMap(List::stream)
+.collect(Collectors.toList());
+}
+
+private static Optional getMaxFromColumnStats(List 
columnStats) {
+return columnStats.stream()
+.filter(entry -> entry.getField().getParentPath() == null)
+.map(ColumnStat::getNumValues)
+.filter(numValues -> numValues > 0)
+.max(Long::compareTo);
+}
+
+public static Map> 
getStatsForFile(ParquetMetadata footer) {
+Map> columnDescStats = new 
HashMap<>();
+MessageType schema = parquetMetadataExtractor.getSchema(footer);
+List columns = new ArrayList<>();
+columns =
+footer.getBlocks().stream()
+.flatMap(blockMetaData -> 
blockMetaData.getColumns().stream())
+.collect(Collectors.toList());
+columnDescStats =
+columns.stream()
+.collect(
+Collectors.groupingBy(
+columnMetaData ->
+
schema.getColumnDescription(columnMetaData.getPath().toArray()),
+Collectors.mapping(
+columnMetaData ->
+ColumnStat.builder()
+.field(
+
InternalField.builder()
+   
 .name(columnMetaData.getPrimitiveType().getName())
+   
 .fieldId(
+   
 columnMetaData.getPrimitiveType().getId() == null
+   
 ? null
+   
 

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052948382


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import lombok.Builder;
+import lombok.Value;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+
+
+private static final ParquetPartitionValueExtractor partitionExtractor =
+ParquetPartitionValueExtractor.getInstance();
+
+
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+
+private static final ParquetMetadataExtractor parquetMetadataExtractor =
+ParquetMetadataExtractor.getInstance();
+
+private static final InputPartitionFields partitions = null;
+
+public static ParquetStatsExtractor getInstance() {
+return INSTANCE;
+}
+
+public static List getColumnStatsForaFile(ParquetMetadata 
footer) {
+return getStatsForFile(footer).values().stream()
+.flatMap(List::stream)
+.collect(Collectors.toList());
+}
+
+private static Optional getMaxFromColumnStats(List 
columnStats) {
+return columnStats.stream()
+.filter(entry -> entry.getField().getParentPath() == null)
+.map(ColumnStat::getNumValues)
+.filter(numValues -> numValues > 0)
+.max(Long::compareTo);
+}
+
+public static Map> 
getStatsForFile(ParquetMetadata footer) {
+Map> columnDescStats = new 
HashMap<>();
+MessageType schema = parquetMetadataExtractor.getSchema(footer);
+List columns = new ArrayList<>();
+columns =
+footer.getBlocks().stream()
+.flatMap(blockMetaData -> 
blockMetaData.getColumns().stream())
+.collect(Collectors.toList());
+columnDescStats =
+columns.stream()
+.collect(
+Collectors.groupingBy(
+columnMetaData ->
+
schema.getColumnDescription(columnMetaData.getPath().toArray()),
+Collectors.mapping(
+columnMetaData ->
+ColumnStat.builder()
+.field(
+
InternalField.builder()
+   
 .name(columnMetaData.getPrimitiveType().getName())
+   
 .fieldId(
+   
 columnMetaData.getPrimitiveType().getId() == null
+   
 ? null
+   
 

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052948382


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import lombok.Builder;
+import lombok.Value;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+
+
+private static final ParquetPartitionValueExtractor partitionExtractor =
+ParquetPartitionValueExtractor.getInstance();
+
+
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+
+private static final ParquetMetadataExtractor parquetMetadataExtractor =
+ParquetMetadataExtractor.getInstance();
+
+private static final InputPartitionFields partitions = null;
+
+public static ParquetStatsExtractor getInstance() {
+return INSTANCE;
+}
+
+public static List getColumnStatsForaFile(ParquetMetadata 
footer) {
+return getStatsForFile(footer).values().stream()
+.flatMap(List::stream)
+.collect(Collectors.toList());
+}
+
+private static Optional getMaxFromColumnStats(List 
columnStats) {
+return columnStats.stream()
+.filter(entry -> entry.getField().getParentPath() == null)
+.map(ColumnStat::getNumValues)
+.filter(numValues -> numValues > 0)
+.max(Long::compareTo);
+}
+
+public static Map> 
getStatsForFile(ParquetMetadata footer) {
+Map> columnDescStats = new 
HashMap<>();
+MessageType schema = parquetMetadataExtractor.getSchema(footer);
+List columns = new ArrayList<>();
+columns =
+footer.getBlocks().stream()
+.flatMap(blockMetaData -> 
blockMetaData.getColumns().stream())
+.collect(Collectors.toList());
+columnDescStats =
+columns.stream()
+.collect(
+Collectors.groupingBy(
+columnMetaData ->
+
schema.getColumnDescription(columnMetaData.getPath().toArray()),
+Collectors.mapping(
+columnMetaData ->
+ColumnStat.builder()
+.field(
+
InternalField.builder()
+   
 .name(columnMetaData.getPrimitiveType().getName())
+   
 .fieldId(
+   
 columnMetaData.getPrimitiveType().getId() == null
+   
 ? null
+   
 

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052935909


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import lombok.Builder;
+import lombok.Value;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+
+
+private static final ParquetPartitionValueExtractor partitionExtractor =
+ParquetPartitionValueExtractor.getInstance();
+
+
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+
+private static final ParquetMetadataExtractor parquetMetadataExtractor =
+ParquetMetadataExtractor.getInstance();
+
+private static final InputPartitionFields partitions = null;
+
+public static ParquetStatsExtractor getInstance() {
+return INSTANCE;
+}
+
+public static List getColumnStatsForaFile(ParquetMetadata 
footer) {
+return getStatsForFile(footer).values().stream()
+.flatMap(List::stream)
+.collect(Collectors.toList());
+}
+
+private static Optional getMaxFromColumnStats(List 
columnStats) {
+return columnStats.stream()
+.filter(entry -> entry.getField().getParentPath() == null)
+.map(ColumnStat::getNumValues)
+.filter(numValues -> numValues > 0)
+.max(Long::compareTo);
+}
+
+public static Map> 
getStatsForFile(ParquetMetadata footer) {
+Map> columnDescStats = new 
HashMap<>();
+MessageType schema = parquetMetadataExtractor.getSchema(footer);
+List columns = new ArrayList<>();
+columns =
+footer.getBlocks().stream()
+.flatMap(blockMetaData -> 
blockMetaData.getColumns().stream())
+.collect(Collectors.toList());
+columnDescStats =
+columns.stream()
+.collect(
+Collectors.groupingBy(
+columnMetaData ->
+
schema.getColumnDescription(columnMetaData.getPath().toArray()),
+Collectors.mapping(
+columnMetaData ->
+ColumnStat.builder()
+.field(
+
InternalField.builder()
+   
 .name(columnMetaData.getPrimitiveType().getName())
+   
 .fieldId(
+   
 columnMetaData.getPrimitiveType().getId() == null
+   
 ? null
+   
 

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052932326


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import java.nio.file.Paths;
+
+import java.nio.file.Files;
+//import java.nio.file.Path;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+import org.junit.jupiter.api.io.TempDir;
+
+public class TestParquetStatsExtractor {
+@TempDir
+static java.nio.file.Path tempDir = Paths.get("./");
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static List initFileTest(File file) throws IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+//MessageTypeParser.parseMessageType("message m { required 
group a {required binary b;}}");

Review Comment:
   are you suggesting, file schema type tests should be done in seperate 
methods e.g.? instead of `initFileTest`() we would have `initIntFileTest`, 
`initBinaryFileTest `and `initFloatFileTest`... where the respecrtive file 
types are written seperately?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052924243


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import lombok.Builder;
+import lombok.Value;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+
+
+private static final ParquetPartitionValueExtractor partitionExtractor =
+ParquetPartitionValueExtractor.getInstance();
+
+
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+
+private static final ParquetMetadataExtractor parquetMetadataExtractor =
+ParquetMetadataExtractor.getInstance();
+
+private static final InputPartitionFields partitions = null;
+
+public static ParquetStatsExtractor getInstance() {
+return INSTANCE;
+}
+
+public static List getColumnStatsForaFile(ParquetMetadata 
footer) {
+return getStatsForFile(footer).values().stream()
+.flatMap(List::stream)
+.collect(Collectors.toList());
+}
+
+private static Optional getMaxFromColumnStats(List 
columnStats) {
+return columnStats.stream()
+.filter(entry -> entry.getField().getParentPath() == null)
+.map(ColumnStat::getNumValues)
+.filter(numValues -> numValues > 0)
+.max(Long::compareTo);
+}
+
+public static Map> 
getStatsForFile(ParquetMetadata footer) {
+Map> columnDescStats = new 
HashMap<>();
+MessageType schema = parquetMetadataExtractor.getSchema(footer);
+List columns = new ArrayList<>();
+columns =
+footer.getBlocks().stream()
+.flatMap(blockMetaData -> 
blockMetaData.getColumns().stream())
+.collect(Collectors.toList());
+columnDescStats =
+columns.stream()
+.collect(
+Collectors.groupingBy(
+columnMetaData ->
+
schema.getColumnDescription(columnMetaData.getPath().toArray()),
+Collectors.mapping(
+columnMetaData ->
+ColumnStat.builder()
+.field(
+
InternalField.builder()
+   
 .name(columnMetaData.getPrimitiveType().getName())
+   
 .fieldId(
+   
 columnMetaData.getPrimitiveType().getId() == null
+   
 ? null
+   

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052920709


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,21 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public boolean equals(InternalDataFile obj2) {

Review Comment:
   Yes I am sure, since I first tested with the generated equals() and didn't 
work. Concerning the list size equality test, I can add the test. Anyways for 
now that's the solution I thought of, maybe in the future this code could be 
altered to use the generated methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052920709


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,21 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public boolean equals(InternalDataFile obj2) {

Review Comment:
   Yes I am sure, since I first tested with the generated equals() and didn't 
work. Concerning the list size equality test, I can the test. Anyways for now 
that's the solution I thought of, maybe in the future this code could be 
altered to use the generated methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052915560


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import lombok.Builder;
+import lombok.Value;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+
+
+private static final ParquetPartitionValueExtractor partitionExtractor =
+ParquetPartitionValueExtractor.getInstance();
+
+
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+
+private static final ParquetMetadataExtractor parquetMetadataExtractor =
+ParquetMetadataExtractor.getInstance();
+
+private static final InputPartitionFields partitions = null;
+
+public static ParquetStatsExtractor getInstance() {
+return INSTANCE;
+}
+
+public static List getColumnStatsForaFile(ParquetMetadata 
footer) {
+return getStatsForFile(footer).values().stream()
+.flatMap(List::stream)
+.collect(Collectors.toList());
+}
+
+private static Optional getMaxFromColumnStats(List 
columnStats) {
+return columnStats.stream()
+.filter(entry -> entry.getField().getParentPath() == null)
+.map(ColumnStat::getNumValues)
+.filter(numValues -> numValues > 0)
+.max(Long::compareTo);
+}
+
+public static Map> 
getStatsForFile(ParquetMetadata footer) {
+Map> columnDescStats = new 
HashMap<>();
+MessageType schema = parquetMetadataExtractor.getSchema(footer);
+List columns = new ArrayList<>();
+columns =
+footer.getBlocks().stream()
+.flatMap(blockMetaData -> 
blockMetaData.getColumns().stream())
+.collect(Collectors.toList());
+columnDescStats =
+columns.stream()
+.collect(
+Collectors.groupingBy(
+columnMetaData ->
+
schema.getColumnDescription(columnMetaData.getPath().toArray()),
+Collectors.mapping(
+columnMetaData ->
+ColumnStat.builder()
+.field(
+
InternalField.builder()
+   
 .name(columnMetaData.getPrimitiveType().getName())
+   
 .fieldId(
+   
 columnMetaData.getPrimitiveType().getId() == null
+   
 ? null
+   
 

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052912672


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,21 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public boolean equals(InternalDataFile obj2) {

Review Comment:
   The implementation here does not assert that the lists are the same size so 
you do not know whether there are more entries in obj2's stats



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052911563


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,21 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public boolean equals(InternalDataFile obj2) {

Review Comment:
   the implementation of a collection's equals will be very similar, are you 
sure there is a problem with the equals method?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052898303


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,21 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public boolean equals(InternalDataFile obj2) {

Review Comment:
   the list comparison generated with delombok for columnStats is the the 
following: 
   
   `if (this$columnStats == null ? other$columnStats != null : 
!this$columnStats.equals(other$columnStats)) return false;` whereas the working 
test I implement is the following:
   `
   ```
   IntStream.range(0, this.getColumnStats().size())
 .allMatch(i -> 
this.getColumnStats().get(i).equals(obj2.getColumnStats().get(i)))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052898303


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,21 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public boolean equals(InternalDataFile obj2) {

Review Comment:
   the list comparison generated with delombok for columnStats is the the 
following: 
   
   ``if (this$columnStats == null ? other$columnStats != null : 
!this$columnStats.equals(other$columnStats)) return false;` whereas the working 
test I implement is the following:
   `
   ```
   IntStream.range(0, this.getColumnStats().size())
 .allMatch(i -> 
this.getColumnStats().get(i).equals(obj2.getColumnStats().get(i)))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052878456


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import lombok.Builder;
+import lombok.Value;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+
+
+private static final ParquetPartitionValueExtractor partitionExtractor =
+ParquetPartitionValueExtractor.getInstance();
+
+
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+
+private static final ParquetMetadataExtractor parquetMetadataExtractor =
+ParquetMetadataExtractor.getInstance();
+
+private static final InputPartitionFields partitions = null;
+
+public static ParquetStatsExtractor getInstance() {
+return INSTANCE;
+}
+
+public static List getColumnStatsForaFile(ParquetMetadata 
footer) {
+return getStatsForFile(footer).values().stream()
+.flatMap(List::stream)
+.collect(Collectors.toList());
+}
+
+private static Optional getMaxFromColumnStats(List 
columnStats) {
+return columnStats.stream()
+.filter(entry -> entry.getField().getParentPath() == null)
+.map(ColumnStat::getNumValues)
+.filter(numValues -> numValues > 0)
+.max(Long::compareTo);
+}
+
+public static Map> 
getStatsForFile(ParquetMetadata footer) {
+Map> columnDescStats = new 
HashMap<>();
+MessageType schema = parquetMetadataExtractor.getSchema(footer);
+List columns = new ArrayList<>();
+columns =
+footer.getBlocks().stream()
+.flatMap(blockMetaData -> 
blockMetaData.getColumns().stream())
+.collect(Collectors.toList());
+columnDescStats =
+columns.stream()
+.collect(
+Collectors.groupingBy(
+columnMetaData ->
+
schema.getColumnDescription(columnMetaData.getPath().toArray()),
+Collectors.mapping(
+columnMetaData ->
+ColumnStat.builder()
+.field(
+
InternalField.builder()
+   
 .name(columnMetaData.getPrimitiveType().getName())
+   
 .fieldId(
+   
 columnMetaData.getPrimitiveType().getId() == null
+   
 ? null
+   

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052876868


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,21 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public boolean equals(InternalDataFile obj2) {

Review Comment:
   can you elaborate on why?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052863515


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,21 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public boolean equals(InternalDataFile obj2) {

Review Comment:
   the generated `equals `does NOT work on the ColumnStats.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052857533


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import lombok.Builder;
+import lombok.Value;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+
+
+private static final ParquetPartitionValueExtractor partitionExtractor =
+ParquetPartitionValueExtractor.getInstance();
+
+
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+
+private static final ParquetMetadataExtractor parquetMetadataExtractor =
+ParquetMetadataExtractor.getInstance();
+
+private static final InputPartitionFields partitions = null;
+
+public static ParquetStatsExtractor getInstance() {
+return INSTANCE;
+}
+
+public static List getColumnStatsForaFile(ParquetMetadata 
footer) {
+return getStatsForFile(footer).values().stream()
+.flatMap(List::stream)
+.collect(Collectors.toList());
+}
+
+private static Optional getMaxFromColumnStats(List 
columnStats) {
+return columnStats.stream()
+.filter(entry -> entry.getField().getParentPath() == null)
+.map(ColumnStat::getNumValues)
+.filter(numValues -> numValues > 0)
+.max(Long::compareTo);
+}
+
+public static Map> 
getStatsForFile(ParquetMetadata footer) {
+Map> columnDescStats = new 
HashMap<>();
+MessageType schema = parquetMetadataExtractor.getSchema(footer);
+List columns = new ArrayList<>();
+columns =
+footer.getBlocks().stream()
+.flatMap(blockMetaData -> 
blockMetaData.getColumns().stream())
+.collect(Collectors.toList());
+columnDescStats =
+columns.stream()
+.collect(
+Collectors.groupingBy(
+columnMetaData ->
+
schema.getColumnDescription(columnMetaData.getPath().toArray()),
+Collectors.mapping(
+columnMetaData ->
+ColumnStat.builder()
+.field(
+
InternalField.builder()
+   
 .name(columnMetaData.getPrimitiveType().getName())
+   
 .fieldId(
+   
 columnMetaData.getPrimitiveType().getId() == null
+   
 ? null
+   
 

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052654864


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import java.nio.file.Paths;
+
+import java.nio.file.Files;
+//import java.nio.file.Path;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+import org.junit.jupiter.api.io.TempDir;
+
+public class TestParquetStatsExtractor {
+@TempDir
+static java.nio.file.Path tempDir = Paths.get("./");
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static List initFileTest(File file) throws IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+//MessageTypeParser.parseMessageType("message m { required 
group a {required binary b;}}");

Review Comment:
   @unical1988 remember that we are volunteering our time here. This is not 
part of my day job. I have 
[asked](https://github.com/apache/incubator-xtable/pull/669#discussion_r2018802552)
 previously about ensuring the outputs align with the project's expectations 
but do not see enough tests to ensure that currently. When dealing with 
people's data it is very important to get the outputs right or else you end up 
with files being skipped due to improper column level statistics and therefore 
incorrect results in the output of your query engine. This is the reason for 
the high bar.
   
   Additionally, this feedback is almost identical to what we went through on 
the [schema 
conversion](https://github.com/apache/incubator-xtable/pull/669#discussion_r2040724521)
 where the test cases kept being removed by commenting them out instead of 
being additive. Instead of commenting out cases, you should always be expanding 
upon the initial test case you came up with to ensure any future changes do not 
break the initial set of cases you were handling. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052623474


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import lombok.Builder;
+import lombok.Value;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+
+
+private static final ParquetPartitionValueExtractor partitionExtractor =
+ParquetPartitionValueExtractor.getInstance();
+
+
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+
+private static final ParquetMetadataExtractor parquetMetadataExtractor =
+ParquetMetadataExtractor.getInstance();
+
+private static final InputPartitionFields partitions = null;
+
+public static ParquetStatsExtractor getInstance() {
+return INSTANCE;
+}
+
+public static List getColumnStatsForaFile(ParquetMetadata 
footer) {
+return getStatsForFile(footer).values().stream()
+.flatMap(List::stream)
+.collect(Collectors.toList());
+}
+
+private static Optional getMaxFromColumnStats(List 
columnStats) {
+return columnStats.stream()
+.filter(entry -> entry.getField().getParentPath() == null)
+.map(ColumnStat::getNumValues)
+.filter(numValues -> numValues > 0)
+.max(Long::compareTo);
+}
+
+public static Map> 
getStatsForFile(ParquetMetadata footer) {
+Map> columnDescStats = new 
HashMap<>();
+MessageType schema = parquetMetadataExtractor.getSchema(footer);
+List columns = new ArrayList<>();
+columns =
+footer.getBlocks().stream()
+.flatMap(blockMetaData -> 
blockMetaData.getColumns().stream())
+.collect(Collectors.toList());
+columnDescStats =
+columns.stream()
+.collect(
+Collectors.groupingBy(
+columnMetaData ->
+
schema.getColumnDescription(columnMetaData.getPath().toArray()),
+Collectors.mapping(
+columnMetaData ->
+ColumnStat.builder()
+.field(
+
InternalField.builder()
+   
 .name(columnMetaData.getPrimitiveType().getName())
+   
 .fieldId(
+   
 columnMetaData.getPrimitiveType().getId() == null
+   
 ? null
+   

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052618935


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import lombok.Builder;
+import lombok.Value;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+
+
+private static final ParquetPartitionValueExtractor partitionExtractor =
+ParquetPartitionValueExtractor.getInstance();
+
+
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+
+private static final ParquetMetadataExtractor parquetMetadataExtractor =
+ParquetMetadataExtractor.getInstance();
+
+private static final InputPartitionFields partitions = null;
+
+public static ParquetStatsExtractor getInstance() {
+return INSTANCE;
+}
+
+public static List getColumnStatsForaFile(ParquetMetadata 
footer) {
+return getStatsForFile(footer).values().stream()
+.flatMap(List::stream)
+.collect(Collectors.toList());
+}
+
+private static Optional getMaxFromColumnStats(List 
columnStats) {
+return columnStats.stream()
+.filter(entry -> entry.getField().getParentPath() == null)
+.map(ColumnStat::getNumValues)
+.filter(numValues -> numValues > 0)
+.max(Long::compareTo);
+}
+
+public static Map> 
getStatsForFile(ParquetMetadata footer) {
+Map> columnDescStats = new 
HashMap<>();
+MessageType schema = parquetMetadataExtractor.getSchema(footer);
+List columns = new ArrayList<>();
+columns =
+footer.getBlocks().stream()
+.flatMap(blockMetaData -> 
blockMetaData.getColumns().stream())
+.collect(Collectors.toList());
+columnDescStats =
+columns.stream()
+.collect(
+Collectors.groupingBy(
+columnMetaData ->
+
schema.getColumnDescription(columnMetaData.getPath().toArray()),
+Collectors.mapping(
+columnMetaData ->
+ColumnStat.builder()
+.field(
+
InternalField.builder()
+   
 .name(columnMetaData.getPrimitiveType().getName())
+   
 .fieldId(
+   
 columnMetaData.getPrimitiveType().getId() == null
+   
 ? null
+   
 

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052616509


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import lombok.Builder;
+import lombok.Value;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+
+
+private static final ParquetPartitionValueExtractor partitionExtractor =
+ParquetPartitionValueExtractor.getInstance();
+
+
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+
+private static final ParquetMetadataExtractor parquetMetadataExtractor =
+ParquetMetadataExtractor.getInstance();
+
+private static final InputPartitionFields partitions = null;
+
+public static ParquetStatsExtractor getInstance() {
+return INSTANCE;
+}
+
+public static List getColumnStatsForaFile(ParquetMetadata 
footer) {
+return getStatsForFile(footer).values().stream()
+.flatMap(List::stream)
+.collect(Collectors.toList());
+}
+
+private static Optional getMaxFromColumnStats(List 
columnStats) {
+return columnStats.stream()
+.filter(entry -> entry.getField().getParentPath() == null)
+.map(ColumnStat::getNumValues)
+.filter(numValues -> numValues > 0)
+.max(Long::compareTo);
+}
+
+public static Map> 
getStatsForFile(ParquetMetadata footer) {
+Map> columnDescStats = new 
HashMap<>();
+MessageType schema = parquetMetadataExtractor.getSchema(footer);
+List columns = new ArrayList<>();
+columns =
+footer.getBlocks().stream()
+.flatMap(blockMetaData -> 
blockMetaData.getColumns().stream())
+.collect(Collectors.toList());
+columnDescStats =
+columns.stream()
+.collect(
+Collectors.groupingBy(
+columnMetaData ->
+
schema.getColumnDescription(columnMetaData.getPath().toArray()),
+Collectors.mapping(
+columnMetaData ->
+ColumnStat.builder()
+.field(
+
InternalField.builder()
+   
 .name(columnMetaData.getPrimitiveType().getName())
+   
 .fieldId(
+   
 columnMetaData.getPrimitiveType().getId() == null
+   
 ? null
+   

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052567178


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import lombok.Builder;
+import lombok.Value;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+
+
+private static final ParquetPartitionValueExtractor partitionExtractor =
+ParquetPartitionValueExtractor.getInstance();
+
+
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+
+private static final ParquetMetadataExtractor parquetMetadataExtractor =
+ParquetMetadataExtractor.getInstance();
+
+private static final InputPartitionFields partitions = null;
+
+public static ParquetStatsExtractor getInstance() {
+return INSTANCE;
+}
+
+public static List getColumnStatsForaFile(ParquetMetadata 
footer) {
+return getStatsForFile(footer).values().stream()
+.flatMap(List::stream)
+.collect(Collectors.toList());
+}
+
+private static Optional getMaxFromColumnStats(List 
columnStats) {
+return columnStats.stream()
+.filter(entry -> entry.getField().getParentPath() == null)
+.map(ColumnStat::getNumValues)
+.filter(numValues -> numValues > 0)
+.max(Long::compareTo);
+}
+
+public static Map> 
getStatsForFile(ParquetMetadata footer) {
+Map> columnDescStats = new 
HashMap<>();
+MessageType schema = parquetMetadataExtractor.getSchema(footer);
+List columns = new ArrayList<>();
+columns =
+footer.getBlocks().stream()
+.flatMap(blockMetaData -> 
blockMetaData.getColumns().stream())
+.collect(Collectors.toList());
+columnDescStats =
+columns.stream()
+.collect(
+Collectors.groupingBy(
+columnMetaData ->
+
schema.getColumnDescription(columnMetaData.getPath().toArray()),
+Collectors.mapping(
+columnMetaData ->
+ColumnStat.builder()
+.field(
+
InternalField.builder()
+   
 .name(columnMetaData.getPrimitiveType().getName())
+   
 .fieldId(
+   
 columnMetaData.getPrimitiveType().getId() == null
+   
 ? null
+   
 

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052547092


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,21 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();

Review Comment:
   Why do you need this wrapper instead of just using the `toBuilder` directly?



##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,21 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public boolean equals(InternalDataFile obj2) {

Review Comment:
   Let's just use the generated equals



##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import lombok.Builder;
+import lombok.Value;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+
+
+private static final ParquetPartitionValueExtractor partitionExtractor =
+ParquetPartitionValueExtractor.getInstance();
+
+
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+
+private static final ParquetMetadataExtractor parquetMetadataExtractor =
+ParquetMetadataExtractor.getInstance();

Review Comment:
   It will be easier for testing if you define these as instance variables so 
we can create an instance with mocks of these dependencies



##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import lombok.Builder;
+import l

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052545271


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import java.nio.file.Paths;
+
+import java.nio.file.Files;
+//import java.nio.file.Path;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+import org.junit.jupiter.api.io.TempDir;
+
+public class TestParquetStatsExtractor {
+@TempDir
+static java.nio.file.Path tempDir = Paths.get("./");
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static List initFileTest(File file) throws IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+//MessageTypeParser.parseMessageType("message m { required 
group a {required binary b;}}");

Review Comment:
   @the-other-tim-brown to me this is vague, i added a test with a different 
file schema that works. I dont request review, i contribute and notify about my 
contribution, and you do your job. If smth is not of your standards you let me 
know, remember i am new here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-21 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2052538869


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+
+import lombok.Builder;
+
+import org.apache.hadoop.conf.Configuration;
+import java.nio.file.Paths;
+
+import java.nio.file.Files;
+//import java.nio.file.Path;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.*;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.storage.InternalDataFile;
+
+import org.junit.jupiter.api.io.TempDir;
+
+public class TestParquetStatsExtractor {
+@TempDir
+static java.nio.file.Path tempDir = Paths.get("./");
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static List initFileTest(File file) throws IOException {
+// create the parquet file by parsing a schema
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema =
+//MessageTypeParser.parseMessageType("message m { required 
group a {required binary b;}}");

Review Comment:
   @unical1988 can you make sure the code is ready for review before requesting 
reviews? Commented out code like this is showing me this is not ready for 
review. These tests need to cover more than a single case so you should be 
combining the commented out code with the active code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-18 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2051342564


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.*;
+
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.junit.jupiter.api.Test;
+import org.apache.parquet.schema.*;
+import org.junit.jupiter.api.Assertions;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.IntLogicalTypeAnnotation;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.StringLogicalTypeAnnotation;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.apache.parquet.schema.Type;
+import org.apache.parquet.schema.Types;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.GroupType;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.parquet.schema.OriginalType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.xtable.model.stat.Range;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.format.Statistics;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.xtable.model.storage.InternalDataFile;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.xtable.model.storage.FileFormat;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+
+import java.io.File;
+import java.util.Map;
+import java.util.HashMap;
+import java.util.Arrays;
+import java.util.ArrayList;
+import java.io.IOException;
+
+import lombok.Builder;
+import org.apache.parquet.schema.MessageTypeParser;
+
+
+
+import org.apache.xtable.model.stat.Range;
+
+
+public class TestParquetStatsExtractor {
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static ParquetFileReader createParquetFile(File file) throws 
IOException {
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema = MessageTypeParser.parseMessageType("message m { 
required group a {required binary b;}}");
+String[] columnPath = {"a", "b"};
+ColumnDescriptor c1 = schema.getColumnDescription(columnPath);
+
+byte[] bytes1 = {0, 1, 2, 3};
+byte[] bytes2 = {2, 3, 4, 5};
+CompressionCodecName codec = CompressionCodecName.UNCOMPRESSED;
+
+// include statics using update()
+IntStatistics stats = new IntStatistics(); // or BinaryStatistics
+stats.updateStats(1);
+stats.updateStats(2);
+stats.updateStats(5);
+
+ParquetFileWriter w = new ParquetFileWriter(configuration, schema, 
path);
+w.start();
+w.startBlock(3);
+w.startColumn(c1, 5, codec);
+w.writeDataPage(2, 4, BytesInput.from(bytes1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.writeDataPage(3, 4, BytesInput.from(bytes1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.st

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-18 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2051336288


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+import org.apache.xtable.model.schema.InternalSchema;
+
+import java.util.Collection;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Map;
+import java.util.HashMap;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.TreeSet;
+import java.util.Optional;
+
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import lombok.Builder;
+import org.apache.xtable.model.schema.InternalField;
+import lombok.Value;
+import org.apache.xtable.model.storage.InternalDataFile;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.Encoding;
+import org.apache.parquet.column.statistics.Statistics;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.hadoop.conf.Configuration;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+@Builder.Default

Review Comment:
   removed Builder annotation



##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetSchemaExtractor.java:
##
@@ -0,0 +1,487 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.List;
+import java.util.ArrayList;
+
+import org.apache.xtable.schema.SchemaUtils;
+import org.apache.xtable.exception.SchemaExtractorException;
+
+import java.util.Collections;
+import java.util.Optional;
+
+import lombok.AccessLevel;
+import lombok.NoArgsConstructor;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+import org.apache.xtable.hudi.idtracking.models.IdMapping;
+import org.apache.avro.Schema;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.PrimitiveType;
+import org.apache.parquet.schema.Type;
+import org.apache.xtable.collectors.CustomCollectors;
+import org.apache.xtable.exception.UnsupportedSchemaTypeException;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.parquet.schema.Type.Repetition;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.parquet.schema.Type.ID;
+import org.apache.parquet.schema.OriginalType;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Types;
+import org.apache.parquet.column.ColumnDescriptor;
+
+
+/**
+ * Class that converts parquet S

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041157148


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.*;
+
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.junit.jupiter.api.Test;
+import org.apache.parquet.schema.*;
+import org.junit.jupiter.api.Assertions;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.IntLogicalTypeAnnotation;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.StringLogicalTypeAnnotation;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.apache.parquet.schema.Type;
+import org.apache.parquet.schema.Types;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.GroupType;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.parquet.schema.OriginalType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.xtable.model.stat.Range;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.format.Statistics;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.xtable.model.storage.InternalDataFile;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.xtable.model.storage.FileFormat;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+
+import java.io.File;
+import java.util.Map;
+import java.util.HashMap;
+import java.util.Arrays;
+import java.util.ArrayList;
+import java.io.IOException;
+
+import lombok.Builder;
+import org.apache.parquet.schema.MessageTypeParser;
+
+
+
+import org.apache.xtable.model.stat.Range;
+
+
+public class TestParquetStatsExtractor {
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static ParquetFileReader createParquetFile(File file) throws 
IOException {
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema = MessageTypeParser.parseMessageType("message m { 
required group a {required binary b;}}");
+String[] columnPath = {"a", "b"};
+ColumnDescriptor c1 = schema.getColumnDescription(columnPath);
+
+byte[] bytes1 = {0, 1, 2, 3};
+byte[] bytes2 = {2, 3, 4, 5};
+CompressionCodecName codec = CompressionCodecName.UNCOMPRESSED;
+
+// include statics using update()
+IntStatistics stats = new IntStatistics(); // or BinaryStatistics
+stats.updateStats(1);
+stats.updateStats(2);
+stats.updateStats(5);
+
+ParquetFileWriter w = new ParquetFileWriter(configuration, schema, 
path);
+w.start();
+w.startBlock(3);
+w.startColumn(c1, 5, codec);
+w.writeDataPage(2, 4, BytesInput.from(bytes1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.writeDataPage(3, 4, BytesInput.from(bytes1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.st

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041156027


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.*;
+
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.junit.jupiter.api.Test;
+import org.apache.parquet.schema.*;
+import org.junit.jupiter.api.Assertions;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.IntLogicalTypeAnnotation;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.StringLogicalTypeAnnotation;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.apache.parquet.schema.Type;
+import org.apache.parquet.schema.Types;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.GroupType;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.parquet.schema.OriginalType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.xtable.model.stat.Range;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.format.Statistics;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.xtable.model.storage.InternalDataFile;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.xtable.model.storage.FileFormat;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+
+import java.io.File;
+import java.util.Map;
+import java.util.HashMap;
+import java.util.Arrays;
+import java.util.ArrayList;
+import java.io.IOException;
+
+import lombok.Builder;
+import org.apache.parquet.schema.MessageTypeParser;
+
+
+
+import org.apache.xtable.model.stat.Range;
+
+
+public class TestParquetStatsExtractor {
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static ParquetFileReader createParquetFile(File file) throws 
IOException {
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema = MessageTypeParser.parseMessageType("message m { 
required group a {required binary b;}}");

Review Comment:
   I already included InStatistics, and look to do for BInaryStatistics and 
BooleanStatistics shortly. For the schema types, yes other types could be 
covered in the tests as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041156027


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.*;
+
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.junit.jupiter.api.Test;
+import org.apache.parquet.schema.*;
+import org.junit.jupiter.api.Assertions;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.IntLogicalTypeAnnotation;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.StringLogicalTypeAnnotation;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.apache.parquet.schema.Type;
+import org.apache.parquet.schema.Types;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.GroupType;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.parquet.schema.OriginalType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.xtable.model.stat.Range;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.format.Statistics;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.xtable.model.storage.InternalDataFile;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.xtable.model.storage.FileFormat;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+
+import java.io.File;
+import java.util.Map;
+import java.util.HashMap;
+import java.util.Arrays;
+import java.util.ArrayList;
+import java.io.IOException;
+
+import lombok.Builder;
+import org.apache.parquet.schema.MessageTypeParser;
+
+
+
+import org.apache.xtable.model.stat.Range;
+
+
+public class TestParquetStatsExtractor {
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static ParquetFileReader createParquetFile(File file) throws 
IOException {
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema = MessageTypeParser.parseMessageType("message m { 
required group a {required binary b;}}");

Review Comment:
   I already included InStatistics, and look to do for BInaryStatistics and 
BooleanStatistics shortly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041152946


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,20 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public static boolean compareFiles(InternalDataFile obj1, InternalDataFile 
obj2) {

Review Comment:
   I resorted to converting them into byte arrays and comparing them as such



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041142002


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,20 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public static boolean compareFiles(InternalDataFile obj1, InternalDataFile 
obj2) {

Review Comment:
   But I am making sure they are equal, and using my equals() the test passes. 
I will check further...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041142002


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,20 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public static boolean compareFiles(InternalDataFile obj1, InternalDataFile 
obj2) {

Review Comment:
   But I am making sure they are equal, and using my equals() the test passes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041140911


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.*;
+
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.junit.jupiter.api.Test;
+import org.apache.parquet.schema.*;
+import org.junit.jupiter.api.Assertions;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.IntLogicalTypeAnnotation;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.StringLogicalTypeAnnotation;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.apache.parquet.schema.Type;
+import org.apache.parquet.schema.Types;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.GroupType;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.parquet.schema.OriginalType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.xtable.model.stat.Range;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.format.Statistics;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.xtable.model.storage.InternalDataFile;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.xtable.model.storage.FileFormat;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+
+import java.io.File;
+import java.util.Map;
+import java.util.HashMap;
+import java.util.Arrays;
+import java.util.ArrayList;
+import java.io.IOException;
+
+import lombok.Builder;
+import org.apache.parquet.schema.MessageTypeParser;
+
+
+
+import org.apache.xtable.model.stat.Range;
+
+
+public class TestParquetStatsExtractor {
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static ParquetFileReader createParquetFile(File file) throws 
IOException {
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema = MessageTypeParser.parseMessageType("message m { 
required group a {required binary b;}}");

Review Comment:
   if you are speaking about the stats, here's the parquet supported types: 
[BinaryStatistics](https://www.javadoc.io/static/org.apache.parquet/parquet-column/1.7.0/org/apache/parquet/column/statistics/BinaryStatistics.html),
 
[BooleanStatistics](https://www.javadoc.io/static/org.apache.parquet/parquet-column/1.7.0/org/apache/parquet/column/statistics/BooleanStatistics.html),
 
[DoubleStatistics](https://www.javadoc.io/static/org.apache.parquet/parquet-column/1.7.0/org/apache/parquet/column/statistics/DoubleStatistics.html),
 
[FloatStatistics](https://www.javadoc.io/static/org.apache.parquet/parquet-column/1.7.0/org/apache/parquet/column/statistics/FloatStatistics.html),
 
[IntStatistics](https://www.javadoc.io/static/org.apache.parquet/parquet-column/1.7.0/org/apache/parquet/column/statistics/IntStatistics.html),
 
[LongStatistics](https://www.javadoc.io/static/org.apac

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041140907


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,20 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public static boolean compareFiles(InternalDataFile obj1, InternalDataFile 
obj2) {

Review Comment:
   yes it does, if the test is failing it is because they are not equal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041140570


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,20 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public static boolean compareFiles(InternalDataFile obj1, InternalDataFile 
obj2) {

Review Comment:
   the one automatically generated doesn't work on the passed types



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041139881


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.*;
+
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.junit.jupiter.api.Test;
+import org.apache.parquet.schema.*;
+import org.junit.jupiter.api.Assertions;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.IntLogicalTypeAnnotation;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.StringLogicalTypeAnnotation;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.apache.parquet.schema.Type;
+import org.apache.parquet.schema.Types;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.GroupType;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.parquet.schema.OriginalType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.xtable.model.stat.Range;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.format.Statistics;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.xtable.model.storage.InternalDataFile;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.xtable.model.storage.FileFormat;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+
+import java.io.File;
+import java.util.Map;
+import java.util.HashMap;
+import java.util.Arrays;
+import java.util.ArrayList;
+import java.io.IOException;
+
+import lombok.Builder;
+import org.apache.parquet.schema.MessageTypeParser;
+
+
+
+import org.apache.xtable.model.stat.Range;
+
+
+public class TestParquetStatsExtractor {
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static ParquetFileReader createParquetFile(File file) throws 
IOException {
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema = MessageTypeParser.parseMessageType("message m { 
required group a {required binary b;}}");

Review Comment:
   Float, double, timestamps, date, ints, longs should all be included



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


the-other-tim-brown commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041139548


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,20 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public static boolean compareFiles(InternalDataFile obj1, InternalDataFile 
obj2) {

Review Comment:
   there is already an `equals` method generated for the class by lombok, there 
is no need for a custom implementation 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041119154


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.*;
+
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.junit.jupiter.api.Test;
+import org.apache.parquet.schema.*;
+import org.junit.jupiter.api.Assertions;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.IntLogicalTypeAnnotation;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.StringLogicalTypeAnnotation;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.apache.parquet.schema.Type;
+import org.apache.parquet.schema.Types;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.GroupType;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.parquet.schema.OriginalType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.xtable.model.stat.Range;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.format.Statistics;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.xtable.model.storage.InternalDataFile;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.xtable.model.storage.FileFormat;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+
+import java.io.File;
+import java.util.Map;
+import java.util.HashMap;
+import java.util.Arrays;
+import java.util.ArrayList;
+import java.io.IOException;
+
+import lombok.Builder;
+import org.apache.parquet.schema.MessageTypeParser;
+
+
+
+import org.apache.xtable.model.stat.Range;
+
+
+public class TestParquetStatsExtractor {
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static ParquetFileReader createParquetFile(File file) throws 
IOException {
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema = MessageTypeParser.parseMessageType("message m { 
required group a {required binary b;}}");

Review Comment:
   you mean like float or binary not int (this test case)?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041106066


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,20 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public static boolean compareFiles(InternalDataFile obj1, InternalDataFile 
obj2) {

Review Comment:
   Seems more relevant to keep it there, renamed to `equals()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041095246


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,20 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public static boolean compareFiles(InternalDataFile obj1, InternalDataFile 
obj2) {

Review Comment:
   Seems more relevant to keep it there. but the question is should the test 
include internaldatafiles or just the stats, I think I will have to add methods 
to compare Range, Field and ColumnStat as well, they could be named 
`.equals()`... inside the respective classes.



##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,20 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public static boolean compareFiles(InternalDataFile obj1, InternalDataFile 
obj2) {

Review Comment:
   Seems more relevant to keep it there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041097072


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.*;
+
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.junit.jupiter.api.Test;
+import org.apache.parquet.schema.*;
+import org.junit.jupiter.api.Assertions;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.IntLogicalTypeAnnotation;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.StringLogicalTypeAnnotation;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.apache.parquet.schema.Type;
+import org.apache.parquet.schema.Types;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.GroupType;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.parquet.schema.OriginalType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.xtable.model.stat.Range;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.format.Statistics;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.xtable.model.storage.InternalDataFile;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.xtable.model.storage.FileFormat;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+
+import java.io.File;
+import java.util.Map;
+import java.util.HashMap;
+import java.util.Arrays;
+import java.util.ArrayList;
+import java.io.IOException;
+
+import lombok.Builder;
+import org.apache.parquet.schema.MessageTypeParser;
+
+
+
+import org.apache.xtable.model.stat.Range;
+
+
+public class TestParquetStatsExtractor {
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static ParquetFileReader createParquetFile(File file) throws 
IOException {
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema = MessageTypeParser.parseMessageType("message m { 
required group a {required binary b;}}");
+String[] columnPath = {"a", "b"};
+ColumnDescriptor c1 = schema.getColumnDescription(columnPath);
+
+byte[] bytes1 = {0, 1, 2, 3};
+byte[] bytes2 = {2, 3, 4, 5};
+CompressionCodecName codec = CompressionCodecName.UNCOMPRESSED;
+
+// include statics using update()
+IntStatistics stats = new IntStatistics(); // or BinaryStatistics
+stats.updateStats(1);
+stats.updateStats(2);
+stats.updateStats(5);
+
+ParquetFileWriter w = new ParquetFileWriter(configuration, schema, 
path);
+w.start();
+w.startBlock(3);
+w.startColumn(c1, 5, codec);
+w.writeDataPage(2, 4, BytesInput.from(bytes1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.writeDataPage(3, 4, BytesInput.from(bytes1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.st

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041098172


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetSchemaExtractor.java:
##
@@ -0,0 +1,487 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.List;
+import java.util.ArrayList;
+
+import org.apache.xtable.schema.SchemaUtils;
+import org.apache.xtable.exception.SchemaExtractorException;
+
+import java.util.Collections;
+import java.util.Optional;
+
+import lombok.AccessLevel;
+import lombok.NoArgsConstructor;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+import org.apache.xtable.hudi.idtracking.models.IdMapping;
+import org.apache.avro.Schema;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.PrimitiveType;
+import org.apache.parquet.schema.Type;
+import org.apache.xtable.collectors.CustomCollectors;
+import org.apache.xtable.exception.UnsupportedSchemaTypeException;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.parquet.schema.Type.Repetition;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.parquet.schema.Type.ID;
+import org.apache.parquet.schema.OriginalType;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Types;
+import org.apache.parquet.column.ColumnDescriptor;
+
+
+/**
+ * Class that converts parquet Schema {@link Schema} to Canonical Schema 
{@link InternalSchema} and
+ * vice-versa. This conversion is fully reversible and there is a strict 1 to 
1 mapping between
+ * parquet data types and canonical data types.
+ */
+@NoArgsConstructor(access = AccessLevel.PRIVATE)
+public class ParquetSchemaExtractor {
+// parquet only supports string keys in maps
+private static final InternalField MAP_KEY_FIELD =
+InternalField.builder()
+.name(InternalField.Constants.MAP_KEY_FIELD_NAME)
+.schema(
+InternalSchema.builder()
+.name("map_key")
+.dataType(InternalType.STRING)
+.isNullable(false)
+.build())
+.defaultValue("")
+.build();
+private static final ParquetSchemaExtractor INSTANCE = new 
ParquetSchemaExtractor();
+private static final String ELEMENT = "element";
+private static final String KEY = "key";
+private static final String VALUE = "value";
+
+public static ParquetSchemaExtractor getInstance() {
+return INSTANCE;
+}
+
+private static boolean isNullable(Type schema) {
+return schema.getRepetition() == Repetition.REQUIRED ? false : true;

Review Comment:
   ok
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041095246


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,20 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public static boolean compareFiles(InternalDataFile obj1, InternalDataFile 
obj2) {

Review Comment:
   Seems more relevant to keep it there. but the question is should the test 
include internaldatafiles or just the stats, I think I will have to add methods 
to compare Range, Field and ColumnStat as well, they could be named 
`.equals()`... inside the respective classes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041097072


##
xtable-core/src/test/java/org/apache/xtable/parquet/TestParquetStatsExtractor.java:
##
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.*;
+
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import java.util.Collections;
+import java.util.List;
+
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.junit.jupiter.api.Test;
+import org.apache.parquet.schema.*;
+import org.junit.jupiter.api.Assertions;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.IntLogicalTypeAnnotation;
+import 
org.apache.parquet.schema.LogicalTypeAnnotation.StringLogicalTypeAnnotation;
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName;
+import org.apache.parquet.schema.Type.Repetition;
+import org.apache.parquet.schema.Type;
+import org.apache.parquet.schema.Types;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.GroupType;
+import org.apache.xtable.model.schema.InternalField;
+import org.apache.xtable.model.schema.InternalSchema;
+import org.apache.xtable.model.schema.InternalType;
+import org.apache.parquet.schema.OriginalType;
+import org.apache.parquet.schema.MessageTypeParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.xtable.model.stat.Range;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.format.Statistics;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.xtable.model.storage.InternalDataFile;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.xtable.model.storage.FileFormat;
+
+import static org.apache.parquet.column.Encoding.BIT_PACKED;
+import static org.apache.parquet.column.Encoding.PLAIN;
+
+import java.io.File;
+import java.util.Map;
+import java.util.HashMap;
+import java.util.Arrays;
+import java.util.ArrayList;
+import java.io.IOException;
+
+import lombok.Builder;
+import org.apache.parquet.schema.MessageTypeParser;
+
+
+
+import org.apache.xtable.model.stat.Range;
+
+
+public class TestParquetStatsExtractor {
+
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+
+public static ParquetFileReader createParquetFile(File file) throws 
IOException {
+Path path = new Path(file.toURI());
+Configuration configuration = new Configuration();
+
+MessageType schema = MessageTypeParser.parseMessageType("message m { 
required group a {required binary b;}}");
+String[] columnPath = {"a", "b"};
+ColumnDescriptor c1 = schema.getColumnDescription(columnPath);
+
+byte[] bytes1 = {0, 1, 2, 3};
+byte[] bytes2 = {2, 3, 4, 5};
+CompressionCodecName codec = CompressionCodecName.UNCOMPRESSED;
+
+// include statics using update()
+IntStatistics stats = new IntStatistics(); // or BinaryStatistics
+stats.updateStats(1);
+stats.updateStats(2);
+stats.updateStats(5);
+
+ParquetFileWriter w = new ParquetFileWriter(configuration, schema, 
path);
+w.start();
+w.startBlock(3);
+w.startColumn(c1, 5, codec);
+w.writeDataPage(2, 4, BytesInput.from(bytes1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.writeDataPage(3, 4, BytesInput.from(bytes1), stats, BIT_PACKED, 
BIT_PACKED, PLAIN);
+w.endColumn();
+w.endBlock();
+w.st

Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041096739


##
xtable-core/src/main/java/org/apache/xtable/parquet/ParquetStatsExtractor.java:
##
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.xtable.parquet;
+
+import org.apache.xtable.model.schema.InternalSchema;
+
+import java.util.Collection;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Map;
+import java.util.HashMap;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.TreeSet;
+import java.util.Optional;
+
+import org.apache.xtable.model.stat.PartitionValue;
+import org.apache.xtable.model.stat.ColumnStat;
+import org.apache.xtable.model.stat.Range;
+import lombok.Builder;
+import org.apache.xtable.model.schema.InternalField;
+import lombok.Value;
+import org.apache.xtable.model.storage.InternalDataFile;
+import org.apache.hadoop.fs.*;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.Encoding;
+import org.apache.parquet.column.statistics.Statistics;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+import org.apache.xtable.model.storage.FileFormat;
+import org.apache.xtable.model.config.InputPartitionFields;
+import org.apache.hadoop.conf.Configuration;
+
+@Value
+@Builder
+public class ParquetStatsExtractor {
+
+private static final ParquetStatsExtractor INSTANCE = null; // new 
ParquetStatsExtractor();
+@Builder.Default
+private static final ParquetPartitionValueExtractor partitionExtractor =
+ParquetPartitionValueExtractor.getInstance();
+@Builder.Default
+private static final ParquetSchemaExtractor schemaExtractor =
+ParquetSchemaExtractor.getInstance();
+@Builder.Default
+private static final ParquetMetadataExtractor parquetMetadataExtractor =
+ParquetMetadataExtractor.getInstance();
+
+private static final InputPartitionFields partitions = null;
+
+public static ParquetStatsExtractor getInstance() {
+return INSTANCE;
+}
+
+public static List getColumnStatsForaFile(ParquetMetadata 
footer) {
+return getStatsForaFile(footer).values().stream()
+.flatMap(List::stream)
+.collect(Collectors.toList());
+}
+
+private static Optional getMaxFromColumnStats(List 
columnStats) {
+return columnStats.stream()
+.filter(entry -> entry.getField().getParentPath() == null)
+.map(ColumnStat::getNumValues)
+.filter(numValues -> numValues > 0)
+.max(Long::compareTo);
+}
+
+
+public static Map> 
getStatsForaFile(ParquetMetadata footer) {

Review Comment:
   you could extract a footer once for a set of files, I uderstand the point of 
changing this.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] smaller PR for parquet [incubator-xtable]

2025-04-13 Thread via GitHub


unical1988 commented on code in PR #669:
URL: https://github.com/apache/incubator-xtable/pull/669#discussion_r2041095246


##
xtable-api/src/main/java/org/apache/xtable/model/storage/InternalDataFile.java:
##
@@ -52,4 +52,20 @@ public class InternalDataFile extends InternalFile {
   @Builder.Default @NonNull List columnStats = 
Collections.emptyList();
   // last modified time in millis since epoch
   long lastModified;
+  public static InternalDataFileBuilder builderFrom(InternalDataFile dataFile) 
{
+return dataFile.toBuilder();
+  }
+
+  public static boolean compareFiles(InternalDataFile obj1, InternalDataFile 
obj2) {

Review Comment:
   the question is should the test include internaldatafiles or just the stats, 
I think I will have to add methods to compare Range, Field and ColumnStat as 
well, they could be named `.equals()`... inside the respective classes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >