[jira] [Commented] (PARQUET-2006) Column resolution by ID

2022-03-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510175#comment-17510175
 ] 

ASF GitHub Bot commented on PARQUET-2006:
-

huaxingao commented on pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#issuecomment-1074572886


   @ggershinsky I updated the description. Please check again to see if it is 
clear to you. Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Column resolution by ID
> ---
>
> Key: PARQUET-2006
> URL: https://issues.apache.org/jira/browse/PARQUET-2006
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Parquet relies on the name. In a lot of usages e.g. schema resolution, this 
> would be a problem. Iceberg uses ID and stored Id/name mappings. 
> This Jira is to add column ID resolution support. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] huaxingao commented on pull request #950: PARQUET-2006: Column resolution by ID

2022-03-21 Thread GitBox


huaxingao commented on pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#issuecomment-1074572886


   @ggershinsky I updated the description. Please check again to see if it is 
clear to you. Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2006) Column resolution by ID

2022-03-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510141#comment-17510141
 ] 

ASF GitHub Bot commented on PARQUET-2006:
-

huaxingao commented on a change in pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r831619229



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java
##
@@ -472,7 +488,13 @@ public static boolean 
getPageWriteChecksumEnabled(Configuration conf) {
 LOG.debug("Parquet properties are:\n{}", props);
 
 WriteContext fileWriteContext = writeSupport.init(conf);
-
+
+MessageType schema = fileWriteContext.getSchema();
+HashSet ids = new HashSet<>();

Review comment:
   Fixed. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Column resolution by ID
> ---
>
> Key: PARQUET-2006
> URL: https://issues.apache.org/jira/browse/PARQUET-2006
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Parquet relies on the name. In a lot of usages e.g. schema resolution, this 
> would be a problem. Iceberg uses ID and stored Id/name mappings. 
> This Jira is to add column ID resolution support. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2006) Column resolution by ID

2022-03-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510140#comment-17510140
 ] 

ASF GitHub Bot commented on PARQUET-2006:
-

huaxingao commented on a change in pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r831619033



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
##
@@ -878,11 +880,92 @@ public String getFile() {
 return blocks;
   }
 
-  public void setRequestedSchema(MessageType projection) {
+  private boolean uniqueId(GroupType schema, HashSet ids) {
+boolean unique = true;
+List fields = schema.getFields();
+for (Type field : fields) {
+  if (field instanceof PrimitiveType) {
+Type.ID id = field.getId();
+if (id != null) {
+  if (ids.contains(id)) {
+return false;
+  }
+  ids.add(id);
+}
+  }
+
+  if (field instanceof GroupType) {
+Type.ID id = field.getId();
+if (id != null) {
+  if (ids.contains(id)) {
+return false;
+  }
+  ids.add(id);
+}
+if (unique) unique = uniqueId(field.asGroupType(), ids);
+  }
+}
+return unique;
+  }
+
+  public MessageType setRequestedSchema(MessageType projection) {
 paths.clear();
-for (ColumnDescriptor col : projection.getColumns()) {
+HashSet ids = new HashSet<>();
+boolean fileSchemaIdUnique = uniqueId(fileMetaData.getSchema(), ids);
+ids = new HashSet<>();
+boolean projectionSchemaIdUnique = uniqueId(projection, ids);
+MessageType schema = null;
+// if ids are unique, use id resolution

Review comment:
   Good idea! Added `ParquetInputFormat.COLUMN_ID_RESOLUTION` to control 
this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Column resolution by ID
> ---
>
> Key: PARQUET-2006
> URL: https://issues.apache.org/jira/browse/PARQUET-2006
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Parquet relies on the name. In a lot of usages e.g. schema resolution, this 
> would be a problem. Iceberg uses ID and stored Id/name mappings. 
> This Jira is to add column ID resolution support. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] huaxingao commented on a change in pull request #950: PARQUET-2006: Column resolution by ID

2022-03-21 Thread GitBox


huaxingao commented on a change in pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r831619229



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java
##
@@ -472,7 +488,13 @@ public static boolean 
getPageWriteChecksumEnabled(Configuration conf) {
 LOG.debug("Parquet properties are:\n{}", props);
 
 WriteContext fileWriteContext = writeSupport.init(conf);
-
+
+MessageType schema = fileWriteContext.getSchema();
+HashSet ids = new HashSet<>();

Review comment:
   Fixed. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] huaxingao commented on a change in pull request #950: PARQUET-2006: Column resolution by ID

2022-03-21 Thread GitBox


huaxingao commented on a change in pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r831619033



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
##
@@ -878,11 +880,92 @@ public String getFile() {
 return blocks;
   }
 
-  public void setRequestedSchema(MessageType projection) {
+  private boolean uniqueId(GroupType schema, HashSet ids) {
+boolean unique = true;
+List fields = schema.getFields();
+for (Type field : fields) {
+  if (field instanceof PrimitiveType) {
+Type.ID id = field.getId();
+if (id != null) {
+  if (ids.contains(id)) {
+return false;
+  }
+  ids.add(id);
+}
+  }
+
+  if (field instanceof GroupType) {
+Type.ID id = field.getId();
+if (id != null) {
+  if (ids.contains(id)) {
+return false;
+  }
+  ids.add(id);
+}
+if (unique) unique = uniqueId(field.asGroupType(), ids);
+  }
+}
+return unique;
+  }
+
+  public MessageType setRequestedSchema(MessageType projection) {
 paths.clear();
-for (ColumnDescriptor col : projection.getColumns()) {
+HashSet ids = new HashSet<>();
+boolean fileSchemaIdUnique = uniqueId(fileMetaData.getSchema(), ids);
+ids = new HashSet<>();
+boolean projectionSchemaIdUnique = uniqueId(projection, ids);
+MessageType schema = null;
+// if ids are unique, use id resolution

Review comment:
   Good idea! Added `ParquetInputFormat.COLUMN_ID_RESOLUTION` to control 
this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2006) Column resolution by ID

2022-03-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510139#comment-17510139
 ] 

ASF GitHub Bot commented on PARQUET-2006:
-

huaxingao commented on a change in pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r831618518



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
##
@@ -878,11 +880,92 @@ public String getFile() {
 return blocks;
   }
 
-  public void setRequestedSchema(MessageType projection) {
+  private boolean uniqueId(GroupType schema, HashSet ids) {
+boolean unique = true;
+List fields = schema.getFields();
+for (Type field : fields) {
+  if (field instanceof PrimitiveType) {
+Type.ID id = field.getId();
+if (id != null) {
+  if (ids.contains(id)) {
+return false;
+  }
+  ids.add(id);
+}
+  }
+
+  if (field instanceof GroupType) {
+Type.ID id = field.getId();
+if (id != null) {
+  if (ids.contains(id)) {
+return false;
+  }
+  ids.add(id);
+}
+if (unique) unique = uniqueId(field.asGroupType(), ids);
+  }
+}
+return unique;
+  }
+
+  public MessageType setRequestedSchema(MessageType projection) {
 paths.clear();
-for (ColumnDescriptor col : projection.getColumns()) {
+HashSet ids = new HashSet<>();

Review comment:
   Updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Column resolution by ID
> ---
>
> Key: PARQUET-2006
> URL: https://issues.apache.org/jira/browse/PARQUET-2006
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Parquet relies on the name. In a lot of usages e.g. schema resolution, this 
> would be a problem. Iceberg uses ID and stored Id/name mappings. 
> This Jira is to add column ID resolution support. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] huaxingao commented on a change in pull request #950: PARQUET-2006: Column resolution by ID

2022-03-21 Thread GitBox


huaxingao commented on a change in pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r831618518



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
##
@@ -878,11 +880,92 @@ public String getFile() {
 return blocks;
   }
 
-  public void setRequestedSchema(MessageType projection) {
+  private boolean uniqueId(GroupType schema, HashSet ids) {
+boolean unique = true;
+List fields = schema.getFields();
+for (Type field : fields) {
+  if (field instanceof PrimitiveType) {
+Type.ID id = field.getId();
+if (id != null) {
+  if (ids.contains(id)) {
+return false;
+  }
+  ids.add(id);
+}
+  }
+
+  if (field instanceof GroupType) {
+Type.ID id = field.getId();
+if (id != null) {
+  if (ids.contains(id)) {
+return false;
+  }
+  ids.add(id);
+}
+if (unique) unique = uniqueId(field.asGroupType(), ids);
+  }
+}
+return unique;
+  }
+
+  public MessageType setRequestedSchema(MessageType projection) {
 paths.clear();
-for (ColumnDescriptor col : projection.getColumns()) {
+HashSet ids = new HashSet<>();

Review comment:
   Updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2006) Column resolution by ID

2022-03-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510138#comment-17510138
 ] 

ASF GitHub Bot commented on PARQUET-2006:
-

huaxingao commented on a change in pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r831618391



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -181,7 +181,7 @@ public void initialize(ParquetFileReader reader, 
ParquetReadOptions options) {
 this.columnCount = requestedSchema.getPaths().size();
 // Setting the projection schema before running any filtering (e.g. 
getting filtered record count)
 // because projection impacts filtering
-reader.setRequestedSchema(requestedSchema);
+this.requestedSchema = reader.setRequestedSchema(requestedSchema);

Review comment:
   Fixed. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Column resolution by ID
> ---
>
> Key: PARQUET-2006
> URL: https://issues.apache.org/jira/browse/PARQUET-2006
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Parquet relies on the name. In a lot of usages e.g. schema resolution, this 
> would be a problem. Iceberg uses ID and stored Id/name mappings. 
> This Jira is to add column ID resolution support. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2006) Column resolution by ID

2022-03-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510137#comment-17510137
 ] 

ASF GitHub Bot commented on PARQUET-2006:
-

huaxingao commented on a change in pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r831618277



##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/SchemaCompatibilityValidator.java
##
@@ -170,6 +174,24 @@ public Void visit(Not not) {
 
   private > void validateColumn(Column column) {
 ColumnPath path = column.getColumnPath();
+HashSet ids = new HashSet<>();

Review comment:
   Fixed. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Column resolution by ID
> ---
>
> Key: PARQUET-2006
> URL: https://issues.apache.org/jira/browse/PARQUET-2006
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Parquet relies on the name. In a lot of usages e.g. schema resolution, this 
> would be a problem. Iceberg uses ID and stored Id/name mappings. 
> This Jira is to add column ID resolution support. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] huaxingao commented on a change in pull request #950: PARQUET-2006: Column resolution by ID

2022-03-21 Thread GitBox


huaxingao commented on a change in pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r831618391



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -181,7 +181,7 @@ public void initialize(ParquetFileReader reader, 
ParquetReadOptions options) {
 this.columnCount = requestedSchema.getPaths().size();
 // Setting the projection schema before running any filtering (e.g. 
getting filtered record count)
 // because projection impacts filtering
-reader.setRequestedSchema(requestedSchema);
+this.requestedSchema = reader.setRequestedSchema(requestedSchema);

Review comment:
   Fixed. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] huaxingao commented on a change in pull request #950: PARQUET-2006: Column resolution by ID

2022-03-21 Thread GitBox


huaxingao commented on a change in pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r831618277



##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/SchemaCompatibilityValidator.java
##
@@ -170,6 +174,24 @@ public Void visit(Not not) {
 
   private > void validateColumn(Column column) {
 ColumnPath path = column.getColumnPath();
+HashSet ids = new HashSet<>();

Review comment:
   Fixed. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Re: Multiple pages with indexes vs multiple row groups with one data page per chunk

2022-03-21 Thread Weston Pace
This is speculative since we haven't implemented page/column indices
in C++ yet but I don't know if shrinking the row group size works well
if you have differently sized columns.  For example, imagine you have
a wide column of varstring "comments" that averages 50 bytes per cell.
If you then changed your row group to contain 20k elements (because
this is your target filterable chunk) you end up with a 1MB page for
this wide column.  However, an int32 column would only have a single
80KB chunk which is probably too small for the latency/bandwidth
tradeoff in most filesystems.


On Sat, Mar 19, 2022 at 3:23 AM Jacques Nadeau  wrote:
>
> I can take your comment two ways: what is the downside to large pages or
> what is the downside to small row groups.
>
> One of the key considerations I've dealt with is that page is the unit of
> compression and if I recall correctly, parquet uses block rather than
> stream compression. This means you typically need enough memory to hold at
> least one page per column worth of data in memory at a time. Bigger pages
> increase this requirement. With many concurrent cores and wide tables
> (hundreds or thousands of columns), this can be a substantial amount of
> memory. So you don't want pages too large in these cases.
>
> For row groups, being able to read only select columns can improve io.
> However, the smaller the column chunk (all pages for a column) in a row
> group, the less benefit one can gain by avoiding io. This is true for
> traditional distributed file systems as well as cloud stores (in either
> case you want to amortize the access time across a large enough transfer).
>
> So yes, the column indexes help performance of pruning pages which are
> typically one to two orders of magnitudes smaller than column chunks.
>
> One key advantage of column indexes even in the case that you fail to avoid
> io time is avoidance of decompression of skippable pages. In fact, an early
> version of column indexes we used was simply duping the page headers in the
> footer.
>
>
>
> On Sat, Mar 19, 2022, 4:42 AM Jorge Cardoso Leitão 
> wrote:
>
> > Hi,
> >
> > I am trying to understand the benefits of using multiple data pages and
> > indexes vs multiple row groups.
> >
> > Some basics first:
> >
> > row groups ensures that a sequence of rows are "aligned" at the group
> > boundary independently of how they are divided in pages:
> >
> > row group 1:
> > c1: |--p11--|--p12--|---p13---|
> > c2: |--p21--|---p22---|
> > c3: |-p31-|-p32-|-p23-|
> > ...
> >
> > rows in a page are encoded and compressed together. The more rows in a
> > page, the higher the encoding/compression potential.
> >
> > There is a tradeoff between selectivity potential and encoding/compression
> > potential, in that the more row groups/page boundaries there are, the
> > higher the selectivity potential, but the lower the encoding/compression
> > potential, and vice-versa.
> >
> > We currently have 4 methods to perform filter pushdown:
> > * row group selection via row group statistics
> > * row group selection via bloom filters
> > * page selection via page statistics (deprecated)
> > * page selection via page indexes and column indexes.
> >
> > Page indexes and column indexes store page locations and page statistics,
> > without having to read the pages sequentially (and without having to read
> > the page header).
> >
> > I am wondering why page selection is better than row group selection in
> > that why should we create multiple pages per row group instead of writing
> > more row groups with a single data page.
> >
> > Specifically, given a sequence of data pages from which we identified that
> > only a subset is valid, say in the example above a filter in column c3
> > selected `p32`, to apply the filter across columns, we need to select the
> > pages from all columns that are part of the interval of rows from which
> > that page belongs to:
> >
> > row group 1:
> > c1: |--p11--|--p12--|---p13---|
> > c2: |--p21--|---p22---|
> > c3: |-p31-|-p32-|-p23-|
> >--I--
> >
> > which, in this example is:
> >
> > c1: |--p11--|--p12--|
> > c2: |--p21--|---p22---|
> > c3: |-p31-|-p32-|-p23-|
> >--I--
> >
> > (i.e. we still had to load most of the data due to non-alignment between
> > pages).
> >
> > My thinking here is that If instead of creating multiple data pages per row
> > group, we could have created multiple row groups with a single data page:
> >
> > row group 1:
> > c1: |--p11--|
> > c2: |--p21--|
> > c3: |--p31--|
> > row group 2:
> > c1: |--p12--|
> > c2: |--p21--|
> > c3: |--p32--|
> > ...
> >
> > and the filter would be applied at the row group level, i.e. by only
> > selecting row group 2. I.e. the fact that row group boundaries enforce a
> > row alignment seems quite important for filter pushdown, as it minimizes
> > data load across columns.
> >
> > Is it the idea that pages solve a different purpose, and column and page
> > indexes 

[jira] [Commented] (PARQUET-2127) Security risk in latest parquet-jackson-1.12.2.jar

2022-03-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510122#comment-17510122
 ] 

ASF GitHub Bot commented on PARQUET-2127:
-

shangxinli merged pull request #952:
URL: https://github.com/apache/parquet-mr/pull/952


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Security risk in latest parquet-jackson-1.12.2.jar
> --
>
> Key: PARQUET-2127
> URL: https://issues.apache.org/jira/browse/PARQUET-2127
> Project: Parquet
>  Issue Type: Improvement
>Reporter: phoebe chen
>Priority: Major
>
> Embed jackson-databind:2.11.4 has security risk of Possible DoS if using JDK 
> serialization to serialize JsonNode 
> ([https://github.com/FasterXML/jackson-databind/issues/3328] ), upgrade to 
> 2.13.1 can fix this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] shangxinli merged pull request #952: PARQUET-2127: upgrade jackson-databind to 2.13.2

2022-03-21 Thread GitBox


shangxinli merged pull request #952:
URL: https://github.com/apache/parquet-mr/pull/952


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2127) Security risk in latest parquet-jackson-1.12.2.jar

2022-03-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510088#comment-17510088
 ] 

ASF GitHub Bot commented on PARQUET-2127:
-

trevorurquhart opened a new pull request #952:
URL: https://github.com/apache/parquet-mr/pull/952


   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. 
 - https://issues.apache.org/jira/browse/PARQUET-2127
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] Tests all pass with upgrade of jackson to 2.13.2
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] No new functionality
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Security risk in latest parquet-jackson-1.12.2.jar
> --
>
> Key: PARQUET-2127
> URL: https://issues.apache.org/jira/browse/PARQUET-2127
> Project: Parquet
>  Issue Type: Improvement
>Reporter: phoebe chen
>Priority: Major
>
> Embed jackson-databind:2.11.4 has security risk of Possible DoS if using JDK 
> serialization to serialize JsonNode 
> ([https://github.com/FasterXML/jackson-databind/issues/3328] ), upgrade to 
> 2.13.1 can fix this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] trevorurquhart opened a new pull request #952: PARQUET-2127: upgrade jackson-databind to 2.13.2

2022-03-21 Thread GitBox


trevorurquhart opened a new pull request #952:
URL: https://github.com/apache/parquet-mr/pull/952


   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. 
 - https://issues.apache.org/jira/browse/PARQUET-2127
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] Tests all pass with upgrade of jackson to 2.13.2
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] No new functionality
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2042) Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay

2022-03-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509743#comment-17509743
 ] 

ASF GitHub Bot commented on PARQUET-2042:
-

mwong38 commented on a change in pull request #900:
URL: https://github.com/apache/parquet-mr/pull/900#discussion_r830941957



##
File path: 
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java
##
@@ -427,6 +485,218 @@ public void addBinary(Binary binary) {
 
   }
 
+  final class ProtoTimestampConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+final LogicalTypeAnnotation.TimestampLogicalTypeAnnotation 
logicalTypeAnnotation;
+
+public ProtoTimestampConverter(ParentValueContainer parent, 
LogicalTypeAnnotation.TimestampLogicalTypeAnnotation logicalTypeAnnotation) {
+  this.parent = parent;
+  this.logicalTypeAnnotation = logicalTypeAnnotation;
+}
+
+@Override
+public void addLong(long value) {
+  switch (logicalTypeAnnotation.getUnit()) {
+case MICROS:
+  parent.add(Timestamps.fromMicros(value));
+  break;
+case MILLIS:
+  parent.add(Timestamps.fromMillis(value));
+  break;
+case NANOS:
+  parent.add(Timestamps.fromNanos(value));
+  break;
+  }
+}
+  }
+
+  final class ProtoDateConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+
+public ProtoDateConverter(ParentValueContainer parent) {
+  this.parent = parent;
+}
+
+@Override
+public void addInt(int value) {
+  LocalDate localDate = LocalDate.ofEpochDay(value);
+  com.google.type.Date date = com.google.type.Date.newBuilder()

Review comment:
   I think in this particular case it's more clear to be explicit that this 
is the Google Protobuf Date, rather than, say, the Java built-in Date.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay
> ---
>
> Key: PARQUET-2042
> URL: https://issues.apache.org/jira/browse/PARQUET-2042
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-protobuf
>Reporter: Michael Wong
>Priority: Major
>
> Related to https://issues.apache.org/jira/browse/PARQUET-1595



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2042) Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay

2022-03-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509744#comment-17509744
 ] 

ASF GitHub Bot commented on PARQUET-2042:
-

mwong38 commented on a change in pull request #900:
URL: https://github.com/apache/parquet-mr/pull/900#discussion_r830942166



##
File path: 
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java
##
@@ -427,6 +485,218 @@ public void addBinary(Binary binary) {
 
   }
 
+  final class ProtoTimestampConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+final LogicalTypeAnnotation.TimestampLogicalTypeAnnotation 
logicalTypeAnnotation;
+
+public ProtoTimestampConverter(ParentValueContainer parent, 
LogicalTypeAnnotation.TimestampLogicalTypeAnnotation logicalTypeAnnotation) {
+  this.parent = parent;
+  this.logicalTypeAnnotation = logicalTypeAnnotation;
+}
+
+@Override
+public void addLong(long value) {
+  switch (logicalTypeAnnotation.getUnit()) {
+case MICROS:
+  parent.add(Timestamps.fromMicros(value));
+  break;
+case MILLIS:
+  parent.add(Timestamps.fromMillis(value));
+  break;
+case NANOS:
+  parent.add(Timestamps.fromNanos(value));
+  break;
+  }
+}
+  }
+
+  final class ProtoDateConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+
+public ProtoDateConverter(ParentValueContainer parent) {
+  this.parent = parent;
+}
+
+@Override
+public void addInt(int value) {
+  LocalDate localDate = LocalDate.ofEpochDay(value);
+  com.google.type.Date date = com.google.type.Date.newBuilder()
+.setYear(localDate.getYear())
+.setMonth(localDate.getMonthValue())
+.setDay(localDate.getDayOfMonth())
+.build();
+  parent.add(date);
+}
+  }
+
+  final class ProtoTimeConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+final LogicalTypeAnnotation.TimeLogicalTypeAnnotation 
logicalTypeAnnotation;
+
+public ProtoTimeConverter(ParentValueContainer parent, 
LogicalTypeAnnotation.TimeLogicalTypeAnnotation logicalTypeAnnotation) {
+  this.parent = parent;
+  this.logicalTypeAnnotation = logicalTypeAnnotation;
+}
+
+@Override
+public void addLong(long value) {
+  LocalTime localTime;
+  switch (logicalTypeAnnotation.getUnit()) {
+case MILLIS:
+  localTime = LocalTime.ofNanoOfDay(value * 1_000_000);
+  break;
+case MICROS:
+  localTime = LocalTime.ofNanoOfDay(value * 1_000);
+  break;
+case NANOS:
+  localTime = LocalTime.ofNanoOfDay(value);
+  break;
+default:
+  throw new IllegalArgumentException("Unrecognized TimeUnit: " + 
logicalTypeAnnotation.getUnit());
+  }
+  com.google.type.TimeOfDay timeOfDay = 
com.google.type.TimeOfDay.newBuilder()
+.setHours(localTime.getHour())
+.setMinutes(localTime.getMinute())
+.setSeconds(localTime.getSecond())
+.setNanos(localTime.getNano())
+.build();
+  parent.add(timeOfDay);
+}
+  }
+
+  final class ProtoDoubleValueConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+
+public ProtoDoubleValueConverter(ParentValueContainer parent) {
+  this.parent = parent;
+}
+
+@Override
+public void addDouble(double value) {
+  parent.add(DoubleValue.of(value));
+}
+  }
+
+  final class ProtoFloatValueConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+
+public ProtoFloatValueConverter(ParentValueContainer parent) {
+  this.parent = parent;
+}
+
+@Override
+public void addFloat(float value) {
+  parent.add(FloatValue.of(value));
+}
+  }
+
+  final class ProtoInt64ValueConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+
+public ProtoInt64ValueConverter(ParentValueContainer parent) {
+  this.parent = parent;
+}
+
+@Override
+public void addLong(long value) {
+  parent.add(Int64Value.of(value));
+}
+  }
+
+  final class ProtoUInt64ValueConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+
+public ProtoUInt64ValueConverter(ParentValueContainer parent) {
+  this.parent = parent;
+}
+
+@Override
+public void addLong(long value) {
+  parent.add(UInt64Value.of(value));
+}
+  }
+
+  final class ProtoInt32ValueConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+
+public ProtoInt32ValueConverter(ParentValueContainer parent) {
+  this.parent = parent;
+}
+
+@Override
+public void addInt(int value) {
+  parent.add(Int32Value.of(value));
+}
+  }
+
+  final class ProtoUInt32ValueConverter extends 

[GitHub] [parquet-mr] mwong38 commented on a change in pull request #900: PARQUET-2042: Add support for unwrapping common Protobuf wrappers and…

2022-03-21 Thread GitBox


mwong38 commented on a change in pull request #900:
URL: https://github.com/apache/parquet-mr/pull/900#discussion_r830942166



##
File path: 
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java
##
@@ -427,6 +485,218 @@ public void addBinary(Binary binary) {
 
   }
 
+  final class ProtoTimestampConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+final LogicalTypeAnnotation.TimestampLogicalTypeAnnotation 
logicalTypeAnnotation;
+
+public ProtoTimestampConverter(ParentValueContainer parent, 
LogicalTypeAnnotation.TimestampLogicalTypeAnnotation logicalTypeAnnotation) {
+  this.parent = parent;
+  this.logicalTypeAnnotation = logicalTypeAnnotation;
+}
+
+@Override
+public void addLong(long value) {
+  switch (logicalTypeAnnotation.getUnit()) {
+case MICROS:
+  parent.add(Timestamps.fromMicros(value));
+  break;
+case MILLIS:
+  parent.add(Timestamps.fromMillis(value));
+  break;
+case NANOS:
+  parent.add(Timestamps.fromNanos(value));
+  break;
+  }
+}
+  }
+
+  final class ProtoDateConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+
+public ProtoDateConverter(ParentValueContainer parent) {
+  this.parent = parent;
+}
+
+@Override
+public void addInt(int value) {
+  LocalDate localDate = LocalDate.ofEpochDay(value);
+  com.google.type.Date date = com.google.type.Date.newBuilder()
+.setYear(localDate.getYear())
+.setMonth(localDate.getMonthValue())
+.setDay(localDate.getDayOfMonth())
+.build();
+  parent.add(date);
+}
+  }
+
+  final class ProtoTimeConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+final LogicalTypeAnnotation.TimeLogicalTypeAnnotation 
logicalTypeAnnotation;
+
+public ProtoTimeConverter(ParentValueContainer parent, 
LogicalTypeAnnotation.TimeLogicalTypeAnnotation logicalTypeAnnotation) {
+  this.parent = parent;
+  this.logicalTypeAnnotation = logicalTypeAnnotation;
+}
+
+@Override
+public void addLong(long value) {
+  LocalTime localTime;
+  switch (logicalTypeAnnotation.getUnit()) {
+case MILLIS:
+  localTime = LocalTime.ofNanoOfDay(value * 1_000_000);
+  break;
+case MICROS:
+  localTime = LocalTime.ofNanoOfDay(value * 1_000);
+  break;
+case NANOS:
+  localTime = LocalTime.ofNanoOfDay(value);
+  break;
+default:
+  throw new IllegalArgumentException("Unrecognized TimeUnit: " + 
logicalTypeAnnotation.getUnit());
+  }
+  com.google.type.TimeOfDay timeOfDay = 
com.google.type.TimeOfDay.newBuilder()
+.setHours(localTime.getHour())
+.setMinutes(localTime.getMinute())
+.setSeconds(localTime.getSecond())
+.setNanos(localTime.getNano())
+.build();
+  parent.add(timeOfDay);
+}
+  }
+
+  final class ProtoDoubleValueConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+
+public ProtoDoubleValueConverter(ParentValueContainer parent) {
+  this.parent = parent;
+}
+
+@Override
+public void addDouble(double value) {
+  parent.add(DoubleValue.of(value));
+}
+  }
+
+  final class ProtoFloatValueConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+
+public ProtoFloatValueConverter(ParentValueContainer parent) {
+  this.parent = parent;
+}
+
+@Override
+public void addFloat(float value) {
+  parent.add(FloatValue.of(value));
+}
+  }
+
+  final class ProtoInt64ValueConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+
+public ProtoInt64ValueConverter(ParentValueContainer parent) {
+  this.parent = parent;
+}
+
+@Override
+public void addLong(long value) {
+  parent.add(Int64Value.of(value));
+}
+  }
+
+  final class ProtoUInt64ValueConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+
+public ProtoUInt64ValueConverter(ParentValueContainer parent) {
+  this.parent = parent;
+}
+
+@Override
+public void addLong(long value) {
+  parent.add(UInt64Value.of(value));
+}
+  }
+
+  final class ProtoInt32ValueConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+
+public ProtoInt32ValueConverter(ParentValueContainer parent) {
+  this.parent = parent;
+}
+
+@Override
+public void addInt(int value) {
+  parent.add(Int32Value.of(value));
+}
+  }
+
+  final class ProtoUInt32ValueConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+
+public ProtoUInt32ValueConverter(ParentValueContainer parent) {
+  this.parent = parent;
+}
+
+@Override
+public void addLong(long value) {
+  

[GitHub] [parquet-mr] mwong38 commented on a change in pull request #900: PARQUET-2042: Add support for unwrapping common Protobuf wrappers and…

2022-03-21 Thread GitBox


mwong38 commented on a change in pull request #900:
URL: https://github.com/apache/parquet-mr/pull/900#discussion_r830941957



##
File path: 
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java
##
@@ -427,6 +485,218 @@ public void addBinary(Binary binary) {
 
   }
 
+  final class ProtoTimestampConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+final LogicalTypeAnnotation.TimestampLogicalTypeAnnotation 
logicalTypeAnnotation;
+
+public ProtoTimestampConverter(ParentValueContainer parent, 
LogicalTypeAnnotation.TimestampLogicalTypeAnnotation logicalTypeAnnotation) {
+  this.parent = parent;
+  this.logicalTypeAnnotation = logicalTypeAnnotation;
+}
+
+@Override
+public void addLong(long value) {
+  switch (logicalTypeAnnotation.getUnit()) {
+case MICROS:
+  parent.add(Timestamps.fromMicros(value));
+  break;
+case MILLIS:
+  parent.add(Timestamps.fromMillis(value));
+  break;
+case NANOS:
+  parent.add(Timestamps.fromNanos(value));
+  break;
+  }
+}
+  }
+
+  final class ProtoDateConverter extends PrimitiveConverter {
+
+final ParentValueContainer parent;
+
+public ProtoDateConverter(ParentValueContainer parent) {
+  this.parent = parent;
+}
+
+@Override
+public void addInt(int value) {
+  LocalDate localDate = LocalDate.ofEpochDay(value);
+  com.google.type.Date date = com.google.type.Date.newBuilder()

Review comment:
   I think in this particular case it's more clear to be explicit that this 
is the Google Protobuf Date, rather than, say, the Java built-in Date.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2042) Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay

2022-03-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509738#comment-17509738
 ] 

ASF GitHub Bot commented on PARQUET-2042:
-

sheinbergon commented on pull request #900:
URL: https://github.com/apache/parquet-mr/pull/900#issuecomment-1073708633


   @mwong38 let me know if you want me to help in any way


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay
> ---
>
> Key: PARQUET-2042
> URL: https://issues.apache.org/jira/browse/PARQUET-2042
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-protobuf
>Reporter: Michael Wong
>Priority: Major
>
> Related to https://issues.apache.org/jira/browse/PARQUET-1595



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] sheinbergon commented on pull request #900: PARQUET-2042: Add support for unwrapping common Protobuf wrappers and…

2022-03-21 Thread GitBox


sheinbergon commented on pull request #900:
URL: https://github.com/apache/parquet-mr/pull/900#issuecomment-1073708633


   @mwong38 let me know if you want me to help in any way


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2006) Column resolution by ID

2022-03-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509660#comment-17509660
 ] 

ASF GitHub Bot commented on PARQUET-2006:
-

ggershinsky commented on a change in pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r830823180



##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/SchemaCompatibilityValidator.java
##
@@ -170,6 +174,24 @@ public Void visit(Not not) {
 
   private > void validateColumn(Column column) {
 ColumnPath path = column.getColumnPath();
+HashSet ids = new HashSet<>();

Review comment:
   nit: this can be introduced after the `if`

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java
##
@@ -472,7 +488,13 @@ public static boolean 
getPageWriteChecksumEnabled(Configuration conf) {
 LOG.debug("Parquet properties are:\n{}", props);
 
 WriteContext fileWriteContext = writeSupport.init(conf);
-
+
+MessageType schema = fileWriteContext.getSchema();
+HashSet ids = new HashSet<>();

Review comment:
   this variable is not used outside `checkDuplicateId`; is there a way to 
hide it?

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
##
@@ -878,11 +880,92 @@ public String getFile() {
 return blocks;
   }
 
-  public void setRequestedSchema(MessageType projection) {
+  private boolean uniqueId(GroupType schema, HashSet ids) {
+boolean unique = true;
+List fields = schema.getFields();
+for (Type field : fields) {
+  if (field instanceof PrimitiveType) {
+Type.ID id = field.getId();
+if (id != null) {
+  if (ids.contains(id)) {
+return false;
+  }
+  ids.add(id);
+}
+  }
+
+  if (field instanceof GroupType) {
+Type.ID id = field.getId();
+if (id != null) {
+  if (ids.contains(id)) {
+return false;
+  }
+  ids.add(id);
+}
+if (unique) unique = uniqueId(field.asGroupType(), ids);
+  }
+}
+return unique;
+  }
+
+  public MessageType setRequestedSchema(MessageType projection) {
 paths.clear();
-for (ColumnDescriptor col : projection.getColumns()) {
+HashSet ids = new HashSet<>();

Review comment:
   this variable is not used outside `uniqueId`; is there a way to hide it?

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -181,7 +181,7 @@ public void initialize(ParquetFileReader reader, 
ParquetReadOptions options) {
 this.columnCount = requestedSchema.getPaths().size();
 // Setting the projection schema before running any filtering (e.g. 
getting filtered record count)
 // because projection impacts filtering
-reader.setRequestedSchema(requestedSchema);
+this.requestedSchema = reader.setRequestedSchema(requestedSchema);

Review comment:
   `this.requestedSchema` is set in the line 180; so this line overrides 
the class field setting. Maybe you can use a temp local variable (with 
appropriate name) in the line 180




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Column resolution by ID
> ---
>
> Key: PARQUET-2006
> URL: https://issues.apache.org/jira/browse/PARQUET-2006
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Parquet relies on the name. In a lot of usages e.g. schema resolution, this 
> would be a problem. Iceberg uses ID and stored Id/name mappings. 
> This Jira is to add column ID resolution support. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] ggershinsky commented on a change in pull request #950: PARQUET-2006: Column resolution by ID

2022-03-21 Thread GitBox


ggershinsky commented on a change in pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r830823180



##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/SchemaCompatibilityValidator.java
##
@@ -170,6 +174,24 @@ public Void visit(Not not) {
 
   private > void validateColumn(Column column) {
 ColumnPath path = column.getColumnPath();
+HashSet ids = new HashSet<>();

Review comment:
   nit: this can be introduced after the `if`

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java
##
@@ -472,7 +488,13 @@ public static boolean 
getPageWriteChecksumEnabled(Configuration conf) {
 LOG.debug("Parquet properties are:\n{}", props);
 
 WriteContext fileWriteContext = writeSupport.init(conf);
-
+
+MessageType schema = fileWriteContext.getSchema();
+HashSet ids = new HashSet<>();

Review comment:
   this variable is not used outside `checkDuplicateId`; is there a way to 
hide it?

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
##
@@ -878,11 +880,92 @@ public String getFile() {
 return blocks;
   }
 
-  public void setRequestedSchema(MessageType projection) {
+  private boolean uniqueId(GroupType schema, HashSet ids) {
+boolean unique = true;
+List fields = schema.getFields();
+for (Type field : fields) {
+  if (field instanceof PrimitiveType) {
+Type.ID id = field.getId();
+if (id != null) {
+  if (ids.contains(id)) {
+return false;
+  }
+  ids.add(id);
+}
+  }
+
+  if (field instanceof GroupType) {
+Type.ID id = field.getId();
+if (id != null) {
+  if (ids.contains(id)) {
+return false;
+  }
+  ids.add(id);
+}
+if (unique) unique = uniqueId(field.asGroupType(), ids);
+  }
+}
+return unique;
+  }
+
+  public MessageType setRequestedSchema(MessageType projection) {
 paths.clear();
-for (ColumnDescriptor col : projection.getColumns()) {
+HashSet ids = new HashSet<>();

Review comment:
   this variable is not used outside `uniqueId`; is there a way to hide it?

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -181,7 +181,7 @@ public void initialize(ParquetFileReader reader, 
ParquetReadOptions options) {
 this.columnCount = requestedSchema.getPaths().size();
 // Setting the projection schema before running any filtering (e.g. 
getting filtered record count)
 // because projection impacts filtering
-reader.setRequestedSchema(requestedSchema);
+this.requestedSchema = reader.setRequestedSchema(requestedSchema);

Review comment:
   `this.requestedSchema` is set in the line 180; so this line overrides 
the class field setting. Maybe you can use a temp local variable (with 
appropriate name) in the line 180




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2006) Column resolution by ID

2022-03-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509653#comment-17509653
 ] 

ASF GitHub Bot commented on PARQUET-2006:
-

ggershinsky commented on a change in pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r830820621



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
##
@@ -878,11 +880,92 @@ public String getFile() {
 return blocks;
   }
 
-  public void setRequestedSchema(MessageType projection) {
+  private boolean uniqueId(GroupType schema, HashSet ids) {
+boolean unique = true;
+List fields = schema.getFields();
+for (Type field : fields) {
+  if (field instanceof PrimitiveType) {
+Type.ID id = field.getId();
+if (id != null) {
+  if (ids.contains(id)) {
+return false;
+  }
+  ids.add(id);
+}
+  }
+
+  if (field instanceof GroupType) {
+Type.ID id = field.getId();
+if (id != null) {
+  if (ids.contains(id)) {
+return false;
+  }
+  ids.add(id);
+}
+if (unique) unique = uniqueId(field.asGroupType(), ids);
+  }
+}
+return unique;
+  }
+
+  public MessageType setRequestedSchema(MessageType projection) {
 paths.clear();
-for (ColumnDescriptor col : projection.getColumns()) {
+HashSet ids = new HashSet<>();
+boolean fileSchemaIdUnique = uniqueId(fileMetaData.getSchema(), ids);
+ids = new HashSet<>();
+boolean projectionSchemaIdUnique = uniqueId(projection, ids);
+MessageType schema = null;
+// if ids are unique, use id resolution

Review comment:
   +1 to an explicit control / visibility




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Column resolution by ID
> ---
>
> Key: PARQUET-2006
> URL: https://issues.apache.org/jira/browse/PARQUET-2006
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Parquet relies on the name. In a lot of usages e.g. schema resolution, this 
> would be a problem. Iceberg uses ID and stored Id/name mappings. 
> This Jira is to add column ID resolution support. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] ggershinsky commented on a change in pull request #950: PARQUET-2006: Column resolution by ID

2022-03-21 Thread GitBox


ggershinsky commented on a change in pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r830820621



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
##
@@ -878,11 +880,92 @@ public String getFile() {
 return blocks;
   }
 
-  public void setRequestedSchema(MessageType projection) {
+  private boolean uniqueId(GroupType schema, HashSet ids) {
+boolean unique = true;
+List fields = schema.getFields();
+for (Type field : fields) {
+  if (field instanceof PrimitiveType) {
+Type.ID id = field.getId();
+if (id != null) {
+  if (ids.contains(id)) {
+return false;
+  }
+  ids.add(id);
+}
+  }
+
+  if (field instanceof GroupType) {
+Type.ID id = field.getId();
+if (id != null) {
+  if (ids.contains(id)) {
+return false;
+  }
+  ids.add(id);
+}
+if (unique) unique = uniqueId(field.asGroupType(), ids);
+  }
+}
+return unique;
+  }
+
+  public MessageType setRequestedSchema(MessageType projection) {
 paths.clear();
-for (ColumnDescriptor col : projection.getColumns()) {
+HashSet ids = new HashSet<>();
+boolean fileSchemaIdUnique = uniqueId(fileMetaData.getSchema(), ids);
+ids = new HashSet<>();
+boolean projectionSchemaIdUnique = uniqueId(projection, ids);
+MessageType schema = null;
+// if ids are unique, use id resolution

Review comment:
   +1 to an explicit control / visibility




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2006) Column resolution by ID

2022-03-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509652#comment-17509652
 ] 

ASF GitHub Bot commented on PARQUET-2006:
-

ggershinsky edited a comment on pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#issuecomment-1073553770


   hi @huaxingao , can you describe the lifecycle of the column IDs at a high 
level, either in the PR description, or in a comment? Where these IDs are 
stored (if in footer - which struct/field)? How are they set and written? Is 
the writer app expected to verify the uniqueness, or it can use this PR code 
for that? How the column IDs are read and used (is the reader app expected to 
do anything beyond using this PR code)?
   I think the answer to the last question is mostly provided, but it doesn't 
explicitly say what IDs are used (where they are stored / read from).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Column resolution by ID
> ---
>
> Key: PARQUET-2006
> URL: https://issues.apache.org/jira/browse/PARQUET-2006
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Parquet relies on the name. In a lot of usages e.g. schema resolution, this 
> would be a problem. Iceberg uses ID and stored Id/name mappings. 
> This Jira is to add column ID resolution support. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] ggershinsky edited a comment on pull request #950: PARQUET-2006: Column resolution by ID

2022-03-21 Thread GitBox


ggershinsky edited a comment on pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#issuecomment-1073553770


   hi @huaxingao , can you describe the lifecycle of the column IDs at a high 
level, either in the PR description, or in a comment? Where these IDs are 
stored (if in footer - which struct/field)? How are they set and written? Is 
the writer app expected to verify the uniqueness, or it can use this PR code 
for that? How the column IDs are read and used (is the reader app expected to 
do anything beyond using this PR code)?
   I think the answer to the last question is mostly provided, but it doesn't 
explicitly say what IDs are used (where they are stored / read from).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-2006) Column resolution by ID

2022-03-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509651#comment-17509651
 ] 

ASF GitHub Bot commented on PARQUET-2006:
-

ggershinsky commented on pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#issuecomment-1073553770


   hi @huaxingao , can you describe the lifecycle of the column IDs at a high 
level, either in the PR description, or in a comment? Where these IDs are 
stored (if in footer - which struct/field)? How are they set and written? Is 
the writer app expected to verify the uniqueness, or it can use this PR code 
for that? How the columns IDs are read and used (is the reader app expected to 
do anything beyond using this PR code)?
   I think the answer to the last question is mostly provided, but it doesn't 
explicitly say what IDs are used (where they are stored / read from).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Column resolution by ID
> ---
>
> Key: PARQUET-2006
> URL: https://issues.apache.org/jira/browse/PARQUET-2006
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Parquet relies on the name. In a lot of usages e.g. schema resolution, this 
> would be a problem. Iceberg uses ID and stored Id/name mappings. 
> This Jira is to add column ID resolution support. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [parquet-mr] ggershinsky commented on pull request #950: PARQUET-2006: Column resolution by ID

2022-03-21 Thread GitBox


ggershinsky commented on pull request #950:
URL: https://github.com/apache/parquet-mr/pull/950#issuecomment-1073553770


   hi @huaxingao , can you describe the lifecycle of the column IDs at a high 
level, either in the PR description, or in a comment? Where these IDs are 
stored (if in footer - which struct/field)? How are they set and written? Is 
the writer app expected to verify the uniqueness, or it can use this PR code 
for that? How the columns IDs are read and used (is the reader app expected to 
do anything beyond using this PR code)?
   I think the answer to the last question is mostly provided, but it doesn't 
explicitly say what IDs are used (where they are stored / read from).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org