[jira] [Work logged] (BEAM-8344) Add infer schema support in ParquetIO and refactor ParquetTableProvider
[ https://issues.apache.org/jira/browse/BEAM-8344?focusedWorklogId=364561=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-364561 ] ASF GitHub Bot logged work on BEAM-8344: Author: ASF GitHub Bot Created on: 30/Dec/19 07:54 Start Date: 30/Dec/19 07:54 Worklog Time Spent: 10m Work Description: stale[bot] commented on pull request #9721: [BEAM-8344] Add inferSchema support in ParquetIO and refactor ParquetTableProvider URL: https://github.com/apache/beam/pull/9721 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 364561) Time Spent: 1.5h (was: 1h 20m) > Add infer schema support in ParquetIO and refactor ParquetTableProvider > --- > > Key: BEAM-8344 > URL: https://issues.apache.org/jira/browse/BEAM-8344 > Project: Beam > Issue Type: Improvement > Components: dsl-sql, io-java-parquet >Reporter: Vishwas >Assignee: Vishwas >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > Add support for inferring Beam Schema in ParquetIO. > Refactor ParquetTable code to use Convert.rows(). > Remove unnecessary java class GenericRecordReadConverter. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8344) Add infer schema support in ParquetIO and refactor ParquetTableProvider
[ https://issues.apache.org/jira/browse/BEAM-8344?focusedWorklogId=364560=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-364560 ] ASF GitHub Bot logged work on BEAM-8344: Author: ASF GitHub Bot Created on: 30/Dec/19 07:54 Start Date: 30/Dec/19 07:54 Worklog Time Spent: 10m Work Description: stale[bot] commented on issue #9721: [BEAM-8344] Add inferSchema support in ParquetIO and refactor ParquetTableProvider URL: https://github.com/apache/beam/pull/9721#issuecomment-569608337 This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 364560) Time Spent: 1h 20m (was: 1h 10m) > Add infer schema support in ParquetIO and refactor ParquetTableProvider > --- > > Key: BEAM-8344 > URL: https://issues.apache.org/jira/browse/BEAM-8344 > Project: Beam > Issue Type: Improvement > Components: dsl-sql, io-java-parquet >Reporter: Vishwas >Assignee: Vishwas >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > Add support for inferring Beam Schema in ParquetIO. > Refactor ParquetTable code to use Convert.rows(). > Remove unnecessary java class GenericRecordReadConverter. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8344) Add infer schema support in ParquetIO and refactor ParquetTableProvider
[ https://issues.apache.org/jira/browse/BEAM-8344?focusedWorklogId=362420=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-362420 ] ASF GitHub Bot logged work on BEAM-8344: Author: ASF GitHub Bot Created on: 23/Dec/19 04:57 Start Date: 23/Dec/19 04:57 Worklog Time Spent: 10m Work Description: stale[bot] commented on issue #9721: [BEAM-8344] Add inferSchema support in ParquetIO and refactor ParquetTableProvider URL: https://github.com/apache/beam/pull/9721#issuecomment-568354908 This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@beam.apache.org list. Thank you for your contributions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 362420) Time Spent: 1h 10m (was: 1h) > Add infer schema support in ParquetIO and refactor ParquetTableProvider > --- > > Key: BEAM-8344 > URL: https://issues.apache.org/jira/browse/BEAM-8344 > Project: Beam > Issue Type: Improvement > Components: dsl-sql, io-java-parquet >Reporter: Vishwas >Assignee: Vishwas >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > > Add support for inferring Beam Schema in ParquetIO. > Refactor ParquetTable code to use Convert.rows(). > Remove unnecessary java class GenericRecordReadConverter. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8344) Add infer schema support in ParquetIO and refactor ParquetTableProvider
[ https://issues.apache.org/jira/browse/BEAM-8344?focusedWorklogId=333070=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333070 ] ASF GitHub Bot logged work on BEAM-8344: Author: ASF GitHub Bot Created on: 24/Oct/19 04:31 Start Date: 24/Oct/19 04:31 Worklog Time Spent: 10m Work Description: bmv126 commented on issue #9721: [BEAM-8344] Add inferSchema support in ParquetIO and refactor ParquetTableProvider URL: https://github.com/apache/beam/pull/9721#issuecomment-545739303 R: @amaliujia R: @reuvenlax R: @jbonofre This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 333070) Time Spent: 1h (was: 50m) > Add infer schema support in ParquetIO and refactor ParquetTableProvider > --- > > Key: BEAM-8344 > URL: https://issues.apache.org/jira/browse/BEAM-8344 > Project: Beam > Issue Type: Improvement > Components: dsl-sql, io-java-parquet >Reporter: Vishwas >Assignee: Vishwas >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > Add support for inferring Beam Schema in ParquetIO. > Refactor ParquetTable code to use Convert.rows(). > Remove unnecessary java class GenericRecordReadConverter. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8344) Add infer schema support in ParquetIO and refactor ParquetTableProvider
[ https://issues.apache.org/jira/browse/BEAM-8344?focusedWorklogId=329744=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-329744 ] ASF GitHub Bot logged work on BEAM-8344: Author: ASF GitHub Bot Created on: 17/Oct/19 10:48 Start Date: 17/Oct/19 10:48 Worklog Time Spent: 10m Work Description: bmv126 commented on issue #9721: [BEAM-8344] Add inferSchema support in ParquetIO and refactor ParquetTableProvider URL: https://github.com/apache/beam/pull/9721#issuecomment-543117923 @amaliujia @reuvenlax can you have a look in to this ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 329744) Time Spent: 50m (was: 40m) > Add infer schema support in ParquetIO and refactor ParquetTableProvider > --- > > Key: BEAM-8344 > URL: https://issues.apache.org/jira/browse/BEAM-8344 > Project: Beam > Issue Type: Improvement > Components: dsl-sql, io-java-parquet >Reporter: Vishwas >Assignee: Vishwas >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > Add support for inferring Beam Schema in ParquetIO. > Refactor ParquetTable code to use Convert.rows(). > Remove unnecessary java class GenericRecordReadConverter. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8344) Add infer schema support in ParquetIO and refactor ParquetTableProvider
[ https://issues.apache.org/jira/browse/BEAM-8344?focusedWorklogId=327643=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-327643 ] ASF GitHub Bot logged work on BEAM-8344: Author: ASF GitHub Bot Created on: 14/Oct/19 07:37 Start Date: 14/Oct/19 07:37 Worklog Time Spent: 10m Work Description: bmv126 commented on pull request #9721: [BEAM-8344] Add inferSchema support in ParquetIO and refactor ParquetTableProvider URL: https://github.com/apache/beam/pull/9721#discussion_r334354220 ## File path: sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java ## @@ -122,15 +122,34 @@ * pattern). */ public static Read read(Schema schema) { -return new AutoValue_ParquetIO_Read.Builder().setSchema(schema).build(); +return new AutoValue_ParquetIO_Read.Builder() +.setSchema(schema) +.setInferBeamSchema(false) +.build(); } /** * Like {@link #read(Schema)}, but reads each file in a {@link PCollection} of {@link * org.apache.beam.sdk.io.FileIO.ReadableFile}, which allows more flexible usage. */ public static ReadFiles readFiles(Schema schema) { -return new AutoValue_ParquetIO_ReadFiles.Builder().setSchema(schema).build(); +return new AutoValue_ParquetIO_ReadFiles.Builder() +.setSchema(schema) +.setInferBeamSchema(false) +.build(); + } + + private static PCollection setBeamSchema( + PCollection pc, Class clazz, @Nullable Schema schema) { +org.apache.beam.sdk.schemas.Schema beamSchema = +org.apache.beam.sdk.schemas.utils.AvroUtils.getSchema(clazz, schema); +if (beamSchema != null) { Review comment: @amaliujia I have done the modification. Can you have a look This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 327643) Time Spent: 40m (was: 0.5h) > Add infer schema support in ParquetIO and refactor ParquetTableProvider > --- > > Key: BEAM-8344 > URL: https://issues.apache.org/jira/browse/BEAM-8344 > Project: Beam > Issue Type: Improvement > Components: dsl-sql, io-java-parquet >Reporter: Vishwas >Assignee: Vishwas >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > Add support for inferring Beam Schema in ParquetIO. > Refactor ParquetTable code to use Convert.rows(). > Remove unnecessary java class GenericRecordReadConverter. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8344) Add infer schema support in ParquetIO and refactor ParquetTableProvider
[ https://issues.apache.org/jira/browse/BEAM-8344?focusedWorklogId=325132=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325132 ] ASF GitHub Bot logged work on BEAM-8344: Author: ASF GitHub Bot Created on: 08/Oct/19 15:40 Start Date: 08/Oct/19 15:40 Worklog Time Spent: 10m Work Description: bmv126 commented on pull request #9721: [BEAM-8344] Add inferSchema support in ParquetIO and refactor ParquetTableProvider URL: https://github.com/apache/beam/pull/9721#discussion_r332587458 ## File path: sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java ## @@ -122,15 +122,34 @@ * pattern). */ public static Read read(Schema schema) { -return new AutoValue_ParquetIO_Read.Builder().setSchema(schema).build(); +return new AutoValue_ParquetIO_Read.Builder() +.setSchema(schema) +.setInferBeamSchema(false) +.build(); } /** * Like {@link #read(Schema)}, but reads each file in a {@link PCollection} of {@link * org.apache.beam.sdk.io.FileIO.ReadableFile}, which allows more flexible usage. */ public static ReadFiles readFiles(Schema schema) { -return new AutoValue_ParquetIO_ReadFiles.Builder().setSchema(schema).build(); +return new AutoValue_ParquetIO_ReadFiles.Builder() +.setSchema(schema) +.setInferBeamSchema(false) +.build(); + } + + private static PCollection setBeamSchema( + PCollection pc, Class clazz, @Nullable Schema schema) { +org.apache.beam.sdk.schemas.Schema beamSchema = +org.apache.beam.sdk.schemas.utils.AvroUtils.getSchema(clazz, schema); +if (beamSchema != null) { Review comment: Thanks for reviewing the code. The idea here was to align it the same way as AvroIO (https://github.com/apache/beam/blob/ad5d3836a47fe2cbd552fe3908e15ffc7f777f11/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java#L339) Currently AvroUtils.getSchema() returns null if it is anything other than GenericRecord type. Currently In parquetIO we are handling only generic record so this method will always return schema. I think with your suggestion of making infer schema as non optional and as we handle only GenericRecord in ParquetIO, I will modify the code to set the beamSchema always without the boolean flag as we do in AvroIO. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 325132) Time Spent: 0.5h (was: 20m) > Add infer schema support in ParquetIO and refactor ParquetTableProvider > --- > > Key: BEAM-8344 > URL: https://issues.apache.org/jira/browse/BEAM-8344 > Project: Beam > Issue Type: Improvement > Components: dsl-sql, io-java-parquet >Reporter: Vishwas >Assignee: Vishwas >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Add support for inferring Beam Schema in ParquetIO. > Refactor ParquetTable code to use Convert.rows(). > Remove unnecessary java class GenericRecordReadConverter. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8344) Add infer schema support in ParquetIO and refactor ParquetTableProvider
[ https://issues.apache.org/jira/browse/BEAM-8344?focusedWorklogId=323638=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-323638 ] ASF GitHub Bot logged work on BEAM-8344: Author: ASF GitHub Bot Created on: 04/Oct/19 18:11 Start Date: 04/Oct/19 18:11 Worklog Time Spent: 10m Work Description: amaliujia commented on pull request #9721: [BEAM-8344] Add inferSchema support in ParquetIO and refactor ParquetTableProvider URL: https://github.com/apache/beam/pull/9721#discussion_r331624690 ## File path: sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java ## @@ -122,15 +122,34 @@ * pattern). */ public static Read read(Schema schema) { -return new AutoValue_ParquetIO_Read.Builder().setSchema(schema).build(); +return new AutoValue_ParquetIO_Read.Builder() +.setSchema(schema) +.setInferBeamSchema(false) +.build(); } /** * Like {@link #read(Schema)}, but reads each file in a {@link PCollection} of {@link * org.apache.beam.sdk.io.FileIO.ReadableFile}, which allows more flexible usage. */ public static ReadFiles readFiles(Schema schema) { -return new AutoValue_ParquetIO_ReadFiles.Builder().setSchema(schema).build(); +return new AutoValue_ParquetIO_ReadFiles.Builder() +.setSchema(schema) +.setInferBeamSchema(false) +.build(); + } + + private static PCollection setBeamSchema( + PCollection pc, Class clazz, @Nullable Schema schema) { +org.apache.beam.sdk.schemas.Schema beamSchema = +org.apache.beam.sdk.schemas.utils.AvroUtils.getSchema(clazz, schema); +if (beamSchema != null) { Review comment: Because here you won't throw exception if there is not beamSchema, why not just make "inforSchema" as a non-optional action so we don't need set the boolean to control it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 323638) Time Spent: 20m (was: 10m) > Add infer schema support in ParquetIO and refactor ParquetTableProvider > --- > > Key: BEAM-8344 > URL: https://issues.apache.org/jira/browse/BEAM-8344 > Project: Beam > Issue Type: Improvement > Components: dsl-sql, io-java-parquet >Reporter: Vishwas >Assignee: Vishwas >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Add support for inferring Beam Schema in ParquetIO. > Refactor ParquetTable code to use Convert.rows(). > Remove unnecessary java class GenericRecordReadConverter. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8344) Add infer schema support in ParquetIO and refactor ParquetTableProvider
[ https://issues.apache.org/jira/browse/BEAM-8344?focusedWorklogId=322438=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-322438 ] ASF GitHub Bot logged work on BEAM-8344: Author: ASF GitHub Bot Created on: 03/Oct/19 06:33 Start Date: 03/Oct/19 06:33 Worklog Time Spent: 10m Work Description: bmv126 commented on pull request #9721: [BEAM-8344] Add inferSchema support in ParquetIO and refactor ParquetTableProvider URL: https://github.com/apache/beam/pull/9721 Task Details: Add support for inferring Beam Schema in ParquetIO. Refactor ParquetTable code to use Convert.rows(). Remove unnecessary java class GenericRecordReadConverter. R: @reuvenlax R: @amaliujia Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build