[jira] [Created] (PARQUET-1599) Fix to-avro to respect the overwrite option
Kengo Seki created PARQUET-1599: --- Summary: Fix to-avro to respect the overwrite option Key: PARQUET-1599 URL: https://issues.apache.org/jira/browse/PARQUET-1599 Project: Parquet Issue Type: Bug Components: parquet-cli Reporter: Kengo Seki parquet-cli's {{to-avro}} has {{--overwrite}} option, and it works as expected: {code} $ ls -l output total 8 -rw-r--r-- 1 sekikn staff 2010 Jun 13 12:37 sample.avro $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main to-avro sample.parquet -o output/sample.avro --overwrite $ ls -l output total 8 -rw-r--r-- 1 sekikn staff 2010 Jun 13 12:38 sample.avro {code} But even without this flag, it overwrites the existing file with no warning. {code} $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main to-avro sample.parquet -o output/sample.avro $ ls -l output total 8 -rw-r--r-- 1 sekikn staff 2010 Jun 13 12:39 sample.avro {code} This behaviour should be fixed for consistency with other subcommands. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1598) Improve error message when convert-csv fails due to an invalid input file name
Kengo Seki created PARQUET-1598: --- Summary: Improve error message when convert-csv fails due to an invalid input file name Key: PARQUET-1598 URL: https://issues.apache.org/jira/browse/PARQUET-1598 Project: Parquet Issue Type: Improvement Components: parquet-cli Reporter: Kengo Seki I ran parquet-cli's {{convert-csv}} with an input file which name starts with a numeric character without {{--schema}} option and got the following error: {code} $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main convert-csv 0sample.csv -o sample.parquet Unknown error shaded.parquet.org.apache.avro.SchemaParseException: Illegal initial character: 0sample at shaded.parquet.org.apache.avro.Schema.validateName(Schema.java:1498) at shaded.parquet.org.apache.avro.Schema.access$200(Schema.java:86) at shaded.parquet.org.apache.avro.Schema$Name.(Schema.java:645) at shaded.parquet.org.apache.avro.Schema.createRecord(Schema.java:182) at shaded.parquet.org.apache.avro.SchemaBuilder$RecordBuilder.fields(SchemaBuilder.java:1805) at org.apache.parquet.cli.csv.AvroCSV.inferSchemaInternal(AvroCSV.java:158) at org.apache.parquet.cli.csv.AvroCSV.inferNullableSchema(AvroCSV.java:78) at org.apache.parquet.cli.commands.ConvertCSVCommand.run(ConvertCSVCommand.java:160) at org.apache.parquet.cli.Main.run(Main.java:147) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.parquet.cli.Main.main(Main.java:177) {code} This is because that {{convert-csv}} uses the input file name as the name for the output schema, while Avro requires its schema name to match the regex pattern {{[A-Za-z_][A-Za-z0-9_]*}}. So users have to change the input file name or use the {{--schema}} option explicitly, but it's not so obvious from the error message. It'd be nice if the message were improved, or the schema name were automatically replaced with valid characters to avoid this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1597) Fix parquet-cli's wrong or missing usage examples
Kengo Seki created PARQUET-1597: --- Summary: Fix parquet-cli's wrong or missing usage examples Key: PARQUET-1597 URL: https://issues.apache.org/jira/browse/PARQUET-1597 Project: Parquet Issue Type: Bug Components: parquet-cli Reporter: Kengo Seki 1. The following parquet-cli's {{to-avro}} usage examples fail due to the lack of {{-o}} options. In addition, "sample.parquet" in the second example should be "sample.avro". {code} Examples: # Create an Avro file from a Parquet file parquet to-avro sample.parquet sample.avro # Create an Avro file in HDFS from a local JSON file parquet to-avro path/to/sample.json hdfs:/user/me/sample.parquet # Create an Avro file from data in S3 parquet to-avro s3:/data/path/sample.parquet sample.avro {code} 2. The above is the same for convert-csv. {code} Examples: # Create a Parquet file from a CSV file parquet convert-csv sample.csv sample.parquet --schema schema.avsc # Create a Parquet file in HDFS from local CSV parquet convert-csv path/to/sample.csv hdfs:/user/me/sample.parquet --schema schema.avsc # Create an Avro file from CSV data in S3 parquet convert-csv s3:/data/path/sample.csv sample.avro --format avro --schema s3:/schemas/schema.avsc {code} 3. The meta command has an "Examples:" heading but lacks its content. {code} $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main help meta Usage: parquet [general options] meta [command options] Description: Print a Parquet file's metadata Examples: {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Floating point data compression for Apache Parquet
Dear all, thank you for your work on the Apache Parquet format. We are a group of students at the Technical University of Munich who would like to extend the available compression and encoding options for 32-bit and 64-bit floating point data in Apache Parquet. The current encodings and compression algorithms offered in Apache Parquet are heavily specialized towards integer and text data. Thus there is an opportunity in reducing both io throughput requirements and space requirements for handling floating point data by selecting a specialized compression algorithm. Currently, I am doing an investigation on the available literature and publicly available fp compressors. In my investigation I am writing a report on my findings - the available algorithms, their strengths and weaknesses, compression rates, compression speeds and decompression speeds, and licenses. Once finished I will share the report with you and make a proposal which ones IMO are good candidates for Apache Parquet. The goal is to add a solution for both 32-bit and 64-bit fp types. I think that it would be beneficial to offer at the very least two distinct paths. The first one should offer fast compression and decompression speed with some but not significant saving in space. The second one should offer slower compression and decompression speed but with a decent compression rate. Both lossless. A lossy path will be investigated further and discussed with the community. If I get an approval from you – the developers – I can continue with adding support for the new encoding/compression options in the C++ implementation of Apache Parquet in Apache Arrow. Please let me know what you think of this idea and whether you have any concerns with the plan. Best regards, Martin Radev
[jira] [Commented] (PARQUET-1596) PARQUET-1375 broke parquet-cli's to-avro command
[ https://issues.apache.org/jira/browse/PARQUET-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862367#comment-16862367 ] Fokko Driesprong commented on PARQUET-1596: --- Thanks [~sekikn] I'll come up with a test and a fix tomorrow right away. Cheers! > PARQUET-1375 broke parquet-cli's to-avro command > > > Key: PARQUET-1596 > URL: https://issues.apache.org/jira/browse/PARQUET-1596 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Reporter: Kengo Seki >Assignee: Fokko Driesprong >Priority: Major > > Given the following JSON file: > {code} > $ cat /tmp/sample.json > { "id": 1, "name": "Alice" } > { "id": 2, "name": "Bob" } > { "id": 3, "name": "Carol" } > { "id": 4, "name": "Dave" } > {code} > using {{to-avro}} on the master branch for converting this into avro fails > with NPE: > {code} > $ git branch -v > * master 47398be7 PARQUET-1375: Upgrade to Jackson 2.9.9 (#616) > $ mvn clean install -DskipTests > (snip) > [INFO] --- maven-install-plugin:2.5.2:install (default-install) @ parquet-cli > --- > [INFO] Installing > /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT.jar > to > /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.jar > [INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/pom.xml to > /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.pom > [INFO] Installing > /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-tests.jar > to > /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-tests.jar > [INFO] Installing > /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-runtime.jar > to > /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-runtime.jar > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 14.769 s > [INFO] Finished at: 2019-06-12T23:52:57+09:00 > [INFO] > > $ mvn dependency:copy-dependencies > (snip) > $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main to-avro > /tmp/sample.json -o /tmp/sample.avro > Unknown error > java.lang.RuntimeException: Failed on record 0 > at > org.apache.parquet.cli.commands.ToAvroCommand.run(ToAvroCommand.java:120) > at org.apache.parquet.cli.Main.run(Main.java:147) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.parquet.cli.Main.main(Main.java:177) > Caused by: java.lang.NullPointerException > at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:153) > at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:145) > at > org.apache.parquet.cli.commands.ToAvroCommand.run(ToAvroCommand.java:112) > ... 3 more > $ echo $? > 1 > {code} > But with its previous revision, it succeeds: > {code} > $ git checkout HEAD^ > HEAD is now at 9d6fb45e PARQUET-1576 Bump Apache Avro to 1.9.0 (#638) > $ mvn clean install -DskipTests > (snip) > [INFO] --- maven-install-plugin:2.5.2:install (default-install) @ parquet-cli > --- > [INFO] Installing > /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT.jar > to > /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.jar > [INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/pom.xml to > /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.pom > [INFO] Installing > /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-tests.jar > to > /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-tests.jar > [INFO] Installing > /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-runtime.jar > to > /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-runtime.jar > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 15.822 s > [INFO] Finished at: 2019-06-12T23:57:04+09:00 > [INFO] > > $ mvn dependency:copy-dependencies > (snip) > $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main to-avro > /tmp/sample.json -o /tmp/sample.avro > $ echo
Re: bloomfilter and tokenisation
Hi Manik, You could store "raw" as a LIST (so you have to tokenize in your ETL step) instead of BYTE_ARRAY and you then reap dictionary encoding benefits. - Wes On Wed, Jun 12, 2019 at 12:08 PM Manik Singla wrote: > > could someone guide on this one > > Regards > Manik Singla > +91-9996008893 > +91-9665639677 > > "Life doesn't consist in holding good cards but playing those you hold > well." > > > On Tue, Jun 11, 2019 at 5:58 PM Manik Singla wrote: > > > Hey Team > > > > I have started using parquet recently. > > > > Kind of data I save is something like > > > > *raw hostname cluster serviceName * > > > > where raw is actual log lines. > > > > For raw, dictionary doesn't work as we no 2 log lines are same. But if we > > tokenise terms in dictionary, then dictionary can help here to filter out > > unwanted rows. For example, parquet is a columnar format will become > > "parquet", "is", "a", "columnar", "format". > > > > Also, I see mention of merging bloomfilter not sure if we considering > > tokenisation there. > > > > Do we support some out of box to way to tokenise text before dictionary > > > > Also, what are your views if we think to add it > > > > Regards > > Manik Singla > > +91-9996008893 > > +91-9665639677 > > > > "Life doesn't consist in holding good cards but playing those you hold > > well." > >
[jira] [Assigned] (PARQUET-1596) PARQUET-1375 broke parquet-cli's to-avro command
[ https://issues.apache.org/jira/browse/PARQUET-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong reassigned PARQUET-1596: - Assignee: Fokko Driesprong > PARQUET-1375 broke parquet-cli's to-avro command > > > Key: PARQUET-1596 > URL: https://issues.apache.org/jira/browse/PARQUET-1596 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Reporter: Kengo Seki >Assignee: Fokko Driesprong >Priority: Major > > Given the following JSON file: > {code} > $ cat /tmp/sample.json > { "id": 1, "name": "Alice" } > { "id": 2, "name": "Bob" } > { "id": 3, "name": "Carol" } > { "id": 4, "name": "Dave" } > {code} > using {{to-avro}} on the master branch for converting this into avro fails > with NPE: > {code} > $ git branch -v > * master 47398be7 PARQUET-1375: Upgrade to Jackson 2.9.9 (#616) > $ mvn clean install -DskipTests > (snip) > [INFO] --- maven-install-plugin:2.5.2:install (default-install) @ parquet-cli > --- > [INFO] Installing > /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT.jar > to > /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.jar > [INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/pom.xml to > /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.pom > [INFO] Installing > /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-tests.jar > to > /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-tests.jar > [INFO] Installing > /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-runtime.jar > to > /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-runtime.jar > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 14.769 s > [INFO] Finished at: 2019-06-12T23:52:57+09:00 > [INFO] > > $ mvn dependency:copy-dependencies > (snip) > $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main to-avro > /tmp/sample.json -o /tmp/sample.avro > Unknown error > java.lang.RuntimeException: Failed on record 0 > at > org.apache.parquet.cli.commands.ToAvroCommand.run(ToAvroCommand.java:120) > at org.apache.parquet.cli.Main.run(Main.java:147) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.parquet.cli.Main.main(Main.java:177) > Caused by: java.lang.NullPointerException > at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:153) > at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:145) > at > org.apache.parquet.cli.commands.ToAvroCommand.run(ToAvroCommand.java:112) > ... 3 more > $ echo $? > 1 > {code} > But with its previous revision, it succeeds: > {code} > $ git checkout HEAD^ > HEAD is now at 9d6fb45e PARQUET-1576 Bump Apache Avro to 1.9.0 (#638) > $ mvn clean install -DskipTests > (snip) > [INFO] --- maven-install-plugin:2.5.2:install (default-install) @ parquet-cli > --- > [INFO] Installing > /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT.jar > to > /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.jar > [INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/pom.xml to > /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.pom > [INFO] Installing > /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-tests.jar > to > /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-tests.jar > [INFO] Installing > /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-runtime.jar > to > /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-runtime.jar > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 15.822 s > [INFO] Finished at: 2019-06-12T23:57:04+09:00 > [INFO] > > $ mvn dependency:copy-dependencies > (snip) > $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main to-avro > /tmp/sample.json -o /tmp/sample.avro > $ echo $? > 0 > $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main head
Re: bloomfilter and tokenisation
could someone guide on this one Regards Manik Singla +91-9996008893 +91-9665639677 "Life doesn't consist in holding good cards but playing those you hold well." On Tue, Jun 11, 2019 at 5:58 PM Manik Singla wrote: > Hey Team > > I have started using parquet recently. > > Kind of data I save is something like > > *raw hostname cluster serviceName * > > where raw is actual log lines. > > For raw, dictionary doesn't work as we no 2 log lines are same. But if we > tokenise terms in dictionary, then dictionary can help here to filter out > unwanted rows. For example, parquet is a columnar format will become > "parquet", "is", "a", "columnar", "format". > > Also, I see mention of merging bloomfilter not sure if we considering > tokenisation there. > > Do we support some out of box to way to tokenise text before dictionary > > Also, what are your views if we think to add it > > Regards > Manik Singla > +91-9996008893 > +91-9665639677 > > "Life doesn't consist in holding good cards but playing those you hold > well." >
[jira] [Created] (PARQUET-1596) PARQUET-1375 broke parquet-cli's to-avro command
Kengo Seki created PARQUET-1596: --- Summary: PARQUET-1375 broke parquet-cli's to-avro command Key: PARQUET-1596 URL: https://issues.apache.org/jira/browse/PARQUET-1596 Project: Parquet Issue Type: Bug Components: parquet-cli Reporter: Kengo Seki Given the following JSON file: {code} $ cat /tmp/sample.json { "id": 1, "name": "Alice" } { "id": 2, "name": "Bob" } { "id": 3, "name": "Carol" } { "id": 4, "name": "Dave" } {code} using {{to-avro}} on the master branch for converting this into avro fails with NPE: {code} $ git branch -v * master 47398be7 PARQUET-1375: Upgrade to Jackson 2.9.9 (#616) $ mvn clean install -DskipTests (snip) [INFO] --- maven-install-plugin:2.5.2:install (default-install) @ parquet-cli --- [INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT.jar to /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.jar [INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/pom.xml to /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.pom [INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-tests.jar to /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-tests.jar [INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-runtime.jar to /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-runtime.jar [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 14.769 s [INFO] Finished at: 2019-06-12T23:52:57+09:00 [INFO] $ mvn dependency:copy-dependencies (snip) $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main to-avro /tmp/sample.json -o /tmp/sample.avro Unknown error java.lang.RuntimeException: Failed on record 0 at org.apache.parquet.cli.commands.ToAvroCommand.run(ToAvroCommand.java:120) at org.apache.parquet.cli.Main.run(Main.java:147) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.parquet.cli.Main.main(Main.java:177) Caused by: java.lang.NullPointerException at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:153) at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:145) at org.apache.parquet.cli.commands.ToAvroCommand.run(ToAvroCommand.java:112) ... 3 more $ echo $? 1 {code} But with its previous revision, it succeeds: {code} $ git checkout HEAD^ HEAD is now at 9d6fb45e PARQUET-1576 Bump Apache Avro to 1.9.0 (#638) $ mvn clean install -DskipTests (snip) [INFO] --- maven-install-plugin:2.5.2:install (default-install) @ parquet-cli --- [INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT.jar to /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.jar [INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/pom.xml to /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.pom [INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-tests.jar to /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-tests.jar [INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-runtime.jar to /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-runtime.jar [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 15.822 s [INFO] Finished at: 2019-06-12T23:57:04+09:00 [INFO] $ mvn dependency:copy-dependencies (snip) $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main to-avro /tmp/sample.json -o /tmp/sample.avro $ echo $? 0 $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main head /tmp/sample.avro {"id": 1, "name": "Alice"} {"id": 2, "name": "Bob"} {"id": 3, "name": "Carol"} {"id": 4, "name": "Dave"} {code} Reverting the following code {code:title=AvroJson.java} public static Iterator parser(final InputStream stream) { try(JsonParser parser = FACTORY.createParser(stream)) { {code} to {code} public static Iterator parser(final InputStream stream) { try { JsonParser parser = FACTORY.createParser(stream);
[jira] [Updated] (PARQUET-1499) [parquet-mr] Add Java 11 to Travis
[ https://issues.apache.org/jira/browse/PARQUET-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong updated PARQUET-1499: -- Fix Version/s: format-2.7.0 > [parquet-mr] Add Java 11 to Travis > -- > > Key: PARQUET-1499 > URL: https://issues.apache.org/jira/browse/PARQUET-1499 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1590) [parquet-format] Add Java 11 to Travis
[ https://issues.apache.org/jira/browse/PARQUET-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong updated PARQUET-1590: -- Fix Version/s: format-2.7.0 > [parquet-format] Add Java 11 to Travis > -- > > Key: PARQUET-1590 > URL: https://issues.apache.org/jira/browse/PARQUET-1590 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1588) Bump Apache Thrift to 0.12.0 in parquet-format
[ https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861933#comment-16861933 ] Fokko Driesprong commented on PARQUET-1588: --- My mistake, thanks! > Bump Apache Thrift to 0.12.0 in parquet-format > -- > > Key: PARQUET-1588 > URL: https://issues.apache.org/jira/browse/PARQUET-1588 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1588) Bump Apache Thrift to 0.12.0 in parquet-format
[ https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Ivanfi resolved PARQUET-1588. Resolution: Fixed > Bump Apache Thrift to 0.12.0 in parquet-format > -- > > Key: PARQUET-1588 > URL: https://issues.apache.org/jira/browse/PARQUET-1588 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1588) Bump Apache Thrift to 0.12.0 in parquet-format
[ https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Ivanfi updated PARQUET-1588: --- Fix Version/s: format-2.7.0 > Bump Apache Thrift to 0.12.0 in parquet-format > -- > > Key: PARQUET-1588 > URL: https://issues.apache.org/jira/browse/PARQUET-1588 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1588) Bump Apache Thrift to 0.12.0 in parquet-format
[ https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861932#comment-16861932 ] Zoltan Ivanfi commented on PARQUET-1588: It already existed, just not as "2.7.0" but as "format-2.7.0" instead. > Bump Apache Thrift to 0.12.0 in parquet-format > -- > > Key: PARQUET-1588 > URL: https://issues.apache.org/jira/browse/PARQUET-1588 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1588) Bump Apache Thrift to 0.12.0 in parquet-format
[ https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861930#comment-16861930 ] Fokko Driesprong commented on PARQUET-1588: --- [~zi] Should we target this to version 2.7.0? > Bump Apache Thrift to 0.12.0 in parquet-format > -- > > Key: PARQUET-1588 > URL: https://issues.apache.org/jira/browse/PARQUET-1588 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (PARQUET-1588) Bump Apache Thrift to 0.12.0 in parquet-format
[ https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861930#comment-16861930 ] Fokko Driesprong edited comment on PARQUET-1588 at 6/12/19 9:16 AM: [~zi] Should we target this to version 2.7.0? I don't have permissions to create a version. was (Author: fokko): [~zi] Should we target this to version 2.7.0? > Bump Apache Thrift to 0.12.0 in parquet-format > -- > > Key: PARQUET-1588 > URL: https://issues.apache.org/jira/browse/PARQUET-1588 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1588) Bump Apache Thrift to 0.12.0 in parquet-format
[ https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861891#comment-16861891 ] ASF GitHub Bot commented on PARQUET-1588: - zivanfi commented on pull request #133: PARQUET-1588: Bump to Apache Thrift 0.12.0 URL: https://github.com/apache/parquet-format/pull/133 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Bump Apache Thrift to 0.12.0 in parquet-format > -- > > Key: PARQUET-1588 > URL: https://issues.apache.org/jira/browse/PARQUET-1588 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (PARQUET-1588) Bump Apache Thrift to 0.12.0
[ https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Ivanfi reopened PARQUET-1588: As we discussed, let's stick to your original approach of separate JIRA-s for parquet-mr and parquet-format to better track what gets released in which version. > Bump Apache Thrift to 0.12.0 > > > Key: PARQUET-1588 > URL: https://issues.apache.org/jira/browse/PARQUET-1588 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1588) Bump Apache Thrift to 0.12.0 in parquet-format
[ https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Ivanfi updated PARQUET-1588: --- Summary: Bump Apache Thrift to 0.12.0 in parquet-format (was: Bump Apache Thrift to 0.12.0) > Bump Apache Thrift to 0.12.0 in parquet-format > -- > > Key: PARQUET-1588 > URL: https://issues.apache.org/jira/browse/PARQUET-1588 > Project: Parquet > Issue Type: Improvement > Components: parquet-format >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1595) Parquet proto writer de-nest Protobuf wrapper classes
Ying Xu created PARQUET-1595: Summary: Parquet proto writer de-nest Protobuf wrapper classes Key: PARQUET-1595 URL: https://issues.apache.org/jira/browse/PARQUET-1595 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: Ying Xu Existing Parquet protobuf writer support preserves the structure of any Protobuf Message objects. This works well in most cases. However, when dealing with [Protobuf wrapper messages|https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/wrappers.proto], users may prefer directly writing the de-nested value into the Parquet files, for ease of querying them directly (in query engine such as Hive/Presto). Proposal: * Implement a control flag, e.g., enableDenestingProtoWrappers, to control whether or not to denest Protobuf wrapper classes. * When this flag is set to true, write the Protobuf wrapper classes as single primitive fields, based on the type of the wrapped *value* field. ||Protobuf Type||Parquet Type|| |BoolValue|boolean| |BytesValue|binary| |DoubleValue|double| |FloatValue|float| |Int32Value|int64 (32-bit, signed)| |Int64Value|int64 (64-bit, signed)| |StringValue|binary (string)| |UInt32Value|int64 (32-bit, unsigned)| |UInt64Value|int64 (64-bit, unsigned)| -- This message was sent by Atlassian JIRA (v7.6.3#76005)