[jira] [Created] (PARQUET-1599) Fix to-avro to respect the overwrite option

2019-06-12 Thread Kengo Seki (JIRA)
Kengo Seki created PARQUET-1599:
---

 Summary: Fix to-avro to respect the overwrite option
 Key: PARQUET-1599
 URL: https://issues.apache.org/jira/browse/PARQUET-1599
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cli
Reporter: Kengo Seki


parquet-cli's {{to-avro}} has {{--overwrite}} option, and it works as expected:

{code}
$ ls -l output
total 8
-rw-r--r--  1 sekikn  staff  2010 Jun 13 12:37 sample.avro
$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main to-avro 
sample.parquet -o output/sample.avro --overwrite
$ ls -l output
total 8
-rw-r--r--  1 sekikn  staff  2010 Jun 13 12:38 sample.avro
{code}

But even without this flag, it overwrites the existing file with no warning.

{code}
$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main to-avro 
sample.parquet -o output/sample.avro
$ ls -l output
total 8
-rw-r--r--  1 sekikn  staff  2010 Jun 13 12:39 sample.avro
{code}

This behaviour should be fixed for consistency with other subcommands.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1598) Improve error message when convert-csv fails due to an invalid input file name

2019-06-12 Thread Kengo Seki (JIRA)
Kengo Seki created PARQUET-1598:
---

 Summary: Improve error message when convert-csv fails due to an 
invalid input file name
 Key: PARQUET-1598
 URL: https://issues.apache.org/jira/browse/PARQUET-1598
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cli
Reporter: Kengo Seki


I ran parquet-cli's {{convert-csv}} with an input file which name starts with a 
numeric character without {{--schema}} option and got the following error:

{code}
$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main 
convert-csv 0sample.csv -o sample.parquet
Unknown error
shaded.parquet.org.apache.avro.SchemaParseException: Illegal initial character: 
0sample
at shaded.parquet.org.apache.avro.Schema.validateName(Schema.java:1498)
at shaded.parquet.org.apache.avro.Schema.access$200(Schema.java:86)
at shaded.parquet.org.apache.avro.Schema$Name.(Schema.java:645)
at shaded.parquet.org.apache.avro.Schema.createRecord(Schema.java:182)
at 
shaded.parquet.org.apache.avro.SchemaBuilder$RecordBuilder.fields(SchemaBuilder.java:1805)
at 
org.apache.parquet.cli.csv.AvroCSV.inferSchemaInternal(AvroCSV.java:158)
at 
org.apache.parquet.cli.csv.AvroCSV.inferNullableSchema(AvroCSV.java:78)
at 
org.apache.parquet.cli.commands.ConvertCSVCommand.run(ConvertCSVCommand.java:160)
at org.apache.parquet.cli.Main.run(Main.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.parquet.cli.Main.main(Main.java:177)
{code}

This is because that {{convert-csv}} uses the input file name as the name for 
the output schema, while Avro requires its schema name to match the regex 
pattern {{[A-Za-z_][A-Za-z0-9_]*}}.
So users have to change the input file name or use the {{--schema}} option 
explicitly, but it's not so obvious from the error message.
It'd be nice if the message were improved, or the schema name were 
automatically replaced with valid characters to avoid this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1597) Fix parquet-cli's wrong or missing usage examples

2019-06-12 Thread Kengo Seki (JIRA)
Kengo Seki created PARQUET-1597:
---

 Summary: Fix parquet-cli's wrong or missing usage examples
 Key: PARQUET-1597
 URL: https://issues.apache.org/jira/browse/PARQUET-1597
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cli
Reporter: Kengo Seki


1. The following parquet-cli's {{to-avro}} usage examples fail due to the lack 
of {{-o}} options.
   In addition, "sample.parquet" in the second example should be "sample.avro".

{code}
  Examples:

# Create an Avro file from a Parquet file
parquet to-avro sample.parquet sample.avro

# Create an Avro file in HDFS from a local JSON file
parquet to-avro path/to/sample.json hdfs:/user/me/sample.parquet

# Create an Avro file from data in S3
parquet to-avro s3:/data/path/sample.parquet sample.avro
{code}

2. The above is the same for convert-csv.

{code}
  Examples:

# Create a Parquet file from a CSV file
parquet convert-csv sample.csv sample.parquet --schema schema.avsc

# Create a Parquet file in HDFS from local CSV
parquet convert-csv path/to/sample.csv hdfs:/user/me/sample.parquet 
--schema schema.avsc

# Create an Avro file from CSV data in S3
parquet convert-csv s3:/data/path/sample.csv sample.avro --format avro 
--schema s3:/schemas/schema.avsc
{code}

3. The meta command has an "Examples:" heading but lacks its content.

{code}
$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main help meta

Usage: parquet [general options] meta  [command options]

  Description:

Print a Parquet file's metadata

  Examples:

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Floating point data compression for Apache Parquet

2019-06-12 Thread Radev, Martin
Dear all,

thank you for your work on the Apache Parquet format.

We are a group of students at the Technical University of Munich who would like 
to extend the available compression and encoding options for 32-bit and 64-bit 
floating point data in Apache Parquet.
The current encodings and compression algorithms offered in Apache Parquet are 
heavily specialized towards integer and text data.
Thus there is an opportunity in reducing both io throughput requirements and 
space requirements for handling floating point data by selecting a specialized 
compression algorithm.

Currently, I am doing an investigation on the available literature and publicly 
available fp compressors. In my investigation I am writing a report on my 
findings - the available algorithms, their strengths and weaknesses, 
compression rates, compression speeds and decompression speeds, and licenses. 
Once finished I will share the report with you and make a proposal which ones 
IMO are good candidates for Apache Parquet.

The goal is to add a solution for both 32-bit and 64-bit fp types. I think that 
it would be beneficial to offer at the very least two distinct paths. The first 
one should offer fast compression and decompression speed with some but not 
significant saving in space. The second one should offer slower compression and 
decompression speed but with a decent compression rate. Both lossless. A lossy 
path will be investigated further and discussed with the community.

If I get an approval from you – the developers – I can continue with adding 
support for the new encoding/compression options in the C++ implementation of 
Apache Parquet in Apache Arrow.

Please let me know what you think of this idea and whether you have any 
concerns with the plan.

Best regards,
Martin Radev



[jira] [Commented] (PARQUET-1596) PARQUET-1375 broke parquet-cli's to-avro command

2019-06-12 Thread Fokko Driesprong (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862367#comment-16862367
 ] 

Fokko Driesprong commented on PARQUET-1596:
---

Thanks [~sekikn] I'll come up with a test and a fix tomorrow right away. Cheers!

> PARQUET-1375 broke parquet-cli's to-avro command
> 
>
> Key: PARQUET-1596
> URL: https://issues.apache.org/jira/browse/PARQUET-1596
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Reporter: Kengo Seki
>Assignee: Fokko Driesprong
>Priority: Major
>
> Given the following JSON file:
> {code}
> $ cat /tmp/sample.json 
> { "id": 1, "name": "Alice" }
> { "id": 2, "name": "Bob" }
> { "id": 3, "name": "Carol" }
> { "id": 4, "name": "Dave" }
> {code}
> using {{to-avro}} on the master branch for converting this into avro fails 
> with NPE:
> {code}
> $ git branch -v
> * master 47398be7 PARQUET-1375: Upgrade to Jackson 2.9.9 (#616)
> $ mvn clean install -DskipTests
> (snip)
> [INFO] --- maven-install-plugin:2.5.2:install (default-install) @ parquet-cli 
> ---
> [INFO] Installing 
> /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT.jar
>  to 
> /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.jar
> [INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/pom.xml to 
> /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.pom
> [INFO] Installing 
> /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-tests.jar
>  to 
> /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-tests.jar
> [INFO] Installing 
> /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-runtime.jar
>  to 
> /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-runtime.jar
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time:  14.769 s
> [INFO] Finished at: 2019-06-12T23:52:57+09:00
> [INFO] 
> 
> $ mvn dependency:copy-dependencies
> (snip)
> $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main to-avro 
> /tmp/sample.json -o /tmp/sample.avro
> Unknown error
> java.lang.RuntimeException: Failed on record 0
>   at 
> org.apache.parquet.cli.commands.ToAvroCommand.run(ToAvroCommand.java:120)
>   at org.apache.parquet.cli.Main.run(Main.java:147)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.parquet.cli.Main.main(Main.java:177)
> Caused by: java.lang.NullPointerException
>   at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:153)
>   at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:145)
>   at 
> org.apache.parquet.cli.commands.ToAvroCommand.run(ToAvroCommand.java:112)
>   ... 3 more
> $ echo $?
> 1
> {code}
> But with its previous revision, it succeeds:
> {code}
> $ git checkout HEAD^
> HEAD is now at 9d6fb45e PARQUET-1576 Bump Apache Avro to 1.9.0 (#638)
> $ mvn clean install -DskipTests
> (snip)
> [INFO] --- maven-install-plugin:2.5.2:install (default-install) @ parquet-cli 
> ---
> [INFO] Installing 
> /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT.jar
>  to 
> /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.jar
> [INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/pom.xml to 
> /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.pom
> [INFO] Installing 
> /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-tests.jar
>  to 
> /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-tests.jar
> [INFO] Installing 
> /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-runtime.jar
>  to 
> /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-runtime.jar
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time:  15.822 s
> [INFO] Finished at: 2019-06-12T23:57:04+09:00
> [INFO] 
> 
> $ mvn dependency:copy-dependencies
> (snip)
> $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main to-avro 
> /tmp/sample.json -o /tmp/sample.avro
> $ echo 

Re: bloomfilter and tokenisation

2019-06-12 Thread Wes McKinney
Hi Manik,

You could store "raw" as a LIST (so you have to tokenize
in your ETL step) instead of BYTE_ARRAY and you then reap dictionary
encoding benefits.

- Wes

On Wed, Jun 12, 2019 at 12:08 PM Manik Singla  wrote:
>
> could someone guide on this one
>
> Regards
> Manik Singla
> +91-9996008893
> +91-9665639677
>
> "Life doesn't consist in holding good cards but playing those you hold
> well."
>
>
> On Tue, Jun 11, 2019 at 5:58 PM Manik Singla  wrote:
>
> > Hey Team
> >
> > I have started using parquet recently.
> >
> > Kind of data I save is something like
> >
> > *raw   hostname cluster serviceName  *
> >
> > where raw is actual log lines.
> >
> > For raw, dictionary doesn't work as we no 2 log lines are same. But if we
> > tokenise terms in dictionary, then dictionary can help here to filter out
> > unwanted rows.  For example, parquet is a columnar format will become
> > "parquet", "is", "a", "columnar", "format".
> >
> > Also, I see mention of merging bloomfilter not sure if we considering
> > tokenisation there.
> >
> > Do we support some out of box to way to tokenise text before dictionary
> >
> > Also, what are your views if we think to add it
> >
> > Regards
> > Manik Singla
> > +91-9996008893
> > +91-9665639677
> >
> > "Life doesn't consist in holding good cards but playing those you hold
> > well."
> >


[jira] [Assigned] (PARQUET-1596) PARQUET-1375 broke parquet-cli's to-avro command

2019-06-12 Thread Fokko Driesprong (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong reassigned PARQUET-1596:
-

Assignee: Fokko Driesprong

> PARQUET-1375 broke parquet-cli's to-avro command
> 
>
> Key: PARQUET-1596
> URL: https://issues.apache.org/jira/browse/PARQUET-1596
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Reporter: Kengo Seki
>Assignee: Fokko Driesprong
>Priority: Major
>
> Given the following JSON file:
> {code}
> $ cat /tmp/sample.json 
> { "id": 1, "name": "Alice" }
> { "id": 2, "name": "Bob" }
> { "id": 3, "name": "Carol" }
> { "id": 4, "name": "Dave" }
> {code}
> using {{to-avro}} on the master branch for converting this into avro fails 
> with NPE:
> {code}
> $ git branch -v
> * master 47398be7 PARQUET-1375: Upgrade to Jackson 2.9.9 (#616)
> $ mvn clean install -DskipTests
> (snip)
> [INFO] --- maven-install-plugin:2.5.2:install (default-install) @ parquet-cli 
> ---
> [INFO] Installing 
> /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT.jar
>  to 
> /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.jar
> [INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/pom.xml to 
> /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.pom
> [INFO] Installing 
> /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-tests.jar
>  to 
> /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-tests.jar
> [INFO] Installing 
> /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-runtime.jar
>  to 
> /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-runtime.jar
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time:  14.769 s
> [INFO] Finished at: 2019-06-12T23:52:57+09:00
> [INFO] 
> 
> $ mvn dependency:copy-dependencies
> (snip)
> $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main to-avro 
> /tmp/sample.json -o /tmp/sample.avro
> Unknown error
> java.lang.RuntimeException: Failed on record 0
>   at 
> org.apache.parquet.cli.commands.ToAvroCommand.run(ToAvroCommand.java:120)
>   at org.apache.parquet.cli.Main.run(Main.java:147)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.parquet.cli.Main.main(Main.java:177)
> Caused by: java.lang.NullPointerException
>   at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:153)
>   at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:145)
>   at 
> org.apache.parquet.cli.commands.ToAvroCommand.run(ToAvroCommand.java:112)
>   ... 3 more
> $ echo $?
> 1
> {code}
> But with its previous revision, it succeeds:
> {code}
> $ git checkout HEAD^
> HEAD is now at 9d6fb45e PARQUET-1576 Bump Apache Avro to 1.9.0 (#638)
> $ mvn clean install -DskipTests
> (snip)
> [INFO] --- maven-install-plugin:2.5.2:install (default-install) @ parquet-cli 
> ---
> [INFO] Installing 
> /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT.jar
>  to 
> /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.jar
> [INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/pom.xml to 
> /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.pom
> [INFO] Installing 
> /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-tests.jar
>  to 
> /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-tests.jar
> [INFO] Installing 
> /home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-runtime.jar
>  to 
> /home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-runtime.jar
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time:  15.822 s
> [INFO] Finished at: 2019-06-12T23:57:04+09:00
> [INFO] 
> 
> $ mvn dependency:copy-dependencies
> (snip)
> $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main to-avro 
> /tmp/sample.json -o /tmp/sample.avro
> $ echo $?
> 0
> $ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main head 

Re: bloomfilter and tokenisation

2019-06-12 Thread Manik Singla
could someone guide on this one

Regards
Manik Singla
+91-9996008893
+91-9665639677

"Life doesn't consist in holding good cards but playing those you hold
well."


On Tue, Jun 11, 2019 at 5:58 PM Manik Singla  wrote:

> Hey Team
>
> I have started using parquet recently.
>
> Kind of data I save is something like
>
> *raw   hostname cluster serviceName  *
>
> where raw is actual log lines.
>
> For raw, dictionary doesn't work as we no 2 log lines are same. But if we
> tokenise terms in dictionary, then dictionary can help here to filter out
> unwanted rows.  For example, parquet is a columnar format will become
> "parquet", "is", "a", "columnar", "format".
>
> Also, I see mention of merging bloomfilter not sure if we considering
> tokenisation there.
>
> Do we support some out of box to way to tokenise text before dictionary
>
> Also, what are your views if we think to add it
>
> Regards
> Manik Singla
> +91-9996008893
> +91-9665639677
>
> "Life doesn't consist in holding good cards but playing those you hold
> well."
>


[jira] [Created] (PARQUET-1596) PARQUET-1375 broke parquet-cli's to-avro command

2019-06-12 Thread Kengo Seki (JIRA)
Kengo Seki created PARQUET-1596:
---

 Summary: PARQUET-1375 broke parquet-cli's to-avro command
 Key: PARQUET-1596
 URL: https://issues.apache.org/jira/browse/PARQUET-1596
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cli
Reporter: Kengo Seki


Given the following JSON file:

{code}
$ cat /tmp/sample.json 
{ "id": 1, "name": "Alice" }
{ "id": 2, "name": "Bob" }
{ "id": 3, "name": "Carol" }
{ "id": 4, "name": "Dave" }
{code}

using {{to-avro}} on the master branch for converting this into avro fails with 
NPE:

{code}
$ git branch -v
* master 47398be7 PARQUET-1375: Upgrade to Jackson 2.9.9 (#616)
$ mvn clean install -DskipTests

(snip)

[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ parquet-cli 
---
[INFO] Installing 
/home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT.jar 
to 
/home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.jar
[INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/pom.xml to 
/home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.pom
[INFO] Installing 
/home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-tests.jar
 to 
/home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-tests.jar
[INFO] Installing 
/home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-runtime.jar
 to 
/home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-runtime.jar
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time:  14.769 s
[INFO] Finished at: 2019-06-12T23:52:57+09:00
[INFO] 
$ mvn dependency:copy-dependencies

(snip)

$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main to-avro 
/tmp/sample.json -o /tmp/sample.avro
Unknown error
java.lang.RuntimeException: Failed on record 0
at 
org.apache.parquet.cli.commands.ToAvroCommand.run(ToAvroCommand.java:120)
at org.apache.parquet.cli.Main.run(Main.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.parquet.cli.Main.main(Main.java:177)
Caused by: java.lang.NullPointerException
at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:153)
at org.apache.avro.file.DataFileWriter.create(DataFileWriter.java:145)
at 
org.apache.parquet.cli.commands.ToAvroCommand.run(ToAvroCommand.java:112)
... 3 more
$ echo $?
1
{code}

But with its previous revision, it succeeds:

{code}
$ git checkout HEAD^
HEAD is now at 9d6fb45e PARQUET-1576 Bump Apache Avro to 1.9.0 (#638)
$ mvn clean install -DskipTests

(snip)

[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ parquet-cli 
---
[INFO] Installing 
/home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT.jar 
to 
/home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.jar
[INFO] Installing /home/sekikn/repo/parquet-mr/parquet-cli/pom.xml to 
/home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT.pom
[INFO] Installing 
/home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-tests.jar
 to 
/home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-tests.jar
[INFO] Installing 
/home/sekikn/repo/parquet-mr/parquet-cli/target/parquet-cli-1.12.0-SNAPSHOT-runtime.jar
 to 
/home/sekikn/.m2/repository/org/apache/parquet/parquet-cli/1.12.0-SNAPSHOT/parquet-cli-1.12.0-SNAPSHOT-runtime.jar
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time:  15.822 s
[INFO] Finished at: 2019-06-12T23:57:04+09:00
[INFO] 
$ mvn dependency:copy-dependencies

(snip)

$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main to-avro 
/tmp/sample.json -o /tmp/sample.avro
$ echo $?
0
$ java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main head 
/tmp/sample.avro
{"id": 1, "name": "Alice"}
{"id": 2, "name": "Bob"}
{"id": 3, "name": "Carol"}
{"id": 4, "name": "Dave"}
{code}

Reverting the following code

{code:title=AvroJson.java}
   public static Iterator parser(final InputStream stream) {
 try(JsonParser parser = FACTORY.createParser(stream)) {
{code}

to

{code}
   public static Iterator parser(final InputStream stream) {
 try {
  JsonParser parser = FACTORY.createParser(stream);

[jira] [Updated] (PARQUET-1499) [parquet-mr] Add Java 11 to Travis

2019-06-12 Thread Fokko Driesprong (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated PARQUET-1499:
--
Fix Version/s: format-2.7.0

> [parquet-mr] Add Java 11 to Travis
> --
>
> Key: PARQUET-1499
> URL: https://issues.apache.org/jira/browse/PARQUET-1499
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1590) [parquet-format] Add Java 11 to Travis

2019-06-12 Thread Fokko Driesprong (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated PARQUET-1590:
--
Fix Version/s: format-2.7.0

> [parquet-format] Add Java 11 to Travis
> --
>
> Key: PARQUET-1590
> URL: https://issues.apache.org/jira/browse/PARQUET-1590
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1588) Bump Apache Thrift to 0.12.0 in parquet-format

2019-06-12 Thread Fokko Driesprong (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861933#comment-16861933
 ] 

Fokko Driesprong commented on PARQUET-1588:
---

My mistake, thanks!

> Bump Apache Thrift to 0.12.0 in parquet-format
> --
>
> Key: PARQUET-1588
> URL: https://issues.apache.org/jira/browse/PARQUET-1588
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1588) Bump Apache Thrift to 0.12.0 in parquet-format

2019-06-12 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi resolved PARQUET-1588.

Resolution: Fixed

> Bump Apache Thrift to 0.12.0 in parquet-format
> --
>
> Key: PARQUET-1588
> URL: https://issues.apache.org/jira/browse/PARQUET-1588
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1588) Bump Apache Thrift to 0.12.0 in parquet-format

2019-06-12 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated PARQUET-1588:
---
Fix Version/s: format-2.7.0

> Bump Apache Thrift to 0.12.0 in parquet-format
> --
>
> Key: PARQUET-1588
> URL: https://issues.apache.org/jira/browse/PARQUET-1588
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1588) Bump Apache Thrift to 0.12.0 in parquet-format

2019-06-12 Thread Zoltan Ivanfi (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861932#comment-16861932
 ] 

Zoltan Ivanfi commented on PARQUET-1588:


It already existed, just not as "2.7.0" but as "format-2.7.0" instead.

> Bump Apache Thrift to 0.12.0 in parquet-format
> --
>
> Key: PARQUET-1588
> URL: https://issues.apache.org/jira/browse/PARQUET-1588
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.7.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1588) Bump Apache Thrift to 0.12.0 in parquet-format

2019-06-12 Thread Fokko Driesprong (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861930#comment-16861930
 ] 

Fokko Driesprong commented on PARQUET-1588:
---

[~zi] Should we target this to version 2.7.0?

> Bump Apache Thrift to 0.12.0 in parquet-format
> --
>
> Key: PARQUET-1588
> URL: https://issues.apache.org/jira/browse/PARQUET-1588
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1588) Bump Apache Thrift to 0.12.0 in parquet-format

2019-06-12 Thread Fokko Driesprong (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861930#comment-16861930
 ] 

Fokko Driesprong edited comment on PARQUET-1588 at 6/12/19 9:16 AM:


[~zi] Should we target this to version 2.7.0? I don't have permissions to 
create a version.


was (Author: fokko):
[~zi] Should we target this to version 2.7.0?

> Bump Apache Thrift to 0.12.0 in parquet-format
> --
>
> Key: PARQUET-1588
> URL: https://issues.apache.org/jira/browse/PARQUET-1588
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1588) Bump Apache Thrift to 0.12.0 in parquet-format

2019-06-12 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861891#comment-16861891
 ] 

ASF GitHub Bot commented on PARQUET-1588:
-

zivanfi commented on pull request #133: PARQUET-1588: Bump to Apache Thrift 
0.12.0
URL: https://github.com/apache/parquet-format/pull/133
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Bump Apache Thrift to 0.12.0 in parquet-format
> --
>
> Key: PARQUET-1588
> URL: https://issues.apache.org/jira/browse/PARQUET-1588
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (PARQUET-1588) Bump Apache Thrift to 0.12.0

2019-06-12 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi reopened PARQUET-1588:


As we discussed, let's stick to your original approach of separate JIRA-s for 
parquet-mr and parquet-format to better track what gets released in which 
version.

> Bump Apache Thrift to 0.12.0
> 
>
> Key: PARQUET-1588
> URL: https://issues.apache.org/jira/browse/PARQUET-1588
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1588) Bump Apache Thrift to 0.12.0 in parquet-format

2019-06-12 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated PARQUET-1588:
---
Summary: Bump Apache Thrift to 0.12.0 in parquet-format  (was: Bump Apache 
Thrift to 0.12.0)

> Bump Apache Thrift to 0.12.0 in parquet-format
> --
>
> Key: PARQUET-1588
> URL: https://issues.apache.org/jira/browse/PARQUET-1588
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1595) Parquet proto writer de-nest Protobuf wrapper classes

2019-06-12 Thread Ying Xu (JIRA)
Ying Xu created PARQUET-1595:


 Summary: Parquet proto writer de-nest Protobuf wrapper classes
 Key: PARQUET-1595
 URL: https://issues.apache.org/jira/browse/PARQUET-1595
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Ying Xu


Existing Parquet protobuf writer support preserves the structure of any 
Protobuf Message objects.  This works well in most cases. However, when dealing 
with [Protobuf wrapper 
messages|https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/wrappers.proto],
 users may prefer directly writing the de-nested value into the Parquet files, 
for ease of querying them directly (in query engine such as Hive/Presto). 

Proposal: 
 * Implement a control flag, e.g., enableDenestingProtoWrappers, to control 
whether or not to denest Protobuf wrapper classes. 
 * When this flag is set to true, write the Protobuf wrapper classes as single 
primitive fields, based on the type of the wrapped *value* field.
 
||Protobuf Type||Parquet Type||
|BoolValue|boolean|
|BytesValue|binary|
|DoubleValue|double|
|FloatValue|float|
|Int32Value|int64 (32-bit, signed)|
|Int64Value|int64 (64-bit, signed)|
|StringValue|binary (string)|
|UInt32Value|int64 (32-bit, unsigned)|
|UInt64Value|int64 (64-bit, unsigned)|

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)