[jira] [Commented] (PARQUET-1666) Remove Unused Modules

2020-12-02 Thread Julien Le Dem (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242864#comment-17242864
 ] 

Julien Le Dem commented on PARQUET-1666:


that sounds good to me too

> Remove Unused Modules 
> --
>
> Key: PARQUET-1666
> URL: https://issues.apache.org/jira/browse/PARQUET-1666
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> In the last two meetings, Ryan Blue proposed to remove some unused Parquet 
> modules. This is to open a task to track it. 
> Here are the related meeting notes for the discussion on this. 
> Remove old Parquet modules
> Hive modules - sounds good
> Scooge - Julien will reach out to twitter
> Tools - undecided - Cloudera may still use the parquet-tools according to 
> Gabor.
> Cascading - undecided
> We can change the module as deprecated as description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1777) add Parquet logo vector files to repo

2020-01-24 Thread Julien Le Dem (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem reassigned PARQUET-1777:
--

Assignee: Julien Le Dem

> add Parquet logo vector files to repo
> -
>
> Key: PARQUET-1777
> URL: https://issues.apache.org/jira/browse/PARQUET-1777
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1777) add Parquet logo vector files to repo

2020-01-24 Thread Julien Le Dem (Jira)
Julien Le Dem created PARQUET-1777:
--

 Summary: add Parquet logo vector files to repo
 Key: PARQUET-1777
 URL: https://issues.apache.org/jira/browse/PARQUET-1777
 Project: Parquet
  Issue Type: Task
  Components: parquet-format
Reporter: Julien Le Dem






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-968) Add Hive/Presto support in ProtoParquet

2018-04-26 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-968.
---
   Resolution: Fixed
Fix Version/s: 1.11

> Add Hive/Presto support in ProtoParquet
> ---
>
> Key: PARQUET-968
> URL: https://issues.apache.org/jira/browse/PARQUET-968
> Project: Parquet
>  Issue Type: Task
>Reporter: Constantin Muraru
>Assignee: Constantin Muraru
>Priority: Major
> Fix For: 1.11
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-968) Add Hive/Presto support in ProtoParquet

2018-04-26 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem reassigned PARQUET-968:
-

Assignee: Constantin Muraru  (was: Julien Le Dem)

> Add Hive/Presto support in ProtoParquet
> ---
>
> Key: PARQUET-968
> URL: https://issues.apache.org/jira/browse/PARQUET-968
> Project: Parquet
>  Issue Type: Task
>Reporter: Constantin Muraru
>Assignee: Constantin Muraru
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet

2018-04-26 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453972#comment-16453972
 ] 

Julien Le Dem commented on PARQUET-968:
---

merged in 
https://github.com/apache/parquet-mr/commit/f84938441be49c665595c936ac631c3e5f171bf9

> Add Hive/Presto support in ProtoParquet
> ---
>
> Key: PARQUET-968
> URL: https://issues.apache.org/jira/browse/PARQUET-968
> Project: Parquet
>  Issue Type: Task
>Reporter: Constantin Muraru
>Assignee: Constantin Muraru
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-968) Add Hive/Presto support in ProtoParquet

2018-04-26 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem reassigned PARQUET-968:
-

Assignee: Julien Le Dem

> Add Hive/Presto support in ProtoParquet
> ---
>
> Key: PARQUET-968
> URL: https://issues.apache.org/jira/browse/PARQUET-968
> Project: Parquet
>  Issue Type: Task
>Reporter: Constantin Muraru
>Assignee: Julien Le Dem
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1281) Jackson dependency

2018-04-24 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450163#comment-16450163
 ] 

Julien Le Dem commented on PARQUET-1281:


parquet-hadoop should have its build include shading like parquet thrift:

https://github.com/apache/parquet-mr/blob/master/parquet-thrift/pom.xml#L174

> Jackson dependency
> --
>
> Key: PARQUET-1281
> URL: https://issues.apache.org/jira/browse/PARQUET-1281
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Qinghui Xu
>Priority: Major
>
> Currently we shaded jackson in parquet-jackson module (org.codehaus.jackon 
> --> shaded.parquet.org.codehaus.jackson), but in fact we do not use the 
> shaded jackson in parquet-hadoop code. Is that a mistake? (see 
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ParquetMetadata.java#L26)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1259) Parquet-protobuf support both protobuf 2 and protobuf 3

2018-04-04 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-1259.

Resolution: Workaround

supporting more than one version adds complexity.

It sounds like people can use protobuf 2 syntax with protobuf 3 library

I would recommend that instead.

I'll close this for now.

Please re-open if this is not satisfying.

> Parquet-protobuf support both protobuf 2 and protobuf 3
> ---
>
> Key: PARQUET-1259
> URL: https://issues.apache.org/jira/browse/PARQUET-1259
> Project: Parquet
>  Issue Type: New Feature
>Affects Versions: 1.10.0, 1.9.1
>Reporter: Qinghui Xu
>Priority: Major
>
> With the merge of pull request: 
> [https://github.com/apache/parquet-mr/pull/407,] now it is protobuf 3 used in 
> parquet-protobuf, and this implies that it cannot work in an environment 
> where people are using protobuf 2 in their own dependencies because there is 
> some new API / breaking change in protobuf 3. People have to face a 
> dependency version conflict with next parquet-protobuf release (e.g. 1.9.1 or 
> 1.10.0).
> What if we support both protobuf 2 and protobuf 3 by providing 
> parquet-protobuf and parquet-protobuf2?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1222) Definition of float and double sort order is ambiguous

2018-03-13 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-1222:
---
Summary: Definition of float and double sort order is ambiguous  (was: 
Definition of float and double sort order is ambigious)

> Definition of float and double sort order is ambiguous
> --
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
> Fix For: format-2.5.0
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.
> For reading existing stats created using TypeDefinedOrder, the following 
> compatibility rules should be applied:
> * When looking for NaN values, min and max should be ignored.
> * If the min is a NaN, it should be ignored.
> * If the max is a NaN, it should be ignored.
> * If the min is \+0, the row group may contain -0 values as well.
> * If the max is -0, the row group may contain \+0 values as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1135) upgrade thrift and protobuf dependencies

2018-03-09 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-1135.

Resolution: Fixed

merged in:

https://github.com/apache/parquet-mr/commit/3d2d4fd1588c8eb3f67f34d75b66967d0c7b06b6

> upgrade thrift and protobuf dependencies
> 
>
> Key: PARQUET-1135
> URL: https://issues.apache.org/jira/browse/PARQUET-1135
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>Priority: Major
> Fix For: 1.9.1
>
>
> thrift 0.7.0 -> 0.9.3
>  protobuf 3.2 -> 3.5.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1135) upgrade thrift and protobuf dependencies

2018-03-09 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-1135:
---
Fix Version/s: 1.9.1
  Description: 
thrift 0.7.0 -> 0.9.3
 protobuf 3.2 -> 3.5.1

  was:
thrift 0.7.0 -> 0.9.3
protobuf 3.2 -> 3.4


> upgrade thrift and protobuf dependencies
> 
>
> Key: PARQUET-1135
> URL: https://issues.apache.org/jira/browse/PARQUET-1135
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>Priority: Major
> Fix For: 1.9.1
>
>
> thrift 0.7.0 -> 0.9.3
>  protobuf 3.2 -> 3.5.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1133) INT96 types and Maps without OriginalType cause exceptions in PigSchemaConverter

2017-10-10 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-1133.

   Resolution: Fixed
Fix Version/s: 1.9.0

Issue resolved by pull request 422
[https://github.com/apache/parquet-mr/pull/422]

> INT96 types and Maps without OriginalType cause exceptions in 
> PigSchemaConverter
> 
>
> Key: PARQUET-1133
> URL: https://issues.apache.org/jira/browse/PARQUET-1133
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-pig
>Affects Versions: 1.9.1
>Reporter: Addisu Feyissa
>Assignee: Addisu Feyissa
> Fix For: 1.9.0
>
>
> Trying to load parquet files in Pig, that have the following causes an 
> exception and parsing to fail:
> * INT96 fields, for example:
> {noformat}
> message spark_schema {
>   optional int96 datetime;
> }
> {noformat}
> The Exception thrown is:
> {noformat}
> Failed to parse: can't convert optional int96 myInt96
>   at 
> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:201)
>   at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1791)
>   at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1764)
>   at org.apache.pig.PigServer.registerQuery(PigServer.java:707)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1075)
>   at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:505)
>   at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:231)
>   at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:206)
>   at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
>   at org.apache.pig.Main.run(Main.java:564)
>   at org.apache.pig.Main.main(Main.java:176)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
> Caused by: org.apache.parquet.pig.SchemaConversionException: can't convert 
> optional int96 myInt96
>   at 
> org.apache.parquet.pig.PigSchemaConverter.convertFields(PigSchemaConverter.java:202)
>   at 
> org.apache.parquet.pig.PigSchemaConverter.convert(PigSchemaConverter.java:178)
>   at 
> org.apache.parquet.pig.TupleReadSupport.getPigSchemaFromMultipleFiles(TupleReadSupport.java:95)
>   at 
> org.apache.parquet.pig.ParquetLoader.initSchema(ParquetLoader.java:300)
>   at org.apache.parquet.pig.ParquetLoader.setInput(ParquetLoader.java:183)
>   at 
> org.apache.parquet.pig.ParquetLoader.getSchema(ParquetLoader.java:285)
>   at 
> org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:175)
>   at 
> org.apache.pig.newplan.logical.relational.LOLoad.(LOLoad.java:89)
>   at 
> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:901)
>   at 
> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3568)
>   at 
> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
>   at 
> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
>   at 
> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
>   at 
> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
>   at 
> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
>   ... 16 more
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: NYI
>   at 
> org.apache.parquet.pig.PigSchemaConverter$1.convertINT96(PigSchemaConverter.java:242)
>   at 
> org.apache.parquet.pig.PigSchemaConverter$1.convertINT96(PigSchemaConverter.java:214)
>   at 
> org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:223)
>   at 
> org.apache.parquet.pig.PigSchemaConverter.getSimpleFieldSchema(PigSchemaConverter.java:213)
>   at 
> org.apache.parquet.pig.PigSchemaConverter.getFieldSchema(PigSchemaConverter.java:320)
>   at 
> org.apache.parquet.pig.PigSchemaConverter.convertFields(PigSchemaConverter.java:193)
>   ... 30 more
> {noformat}
> * Map Types without OriginalType, for example:
>  {noformat}
> message spark_schema {
>   optional binary a;
>   optional group b (MAP) {
> repeated group map {
>   required binary key;
>   optional group value {
> optional fixed_len_byte_array(5) 

[jira] [Assigned] (PARQUET-1135) upgrade thrift and protobuf dependencies

2017-10-10 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem reassigned PARQUET-1135:
--

Assignee: Julien Le Dem

> upgrade thrift and protobuf dependencies
> 
>
> Key: PARQUET-1135
> URL: https://issues.apache.org/jira/browse/PARQUET-1135
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>
> thrift 0.7.0 -> 0.9.3
> protobuf 3.2 -> 3.4



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PARQUET-1135) upgrade thrift and protobuf dependencies

2017-10-10 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-1135:
--

 Summary: upgrade thrift and protobuf dependencies
 Key: PARQUET-1135
 URL: https://issues.apache.org/jira/browse/PARQUET-1135
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Julien Le Dem


thrift 0.7.0 -> 0.9.3
protobuf 3.2 -> 3.4



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (PARQUET-1133) INT96 types and Maps without OriginalType cause exceptions in PigSchemaConverter

2017-10-10 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem reassigned PARQUET-1133:
--

Assignee: Addisu Feyissa

> INT96 types and Maps without OriginalType cause exceptions in 
> PigSchemaConverter
> 
>
> Key: PARQUET-1133
> URL: https://issues.apache.org/jira/browse/PARQUET-1133
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-pig
>Affects Versions: 1.9.1
>Reporter: Addisu Feyissa
>Assignee: Addisu Feyissa
>
> Trying to load parquet files in Pig, that have the following causes an 
> exception and parsing to fail:
> * INT96 fields, for example:
> {noformat}
> message spark_schema {
>   optional int96 datetime;
> }
> {noformat}
> The Exception thrown is:
> {noformat}
> Failed to parse: can't convert optional int96 myInt96
>   at 
> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:201)
>   at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1791)
>   at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1764)
>   at org.apache.pig.PigServer.registerQuery(PigServer.java:707)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1075)
>   at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:505)
>   at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:231)
>   at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:206)
>   at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
>   at org.apache.pig.Main.run(Main.java:564)
>   at org.apache.pig.Main.main(Main.java:176)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
> Caused by: org.apache.parquet.pig.SchemaConversionException: can't convert 
> optional int96 myInt96
>   at 
> org.apache.parquet.pig.PigSchemaConverter.convertFields(PigSchemaConverter.java:202)
>   at 
> org.apache.parquet.pig.PigSchemaConverter.convert(PigSchemaConverter.java:178)
>   at 
> org.apache.parquet.pig.TupleReadSupport.getPigSchemaFromMultipleFiles(TupleReadSupport.java:95)
>   at 
> org.apache.parquet.pig.ParquetLoader.initSchema(ParquetLoader.java:300)
>   at org.apache.parquet.pig.ParquetLoader.setInput(ParquetLoader.java:183)
>   at 
> org.apache.parquet.pig.ParquetLoader.getSchema(ParquetLoader.java:285)
>   at 
> org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:175)
>   at 
> org.apache.pig.newplan.logical.relational.LOLoad.(LOLoad.java:89)
>   at 
> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:901)
>   at 
> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3568)
>   at 
> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
>   at 
> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
>   at 
> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
>   at 
> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
>   at 
> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
>   ... 16 more
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: NYI
>   at 
> org.apache.parquet.pig.PigSchemaConverter$1.convertINT96(PigSchemaConverter.java:242)
>   at 
> org.apache.parquet.pig.PigSchemaConverter$1.convertINT96(PigSchemaConverter.java:214)
>   at 
> org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:223)
>   at 
> org.apache.parquet.pig.PigSchemaConverter.getSimpleFieldSchema(PigSchemaConverter.java:213)
>   at 
> org.apache.parquet.pig.PigSchemaConverter.getFieldSchema(PigSchemaConverter.java:320)
>   at 
> org.apache.parquet.pig.PigSchemaConverter.convertFields(PigSchemaConverter.java:193)
>   ... 30 more
> {noformat}
> * Map Types without OriginalType, for example:
>  {noformat}
> message spark_schema {
>   optional binary a;
>   optional group b (MAP) {
> repeated group map {
>   required binary key;
>   optional group value {
> optional fixed_len_byte_array(5) c;
> optional fixed_len_byte_array(7) d;
>   }
> }
>   }
> }
> {noformat}
> The Exception thrown is:
> {noformat}
> 

[jira] [Resolved] (PARQUET-1024) allow for case insensitive parquet-xxx prefix in PR title

2017-06-09 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-1024.

   Resolution: Fixed
Fix Version/s: 1.9.1

Issue resolved by pull request 415
[https://github.com/apache/parquet-mr/pull/415]

> allow for case insensitive parquet-xxx prefix in PR title
> -
>
> Key: PARQUET-1024
> URL: https://issues.apache.org/jira/browse/PARQUET-1024
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Fix For: 1.9.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-783) H2SeekableInputStream does not close its underlying FSDataInputStream, leading to connection leaks

2017-06-09 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044715#comment-16044715
 ] 

Julien Le Dem commented on PARQUET-783:
---

Hi [~fuka], I created a jira ticket to make a 1.9.1 release: PARQUET-1027
We should link to it any JIRA we think should be added and get it started soon.


> H2SeekableInputStream does not close its underlying FSDataInputStream, 
> leading to connection leaks
> --
>
> Key: PARQUET-783
> URL: https://issues.apache.org/jira/browse/PARQUET-783
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Michael Allman
>Assignee: Michael Allman
>Priority: Critical
> Fix For: 1.10.0
>
>
> {{ParquetFileReader}} opens a {{SeekableInputStream}} to read a footer. In 
> the process, it opens a new {{FSDataInputStream}} and wraps it. However, 
> {{H2SeekableInputStream}} does not override the {{close}} method. Therefore, 
> when {{ParquetFileReader}} closes it, the underlying {{FSDataInputStream}} is 
> not closed. As a result, these stale connections can exhaust a clusters' data 
> nodes' connection resources and lead to mysterious HDFS read failures in HDFS 
> clients, e.g.
> {noformat}
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: 
> BP-905337612-172.16.70.103-1444328960665:blk_1720536852_646811517
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PARQUET-1027) releas Parquet-mr 1.9.1

2017-06-09 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-1027:
--

 Summary: releas Parquet-mr 1.9.1
 Key: PARQUET-1027
 URL: https://issues.apache.org/jira/browse/PARQUET-1027
 Project: Parquet
  Issue Type: Task
Reporter: Julien Le Dem






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PARQUET-1027) release Parquet-mr 1.9.1

2017-06-09 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-1027:
---
Summary: release Parquet-mr 1.9.1  (was: releas Parquet-mr 1.9.1)

> release Parquet-mr 1.9.1
> 
>
> Key: PARQUET-1027
> URL: https://issues.apache.org/jira/browse/PARQUET-1027
> Project: Parquet
>  Issue Type: Task
>Reporter: Julien Le Dem
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PARQUET-884) Add support for Decimal datatype to Parquet-Pig record reader

2017-06-07 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-884:
--
Fix Version/s: (was: 1.9.0)
   1.10.0

> Add support for Decimal datatype to Parquet-Pig record reader
> -
>
> Key: PARQUET-884
> URL: https://issues.apache.org/jira/browse/PARQUET-884
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-pig
>Reporter: Ellen Kletscher
>Assignee: Ellen Kletscher
>Priority: Minor
> Fix For: 1.10.0
>
>
> parquet.pig.ParquetLoader defaults the Parquet decimal datatype to bytearray. 
>  Would like to add support to convert to BigDecimal instead, which will turn 
> garbage bytearrays into actual numbers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PARQUET-392) Release Parquet-mr 1.9.0

2017-06-07 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-392.
---
Resolution: Delivered

> Release Parquet-mr 1.9.0
> 
>
> Key: PARQUET-392
> URL: https://issues.apache.org/jira/browse/PARQUET-392
> Project: Parquet
>  Issue Type: Task
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-392) Release Parquet-mr 1.9.0

2017-06-07 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041807#comment-16041807
 ] 

Julien Le Dem commented on PARQUET-392:
---

[~zi] done. Thanks for checking


> Release Parquet-mr 1.9.0
> 
>
> Key: PARQUET-392
> URL: https://issues.apache.org/jira/browse/PARQUET-392
> Project: Parquet
>  Issue Type: Task
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-392) Release Parquet-mr 1.9.0

2017-06-07 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041805#comment-16041805
 ] 

Julien Le Dem commented on PARQUET-392:
---

[~djiangxu] I have updated PARQUET-686: jiras should be closed automatically 
when we merge the PR. This one fell through: 
https://github.com/apache/parquet-mr/commit/de99127d77dabfc6c8134b3c58e0b9a0b74e5f37

> Release Parquet-mr 1.9.0
> 
>
> Key: PARQUET-392
> URL: https://issues.apache.org/jira/browse/PARQUET-392
> Project: Parquet
>  Issue Type: Task
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PARQUET-686) Allow for Unsigned Statistics in Binary Type

2017-06-07 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-686.
---
   Resolution: Fixed
 Assignee: Ryan Blue
Fix Version/s: 1.9.0

> Allow for Unsigned Statistics in Binary Type
> 
>
> Key: PARQUET-686
> URL: https://issues.apache.org/jira/browse/PARQUET-686
> Project: Parquet
>  Issue Type: Bug
>Reporter: Andrew Duffy
>Assignee: Ryan Blue
> Fix For: 1.9.0
>
>
> BinaryStatistics currently only have a min/max, which are compared as signed 
> {{byte[]}}. However, for real UTF8-friendly lexicographic comparison, e.g. 
> for string columns, we would want to calculate the BinaryStatistics based off 
> of a comparator that treats the bytes as unsigned.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-686) Allow for Unsigned Statistics in Binary Type

2017-06-07 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041802#comment-16041802
 ] 

Julien Le Dem commented on PARQUET-686:
---

The first issue of not returning bad stats was solved in: 
https://github.com/apache/parquet-mr/commit/de99127d77dabfc6c8134b3c58e0b9a0b74e5f37

> Allow for Unsigned Statistics in Binary Type
> 
>
> Key: PARQUET-686
> URL: https://issues.apache.org/jira/browse/PARQUET-686
> Project: Parquet
>  Issue Type: Bug
>Reporter: Andrew Duffy
>
> BinaryStatistics currently only have a min/max, which are compared as signed 
> {{byte[]}}. However, for real UTF8-friendly lexicographic comparison, e.g. 
> for string columns, we would want to calculate the BinaryStatistics based off 
> of a comparator that treats the bytes as unsigned.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PARQUET-1024) allow for case insensitive parquet-xxx prefix in PR title

2017-06-07 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-1024:
--

 Summary: allow for case insensitive parquet-xxx prefix in PR title
 Key: PARQUET-1024
 URL: https://issues.apache.org/jira/browse/PARQUET-1024
 Project: Parquet
  Issue Type: Improvement
Reporter: Julien Le Dem
Assignee: Julien Le Dem






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PARQUET-884) Add support for Decimal datatype to Parquet-Pig record reader

2017-06-07 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-884.
---
   Resolution: Fixed
Fix Version/s: 1.9.0

Issue resolved by pull request 404
[https://github.com/apache/parquet-mr/pull/404]

> Add support for Decimal datatype to Parquet-Pig record reader
> -
>
> Key: PARQUET-884
> URL: https://issues.apache.org/jira/browse/PARQUET-884
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-pig
>Reporter: Ellen Kletscher
>Priority: Minor
> Fix For: 1.9.0
>
>
> parquet.pig.ParquetLoader defaults the Parquet decimal datatype to bytearray. 
>  Would like to add support to convert to BigDecimal instead, which will turn 
> garbage bytearrays into actual numbers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (PARQUET-906) add logical type timestamp with timezone (per SQL)

2017-06-07 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem reassigned PARQUET-906:
-

   Assignee: Julien Le Dem
Description: 
timestamp with timezone (per SQL)
timestamps are adjusted to UTC and stored as integers.
metadata in logical types PR:
See discussion here: 
https://github.com/apache/parquet-format/pull/51#discussion_r109667837



  was:
We need to clarify the spec here.
TODO: validate the following points.
timestamp with timezone (per SQL)
- each value has timezone
- TZ can be different for each value



> add logical type timestamp with timezone (per SQL)
> --
>
> Key: PARQUET-906
> URL: https://issues.apache.org/jira/browse/PARQUET-906
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>Priority: Minor
>
> timestamp with timezone (per SQL)
> timestamps are adjusted to UTC and stored as integers.
> metadata in logical types PR:
> See discussion here: 
> https://github.com/apache/parquet-format/pull/51#discussion_r109667837



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PARQUET-839) Min-max should be computed based on logical type

2017-06-07 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-839.
---
Resolution: Duplicate

> Min-max should be computed based on logical type
> 
>
> Key: PARQUET-839
> URL: https://issues.apache.org/jira/browse/PARQUET-839
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Affects Versions: format-2.3.1
>Reporter: Tim Armstrong
>
> The min/max stats are currently underspecified - it is not clear in any cases 
> from the spec what the expected ordering is.
> There are some related issues, like PARQUET-686 to fix specific problems, but 
> there seems to be a general assumption that the min/max should be defined 
> based on the primitive type instead of the logical type.
> However, this makes the stats nearly useless for some logical types. E.g. 
> consider a DECIMAL encoded into a (variable-length) BINARY. The min-max of 
> the underlying binary type is based on the lexical order of the byte string, 
> but that does not correspond to any reasonable ordering of the decimal 
> values. E.g. 16 (0x1 0x0) will be ordered between 1 (0x0) and (0x2). This 
> makes min-max filtering a lot less effective and would force query engines 
> using parquet to implement workarounds to produce correct results (e.g. 
> custom comparators).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (PARQUET-990) More detailed error messages in footer parsing

2017-05-16 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem reassigned PARQUET-990:
-

Assignee: Andrew Ash

> More detailed error messages in footer parsing
> --
>
> Key: PARQUET-990
> URL: https://issues.apache.org/jira/browse/PARQUET-990
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Andrew Ash
>Assignee: Andrew Ash
>Priority: Minor
> Fix For: 1.10.0
>
>
> Include invalid values in exception messages when reading footer for two 
> situations:
> - too-short files (include file length)
> - files with corrupted footer lengths (include calculated footer start index)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PARQUET-990) More detailed error messages in footer parsing

2017-05-16 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-990.
---
   Resolution: Fixed
Fix Version/s: 1.10.0

Issue resolved by pull request 408
[https://github.com/apache/parquet-mr/pull/408]

> More detailed error messages in footer parsing
> --
>
> Key: PARQUET-990
> URL: https://issues.apache.org/jira/browse/PARQUET-990
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Andrew Ash
>Priority: Minor
> Fix For: 1.10.0
>
>
> Include invalid values in exception messages when reading footer for two 
> situations:
> - too-short files (include file length)
> - files with corrupted footer lengths (include calculated footer start index)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PARQUET-852) Slowly ramp up sizes of byte[] in ByteBasedBitPackingEncoder

2017-05-12 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-852.
---
   Resolution: Fixed
Fix Version/s: 1.10.0

Issue resolved by pull request 401
[https://github.com/apache/parquet-mr/pull/401]

> Slowly ramp up sizes of byte[] in ByteBasedBitPackingEncoder
> 
>
> Key: PARQUET-852
> URL: https://issues.apache.org/jira/browse/PARQUET-852
> Project: Parquet
>  Issue Type: Improvement
>Reporter: John Jenkins
>Priority: Minor
> Fix For: 1.10.0
>
>
> The current allocation policy for ByteBasedBitPackingEncoder is to allocate 
> 64KB * #bits up-front. As similarly observed in [PARQUET-580], this can lead 
> to significant memory overheads for high-fanout scenarios (many columns 
> and/or open files, in my case using BooleanPlainValuesWriter).
> As done in [PARQUET-585], I'll follow up with a PR that starts with a smaller 
> buffer and works its way up to a max.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PARQUET-196) parquet-tools command to get rowcount & size

2017-05-12 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-196.
---
   Resolution: Fixed
Fix Version/s: (was: 1.6.0)
   1.10.0

Issue resolved by pull request 406
[https://github.com/apache/parquet-mr/pull/406]

> parquet-tools command to get rowcount & size
> 
>
> Key: PARQUET-196
> URL: https://issues.apache.org/jira/browse/PARQUET-196
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.6.0
>Reporter: Swapnil
>Priority: Minor
>  Labels: features
> Fix For: 1.10.0
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> Parquet files contain metadata about rowcount & file size. We should have new 
> commands to get rows count & size.
> These command can be added in parquet-tools:
> 1. rowcount : This should add number of rows in all footers to give total 
> rows in data. 
> 2. size : This should give compresses size in bytes and human readable format 
> too.
> These command helps us to avoid parsing job logs or loading data once again 
> to find number of rows in data. This comes very handy in complex processes, 
> stats generation, QA etc..



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PARQUET-969) Decimal datatype support for parquet-tools output

2017-05-12 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-969.
---
   Resolution: Fixed
Fix Version/s: 1.10.0

Issue resolved by pull request 412
[https://github.com/apache/parquet-mr/pull/412]

> Decimal datatype support for parquet-tools output
> -
>
> Key: PARQUET-969
> URL: https://issues.apache.org/jira/browse/PARQUET-969
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Dan Fowler
>Priority: Minor
> Fix For: 1.10.0
>
>
> parquet-tools cat outputs decimal datatypes in binary/bytearray format. I 
> would like to have the decimal datatypes converted to their actual number 
> representation so that when parquet data is output from parquet-tools 
> decimals will be numbers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PARQUET-973) Corrupt statistics test should include int96

2017-05-03 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-973:
-

 Summary: Corrupt statistics test should include int96
 Key: PARQUET-973
 URL: https://issues.apache.org/jira/browse/PARQUET-973
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Julien Le Dem


int96 are treated as byte arrays internally and were affected by the same bug.

https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L56



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PARQUET-973) Corrupt statistics test should include int96

2017-05-03 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-973:
--
Description: 
int96 are treated as byte arrays internally and were affected by the same bug.

https://github.com/apache/parquet-mr/blob/70f28810a5547219e18ffc3465f519c454fee6e5/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L56

  was:
int96 are treated as byte arrays internally and were affected by the same bug.

https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L56


> Corrupt statistics test should include int96
> 
>
> Key: PARQUET-973
> URL: https://issues.apache.org/jira/browse/PARQUET-973
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Julien Le Dem
>
> int96 are treated as byte arrays internally and were affected by the same bug.
> https://github.com/apache/parquet-mr/blob/70f28810a5547219e18ffc3465f519c454fee6e5/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java#L56



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-964) Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: totalValueCount '0' <= 0

2017-04-26 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985131#comment-15985131
 ] 

Julien Le Dem commented on PARQUET-964:
---

Thanks for getting to the bottom of it.
Let us know when you're project is working.

> Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: 
> totalValueCount '0' <= 0
> -
>
> Key: PARQUET-964
> URL: https://issues.apache.org/jira/browse/PARQUET-964
> Project: Parquet
>  Issue Type: Bug
>Reporter: Constantin Muraru
> Attachments: ListOfList.proto, ListOfListProtoParquetConverter.java, 
> parquet_totalValueCount.png
>
>
> Hi folks!
> We're working on adding support for ProtoParquet to work with Hive / AWS 
> Athena (Presto) \[1\]. The problem we've encountered appears whenever we 
> declare a repeated field (array) or a map in the protobuf schema and we then 
> try to convert it to parquet. The conversion works fine, but when we try to 
> query the data with Hive/Presto, we get some freaky errors.
> We've noticed though that AvroToParquet works great, even when we declare 
> such fields (arrays, maps)! 
> Comparing the parquet schema generated by protobuf vs avro, we've noticed a 
> few differences.
> Take the simple schema below (protobuf):
> {code}
> message ListOfList {
> string top_field = 1;
> repeated MyInnerMessage first_array = 2;
> }
> message MyInnerMessage {
> int32 inner_field = 1;
> repeated int32 second_array = 2;
> }
> {code}
> After using ProtoParquetWriter, the resulting parquet schema is the following:
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   repeated group first_array {
> optional int32 inner_field;
> repeated int32 second_array;
>   }
> }
> {code}
> When we try to query this data, we get parsing errors from Hive/Athena. The 
> parsing errors are related to the array/map fields.
> However, if we create a similar avro schema, the parquet result of the 
> AvroParquetWriter is the following:
> {code}
> message TestProtobuf.ListOfList {
>   required binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   required int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> This works beautifully with Hive/Athena. Too bad our systems are stuck with 
> protobuf :-) .
> You can see the additional wrappers which are missing from protobuf: 
> {{required group first_array (LIST)}}.
> Our goal is to make the ProtoParquetWriter generate a parquet schema similar 
> to what Avro is doing. We basically want to add these wrappers around 
> lists/maps.
> Everything seemed to work great, until we've bumped into an issue. We tuned 
> ProtoParquetWriter to generate the same parquet schema as AvroParquetWriter. 
> However, one difference between protobuf and avro is that in protobuf we can 
> have a bunch of Optional fields. 
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   optional int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> Notice the: *optional* int32 inner_field (for avro that was *required*).
> When testing with some real proto-parquet data, we get an error every time 
> inner_field is not populated, but the second_array is.
> {noformat}
> parquet-tools cat /tmp/test23.parquet
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file file:/tmp/test23.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:223)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:122)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:126)
>   at 
> org.apache.parquet.tools.command.CatCommand.execute(CatCommand.java:79)
>   at org.apache.parquet.proto.tools.Main.main(Main.java:214)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
> Caused by: org.apache.parquet.io.ParquetDecodingException: totalValueCount 
> '0' <= 0
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:349)
>   at 
> org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:82)
> 

[jira] [Commented] (PARQUET-964) Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: totalValueCount '0' <= 0

2017-04-26 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985126#comment-15985126
 ] 

Julien Le Dem commented on PARQUET-964:
---

Nice. 
I had made this ValidatingRecordConsumer to catch those:
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/ValidatingRecordConsumer.java
It is turned off by default because it is relatively expensive.

> Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: 
> totalValueCount '0' <= 0
> -
>
> Key: PARQUET-964
> URL: https://issues.apache.org/jira/browse/PARQUET-964
> Project: Parquet
>  Issue Type: Bug
>Reporter: Constantin Muraru
> Attachments: ListOfList.proto, ListOfListProtoParquetConverter.java, 
> parquet_totalValueCount.png
>
>
> Hi folks!
> We're working on adding support for ProtoParquet to work with Hive / AWS 
> Athena (Presto) \[1\]. The problem we've encountered appears whenever we 
> declare a repeated field (array) or a map in the protobuf schema and we then 
> try to convert it to parquet. The conversion works fine, but when we try to 
> query the data with Hive/Presto, we get some freaky errors.
> We've noticed though that AvroToParquet works great, even when we declare 
> such fields (arrays, maps)! 
> Comparing the parquet schema generated by protobuf vs avro, we've noticed a 
> few differences.
> Take the simple schema below (protobuf):
> {code}
> message ListOfList {
> string top_field = 1;
> repeated MyInnerMessage first_array = 2;
> }
> message MyInnerMessage {
> int32 inner_field = 1;
> repeated int32 second_array = 2;
> }
> {code}
> After using ProtoParquetWriter, the resulting parquet schema is the following:
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   repeated group first_array {
> optional int32 inner_field;
> repeated int32 second_array;
>   }
> }
> {code}
> When we try to query this data, we get parsing errors from Hive/Athena. The 
> parsing errors are related to the array/map fields.
> However, if we create a similar avro schema, the parquet result of the 
> AvroParquetWriter is the following:
> {code}
> message TestProtobuf.ListOfList {
>   required binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   required int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> This works beautifully with Hive/Athena. Too bad our systems are stuck with 
> protobuf :-) .
> You can see the additional wrappers which are missing from protobuf: 
> {{required group first_array (LIST)}}.
> Our goal is to make the ProtoParquetWriter generate a parquet schema similar 
> to what Avro is doing. We basically want to add these wrappers around 
> lists/maps.
> Everything seemed to work great, until we've bumped into an issue. We tuned 
> ProtoParquetWriter to generate the same parquet schema as AvroParquetWriter. 
> However, one difference between protobuf and avro is that in protobuf we can 
> have a bunch of Optional fields. 
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   optional int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> Notice the: *optional* int32 inner_field (for avro that was *required*).
> When testing with some real proto-parquet data, we get an error every time 
> inner_field is not populated, but the second_array is.
> {noformat}
> parquet-tools cat /tmp/test23.parquet
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file file:/tmp/test23.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:223)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:122)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:126)
>   at 
> org.apache.parquet.tools.command.CatCommand.execute(CatCommand.java:79)
>   at org.apache.parquet.proto.tools.Main.main(Main.java:214)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
> Caused by: org.apache.parquet.io.ParquetDecodingException: totalValueCount 
> '0' <= 0
>   at 
> 

[jira] [Resolved] (PARQUET-964) Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: totalValueCount '0' <= 0

2017-04-26 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-964.
---
Resolution: Not A Problem

> Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: 
> totalValueCount '0' <= 0
> -
>
> Key: PARQUET-964
> URL: https://issues.apache.org/jira/browse/PARQUET-964
> Project: Parquet
>  Issue Type: Bug
>Reporter: Constantin Muraru
> Attachments: ListOfList.proto, ListOfListProtoParquetConverter.java, 
> parquet_totalValueCount.png
>
>
> Hi folks!
> We're working on adding support for ProtoParquet to work with Hive / AWS 
> Athena (Presto) \[1\]. The problem we've encountered appears whenever we 
> declare a repeated field (array) or a map in the protobuf schema and we then 
> try to convert it to parquet. The conversion works fine, but when we try to 
> query the data with Hive/Presto, we get some freaky errors.
> We've noticed though that AvroToParquet works great, even when we declare 
> such fields (arrays, maps)! 
> Comparing the parquet schema generated by protobuf vs avro, we've noticed a 
> few differences.
> Take the simple schema below (protobuf):
> {code}
> message ListOfList {
> string top_field = 1;
> repeated MyInnerMessage first_array = 2;
> }
> message MyInnerMessage {
> int32 inner_field = 1;
> repeated int32 second_array = 2;
> }
> {code}
> After using ProtoParquetWriter, the resulting parquet schema is the following:
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   repeated group first_array {
> optional int32 inner_field;
> repeated int32 second_array;
>   }
> }
> {code}
> When we try to query this data, we get parsing errors from Hive/Athena. The 
> parsing errors are related to the array/map fields.
> However, if we create a similar avro schema, the parquet result of the 
> AvroParquetWriter is the following:
> {code}
> message TestProtobuf.ListOfList {
>   required binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   required int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> This works beautifully with Hive/Athena. Too bad our systems are stuck with 
> protobuf :-) .
> You can see the additional wrappers which are missing from protobuf: 
> {{required group first_array (LIST)}}.
> Our goal is to make the ProtoParquetWriter generate a parquet schema similar 
> to what Avro is doing. We basically want to add these wrappers around 
> lists/maps.
> Everything seemed to work great, until we've bumped into an issue. We tuned 
> ProtoParquetWriter to generate the same parquet schema as AvroParquetWriter. 
> However, one difference between protobuf and avro is that in protobuf we can 
> have a bunch of Optional fields. 
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   optional int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> Notice the: *optional* int32 inner_field (for avro that was *required*).
> When testing with some real proto-parquet data, we get an error every time 
> inner_field is not populated, but the second_array is.
> {noformat}
> parquet-tools cat /tmp/test23.parquet
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file file:/tmp/test23.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:223)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:122)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:126)
>   at 
> org.apache.parquet.tools.command.CatCommand.execute(CatCommand.java:79)
>   at org.apache.parquet.proto.tools.Main.main(Main.java:214)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
> Caused by: org.apache.parquet.io.ParquetDecodingException: totalValueCount 
> '0' <= 0
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:349)
>   at 
> org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:82)
>   at 
> 

[jira] [Commented] (PARQUET-964) Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: totalValueCount '0' <= 0

2017-04-25 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983784#comment-15983784
 ] 

Julien Le Dem commented on PARQUET-964:
---

totalValueCount includes null values so it should never be 0 unless you're 
creating empty parquet files?
separately, It should also not be negative (that indicates an overflow of the 
value since the underlying metadata accepts long)
Could you look into why totalValueCount == 0? This should be the sum of all 
value counts for all pages in that column chunk.


> Using ProtoParquet with Hive / AWS Athena: ParquetDecodingException: 
> totalValueCount '0' <= 0
> -
>
> Key: PARQUET-964
> URL: https://issues.apache.org/jira/browse/PARQUET-964
> Project: Parquet
>  Issue Type: Bug
>Reporter: Constantin Muraru
> Attachments: ListOfList.proto, ListOfListProtoParquetConverter.java
>
>
> Hi folks!
> We're working on adding support for ProtoParquet to work with Hive / AWS 
> Athena (Presto) \[1\]. The problem we've encountered appears whenever we 
> declare a repeated field (array) or a map in the protobuf schema and we then 
> try to convert it to parquet. The conversion works fine, but when we try to 
> query the data with Hive/Presto, we get some freaky errors.
> We've noticed though that AvroToParquet works great, even when we declare 
> such fields (arrays, maps)! 
> Comparing the parquet schema generated by protobuf vs avro, we've noticed a 
> few differences.
> Take the simple schema below (protobuf):
> {code}
> message ListOfList {
> string top_field = 1;
> repeated MyInnerMessage first_array = 2;
> }
> message MyInnerMessage {
> int32 inner_field = 1;
> repeated int32 second_array = 2;
> }
> {code}
> After using ProtoParquetWriter, the resulting parquet schema is the following:
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   repeated group first_array {
> optional int32 inner_field;
> repeated int32 second_array;
>   }
> }
> {code}
> When we try to query this data, we get parsing errors from Hive/Athena. The 
> parsing errors are related to the array/map fields.
> However, if we create a similar avro schema, the parquet result of the 
> AvroParquetWriter is the following:
> {code}
> message TestProtobuf.ListOfList {
>   required binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   required int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> This works beautifully with Hive/Athena. Too bad our systems are stuck with 
> protobuf :-) .
> You can see the additional wrappers which are missing from protobuf: 
> {{required group first_array (LIST)}}.
> Our goal is to make the ProtoParquetWriter generate a parquet schema similar 
> to what Avro is doing. We basically want to add these wrappers around 
> lists/maps.
> Everything seemed to work great, until we've bumped into an issue. We tuned 
> ProtoParquetWriter to generate the same parquet schema as AvroParquetWriter. 
> However, one difference between protobuf and avro is that in protobuf we can 
> have a bunch of Optional fields. 
> {code}
> message TestProtobuf.ListOfList {
>   optional binary top_field (UTF8);
>   required group first_array (LIST) {
> repeated group array {
>   optional int32 inner_field;
>   required group second_array (LIST) {
> repeated int32 array;
>   }
> }
>   }
> }
> {code}
> Notice the: *optional* int32 inner_field (for avro that was *required*).
> When testing with some real proto-parquet data, we get an error every time 
> inner_field is not populated, but the second_array is.
> {noformat}
> parquet-tools cat /tmp/test23.parquet
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file file:/tmp/test23.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:223)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:122)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:126)
>   at 
> org.apache.parquet.tools.command.CatCommand.execute(CatCommand.java:79)
>   at org.apache.parquet.proto.tools.Main.main(Main.java:214)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
> Caused by: 

[jira] [Created] (PARQUET-922) Define Index pages when a Parquet file is sorted

2017-03-24 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-922:
-

 Summary: Define Index pages when a Parquet file is sorted
 Key: PARQUET-922
 URL: https://issues.apache.org/jira/browse/PARQUET-922
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Julien Le Dem
Assignee: Marcel Kornacker


When a Parquet file is sorted we can define an index consisting of the boundary 
values for the pages of the columns sorted on as well as the offsets and length 
of said pages in the file.
The goal is to optimize lookup and range scan type queries, using this to read 
only the pages containing data matching the filter.
We'd require the pages to be aligned accross columns.

[~marcelk] will add a link to the google doc to discuss the spec



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PARQUET-907) Optionally store Page level metadata in the footer to enable predicate pushdowns

2017-03-06 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-907:
-

 Summary: Optionally store Page level metadata in the footer to 
enable predicate pushdowns
 Key: PARQUET-907
 URL: https://issues.apache.org/jira/browse/PARQUET-907
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Julien Le Dem






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PARQUET-906) add logical type timestamp with timezone (per SQL)

2017-03-06 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-906:
-

 Summary: add logical type timestamp with timezone (per SQL)
 Key: PARQUET-906
 URL: https://issues.apache.org/jira/browse/PARQUET-906
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Julien Le Dem
Priority: Minor


We need to clarify the spec here.
TODO: validate the following points.
timestamp with timezone (per SQL)
- each value has timezone
- TZ can be different for each value




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PARQUET-905) Add "Floating Timestamp" logical type

2017-03-06 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-905:
-

 Summary: Add "Floating Timestamp" logical type
 Key: PARQUET-905
 URL: https://issues.apache.org/jira/browse/PARQUET-905
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Julien Le Dem


Unlike current Parquet Timestamp stored in UTC, a "floating timestamp" has no 
timezone, it is up to the reader to interpret the timestamps based on their 
timezone.
This is the behavior of a Timestamp in the sql standard





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PARQUET-904) Define INT96 ordering

2017-03-06 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-904:
-

 Summary: Define INT96 ordering
 Key: PARQUET-904
 URL: https://issues.apache.org/jira/browse/PARQUET-904
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Julien Le Dem


Currently int96 binary ordering doesn't match its natural ordering.
We should either specify this or declare int96 not ordered and link to the type 
replacing it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PARQUET-323) INT96 should be marked as deprecated

2017-03-06 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-323:
--
Description: 
As discussed in the mailing list, {{INT96}} is only used to represent nanosec 
timestamp in Impala for some historical reasons, and should be deprecated. 
Since nanosec precision is rarely a real requirement, one possible and simple 
solution would be replacing {{INT96}} with {{INT64 (TIMESTAMP_MILLIS)}} or 
{{INT64 (TIMESTAMP_MICROS)}}.

Several projects (Impala, Hive, Spark, ...) support INT96.
We need a clear spec of the replacement and the path to deprecation.

  was:As discussed in the mailing list, {{INT96}} is only used to represent 
nanosec timestamp in Impala for some historical reasons, and should be 
deprecated. Since nanosec precision is rarely a real requirement, one possible 
and simple solution would be replacing {{INT96}} with {{INT64 
(TIMESTAMP_MILLIS)}} or {{INT64 (TIMESTAMP_MICROS)}}.


> INT96 should be marked as deprecated
> 
>
> Key: PARQUET-323
> URL: https://issues.apache.org/jira/browse/PARQUET-323
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>
> As discussed in the mailing list, {{INT96}} is only used to represent nanosec 
> timestamp in Impala for some historical reasons, and should be deprecated. 
> Since nanosec precision is rarely a real requirement, one possible and simple 
> solution would be replacing {{INT96}} with {{INT64 (TIMESTAMP_MILLIS)}} or 
> {{INT64 (TIMESTAMP_MICROS)}}.
> Several projects (Impala, Hive, Spark, ...) support INT96.
> We need a clear spec of the replacement and the path to deprecation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PARQUET-371) Bumps Thrift version to 0.9.3

2017-02-22 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-371:
--
Summary: Bumps Thrift version to 0.9.3  (was: Bumps Thrift version to 0.9.0)

> Bumps Thrift version to 0.9.3
> -
>
> Key: PARQUET-371
> URL: https://issues.apache.org/jira/browse/PARQUET-371
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Cheng Lian
> Fix For: format-2.4.0
>
>
> Thrift 0.7.0 is too old a version, and it doesn't compile on Mac. Would be 
> nice to bump Thrift version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PARQUET-786) parquet-tools README incorrectly has 'java jar' instead of 'java -jar'

2016-12-05 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-786.
---
   Resolution: Fixed
Fix Version/s: 1.10.0

Merged in:
https://github.com/apache/parquet-mr/pull/386
https://github.com/apache/parquet-mr/commit/7987a544cce59537467621114b400f670c71d722

> parquet-tools README incorrectly has 'java jar' instead of 'java -jar'
> --
>
> Key: PARQUET-786
> URL: https://issues.apache.org/jira/browse/PARQUET-786
> Project: Parquet
>  Issue Type: Bug
>Reporter: Mark Nelson
>Assignee: Mark Nelson
> Fix For: 1.10.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-786) parquet-tools README incorrectly has 'java jar' instead of 'java -jar'

2016-12-05 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-786:
--
Assignee: Mark Nelson

> parquet-tools README incorrectly has 'java jar' instead of 'java -jar'
> --
>
> Key: PARQUET-786
> URL: https://issues.apache.org/jira/browse/PARQUET-786
> Project: Parquet
>  Issue Type: Bug
>Reporter: Mark Nelson
>Assignee: Mark Nelson
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-774) Release parquet-cpp 0.1

2016-11-08 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-774:
-

 Summary: Release parquet-cpp 0.1
 Key: PARQUET-774
 URL: https://issues.apache.org/jira/browse/PARQUET-774
 Project: Parquet
  Issue Type: Task
  Components: parquet-cpp
Reporter: Julien Le Dem
Assignee: Uwe L. Korn






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-757) Add NULL type to Bring Parquet logical types to par with Arrow

2016-11-04 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-757.
---
   Resolution: Fixed
Fix Version/s: format-2.4.0

Issue resolved by pull request 45
[https://github.com/apache/parquet-format/pull/45]

> Add NULL type to Bring Parquet logical types to par with Arrow
> --
>
> Key: PARQUET-757
> URL: https://issues.apache.org/jira/browse/PARQUET-757
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Fix For: format-2.4.0
>
>
> Missing:
>  - Null
>  - Interval types
>  - Union
>  - half precision float



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-758) HALF precision FLOAT Logical type

2016-10-28 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-758:
-

 Summary: HALF precision FLOAT Logical type
 Key: PARQUET-758
 URL: https://issues.apache.org/jira/browse/PARQUET-758
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Julien Le Dem
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PARQUET-757) Bring Parquet logical types to par with Arrow

2016-10-26 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609566#comment-15609566
 ] 

Julien Le Dem edited comment on PARQUET-757 at 10/26/16 8:26 PM:
-

Those differences came up in https://github.com/apache/parquet-mr/pull/381


was (Author: julienledem):
Those difference came up in https://github.com/apache/parquet-mr/pull/381

> Bring Parquet logical types to par with Arrow
> -
>
> Key: PARQUET-757
> URL: https://issues.apache.org/jira/browse/PARQUET-757
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>
> Missing:
>  - Null
>  - Interval types
>  - Union
>  - half precision float



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PARQUET-675) Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types

2016-10-26 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem reassigned PARQUET-675:
-

Assignee: Julien Le Dem

> Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types
> ---
>
> Key: PARQUET-675
> URL: https://issues.apache.org/jira/browse/PARQUET-675
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>
> For completeness and compatibility with Arrow and SQL types.
> Those are related to the existing INTERVAL type.
> some references:
>  - https://msdn.microsoft.com/en-us/library/ms716506(v=vs.85).aspx
>  - 
> http://www.techrepublic.com/article/sql-basics-datetime-and-interval-data-types/
>  - https://www.postgresql.org/docs/9.3/static/datatype-datetime.html
>  - https://docs.oracle.com/html/E26088_01/sql_elements001.htm
>  - 
> http://www.ibm.com/support/knowledgecenter/SSGU8G_12.1.0/com.ibm.sqlr.doc/ids_sqr_123.htm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-757) Bring Parquet logical types to par with Arrow

2016-10-26 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-757:
--
Description: 
Missing:
 - Null
 - Interval types
 - Union
 - half precision float


  was:
Missing:
 - Null
 - Interval types
 - Union
 - Short float



> Bring Parquet logical types to par with Arrow
> -
>
> Key: PARQUET-757
> URL: https://issues.apache.org/jira/browse/PARQUET-757
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>
> Missing:
>  - Null
>  - Interval types
>  - Union
>  - half precision float



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-757) Bring Parquet logical types to par with Arrow

2016-10-26 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-757:
-

 Summary: Bring Parquet logical types to par with Arrow
 Key: PARQUET-757
 URL: https://issues.apache.org/jira/browse/PARQUET-757
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Julien Le Dem
Assignee: Julien Le Dem


Missing:
 - Null
 - Interval types
 - Union
 - Short float




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-723) parquet is not storing the type for the column.

2016-10-26 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609272#comment-15609272
 ] 

Julien Le Dem commented on PARQUET-723:
---

It looks like a bug/missing feature in Hive.
[~spena] What do you think?

> parquet is not storing the type for the column.
> ---
>
> Key: PARQUET-723
> URL: https://issues.apache.org/jira/browse/PARQUET-723
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Narasimha
>
> 1. Create Text file format table 
>   CREATE EXTERNAL TABLE IF NOT EXISTS emp(
>   id INT,
>   first_name STRING,
>   last_name STRING,
>   dateofBirth STRING,
>   join_date INT
>   )
>   COMMENT 'This is Employee Table Date Of Birth of type String'
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
>   LINES TERMINATED BY '\n'
>   STORED AS TEXTFILE
>   LOCATION '/user/employee/beforePartition';
> 2. Load the data into table
>   load data inpath '/user/somupoc_timestamp/employeeData_partitioned.csv' 
> into table emp;
>   select * from emp;
> 3. Create Partitioned table with file format as Parquet (dateofBirth STRING))
>   create external table emp_afterpartition(
>   id int, first_name STRING, last_name STRING, dateofBirth STRING)
>   COMMENT 'Employee partitioned table with dateofBirth of type string'
>   partitioned by (join_date int)
>   STORED as parquet
>   LOCATION '/user/employee/afterpartition';
> 4.  Fetch the data from Partitioned column
>   set hive.exec.dynamic.partition=true;  
>   set hive.exec.dynamic.partition.mode=nonstrict; 
>   insert overwrite table emp_afterpartition partition (join_date) select 
> * from emp;
>   select * from emp_afterpartition;
> 5. Create Partitioned table with file format as Parquet (dateofBirth 
> TIMESTAMP))
>   CREATE EXTERNAL TABLE IF NOT EXISTS 
> employee_afterpartition_timestamp_parq(
>   id INT,first_name STRING,last_name STRING,dateofBirth TIMESTAMP)
>   COMMENT 'employee partitioned table with dateofBirth of type TIMESTAMP'
>   PARTITIONED BY (join_date INT)
>   STORED AS PARQUET
>   LOCATION '/user/employee/afterpartition';
>   select * from employee_afterpartition_timestamp_parq;
> -- 0 records returned
>   impala ::   alter table employee_afterpartition_timestamp_parq 
> RECOVER PARTITIONS;
>   Hive :: MSCK REPAIR TABLE 
> employee_afterpartition_timestamp_parq;
>   -- MSCK works in Hive and  RECOVER PARTITIONS works in Impala -- 
> metastore check command with the repair table option:
>   select * from employee_afterpartition_timestamp_parq;
> Actual Result :: Failed with exception 
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to 
> org.apache.hadoop.hive.serde2.io.TimestampWritable
> Expected Result :: Data should display
> Note: if file format is text file instead of Parquet then I am able to fetch 
> the data.
> Observation : Two tables having different column type pointing to same 
> location(HDFS ).
> sample Data
> =
> 1,Joyce,Garza,2016-07-17 14:42:18,201607
> 2,Jerry,Ortiz,2016-08-17 21:36:54,201608
> 3,Steven,Ryan,2016-09-10 01:32:40,201609
> 4,Lisa,Black,2015-10-12 15:05:13,201610
> 5,Jose,Turner,2015-011-10 06:38:40,201611
> 6,Joyce,Garza,2016-08-02,201608
> 7,Jerry,Ortiz,2016-01-01,201601
> 8,Steven,Ryan,2016/08/20,201608
> 9,Lisa,Black,2016/09/12,201609
> 10,Jose,Turner,09/19/2016,201609
> 11,Jose,Turner,20160915,201609



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-756) Add Union Logical type

2016-10-26 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-756:
-

 Summary: Add Union Logical type
 Key: PARQUET-756
 URL: https://issues.apache.org/jira/browse/PARQUET-756
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Julien Le Dem
Assignee: Julien Le Dem


Add a union type annotation for Group types that represent a Union rather than 
a struct.
Models like Avro or Arrow would make use of it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-753) GroupType.union() doesn't merge the original type

2016-10-26 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-753.
---
Resolution: Fixed

https://github.com/apache/parquet-mr/commit/e5cd652aeb3305ef2b82a7925cce3a132bf6f5ae

> GroupType.union() doesn't merge the original type
> -
>
> Key: PARQUET-753
> URL: https://issues.apache.org/jira/browse/PARQUET-753
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.1
>Reporter: Deneche A. Hakim
>
> When merging two GroupType, the union() method doesn't merge their original 
> type which will be lost after the union.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PARQUET-727) Ensure correct version of thrift is used

2016-09-27 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem reassigned PARQUET-727:
-

Assignee: Julien Le Dem

> Ensure correct version of thrift is used
> 
>
> Key: PARQUET-727
> URL: https://issues.apache.org/jira/browse/PARQUET-727
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Niels Basjes
>Assignee: Julien Le Dem
>
> I found that if you have the wrong version of thrift in your path during the 
> build the errors you get are very obscure and verbose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (PARQUET-722) Building with JDK 8 fails over a maven bug

2016-09-21 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem closed PARQUET-722.
-

> Building with JDK 8 fails over a maven bug
> --
>
> Key: PARQUET-722
> URL: https://issues.apache.org/jira/browse/PARQUET-722
> Project: Parquet
>  Issue Type: Bug
>Reporter: Niels Basjes
>
> When I build parquet on my system I get this error during the build:
> {quote}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) 
> on project parquet-generator: Error rendering velocity resource. 
> NullPointerException -> [Help 1]
> {quote}
> About a year ago [~julienledem] responded that this is caused due to a bug in 
> Maven in combination with Java 8:
> At this page 
> http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure/33360512#33360512
>  
> Now this bug has been solved at the Maven end in maven-filtering 1.2
> https://issues.apache.org/jira/browse/MSHARED-319
> The problem is that this fix has not yet been integrated into the latest 
> available maven versions yet.
> I'll put up a pull request with a proposed fix for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-722) Building with JDK 8 fails over a maven bug

2016-09-21 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511173#comment-15511173
 ] 

Julien Le Dem commented on PARQUET-722:
---

Thanks for spending the time!

> Building with JDK 8 fails over a maven bug
> --
>
> Key: PARQUET-722
> URL: https://issues.apache.org/jira/browse/PARQUET-722
> Project: Parquet
>  Issue Type: Bug
>Reporter: Niels Basjes
>
> When I build parquet on my system I get this error during the build:
> {quote}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) 
> on project parquet-generator: Error rendering velocity resource. 
> NullPointerException -> [Help 1]
> {quote}
> About a year ago [~julienledem] responded that this is caused due to a bug in 
> Maven in combination with Java 8:
> At this page 
> http://stackoverflow.com/questions/31229445/build-failure-apache-parquet-mr-source-mvn-install-failure/33360512#33360512
>  
> Now this bug has been solved at the Maven end in maven-filtering 1.2
> https://issues.apache.org/jira/browse/MSHARED-319
> The problem is that this fix has not yet been integrated into the latest 
> available maven versions yet.
> I'll put up a pull request with a proposed fix for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-715) clean up abandoned PRs

2016-09-08 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-715:
-

 Summary: clean up abandoned PRs
 Key: PARQUET-715
 URL: https://issues.apache.org/jira/browse/PARQUET-715
 Project: Parquet
  Issue Type: Task
Reporter: Julien Le Dem


parquet-mr: #333
parquet-format: #38, #33



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-392) Release Parquet-mr 1.9.0

2016-09-08 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475130#comment-15475130
 ] 

Julien Le Dem commented on PARQUET-392:
---

[~rdblue] is making a RC soon

> Release Parquet-mr 1.9.0
> 
>
> Key: PARQUET-392
> URL: https://issues.apache.org/jira/browse/PARQUET-392
> Project: Parquet
>  Issue Type: Task
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-655) The LogicalTypes.md link in README.md points to the old Parquet GitHub repository

2016-09-08 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-655:
--
Assignee: Cheng Lian

> The LogicalTypes.md link in README.md points to the old Parquet GitHub 
> repository
> -
>
> Key: PARQUET-655
> URL: https://issues.apache.org/jira/browse/PARQUET-655
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Trivial
> Fix For: format-2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-655) The LogicalTypes.md link in README.md points to the old Parquet GitHub repository

2016-09-08 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-655.
---
   Resolution: Fixed
Fix Version/s: format-2.4.0

Issue resolved by pull request 41
[https://github.com/apache/parquet-format/pull/41]

> The LogicalTypes.md link in README.md points to the old Parquet GitHub 
> repository
> -
>
> Key: PARQUET-655
> URL: https://issues.apache.org/jira/browse/PARQUET-655
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Trivial
> Fix For: format-2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-696) Move travis download from google code (defunct) to github

2016-08-29 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-696.
---
   Resolution: Fixed
Fix Version/s: 1.9.0

Issue resolved by pull request 364
[https://github.com/apache/parquet-mr/pull/364]

> Move travis download from google code (defunct) to github
> -
>
> Key: PARQUET-696
> URL: https://issues.apache.org/jira/browse/PARQUET-696
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-696) Move travis download from google code (defunct) to github

2016-08-29 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-696:
-

 Summary: Move travis download from google code (defunct) to github
 Key: PARQUET-696
 URL: https://issues.apache.org/jira/browse/PARQUET-696
 Project: Parquet
  Issue Type: Task
  Components: parquet-mr
Reporter: Julien Le Dem
Assignee: Julien Le Dem






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-682) Configure the encoding used by ValueWriters

2016-08-23 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434083#comment-15434083
 ] 

Julien Le Dem commented on PARQUET-682:
---

I general I think we have 2 use cases:

1) The users have specific knowledge of the data that make them pick a better 
encoding for a given column.
For this we want the override to be by column name rather than type.
Because for example:
 - the user knows that a field will not get dictionary encoded but will perform 
in prefix coding. it will save time/memory not to fallback from dic coding and 
just do prefix coding right away.
 - the user knows that a specific encoding will do better on a given column and 
wants to try it first.
 - the user wants to force dictionary encoding on a certain field (and fail if 
it gets too big) for perf reasons.

2) Tweaking a general heuristic to pick a good encoding unsupervised.
your suggestion seems to apply to this in particular. (override by type)


> Configure the encoding used by ValueWriters
> ---
>
> Key: PARQUET-682
> URL: https://issues.apache.org/jira/browse/PARQUET-682
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Piyush Narang
>
> This was supposed to be tackled by jira: 
> https://issues.apache.org/jira/browse/PARQUET-601 but that ended up being 
> just the work done to refactor the ValuesWriter factory code out of 
> ParquetProperties. As that is now merged, it would be nice to revisit the 
> original purpose - being able to configure which type of ValuesWriters to be 
> used for writing out columns. 
> Background: Parquet is currently structured to choose the appropriate value 
> writer based on the type of the column as well as the Parquet version. Value 
> writers are responsible for writing out values with the appropriate encoding. 
> As an example, for Boolean data types, we use BooleanPlainValuesWriter (v1.0) 
> or RunLengthBitPackingHybridValuesWriter (v2.0). Code to do this is in the 
> [DefaultV1ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L31]
>  and the 
> [DefaultV2ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L35].
>  
> Would be nice to support being able to override the encodings in some way. 
> That allows users to experiment with various encoding strategies manually as 
> well as enables them to override the hardcoded defaults if they don't suit 
> their use case.
> Couple of options I can think of:
> Specifying encoding by type (or column):
> {code}
> parquet.writer.encoding-override. = "encoding1[,encoding2]"
> As an example:
> "parquet.writer.encoding-override.int32" = "plain"
> {code}
> Chooses Plain encoding and hence the PlainValuesWriter.
> When a primary + fallback need to be specified, we can do the following:
> {code}
> "parquet.writer.encoding-override.binary" = "rle_dictionary,delta_byte_array"
> {code}
> Chooses RLE_DICTIONARY encoding as the initial encoding and DELTA_BYTE_ARRAY 
> encoding as the fallback and hence creates a 
> FallbackWriter(PlainBinaryDictionaryValuesWriter, DeltaByteArrayWriter). 
> In such cases we can mandate that the first encoding listed must allow for 
> Fallbacks by implementing 
> [RequiresFallback|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/RequiresFallback.java#L31].
>  
> Another option suggested by [~alexlevenson], was to allow overriding of the 
> ValuesWriterFactory using reflection:
> {code}
> parquet.writer.factory-override = 
> "org.apache.parquet.hadoop.MyValuesWriterFactory"
> {code}
> This creates a factory, MyValuesWriterFactory which is then invoked for every 
> ColumnDescriptor to get a ValueWriter. This provides the flexibility to the 
> user to implement a ValuesWriterFactory that can read configuration for per 
> type / column encoding overrides. Can also be used to plug-in a more 
> sophisticated approach where we choose the appropriate encoding based on the 
> data being seen. A concern raised by [~rdblue] regarding this approach was 
> that ValuesWriters are supposed to be internal classes in Parquet. So we 
> shouldn't be allowing users to configure the ValuesWriter factories via 
> config.
> cc [~julienledem] / [~rdblue] / [~alexlevenson] for you thoughts / other 
> ideas. We could also explore other ideas based on any other potential use 
> cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-685) Deprecated ParquetInputSplit constructor passes parameters in the wrong order.

2016-08-23 Thread Julien Le Dem (JIRA)
Julien Le Dem created PARQUET-685:
-

 Summary: Deprecated ParquetInputSplit constructor passes 
parameters in the wrong order.
 Key: PARQUET-685
 URL: https://issues.apache.org/jira/browse/PARQUET-685
 Project: Parquet
  Issue Type: Bug
Reporter: Julien Le Dem


https://github.com/apache/parquet-mr/blob/255f10834a67cf13518316de0e2c8a345677ebbf/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L92
{noformat}
  @Deprecated
  public ParquetInputSplit(
  Path path,
  long start,
  long length,
  String[] hosts,
  List blocks,
  String requestedSchema,
  String fileSchema,
  Map extraMetadata,
  Map readSupportMetadata) {
this(path, start, length, end(blocks, requestedSchema), hosts, 
offsets(blocks));
  }
{noformat}
this() refers to the following:
https://github.com/apache/parquet-mr/blob/255f10834a67cf13518316de0e2c8a345677ebbf/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputSplit.java#L163
{noformat}
  public ParquetInputSplit(
  Path file, long start, long end, long length, String[] hosts,
  long[] rowGroupOffsets) {
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-677) Quoted identifiers in column names

2016-08-21 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429786#comment-15429786
 ] 

Julien Le Dem commented on PARQUET-677:
---

Parquet support for Hive has moved to the hive repo itself for more recent 
versions. You should find the same tests there.
The serde in the parquet repo is for older versions of Hive so it wont be 
moving up.
This was to better support compatibility across Hive versions since Hive's API 
kept changing.


> Quoted identifiers in column names
> --
>
> Key: PARQUET-677
> URL: https://issues.apache.org/jira/browse/PARQUET-677
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Michael Styles
>Priority: Minor
>
> Add the ability to quote identifiers for columns in a table. This would allow 
> column names to contain arbitrary characters such as spaces. Hive supports 
> these types of identifiers using backquotes. For example,
> create table parquet_table (`Session Token` string) stored as parquetfile;
> However, attempting to insert a new row into this table results in an error.
> insert into parquet_table values ('1234-45')
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException: field ended by ';': expected ';' but got 
> 'token' at line 1:   optional string Session Token
> I would suggest using backquotes in Parquet as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-460) Parquet files concat tool

2016-08-16 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-460.
---
   Resolution: Fixed
Fix Version/s: 1.9.0

Issue resolved by pull request 327
[https://github.com/apache/parquet-mr/pull/327]

> Parquet files concat tool
> -
>
> Key: PARQUET-460
> URL: https://issues.apache.org/jira/browse/PARQUET-460
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.7.0, 1.8.0
>Reporter: flykobe cheng
>Assignee: flykobe cheng
> Fix For: 1.9.0
>
>
> Currently the parquet file generation is time consuming, most of time used 
> for serialize and compress. It cost about 10mins to generate a 100MB~ parquet 
> file in our scenario. We want to improve write performance without generate 
> too many small files, which will impact read performance.
> We propose to:
> 1. generate several small parquet files concurrently
> 2. merge small files to one file: concat the parquet blocks in binary 
> (without SerDe), merge footers and modify the path and offset metadata.
> We create ParquetFilesConcat class to finish step 2. It can be invoked by 
> parquet.tools.command.ConcatCommand. If this function approved by parquet 
> community, we will integrate it in spark.
> It will impact compression and introduced more dictionary pages, but it can 
> be improved by adjusting the concurrency of step 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-146) make Parquet compile with java 7 instead of java 6

2016-08-15 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-146.
---
   Resolution: Fixed
Fix Version/s: 1.9.0

Issue resolved by pull request 231
[https://github.com/apache/parquet-mr/pull/231]

> make Parquet compile with java 7 instead of java 6
> --
>
> Key: PARQUET-146
> URL: https://issues.apache.org/jira/browse/PARQUET-146
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Julien Le Dem
>  Labels: beginner, noob, pick-me-up
> Fix For: 1.9.0
>
>
> currently Parquet is compatible with java 6. we should remove this constraint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-669) Allow reading file footers from input streams when writing metadata files

2016-08-03 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-669.
---
   Resolution: Fixed
Fix Version/s: 1.9.0

Issue resolved by pull request 357
https://github.com/apache/parquet-mr/pull/357


> Allow reading file footers from input streams when writing metadata files
> -
>
> Key: PARQUET-669
> URL: https://issues.apache.org/jira/browse/PARQUET-669
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Robert Kruszewski
>Assignee: Robert Kruszewski
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-669) Allow reading file footers from input streams when writing metadata files

2016-08-03 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-669:
--
Assignee: Robert Kruszewski

> Allow reading file footers from input streams when writing metadata files
> -
>
> Key: PARQUET-669
> URL: https://issues.apache.org/jira/browse/PARQUET-669
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Robert Kruszewski
>Assignee: Robert Kruszewski
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-668) Provide option to disable auto crop feature in DumpCommand output

2016-08-03 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-668.
---
   Resolution: Fixed
Fix Version/s: 1.9.0

Issue resolved by pull request 358
[https://github.com/apache/parquet-mr/pull/358]

> Provide option to disable auto crop feature in DumpCommand output
> -
>
> Key: PARQUET-668
> URL: https://issues.apache.org/jira/browse/PARQUET-668
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Daniel Harper
>Priority: Trivial
> Fix For: 1.9.0
>
>
> *Problem*
> When using the {{dump}} command in {{parquet-tools}}, the output will 
> sometimes be truncated based on the width of your console, especially on 
> smaller displays.
> Example:
> {code}
> row group 0
> 
> id:  INT32 SNAPPY DO:0 FPO:4 SZ:44668/920538/20.61 VC:7240100  
> [more]...
> name:BINARY SNAPPY DO:0 FPO:44672 SZ:89464018/1031768430/11.53 
> [more]...
> event_time:  INT64 SNAPPY DO:0 FPO:89508690 SZ:43600235/57923935/1.33 
> VC:7240100 [more]...
> id TV=7240100 RL=0 DL=0 DS: 2 DE:PLAIN_DICTIONARY
> 
> 
> page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLA 
> [more]... SZ:33291
> {code}
> This is especially annoying if you pipe the output to a file as the 
> truncation remains in place. 
> *Proposed fix*
> Provide the flag {{--disable-crop}} for the dump command. Truncation is 
> enabled by default and will only be disabled when this flag is provided,
> This will output the full content to standard out, for example:
> {code}
> row group 0
> 
> id:  INT32 SNAPPY DO:0 FPO:4 SZ:44668/920538/20.61 VC:7240100 
> ENC:BIT_PACKED,PLAIN_DICTIONARY
> name:BINARY SNAPPY DO:0 FPO:44672 SZ:89464018/1031768430/11.53 
> VC:7240100 ENC:PLAIN,BIT_PACKED
> event_time:  INT64 SNAPPY DO:0 FPO:89508690 SZ:43600235/57923935/1.33 
> VC:7240100 ENC:PLAIN,BIT_PACKED,RLE
> id TV=7240100 RL=0 DL=0 DS: 2 DE:PLAIN_DICTIONARY
> 
> 
> page 0:  DLE:BIT_PACKED RLE:BIT_PACKED 
> VLE:PLAIN_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] SZ:33291 VC:262146
> page 1:  DLE:BIT_PACKED RLE:BIT_PACKED 
> VLE:PLAIN_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] SZ:33291 VC:262145
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (PARQUET-323) INT96 should be marked as deprecated

2016-08-03 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-323:
--
Comment: was deleted

(was: I think we should deprecate it and discourage its use. For backward 
compatibility, it has to stay.
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md doesn't 
even refer to it.
)

> INT96 should be marked as deprecated
> 
>
> Key: PARQUET-323
> URL: https://issues.apache.org/jira/browse/PARQUET-323
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>
> As discussed in the mailing list, {{INT96}} is only used to represent nanosec 
> timestamp in Impala for some historical reasons, and should be deprecated. 
> Since nanosec precision is rarely a real requirement, one possible and simple 
> solution would be replacing {{INT96}} with {{INT64 (TIMESTAMP_MILLIS)}} or 
> {{INT64 (TIMESTAMP_MICROS)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-323) INT96 should be marked as deprecated

2016-08-03 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406330#comment-15406330
 ] 

Julien Le Dem commented on PARQUET-323:
---

I think we should deprecate it and discourage its use. For backward 
compatibility, it has to stay.
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md doesn't 
even refer to it.


> INT96 should be marked as deprecated
> 
>
> Key: PARQUET-323
> URL: https://issues.apache.org/jira/browse/PARQUET-323
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>
> As discussed in the mailing list, {{INT96}} is only used to represent nanosec 
> timestamp in Impala for some historical reasons, and should be deprecated. 
> Since nanosec precision is rarely a real requirement, one possible and simple 
> solution would be replacing {{INT96}} with {{INT64 (TIMESTAMP_MILLIS)}} or 
> {{INT64 (TIMESTAMP_MICROS)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-671) Improve performance of RLE/bit-packed decoding in parquet-cpp

2016-08-01 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402877#comment-15402877
 ] 

Julien Le Dem commented on PARQUET-671:
---

Thanks [~edaniel]
Looking forward to see your PR

> Improve performance of RLE/bit-packed decoding in parquet-cpp
> -
>
> Key: PARQUET-671
> URL: https://issues.apache.org/jira/browse/PARQUET-671
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Eric Daniel
>Assignee: Eric Daniel
>
> There are steps that can dramatically improve decoding performance:
> - when decoding repeated values in the rle/dictionary encoding, do the 
> dictionary lookup only once
> - when decoding bit-packed sequences, do the decoding in batches so the bit 
> unpacker's state can be kept in registers (instead of updating members for 
> every decoded value)
> - use Daniel Lemire's fast unpacking routines whenever possible 
> (https://github.com/lemire/FrameOfReference/)
> I have a PR ready to implement these changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-612) Add compression to FileEncodingIT tests

2016-06-30 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-612:
--
Assignee: Ryan Blue

> Add compression to FileEncodingIT tests
> ---
>
> Key: PARQUET-612
> URL: https://issues.apache.org/jira/browse/PARQUET-612
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 1.9.0
>
>
> The {{FileEncodingsIT}} test validates that pages can be read independently 
> with all encodings, without compression. Pages should not depend on one 
> another for compression to be correct as well, so we should extend this test 
> to use the other compression codecs.
> This test is already expensive, so I propose adding an environment variable 
> to add more compression codecs. That way this results in no extra build/test 
> time, but we can turn on more validation in Travis CI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-642) Improve performance of ByteBuffer based read / write paths

2016-06-30 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-642.
---
   Resolution: Fixed
Fix Version/s: 1.9.0

Issue resolved by pull request 347
[https://github.com/apache/parquet-mr/pull/347]

> Improve performance of ByteBuffer based read / write paths
> --
>
> Key: PARQUET-642
> URL: https://issues.apache.org/jira/browse/PARQUET-642
> Project: Parquet
>  Issue Type: Bug
>Reporter: Piyush Narang
>Assignee: Piyush Narang
> Fix For: 1.9.0
>
>
> While trying out the newest Parquet version, we noticed that the changes to 
> start using ByteBuffers: 
> https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
>  and 
> https://github.com/apache/parquet-mr/commit/6b24a1d1b5e2792a7821ad172a45e38d2b04f9b8
>  (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow 
> down a bit. 
> Read overhead: 4-6% (in MB_Millis)
> Write overhead: 6-10% (MB_Millis). 
> Seems like this seems to be due to the encoding / decoding of Strings in the 
> Binary class 
> (https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java)
>  - toStringUsingUTF8() - for reads
> encodeUTF8() - for writes
> In those methods we're using the nio Charsets for encode / decode:
> {code}
> private static ByteBuffer encodeUTF8(CharSequence value) {
>   try {
> return ENCODER.get().encode(CharBuffer.wrap(value));
>   } catch (CharacterCodingException e) {
> throw new ParquetEncodingException("UTF-8 not supported.", e);
>   }
> }
>   }
> ...
> @Override
> public String toStringUsingUTF8() {
>   int limit = value.limit();
>   value.limit(offset+length);
>   int position = value.position();
>   value.position(offset);
>   // no corresponding interface to read a subset of a buffer, would have 
> to slice it
>   // which creates another ByteBuffer object or do what is done here to 
> adjust the
>   // limit/offset and set them back after
>   String ret = UTF8.decode(value).toString();
>   value.limit(limit);
>   value.position(position);
>   return ret;
> }
> {code}
> Tried out some micro / macro benchmarks and it seems like switching those out 
> to using the String class for the encoding / decoding improves performance:
> {code}
> @Override
> public String toStringUsingUTF8() {
>   String ret;
>   if (value.hasArray()) {
> try {
>   ret = new String(value.array(), value.arrayOffset() + offset, 
> length, "UTF-8");
> } catch (UnsupportedEncodingException e) {
>   throw new ParquetDecodingException("UTF-8 not supported");
> }
>   } else {
> int limit = value.limit();
> value.limit(offset+length);
> int position = value.position();
> value.position(offset);
> // no corresponding interface to read a subset of a buffer, would 
> have to slice it
> // which creates another ByteBuffer object or do what is done here to 
> adjust the
> // limit/offset and set them back after
> ret = UTF8.decode(value).toString();
> value.limit(limit);
> value.position(position);
>   }
>   return ret;
> }
> ...
> private static ByteBuffer encodeUTF8(String value) {
>   try {
> return ByteBuffer.wrap(value.getBytes("UTF-8"));
>   } catch (UnsupportedEncodingException e) {
> throw new ParquetEncodingException("UTF-8 not supported.", e);
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-642) Improve performance of ByteBuffer based read / write paths

2016-06-30 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-642:
--
Assignee: Piyush Narang

> Improve performance of ByteBuffer based read / write paths
> --
>
> Key: PARQUET-642
> URL: https://issues.apache.org/jira/browse/PARQUET-642
> Project: Parquet
>  Issue Type: Bug
>Reporter: Piyush Narang
>Assignee: Piyush Narang
> Fix For: 1.9.0
>
>
> While trying out the newest Parquet version, we noticed that the changes to 
> start using ByteBuffers: 
> https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
>  and 
> https://github.com/apache/parquet-mr/commit/6b24a1d1b5e2792a7821ad172a45e38d2b04f9b8
>  (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow 
> down a bit. 
> Read overhead: 4-6% (in MB_Millis)
> Write overhead: 6-10% (MB_Millis). 
> Seems like this seems to be due to the encoding / decoding of Strings in the 
> Binary class 
> (https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java)
>  - toStringUsingUTF8() - for reads
> encodeUTF8() - for writes
> In those methods we're using the nio Charsets for encode / decode:
> {code}
> private static ByteBuffer encodeUTF8(CharSequence value) {
>   try {
> return ENCODER.get().encode(CharBuffer.wrap(value));
>   } catch (CharacterCodingException e) {
> throw new ParquetEncodingException("UTF-8 not supported.", e);
>   }
> }
>   }
> ...
> @Override
> public String toStringUsingUTF8() {
>   int limit = value.limit();
>   value.limit(offset+length);
>   int position = value.position();
>   value.position(offset);
>   // no corresponding interface to read a subset of a buffer, would have 
> to slice it
>   // which creates another ByteBuffer object or do what is done here to 
> adjust the
>   // limit/offset and set them back after
>   String ret = UTF8.decode(value).toString();
>   value.limit(limit);
>   value.position(position);
>   return ret;
> }
> {code}
> Tried out some micro / macro benchmarks and it seems like switching those out 
> to using the String class for the encoding / decoding improves performance:
> {code}
> @Override
> public String toStringUsingUTF8() {
>   String ret;
>   if (value.hasArray()) {
> try {
>   ret = new String(value.array(), value.arrayOffset() + offset, 
> length, "UTF-8");
> } catch (UnsupportedEncodingException e) {
>   throw new ParquetDecodingException("UTF-8 not supported");
> }
>   } else {
> int limit = value.limit();
> value.limit(offset+length);
> int position = value.position();
> value.position(offset);
> // no corresponding interface to read a subset of a buffer, would 
> have to slice it
> // which creates another ByteBuffer object or do what is done here to 
> adjust the
> // limit/offset and set them back after
> ret = UTF8.decode(value).toString();
> value.limit(limit);
> value.position(position);
>   }
>   return ret;
> }
> ...
> private static ByteBuffer encodeUTF8(String value) {
>   try {
> return ByteBuffer.wrap(value.getBytes("UTF-8"));
>   } catch (UnsupportedEncodingException e) {
> throw new ParquetEncodingException("UTF-8 not supported.", e);
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-645) DictionaryFilter incorrectly handles null

2016-06-30 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-645.
---
Resolution: Fixed

Issue resolved by pull request 348
[https://github.com/apache/parquet-mr/pull/348]

> DictionaryFilter incorrectly handles null
> -
>
> Key: PARQUET-645
> URL: https://issues.apache.org/jira/browse/PARQUET-645
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 1.9.0
>
>
> DictionaryFilter checks whether a column can match a query and filters out 
> row groups that can't match. Equality checks don't currently handle null 
> correctly, which is never in the dictionary and is encoded by the definition 
> level. This is causing row groups to be filtered when they should not be 
> because "col is null" is always true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-544) ParquetWriter.close() throws NullPointerException on second call, improper implementation of Closeable contract

2016-06-30 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-544.
---
   Resolution: Fixed
Fix Version/s: 1.9.0

Issue resolved by pull request 345
[https://github.com/apache/parquet-mr/pull/345]

> ParquetWriter.close() throws NullPointerException on second call, improper 
> implementation of Closeable contract
> ---
>
> Key: PARQUET-544
> URL: https://issues.apache.org/jira/browse/PARQUET-544
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.1
>Reporter: Michal Turek
>Assignee: Michal Turek
>Priority: Minor
> Fix For: 1.9.0
>
>
> {{org.apache.parquet.hadoop.ParquetWriter}} implements 
> {{java.util.Closeable}}, but its {{close()}} method doesn't follow its 
> contract properly. The interface defines "If the stream is already closed 
> then invoking this method has no effect.", but {{ParquetWriter}} instead 
> throws {{NullPointerException}}.
> It's source is quite obvious, {{columnStore}} is set to null and then 
> accessed again. There is no "if already closed" condition to prevent it.
> {noformat}
> java.lang.NullPointerException: null
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:157)
>  ~[parquet-hadoop-1.8.1.jar:1.8.1]
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
>  ~[parquet-hadoop-1.8.1.jar:1.8.1]
>   at 
> org.apache.parquet.hadoop.ParquetWriter.close(ParquetWriter.java:297) 
> ~[parquet-hadoop-1.8.1.jar:1.8.1]
> {noformat}
> {noformat}
>   private void flushRowGroupToStore()
>   throws IOException {
> LOG.info(format("Flushing mem columnStore to file. allocated memory: 
> %,d", columnStore.getAllocatedSize()));
> if (columnStore.getAllocatedSize() > (3 * rowGroupSizeThreshold)) {
>   LOG.warn("Too much memory used: " + columnStore.memUsageString());
> }
> if (recordCount > 0) {
>   parquetFileWriter.startBlock(recordCount);
>   columnStore.flush();
>   pageStore.flushToFileWriter(parquetFileWriter);
>   recordCount = 0;
>   parquetFileWriter.endBlock();
>   this.nextRowGroupSize = Math.min(
>   parquetFileWriter.getNextRowGroupSize(),
>   rowGroupSizeThreshold);
> }
> columnStore = null;
> pageStore = null;
>   }
> {noformat}
> Known workaround is to prevent the second and other closes explicitly in the 
> application code.
> {noformat}
> private final ParquetWriter writer;
> private boolean closed;
> private void closeWriterOnlyOnce() throws IOException {
> if (!closed) {
> closed = true;
> writer.close();
> }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-594) Support CRC checksums in pages

2016-05-13 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283433#comment-15283433
 ] 

Julien Le Dem commented on PARQUET-594:
---

This has been defined in the spec but is not implemented.
Sometimes you may want to turn off hdfs crc check to speed up things (crc on 
pages are finer granularity and require less checking depending on what you are 
doing)
I agree that it's low on parquet-cpp priority list

> Support CRC checksums in pages
> --
>
> Key: PARQUET-594
> URL: https://issues.apache.org/jira/browse/PARQUET-594
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Uwe L. Korn
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-616) C++: WriteBatch should accept const arrays

2016-05-13 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283431#comment-15283431
 ] 

Julien Le Dem commented on PARQUET-616:
---

done

> C++: WriteBatch should accept const arrays
> --
>
> Key: PARQUET-616
> URL: https://issues.apache.org/jira/browse/PARQUET-616
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Uwe L. Korn
> Fix For: cpp-0.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-482) Organize src code file structure to have a very clear folder with public headers.

2016-03-01 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-482.
---
   Resolution: Fixed
Fix Version/s: cpp-0.1

Issue resolved by pull request 70
[https://github.com/apache/parquet-cpp/pull/70]

> Organize src code file structure to have a very clear folder with public 
> headers.
> -
>
> Key: PARQUET-482
> URL: https://issues.apache.org/jira/browse/PARQUET-482
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Nong Li
>Assignee: Wes McKinney
> Fix For: cpp-0.1
>
>
> We should organize the source code structure to have a folder where all the 
> public headers are and nothing else. This makes it easy to understand what is 
> the public API and which APIs needed to be looked at wrt to compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-519) Disable compiler warning supressions and fix all DEBUG build warnings

2016-03-01 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-519.
---
   Resolution: Fixed
Fix Version/s: cpp-0.1

Issue resolved by pull request 69
[https://github.com/apache/parquet-cpp/pull/69]

> Disable compiler warning supressions and fix all DEBUG build warnings
> -
>
> Key: PARQUET-519
> URL: https://issues.apache.org/jira/browse/PARQUET-519
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: cpp-0.1
>
>
> Related to PARQUET-447



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-537) LocalFileSource leaks resources

2016-03-01 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-537.
---
   Resolution: Fixed
Fix Version/s: cpp-0.1

Issue resolved by pull request 68
[https://github.com/apache/parquet-cpp/pull/68]

> LocalFileSource leaks resources
> ---
>
> Key: PARQUET-537
> URL: https://issues.apache.org/jira/browse/PARQUET-537
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-0.1
>Reporter: Aliaksei Sandryhaila
>Assignee: Aliaksei Sandryhaila
> Fix For: cpp-0.1
>
>
> As a result of modifications introduced in PARQUET-497, LocalFileSource never 
> gets deleted and the associated memory and file handle are leaked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-545) Improve API to support Decimal type

2016-02-29 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-545.
---
   Resolution: Fixed
Fix Version/s: cpp-0.1

Issue resolved by pull request 65
[https://github.com/apache/parquet-cpp/pull/65]

> Improve API to support Decimal type
> ---
>
> Key: PARQUET-545
> URL: https://issues.apache.org/jira/browse/PARQUET-545
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
> Fix For: cpp-0.1
>
>
> Extend the `ColumnDescriptor` API to return `precision` and `scale` values 
> from DecimalMetadata. Implement necessary checks if the `LogicalType` is 
> Decimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-494) Implement PLAIN_DICTIONARY encoding and decoding

2016-02-26 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-494.
---
   Resolution: Fixed
Fix Version/s: cpp-0.1

Issue resolved by pull request 64
[https://github.com/apache/parquet-cpp/pull/64]

> Implement PLAIN_DICTIONARY encoding and decoding
> 
>
> Key: PARQUET-494
> URL: https://issues.apache.org/jira/browse/PARQUET-494
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: cpp-0.1
>
>
> parquet-cpp currently only supports {{Encoding::RLE_DICTIONARY}}. Some 
> implementations of Parquet still use {{Encoding::PLAIN_DICTIONARY}} (the 
> dictionary indices are not RLE-encoded). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-525) Test coverage for malformed file failure modes on the read path

2016-02-22 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-525.
---
   Resolution: Fixed
Fix Version/s: cpp-0.1

Issue resolved by pull request 60
[https://github.com/apache/parquet-cpp/pull/60]

> Test coverage for malformed file failure modes on the read path
> ---
>
> Key: PARQUET-525
> URL: https://issues.apache.org/jira/browse/PARQUET-525
> Project: Parquet
>  Issue Type: Test
>  Components: parquet-cpp
>Reporter: Wes McKinney
> Fix For: cpp-0.1
>
>
> These code paths do not have test coverage. We should construct test cases 
> that each possible kind of malformation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-541) Portable build scripts

2016-02-20 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-541.
---
   Resolution: Fixed
Fix Version/s: cpp-0.1

Issue resolved by pull request 61
[https://github.com/apache/parquet-cpp/pull/61]

> Portable build scripts
> --
>
> Key: PARQUET-541
> URL: https://issues.apache.org/jira/browse/PARQUET-541
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Dmitry Bushev
>Priority: Minor
> Fix For: cpp-0.1
>
>
> Shebangs in build scripts should be portable, because some systems doesn't 
> have /bin/bash absolute path (i.e. NixOS)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-533) Simplify RandomAccessSource API to combine Seek/Read

2016-02-20 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-533.
---
   Resolution: Fixed
Fix Version/s: cpp-0.1

Issue resolved by pull request 59
[https://github.com/apache/parquet-cpp/pull/59]

> Simplify RandomAccessSource API to combine Seek/Read 
> -
>
> Key: PARQUET-533
> URL: https://issues.apache.org/jira/browse/PARQUET-533
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: cpp-0.1
>
>
> In situations where memory-mapping is available, copying bytes into a 
> newly-allocated memory buffer may be unnecessary.
> I propose to generally simplify the interface to random-access capable data 
> sources to instead return a {{Buffer}} object (that I'll define) whose 
> subclasses can be responsible for RAII memory-allocation/deallocation if it 
> is necessary. This way, users of {{RandomAccessSource}} need not necessarily 
> be responsible for memory allocation and object lifetime management. 
> Not an urgent matter but will get a patch together sometime in the next 
> several weeks (most likely at the same time as adding a memory-mapped file 
> input source).
> As an aside, it would be useful to have this same kind of abstraction 
> available in the context of compressed data pages (note the decompression 
> buffer member variable in {{ColumnReader}})



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-514) Automate coveralls.io updates in Travis CI

2016-02-19 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-514.
---
   Resolution: Fixed
Fix Version/s: cpp-0.1

Issue resolved by pull request 57
[https://github.com/apache/parquet-cpp/pull/57]

> Automate coveralls.io updates in Travis CI
> --
>
> Key: PARQUET-514
> URL: https://issues.apache.org/jira/browse/PARQUET-514
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Minor
> Fix For: cpp-0.1
>
>
> The repo has been enabled in INFRA-11273, so all that's left is to work on 
> the Travis CI build matrix and add coveralls to one of the builds (rather 
> than running it for all of them)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-471) Use the same environment setup script for Travis CI as local sandbox development

2016-02-18 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-471.
---
   Resolution: Fixed
Fix Version/s: cpp-0.1

Issue resolved by pull request 54
[https://github.com/apache/parquet-cpp/pull/54]

> Use the same environment setup script for Travis CI as local sandbox 
> development
> 
>
> Key: PARQUET-471
> URL: https://issues.apache.org/jira/browse/PARQUET-471
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: cpp-0.1
>
>
> Currently the environment setups are slightly different, and so a passing 
> Travis CI build might have a problem with the sandbox build and vice versa.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-499) Complete PlainEncoder implementation for all primitive types and test end to end

2016-02-18 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-499.
---
   Resolution: Fixed
Fix Version/s: cpp-0.1

resolved by:
https://github.com/apache/parquet-cpp/pull/52

> Complete PlainEncoder implementation for all primitive types and test end to 
> end
> 
>
> Key: PARQUET-499
> URL: https://issues.apache.org/jira/browse/PARQUET-499
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Deepak Majeti
> Fix For: cpp-0.1
>
>
> As part of PARQUET-485, I added a partial {{Encoding::PLAIN}} encoder 
> implementation. This needs to be finished, with a test suite that validates 
> data round-trips across all primitive types. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-515) Add "Reset" to LevelEncoder and LevelDecoder

2016-02-16 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-515.
---
   Resolution: Fixed
Fix Version/s: cpp-0.1

Issue resolved by pull request 51
[https://github.com/apache/parquet-cpp/pull/51]

> Add "Reset" to LevelEncoder and LevelDecoder
> 
>
> Key: PARQUET-515
> URL: https://issues.apache.org/jira/browse/PARQUET-515
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
> Fix For: cpp-0.1
>
>
> The rle-encoder and rle-decoder classes have a "Reset" method as a quick way 
> to initialize the objects. This method resets the encoder an decoder state to 
> work on a new buffer without the need to create a new object at every DATA 
> PAGE granularity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-431) Make ParquetOutputFormat.memoryManager volatile

2016-02-15 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-431.
---
Resolution: Fixed

Issue resolved by pull request 313
[https://github.com/apache/parquet-mr/pull/313]

> Make ParquetOutputFormat.memoryManager volatile
> ---
>
> Key: PARQUET-431
> URL: https://issues.apache.org/jira/browse/PARQUET-431
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0, 1.8.1
>Reporter: Liwei Lin
>Assignee: Liwei Lin
> Fix For: 1.9.0
>
>
> Currently ParquetOutputFormat.getRecordWriter() contains an unsynchronized 
> lazy initialization of the non-volatile static field *memoryManager*.
> Because the compiler or processor may reorder instructions, threads are not 
> guaranteed to see a completely initialized object, when 
> ParquetOutputFormat.getRecordWriter() is called by multiple threads.
> This ticket proposes to make *memoryManager* volatile to correct the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >