Re: [VOTE] Release Apache Parquet MR 1.8.3 RC0

2018-05-08 Thread Ryan Blue
+0

The signature is good and I was able to build and test.

The release doesn't conform to a recently-updated policy change for
checksums. Specifically, the sha file should be named sha1 but sha512 is
recommended, and there should not be a md5 checksum:
http://www.apache.org/dev/release-distribution#sigs-and-sums

Could you guys create a sha512 file and delete the other two checksums?
That would change my vote to a +1.

rb

On Tue, May 8, 2018 at 7:26 AM, Zoltan Ivanfi  wrote:

> +1 (binding)
>
> built and tested
> verified signature
>
> I agree with Uwe that a verification script would be useful.
>
> Zoltan
>
> On Mon, May 7, 2018 at 5:37 PM Uwe L. Korn  wrote:
>
> > +1 (binding)
> >
> > * Built and tested on Debian 8
> > * verified sha1
> > * verified signature
> >
> > was quite a hassle to build with manually installing protobuf and thrift.
> > For newer releases, there definitely needs to be a verification script
> > otherwise voting is quite a labor intensive process.
> >
> > Uwe
> >
> > On Mon, May 7, 2018, at 9:58 AM, Gabor Szadovszky wrote:
> > > Hi Uwe,
> > >
> > > I guess this is because you are building it with java8. The 1.8.3
> branch
> > > is still on 1.6 (source and target) and travis is configured to use
> > > jdk7. We also used jdk7 for the build.
> > >
> > > Cheers,
> > > Gabor
> > >
> > > > On 7 May 2018, at 09:46, Uwe L. Korn  wrote:
> > > >
> > > > Hello,
> > > >
> > > > the build is failing for me with "[ERROR] Failed to execute goal
> > org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process
> > (default) on project parquet-generator: Error rendering velocity
> resource.:
> > NullPointerException", exteneded stacktrace:
> > https://gist.github.com/xhochy/fd62748ba8c300a5f238a80e8bacfc90
> > > >
> > > > I can provide more information if you can tell me what you would
> need.
> > > >
> > > > Uwe
> > > >
> > > > On Fri, May 4, 2018, at 2:12 PM, Gabor Szadovszky wrote:
> > > >> Hi everyone,
> > > >>
> > > >> Zoltan and I propose the following RC to be released as official
> > Apache
> > > >> Parquet MR 1.8.3 release.
> > > >>
> > > >> The commit id is aef7230e114214b7cc962a8f3fc5aeed6ce80828
> > > >> * This corresponds to the tag: apache-parquet-1.8.3
> > > >> *
> > > >>
> > https://github.com/apache/parquet-mr/tree/aef7230e114214b7cc962a8f3fc5ae
> ed6ce80828
> > > >> <
> > https://github.com/apache/parquet-mr/tree/aef7230e114214b7cc962a8f3fc5ae
> ed6ce80828
> > >
> > > >>
> > > >> The release tarball, signature, and checksums are here:
> > > >> *
> > > >>
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.8.3-rc0/
> > > >> <
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.8.3-rc0/
> > > >>>
> > > >>
> > > >> You can find the KEYS file here:
> > > >> * https://dist.apache.org/repos/dist/dev/parquet/KEYS
> > > >> 
> > > >>
> > > >> Binary artifacts are staged in Nexus here:
> > > >> *
> > > >>
> > https://repository.apache.org/content/groups/staging/org/apache/parquet/
> > > >> <
> > https://repository.apache.org/content/groups/staging/org/apache/parquet/
> > > >>>
> > > >>
> > > >> This is a maintenance release created mainly for Spark containing 2
> > bug
> > > >> fixes related to the statistics handling.
> > > >> See
> > > >>
> > https://github.com/apache/parquet-mr/blob/aef7230e114214b7cc962a8f3fc5ae
> ed6ce80828/CHANGES.md
> > > >> <
> > https://github.com/apache/parquet-mr/blob/aef7230e114214b7cc962a8f3fc5ae
> ed6ce80828/CHANGES.md>
> >
> > > >> for details.
> > > >>
> > > >> Please download, verify, and test.
> > > >>
> > > >> [ ] +1 Release this as Apache Parquet MR 1.8.3
> > > >> [ ] +0
> > > >> [ ] -1 Do not release this because…
> > >
> >
>



-- 
Ryan Blue
Software Engineer
Netflix


[jira] [Commented] (PARQUET-1253) Support for new logical type representation

2018-05-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467616#comment-16467616
 ] 

ASF GitHub Bot commented on PARQUET-1253:
-

gszadovszky commented on a change in pull request #463: PARQUET-1253: Support 
for new logical type representation
URL: https://github.com/apache/parquet-mr/pull/463#discussion_r186779505
 
 

 ##
 File path: 
parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
 ##
 @@ -36,42 +36,152 @@
 import org.apache.parquet.format.TimeType;
 import org.apache.parquet.format.TimestampType;
 
+import java.util.List;
 import java.util.Objects;
 
-public interface LogicalTypeAnnotation {
+public abstract class LogicalTypeAnnotation {
+  public enum LogicalTypes {
+MAP {
+  @Override
+  protected LogicalTypeAnnotation fromString(List params) {
+return mapType();
+  }
+},
+LIST {
+  @Override
+  protected LogicalTypeAnnotation fromString(List params) {
+return listType();
+  }
+},
+UTF8 {
+  @Override
+  protected LogicalTypeAnnotation fromString(List params) {
+return stringType();
+  }
+},
+MAP_KEY_VALUE {
+  @Override
+  protected LogicalTypeAnnotation fromString(List params) {
+return MapKeyValueTypeAnnotation.getInstance();
+  }
+},
+ENUM {
+  @Override
+  protected LogicalTypeAnnotation fromString(List params) {
+return enumType();
+  }
+},
+DECIMAL {
+  @Override
+  protected LogicalTypeAnnotation fromString(List params) {
+if (params.size() != 2) {
+  throw new RuntimeException("Expecting 2 parameters for decimal 
logical type, got " + params.size());
+}
+return decimalType(Integer.valueOf(params.get(1)), 
Integer.valueOf(params.get(0)));
+  }
+},
+DATE {
+  @Override
+  protected LogicalTypeAnnotation fromString(List params) {
+return dateType();
+  }
+},
+TIME {
+  @Override
+  protected LogicalTypeAnnotation fromString(List params) {
+if (params.size() != 2) {
+  throw new RuntimeException("Expecting 2 parameters for time logical 
type, got " + params.size());
+}
+return timeType(Boolean.parseBoolean(params.get(1)), 
TimeUnit.valueOf(params.get(0)));
+  }
+},
+TIMESTAMP {
+  @Override
+  protected LogicalTypeAnnotation fromString(List params) {
+if (params.size() != 2) {
+  throw new RuntimeException("Expecting 2 parameters for timestamp 
logical type, got " + params.size());
+}
+return timestampType(Boolean.parseBoolean(params.get(1)), 
TimeUnit.valueOf(params.get(0)));
+  }
+},
+INT {
+  @Override
+  protected LogicalTypeAnnotation fromString(List params) {
+if (params.size() != 2) {
+  throw new RuntimeException("Expecting 2 parameters for integer 
logical type, got " + params.size());
+}
+return intType(Integer.valueOf(params.get(0)), 
Boolean.parseBoolean(params.get(1)));
+  }
+},
+JSON {
+  @Override
+  protected LogicalTypeAnnotation fromString(List params) {
+return jsonType();
+  }
+},
+BSON {
+  @Override
+  protected LogicalTypeAnnotation fromString(List params) {
+return bsonType();
+  }
+},
+INTERVAL {
+  @Override
+  protected LogicalTypeAnnotation fromString(List params) {
+return IntervalLogicalTypeAnnotation.getInstance();
+  }
+};
+
+protected abstract LogicalTypeAnnotation fromString(List params);
+  }
+
   /**
* Convert this parquet-mr logical type to parquet-format LogicalType.
*
* @return the parquet-format LogicalType representation of this logical 
type implementation
*/
-  LogicalType toLogicalType();
+  public abstract LogicalType toLogicalType();
 
   /**
* Convert this parquet-mr logical type to parquet-format ConvertedType.
*
* @return the parquet-format ConvertedType representation of this logical 
type implementation
*/
-  ConvertedType toConvertedType();
+  public abstract ConvertedType toConvertedType();
 
   /**
* Convert this logical type to old logical type representation in 
parquet-mr (if there's any).
* Those logical type implementations, which don't have a corresponding 
mapping should return null.
*
* @return the OriginalType representation of the new logical type, or null 
if there's none
*/
-  OriginalType toOriginalType();
+  public abstract OriginalType toOriginalType();
 
   /**
* Visits this logical type with the given visitor
*
* @param logicalTypeAnnotationVisitor the visitor to visit this type
*/
-  void accept(LogicalTypeAnnotationVisitor logicalTypeAnnotationVisitor);
+  public abstract void accept(LogicalTypeAnnotationVisitor 

[jira] [Commented] (PARQUET-1253) Support for new logical type representation

2018-05-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467613#comment-16467613
 ] 

ASF GitHub Bot commented on PARQUET-1253:
-

gszadovszky commented on a change in pull request #463: PARQUET-1253: Support 
for new logical type representation
URL: https://github.com/apache/parquet-mr/pull/463#discussion_r186777588
 
 

 ##
 File path: 
parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
 ##
 @@ -36,42 +36,152 @@
 import org.apache.parquet.format.TimeType;
 import org.apache.parquet.format.TimestampType;
 
+import java.util.List;
 import java.util.Objects;
 
-public interface LogicalTypeAnnotation {
+public abstract class LogicalTypeAnnotation {
+  public enum LogicalTypes {
 
 Review comment:
   This enum is used only for parsing/printing and we don't want the users to 
really use them. So, I would suggest using a name that suggests its use e.g. 
`LogicalTypeParseHelper`?
   Also, it would be nice if we could annotate/comment that this one is not 
part of the public API.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support for new logical type representation
> ---
>
> Key: PARQUET-1253
> URL: https://issues.apache.org/jira/browse/PARQUET-1253
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>
> Latest parquet-format 
> [introduced|https://github.com/apache/parquet-format/commit/863875e0be3237c6aa4ed71733d54c91a51deabe#diff-0f9d1b5347959e15259da7ba8f4b6252]
>  a new representation for logical types. As of now this is not yet supported 
> in parquet-mr, thus there's no way to use parametrized UTC normalized 
> timestamp data types. When reading and writing Parquet files, besides 
> 'converted_type' parquet-mr should use the new 'logicalType' field in 
> SchemaElement to tell the current logical type annotation. To maintain 
> backward compatibility, the semantic of converted_type shouldn't change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Date and time for the next Parquet sync

2018-05-08 Thread Lars Volker
I sent an invite for the proposed time. Please let me know if you would
like to be added to the meeting but haven't received an invite.

Cheers, Lars


On Mon, May 7, 2018 at 9:27 AM, Lars Volker  wrote:

> Hi All,
>
> I'd like to propose to have a Parquet Sync this week on Wednesday, May
> 9th, at 6pm CET / 9 am PST. Last time we met on a Tuesday, so this time
> it should be Wednesday.
>
> Please speak up if that time does not work for you. Otherwise I will send
> out the MR tomorrow morning.
>
> Cheers, Lars
>
>


[jira] [Commented] (PARQUET-1211) Write column indexes: read/write API

2018-05-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467557#comment-16467557
 ] 

ASF GitHub Bot commented on PARQUET-1211:
-

zivanfi commented on a change in pull request #456: PARQUET-1211: Write column 
indexes: read/write API
URL: https://github.com/apache/parquet-mr/pull/456#discussion_r186768014
 
 

 ##
 File path: 
parquet-hadoop/src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java
 ##
 @@ -892,4 +898,60 @@ public void testColumnOrders() throws IOException {
 assertEquals(ColumnOrder.undefined(), 
columns.get(1).getPrimitiveType().columnOrder());
 assertEquals(ColumnOrder.undefined(), 
columns.get(2).getPrimitiveType().columnOrder());
   }
+
+  @Test
+  public void testOffsetIndexConversion() {
+OffsetIndexBuilder builder = OffsetIndexBuilder.getBuilder();
+builder.add(1000, 1, 0);
+builder.add(22000, 12000, 100);
+OffsetIndex offsetIndex = ParquetMetadataConverter
+
.fromParquetOffsetIndex(ParquetMetadataConverter.toParquetOffsetIndex(builder.build(10)));
+assertEquals(2, offsetIndex.getPageCount());
+assertEquals(101000, offsetIndex.getOffset(0));
+assertEquals(1, offsetIndex.getCompressedPageSize(0));
+assertEquals(0, offsetIndex.getFirstRowIndex(0));
+assertEquals(122000, offsetIndex.getOffset(1));
+assertEquals(12000, offsetIndex.getCompressedPageSize(1));
+assertEquals(100, offsetIndex.getFirstRowIndex(1));
+  }
+
+  @Test
+  public void testColumnIndexConversion() {
+PrimitiveType type = 
Types.required(PrimitiveTypeName.INT64).named("test_int64");
+ColumnIndexBuilder builder = ColumnIndexBuilder.getBuilder(type);
+Statistics stats = Statistics.createStats(type);
+stats.incrementNumNulls(16);
+stats.updateStats(-100l);
+stats.updateStats(100l);
+builder.add(stats);
+stats = Statistics.createStats(type);
+stats.incrementNumNulls(111);
+builder.add(stats);
+stats = Statistics.createStats(type);
+stats.updateStats(200l);
+stats.updateStats(500l);
+builder.add(stats);
+org.apache.parquet.format.ColumnIndex parquetColumnIndex = 
+ParquetMetadataConverter.toParquetColumnIndex(type, builder.build());
+ColumnIndex columnIndex = 
ParquetMetadataConverter.fromParquetColumnIndex(type, parquetColumnIndex);
+assertEquals(BoundaryOrder.ASCENDING, columnIndex.getBoundaryOrder());
+assertTrue(Arrays.asList(false, true, 
false).equals(columnIndex.getNullPages()));
+assertTrue(Arrays.asList(16l, 111l, 
0l).equals(columnIndex.getNullCounts()));
+assertTrue(Arrays.asList(
+ByteBuffer.wrap(BytesUtils.longToBytes(-100l)),
+ByteBuffer.allocate(0),
+
ByteBuffer.wrap(BytesUtils.longToBytes(200l))).equals(columnIndex.getMinValues()));
+assertTrue(Arrays.asList(
+ByteBuffer.wrap(BytesUtils.longToBytes(100l)),
+ByteBuffer.allocate(0),
+
ByteBuffer.wrap(BytesUtils.longToBytes(500l))).equals(columnIndex.getMaxValues()));
+
+assertNull("Should handle null column index", ParquetMetadataConverter
+
.toParquetColumnIndex(Types.required(PrimitiveTypeName.INT32).named("test_int32"),
 null));
+assertNull("Should handle unsupported types", ParquetMetadataConverter
 
 Review comment:
   Should ignore unsupported types.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Write column indexes: read/write API
> 
>
> Key: PARQUET-1211
> URL: https://issues.apache.org/jira/browse/PARQUET-1211
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1211) Write column indexes: read/write API

2018-05-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467559#comment-16467559
 ] 

ASF GitHub Bot commented on PARQUET-1211:
-

zivanfi commented on a change in pull request #456: PARQUET-1211: Write column 
indexes: read/write API
URL: https://github.com/apache/parquet-mr/pull/456#discussion_r186768614
 
 

 ##
 File path: 
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestColumnChunkPageWriteStore.java
 ##
 @@ -66,6 +81,40 @@
 
 public class TestColumnChunkPageWriteStore {
 
+  // OutputFile implementation to reach out the PositionOutputStream 
internally used by the writer
 
 Review comment:
   s/reach out/expose/


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Write column indexes: read/write API
> 
>
> Key: PARQUET-1211
> URL: https://issues.apache.org/jira/browse/PARQUET-1211
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1211) Write column indexes: read/write API

2018-05-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467558#comment-16467558
 ] 

ASF GitHub Bot commented on PARQUET-1211:
-

zivanfi commented on a change in pull request #456: PARQUET-1211: Write column 
indexes: read/write API
URL: https://github.com/apache/parquet-mr/pull/456#discussion_r186759031
 
 

 ##
 File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
 ##
 @@ -903,6 +906,38 @@ private DictionaryPage readCompressedDictionary(
 converter.getEncoding(dictHeader.getEncoding()));
   }
 
+  /**
+   * @param column
+   *  the column chunk which the column index is to be returned for
+   * @return the column index for the specified column chunk or {@code null} 
if the there is no index
 
 Review comment:
   (nit) s/the there/there/ (in this line and in another line below as well)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Write column indexes: read/write API
> 
>
> Key: PARQUET-1211
> URL: https://issues.apache.org/jira/browse/PARQUET-1211
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Parquet MR 1.8.3 RC0

2018-05-08 Thread Zoltan Ivanfi
+1 (binding)

built and tested
verified signature

I agree with Uwe that a verification script would be useful.

Zoltan

On Mon, May 7, 2018 at 5:37 PM Uwe L. Korn  wrote:

> +1 (binding)
>
> * Built and tested on Debian 8
> * verified sha1
> * verified signature
>
> was quite a hassle to build with manually installing protobuf and thrift.
> For newer releases, there definitely needs to be a verification script
> otherwise voting is quite a labor intensive process.
>
> Uwe
>
> On Mon, May 7, 2018, at 9:58 AM, Gabor Szadovszky wrote:
> > Hi Uwe,
> >
> > I guess this is because you are building it with java8. The 1.8.3 branch
> > is still on 1.6 (source and target) and travis is configured to use
> > jdk7. We also used jdk7 for the build.
> >
> > Cheers,
> > Gabor
> >
> > > On 7 May 2018, at 09:46, Uwe L. Korn  wrote:
> > >
> > > Hello,
> > >
> > > the build is failing for me with "[ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process
> (default) on project parquet-generator: Error rendering velocity resource.:
> NullPointerException", exteneded stacktrace:
> https://gist.github.com/xhochy/fd62748ba8c300a5f238a80e8bacfc90
> > >
> > > I can provide more information if you can tell me what you would need.
> > >
> > > Uwe
> > >
> > > On Fri, May 4, 2018, at 2:12 PM, Gabor Szadovszky wrote:
> > >> Hi everyone,
> > >>
> > >> Zoltan and I propose the following RC to be released as official
> Apache
> > >> Parquet MR 1.8.3 release.
> > >>
> > >> The commit id is aef7230e114214b7cc962a8f3fc5aeed6ce80828
> > >> * This corresponds to the tag: apache-parquet-1.8.3
> > >> *
> > >>
> https://github.com/apache/parquet-mr/tree/aef7230e114214b7cc962a8f3fc5aeed6ce80828
> > >> <
> https://github.com/apache/parquet-mr/tree/aef7230e114214b7cc962a8f3fc5aeed6ce80828
> >
> > >>
> > >> The release tarball, signature, and checksums are here:
> > >> *
> > >>
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.8.3-rc0/
> > >> <
> https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.8.3-rc0/
> > >>>
> > >>
> > >> You can find the KEYS file here:
> > >> * https://dist.apache.org/repos/dist/dev/parquet/KEYS
> > >> 
> > >>
> > >> Binary artifacts are staged in Nexus here:
> > >> *
> > >>
> https://repository.apache.org/content/groups/staging/org/apache/parquet/
> > >> <
> https://repository.apache.org/content/groups/staging/org/apache/parquet/
> > >>>
> > >>
> > >> This is a maintenance release created mainly for Spark containing 2
> bug
> > >> fixes related to the statistics handling.
> > >> See
> > >>
> https://github.com/apache/parquet-mr/blob/aef7230e114214b7cc962a8f3fc5aeed6ce80828/CHANGES.md
> > >> <
> https://github.com/apache/parquet-mr/blob/aef7230e114214b7cc962a8f3fc5aeed6ce80828/CHANGES.md>
>
> > >> for details.
> > >>
> > >> Please download, verify, and test.
> > >>
> > >> [ ] +1 Release this as Apache Parquet MR 1.8.3
> > >> [ ] +0
> > >> [ ] -1 Do not release this because…
> >
>