[jira] [Resolved] (PARQUET-1923) parquet-tools 1.11.0: TestSimpleRecordConverter fails with ExceptionInInitializerError on openjdk 15

2020-10-17 Thread Alexander Bayandin (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Bayandin resolved PARQUET-1923.
-
Resolution: Fixed

It works for me now! Great!

> parquet-tools 1.11.0: TestSimpleRecordConverter fails with 
> ExceptionInInitializerError on openjdk 15
> 
>
> Key: PARQUET-1923
> URL: https://issues.apache.org/jira/browse/PARQUET-1923
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.11.0
> Environment: {code}
> $ mvn --version
> Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
> Maven home: /usr/local/Cellar/maven/3.6.3_1/libexec
> Java version: 15, vendor: N/A, runtime: 
> /usr/local/Cellar/openjdk/15/libexec/openjdk.jdk/Contents/Home
> Default locale: en_GB, platform encoding: UTF-8
> OS name: "mac os x", version: "10.15.7", arch: "x86_64", family: "mac"
> $ java -version
> openjdk version "15" 2020-09-15
> OpenJDK Runtime Environment (build 15+36)
> OpenJDK 64-Bit Server VM (build 15+36, mixed mode, sharing)
> $ sw_vers 
> ProductName:  Mac OS X
> ProductVersion:   10.15.7
> BuildVersion: 19H2
> {code}
>  
>Reporter: Alexander Bayandin
>Assignee: Alexander Bayandin
>Priority: Major
>
> {{mvn clean package -Plocal}} for parquet-tools 1.11.1 fails with a failing 
> test 
> {{testConverter(org.apache.parquet.tools.read.TestSimpleRecordConverter)}}.
> {{mvn clean -Dtest=TestSimpleRecordConverter "-Plocal" test}}:
> {code}
> --Test
>  set: 
> org.apache.parquet.tools.read.TestSimpleRecordConverter---Tests
>  run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.338 sec <<< 
> FAILURE!testConverter(org.apache.parquet.tools.read.TestSimpleRecordConverter)
>   Time elapsed: 0.268 sec  <<< ERROR!java.lang.ExceptionInInitializerError at 
> org.apache.hadoop.util.StringUtils.(StringUtils.java:80) at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2823) at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:2818) at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2684) at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:172) at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:357) at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at 
> org.apache.parquet.hadoop.util.HadoopOutputFile.fromPath(HadoopOutputFile.java:58)
>  at org.apache.parquet.hadoop.ParquetWriter.(ParquetWriter.java:227) at 
> org.apache.parquet.hadoop.ParquetWriter.(ParquetWriter.java:192) at 
> org.apache.parquet.tools.read.TestSimpleRecordConverter.createTestParquetFile(TestSimpleRecordConverter.java:114)
>  at 
> org.apache.parquet.tools.read.TestSimpleRecordConverter.setUp(TestSimpleRecordConverter.java:90)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:564) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) 
> at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20) at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
> org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:53) 
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:123)
>  at 
> 

[jira] [Created] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-10-17 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-1927:


 Summary: ColumnIndex should provide number of records skipped 
 Key: PARQUET-1927
 URL: https://issues.apache.org/jira/browse/PARQUET-1927
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.11.0
Reporter: Xinli Shang
 Fix For: 1.12.0


When integrating Parquet ColumnIndex, I found we need to know from Parquet that 
how many records that we skipped due to ColumnIndex filtering. When rowCount is 
0, readNextFilteredRowGroup() just advance to next without telling the caller. 
See code here 
[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]

 

In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
following code():

valuesRead + skippedValues < totalValues

See 
([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
 

So without knowing the skipped values, it is hard to determine hasNext() or 
not. 

 

Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
returns null, we consider it is done for the whole file. Then hasNext() just 
retrun false. 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

2020-10-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215801#comment-17215801
 ] 

ASF GitHub Bot commented on PARQUET-1883:
-

anantdamle commented on pull request #821:
URL: https://github.com/apache/parquet-mr/pull/821#issuecomment-710758060


   Hi team,
   How should I proceed? We have many old Parquet files still using INT96, this 
patch will at least help in reading those files using ParquertIO.readers in 
Apache Beam.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> int96 support in parquet-avro
> -
>
> Key: PARQUET-1883
> URL: https://issues.apache.org/jira/browse/PARQUET-1883
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: satish
>Priority: Major
>
> Hi
> It looks like 'timestamp' is being converted to 'int64' primitive type in 
> parquet-avro. This is incompatible with hive2. Hive throws below error 
> {code:java}
> Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.TimestampWritable (state=,code=0)
> {code}
> What does it take to write timestamp field as 'int96'? 
> Hive seems to write timestamp field as int96.  See example below
> {code:java}
> $ hadoop jar parquet-tools-1.9.0.jar meta hdfs://timestamp_test/00_0
> creator: parquet-mr version 1.10.6 (build 
> 098c6199a821edd3d6af56b962fd0f1558af849b)
> file schema: hive_schema
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:88 OFFSET:4
> 
> ts:   INT96 UNCOMPRESSED DO:0 FPO:4 SZ:88/88/1.00 VC:4 
> ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
> {code}
> Writing a spark dataframe into parquet format (without using avro) is also 
> using int96.
> {code:java}
> scala> testDS.printSchema()
> root
>  |-- ts: timestamp (nullable = true)
> scala> testDS.write.mode(Overwrite).save("/tmp/x");
> $ parquet-tools meta 
> /tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> file:
> file:/tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> creator: parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}
>  
> file schema: spark_schema 
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:93 OFFSET:4 
> 
> ts:   INT96 GZIP DO:0 FPO:4 SZ:130/93/0.72 VC:4 
> ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED ST:[no stats for this column]
> {code}
> I saw some explanation for deprecating int96 [support 
> here|https://issues.apache.org/jira/browse/PARQUET-1870?focusedCommentId=17127963=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17127963]
>  from [~gszadovszky]. But given hive and serialization in other parquet 
> modules (non-avro) support int96, I'm trying to understand the reasoning for 
> not implementing it in parquet-avro.
> A bit more context: we are trying to migrate some of our data to [hudi 
> format|https://hudi.apache.org/]. Hudi adds a lot of efficiency for our use 
> cases. But, when we write data using hudi, hudi uses parquet-avro and 
> timestamp is being converted to int64. As mentioned earlier, this breaks 
> compatibility with hive. A lot of columns in our tables have 'timestamp' as 
> type in hive DDL.  It is almost impossible to change DDL to long as there are 
> large number of tables and columns. 
> We are happy to contribute if there is a clear path forward to support int96 
> in parquet-avro. Please also let me know if you are aware of a workaround in 
> hive that can read int64 correctly as timestamp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] anantdamle commented on pull request #821: PARQUET-1883 Interpret deprecated INT96 as LocalDateTime in Avro conversion

2020-10-17 Thread GitBox


anantdamle commented on pull request #821:
URL: https://github.com/apache/parquet-mr/pull/821#issuecomment-710758060


   Hi team,
   How should I proceed? We have many old Parquet files still using INT96, this 
patch will at least help in reading those files using ParquertIO.readers in 
Apache Beam.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org