[jira] [Commented] (PARQUET-2170) Empty projection returns the wrong number of rows when column index is enabled

2022-08-04 Thread Ivan Sadikov (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575562#comment-17575562
 ] 

Ivan Sadikov commented on PARQUET-2170:
---

I will update the description later and I would like to open a PR to fix the 
issue. I think we just need to check if the column set is empty or not when 
checking paths in the ColumnIndexFilter but I will need to confirm this.

> Empty projection returns the wrong number of rows when column index is enabled
> --
>
> Key: PARQUET-2170
> URL: https://issues.apache.org/jira/browse/PARQUET-2170
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>    Reporter: Ivan Sadikov
>Priority: Major
>
> Discovered in Spark, when returning an empty projection from a Parquet file 
> with filter pushdown enabled (typically when doing filter + count), 
> Parquet-Mr returns a wrong number of rows with column index enabled. When the 
> column index feature is disabled, the result is correct.
>  
> This happens due to the following:
>  # ParquetFileReader::getFilteredRowCount() 
> ([https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L851)]
>  selects row ranges to calculate the row count when column index is enabled.
>  # In ColumnIndexFilter 
> ([https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L80)]
>  we filter row ranges and pass the set of paths which in this case is empty.
>  # When evaluating the filter, if the column path is not in the set, we would 
> return an empty list of rows 
> ([https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L178)|https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L178).]
>  which is always the case for an empty projection.
>  # This results in the incorrect number of records reported by the library.
> I will provide the full repro later.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2170) Empty projection returns the wrong number of rows when column index is enabled

2022-08-04 Thread Ivan Sadikov (Jira)
Ivan Sadikov created PARQUET-2170:
-

 Summary: Empty projection returns the wrong number of rows when 
column index is enabled
 Key: PARQUET-2170
 URL: https://issues.apache.org/jira/browse/PARQUET-2170
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Ivan Sadikov


Discovered in Spark, when returning an empty projection from a Parquet file 
with filter pushdown enabled (typically when doing filter + count), Parquet-Mr 
returns a wrong number of rows with column index enabled. When the column index 
feature is disabled, the result is correct.

 

This happens due to the following:
 # ParquetFileReader::getFilteredRowCount() 
([https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L851)]
 selects row ranges to calculate the row count when column index is enabled.
 # In ColumnIndexFilter 
([https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L80)]
 we filter row ranges and pass the set of paths which in this case is empty.
 # When evaluating the filter, if the column path is not in the set, we would 
return an empty list of rows 
([https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L178)|https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L178).]
 which is always the case for an empty projection.
 # This results in the incorrect number of records reported by the library.

I will provide the full repro later.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1632) Negative initial size when writing large values in parquet-mr

2019-08-09 Thread Ivan Sadikov (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903899#comment-16903899
 ] 

Ivan Sadikov commented on PARQUET-1632:
---

Actually, setting page.size configuration to smaller values results in 
PARQUET-1633, where one can't read the file one has written.

Yes, I see what you mean. arrayOutput overflowed, that resulted in size being 
negative, passed the check on page max size in writePage method and instead 
failed when running buf.collect.

> Negative initial size when writing large values in parquet-mr
> -
>
> Key: PARQUET-1632
> URL: https://issues.apache.org/jira/browse/PARQUET-1632
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Ivan Sadikov
>Assignee: Junjie Chen
>Priority: Major
>
> I encountered an issue when writing large string values to Parquet.
> Here is the code to reproduce the issue:
> {code:java}
> import org.apache.spark.sql.functions._
> def longString: String = "a" * (64 * 1024 * 1024)
> val long_string = udf(() => longString)
> val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())
> spark.conf.set("parquet.enable.dictionary", "false")
> df.write.option("compression", 
> "uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
>  
> This Spark job fails with the exception:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 10861.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 10861.0 (TID 671168, 10.0.180.13, executor 14656): 
> org.apache.spark.SparkException: Task failed while writing rows. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) 
> Caused by: java.lang.IllegalArgumentException: Negative initial size: 
> -1610612543 at 
> java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:74) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:234) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:232) at 
> org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:202) at 
> org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33)
>  at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:126)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235) 
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
>  at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(Fi

[jira] [Comment Edited] (PARQUET-1632) Negative initial size when writing large values in parquet-mr

2019-08-08 Thread Ivan Sadikov (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903092#comment-16903092
 ] 

Ivan Sadikov edited comment on PARQUET-1632 at 8/8/19 3:57 PM:
---

Left a comment already. It looks like there is an issue of treating BytesInput 
as a byte array when it should be treated as stream of bytes. Also bytes 
collector should gracefully handle adding large BytesInput instances. 

 

The recommended configuration setting is the current mitigation, not a solution 
to the problem. Plus, the config is about tweaking page size, not column chunk 
size (see the comment above).


was (Author: sadikovi):
Left a comment already. It looks like there is an issue of treating BytesInput 
as a byte array when it should be treated as stream of bytes. Also bytes 
collector should gracefully handle adding large BytesInput instances. 

 

The recommended configuration setting is the current mitigation, not a solution 
to the problem. Plus, it is about page size, not column chunk size (see the 
comment above).

> Negative initial size when writing large values in parquet-mr
> -
>
> Key: PARQUET-1632
> URL: https://issues.apache.org/jira/browse/PARQUET-1632
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Ivan Sadikov
>Assignee: Junjie Chen
>Priority: Major
>
> I encountered an issue when writing large string values to Parquet.
> Here is the code to reproduce the issue:
> {code:java}
> import org.apache.spark.sql.functions._
> def longString: String = "a" * (64 * 1024 * 1024)
> val long_string = udf(() => longString)
> val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())
> spark.conf.set("parquet.enable.dictionary", "false")
> df.write.option("compression", 
> "uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
>  
> This Spark job fails with the exception:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 10861.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 10861.0 (TID 671168, 10.0.180.13, executor 14656): 
> org.apache.spark.SparkException: Task failed while writing rows. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) 
> Caused by: java.lang.IllegalArgumentException: Negative initial size: 
> -1610612543 at 
> java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:74) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:234) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:232) at 
> org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:202) at 
> org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33)
>  at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:126)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235) 
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
>  at 
> org.apache.parquet.hadoop.ParquetRec

[jira] [Reopened] (PARQUET-1632) Negative initial size when writing large values in parquet-mr

2019-08-08 Thread Ivan Sadikov (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Sadikov reopened PARQUET-1632:
---

Left a comment already. It looks like there is an issue of treating BytesInput 
as a byte array when it should be treated as stream of bytes. Also bytes 
collector should gracefully handle adding large BytesInput instances. 

 

The recommended configuration setting is the current mitigation, not a solution 
to the problem. Plus, it is about page size, not column chunk size (see the 
comment above).

> Negative initial size when writing large values in parquet-mr
> -
>
> Key: PARQUET-1632
> URL: https://issues.apache.org/jira/browse/PARQUET-1632
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Ivan Sadikov
>Assignee: Junjie Chen
>Priority: Major
>
> I encountered an issue when writing large string values to Parquet.
> Here is the code to reproduce the issue:
> {code:java}
> import org.apache.spark.sql.functions._
> def longString: String = "a" * (64 * 1024 * 1024)
> val long_string = udf(() => longString)
> val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())
> spark.conf.set("parquet.enable.dictionary", "false")
> df.write.option("compression", 
> "uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
>  
> This Spark job fails with the exception:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 10861.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 10861.0 (TID 671168, 10.0.180.13, executor 14656): 
> org.apache.spark.SparkException: Task failed while writing rows. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) 
> Caused by: java.lang.IllegalArgumentException: Negative initial size: 
> -1610612543 at 
> java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:74) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:234) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:232) at 
> org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:202) at 
> org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33)
>  at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:126)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235) 
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
>  at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(Fi

[jira] [Comment Edited] (PARQUET-1632) Negative initial size when writing large values in parquet-mr

2019-08-08 Thread Ivan Sadikov (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903082#comment-16903082
 ] 

Ivan Sadikov edited comment on PARQUET-1632 at 8/8/19 3:43 PM:
---

I don't think your assessment is correct. Yes, it overflows when we cast size() 
to integer, even though the size can be long. 

It looks like the problem is ConcatenatingByteArrayCollector and toByteArray 
method. ConcatenatingByteArrayCollector class should handle arbitrary 
BytesInput by checking its size first and splitting the data across slabs (in 
this particular case even when the total size is larger than i32 max value) - 
I'd say that is the whole point of concatenating byte array in the first place.

Note that BytesInput instance itself did not fail with overflow. It only failed 
when turning it to a byte array. If what you mentioned in the comment was true, 
I would expect failure when creating an instance of BytesInput.

Collector should be filling up the last slab and allocating those as it 
requests more and more data from BytesInput, similar to output stream. 
BytesInput is treated like stream based on comments in the code, so it is a 
bug, IMHO.

 

Regarding your comment around page size check - it is merely a mitigation 
strategy that makes it pass the error. It is not a fix. I already included it 
in PARQUET-1633.

My understanding is parquet.page.size.row.check.min affects the page size 
itself. If page size was larger than 2GB (which, I would expect the code to 
fail here:

[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L116]

Or here:

[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L123]

Those conditions explicitly check for the page size. But the code did not fail 
there.


was (Author: sadikovi):
I don't think your assessment is correct. Yes, it overflows when we cast size() 
to integer, even though the size can be long. 

It looks like the problem is ConcatenatingByteArrayCollector and toByteArray 
method. ConcatenatingByteArrayCollector class should handle arbitrary 
BytesInput by checking its size first and splitting the data across slabs (in 
this particular case even when the total size is larger than i32 max value) - 
I'd say that is the whole point of concatenating byte array in the first place.

Note that BytesInput instance itself did not fail with overflow. It only failed 
when turning it to a byte array. If what you mentioned in the comment was true, 
I would expect failure when creating an instance of BytesInput.

Collector should be filling up the last slab and allocating those as it 
requests more and more data from BytesInput, similar to output stream.

 

Regarding your comment around page size check - it is merely a mitigation 
strategy that makes it pass the error. It is not a fix. I already included it 
in PARQUET-1633.

My understanding is parquet.page.size.row.check.min affects the page size 
itself. If page size was larger than 2GB (which, I would expect the code to 
fail here:

[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L116]

Or here:

[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L123]

Those conditions explicitly check for the page size. But the code did not fail 
there.

> Negative initial size when writing large values in parquet-mr
> -
>
> Key: PARQUET-1632
> URL: https://issues.apache.org/jira/browse/PARQUET-1632
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Ivan Sadikov
>Assignee: Junjie Chen
>Priority: Major
>
> I encountered an issue when writing large string values to Parquet.
> Here is the code to reproduce the issue:
> {code:java}
> import org.apache.spark.sql.functions._
> def longString: String = "a" * (64 * 1024 * 1024)
> val long_string = udf(() => longString)
> val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())
> spark.conf.set("parquet.enable.dictionary", "false")
> df.write.option("compression", 
> "uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
>  
> This Spark job fails with the exception:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 10861.0 failed 4 time

[jira] [Commented] (PARQUET-1632) Negative initial size when writing large values in parquet-mr

2019-08-08 Thread Ivan Sadikov (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903082#comment-16903082
 ] 

Ivan Sadikov commented on PARQUET-1632:
---

I don't think your assessment is correct. Yes, it overflows when we cast size() 
to integer, even though the size can be long. 

It looks like the problem is ConcatenatingByteArrayCollector and toByteArray 
method. ConcatenatingByteArrayCollector class should handle arbitrary 
BytesInput by checking its size first and splitting the data across slabs (in 
this particular case even when the total size is larger than i32 max value) - 
I'd say that is the whole point of concatenating byte array in the first place.

Note that BytesInput instance itself did not fail with overflow. It only failed 
when turning it to a byte array. If what you mentioned in the comment was true, 
I would expect failure when creating an instance of BytesInput.

Collector should be filling up the last slab and allocating those as it 
requests more and more data from BytesInput, similar to output stream.

 

Regarding your comment around page size check - it is merely a mitigation 
strategy that makes it pass the error. It is not a fix. I already included it 
in PARQUET-1633.

My understanding is parquet.page.size.row.check.min affects the page size 
itself. If page size was larger than 2GB (which, I would expect the code to 
fail here:

[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L116]

Or here:

[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L123]

Those conditions explicitly check for the page size. But the code did not fail 
there.

> Negative initial size when writing large values in parquet-mr
> -
>
> Key: PARQUET-1632
> URL: https://issues.apache.org/jira/browse/PARQUET-1632
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Ivan Sadikov
>Assignee: Junjie Chen
>Priority: Major
>
> I encountered an issue when writing large string values to Parquet.
> Here is the code to reproduce the issue:
> {code:java}
> import org.apache.spark.sql.functions._
> def longString: String = "a" * (64 * 1024 * 1024)
> val long_string = udf(() => longString)
> val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())
> spark.conf.set("parquet.enable.dictionary", "false")
> df.write.option("compression", 
> "uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
>  
> This Spark job fails with the exception:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 10861.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 10861.0 (TID 671168, 10.0.180.13, executor 14656): 
> org.apache.spark.SparkException: Task failed while writing rows. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) 
> Caused by: java.lang.IllegalArgumentException: Negative initial size: 
> -1610612543 at 
> java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:74) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:234) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:232) at 
> org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:202) at 
> org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33)
>  at 
> org.apache.parquet.hadoop

[jira] [Updated] (PARQUET-1632) Negative initial size when writing large values in parquet-mr

2019-08-08 Thread Ivan Sadikov (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Sadikov updated PARQUET-1632:
--
Description: 
I encountered an issue when writing large string values to Parquet.

Here is the code to reproduce the issue:
{code:java}
import org.apache.spark.sql.functions._

def longString: String = "a" * (64 * 1024 * 1024)
val long_string = udf(() => longString)

val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())

spark.conf.set("parquet.enable.dictionary", "false")
df.write.option("compression", 
"uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
 

This Spark job fails with the exception:
{code:java}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 10861.0 failed 4 times, most recent failure: Lost task 0.3 in 
stage 10861.0 (TID 671168, 10.0.180.13, executor 14656): 
org.apache.spark.SparkException: Task failed while writing rows. at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
org.apache.spark.scheduler.Task.run(Task.scala:112) at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 

Caused by: java.lang.IllegalArgumentException: Negative initial size: 
-1610612543 at 
java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:74) at 
org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:234) at 
org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:232) at 
org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:202) at 
org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33)
 at 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:126)
 at 
org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
 at 
org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235) at 
org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
 at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172)
 at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
 at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
 at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
 at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1560)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
 ... 11 more{code}
 

Would appreciate if you could help addressing the problem. Thanks!

  was:
I encountered an issue when writing large string values to Parquet.

Here is the code to reproduce the issue:
{code:java}
import org.apache.spark.sql.functions._

def longString: String = "a" * (64 * 1024 * 1024)
val long_string = udf(() => longString)

val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())

spark.conf.set("parquet.enable.dictionary", "false")
df.write.option("compression", 
"uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
 

This Spark job fails with the exception:
{code:java}
Caused by:

[jira] [Commented] (PARQUET-1632) Negative initial size when writing large values in parquet-mr

2019-08-07 Thread Ivan Sadikov (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902098#comment-16902098
 ] 

Ivan Sadikov commented on PARQUET-1632:
---

Thank you.

> Negative initial size when writing large values in parquet-mr
> -
>
> Key: PARQUET-1632
> URL: https://issues.apache.org/jira/browse/PARQUET-1632
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Ivan Sadikov
>Assignee: Junjie Chen
>Priority: Major
>
> I encountered an issue when writing large string values to Parquet.
> Here is the code to reproduce the issue:
> {code:java}
> import org.apache.spark.sql.functions._
> def longString: String = "a" * (64 * 1024 * 1024)
> val long_string = udf(() => longString)
> val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())
> spark.conf.set("parquet.enable.dictionary", "false")
> df.write.option("compression", 
> "uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
>  
> This Spark job fails with the exception:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 10861.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 10861.0 (TID 671168, 10.0.180.13, executor 14656): 
> org.apache.spark.SparkException: Task failed while writing rows. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> java.lang.IllegalArgumentException: Negative initial size: -1610612543 at 
> java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:74) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:234) at 
> org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:232) at 
> org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:202) at 
> org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33)
>  at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:126)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
>  at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235) 
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
>  at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1560)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
>  ... 11 more{code}
>  
> Would appreciate if you could help addressing the problem. Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList

2019-08-07 Thread Ivan Sadikov (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902096#comment-16902096
 ] 

Ivan Sadikov commented on PARQUET-1633:
---

I would say it is a bug. There are a few instances of overflow I found in 
parquet-mr, another example being 
https://issues.apache.org/jira/browse/PARQUET-1632.

I also have a simple repro with Spark; by the way, if you make records shorter, 
the problem still persists:
{code:java}
import org.apache.spark.sql.functions._

val large_str = udf(() => "a" * (128 * 1024 * 1024))

val df = spark.range(0, 20, 1, 1).withColumn("large_str", large_str())

spark.conf.set("parquet.enable.dictionary", "false")
spark.conf.set("parquet.page.size.row.check.min", "1") // this is done so I 
don't hit PARQUET-1632
df.write.option("compression", 
"uncompressed").mode("overwrite").parquet("/mnt/large.parquet")

spark.read.parquet("/mnt/large.parquet").foreach(_ => Unit) // Fails{code}
Here is the stacktrace:
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in 
stage 22.0 failed 4 times, most recent failure: Lost task 10.3 in stage 22.0 
(TID 121, 10.0.217.97, executor 1): java.lang.IllegalArgumentException: Illegal 
Capacity: -191
at java.util.ArrayList.(ArrayList.java:157) at 
org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169)
 at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:811){code}
 

Well, there is no restriction on row group size in parquet format. Also, it is 
not a significant effort to patch this issue - making ChunkDescriptor size 
field as well as ConsecutiveChunkList length to have long type should patch 
this problem (it already has long type in metadata).

> Integer overflow in ParquetFileReader.ConsecutiveChunkList
> --
>
> Key: PARQUET-1633
> URL: https://issues.apache.org/jira/browse/PARQUET-1633
>     Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Ivan Sadikov
>Priority: Major
>
> When reading a large Parquet file (2.8GB), I encounter the following 
> exception:
> {code:java}
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file 
> dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228)
> ... 14 more
> Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212
> at java.util.ArrayList.(ArrayList.java:157)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code}
>  
> The file metadata is:
>  * block 1 (3 columns)
>  ** rowCount: 110,100
>  ** totalByteSize: 348,492,072
>  ** compressedSize: 165,689,649
>  * block 2 (3 columns)
>  ** rowCount: 90,054
>  ** totalByteSize: 3,243,165,541
>  ** compressedSize: 2,509,579,966
>  * block 3 (3 columns)
>  ** rowCount: 105,119
>  ** totalByteSize: 350,901,693
>  ** compressedSize: 144,952,177
>  * block 4 (3 columns)
>  ** rowCount: 48,741
>  ** totalByteSize: 1,275,995
>  ** compressedSize: 914,205
> I don't have the code to reproduce the issue, unfortunately; however, I 
> looked at the code and it seems that integer {{length}} field in 
> ConsecutiveChunkList overflows, which results in negative capacity for array 
> list in {{readAll}} method:
> {code:java}
> int fullAllocations = length / options.getMaxAllocationSize();
> int lastAllocationSize = length % options.getMaxAllocationSize();
>   
> int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0);
> List buffers = new ArrayList<>(numAllocations);{code}
>  
> This is caused by cast to integer in {{readNextRowGroup}} method in 
> ParquetFileReader:
> {code:java}
> currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, 
> (int)mc.getTotalSize()));
> {code}
> which overflows when total size of the column is larger than 
> Integer.MAX_VALUE.
> I would appreciate if you could help addressing the issue. Thanks!
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (PARQUET-1632) Negative initial size when writing large values in parquet-mr

2019-08-06 Thread Ivan Sadikov (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Sadikov updated PARQUET-1632:
--
Description: 
I encountered an issue when writing large string values to Parquet.

Here is the code to reproduce the issue:
{code:java}
import org.apache.spark.sql.functions._

def longString: String = "a" * (64 * 1024 * 1024)
val long_string = udf(() => longString)

val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())

spark.conf.set("parquet.enable.dictionary", "false")
df.write.option("compression", 
"uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
 

This Spark job fails with the exception:
{code:java}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 10861.0 failed 4 times, most recent failure: Lost task 0.3 in 
stage 10861.0 (TID 671168, 10.0.180.13, executor 14656): 
org.apache.spark.SparkException: Task failed while writing rows. at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
org.apache.spark.scheduler.Task.run(Task.scala:112) at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) Caused by: 
java.lang.IllegalArgumentException: Negative initial size: -1610612543 at 
java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:74) at 
org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:234) at 
org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:232) at 
org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:202) at 
org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33)
 at 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:126)
 at 
org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
 at 
org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235) at 
org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
 at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172)
 at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
 at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
 at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
 at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1560)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
 ... 11 more{code}
 

Would appreciate if you could help addressing the problem. Thanks!

  was:
I encountered an issue when writing large string values to Parquet.

Here is the code to reproduce the issue:
{code:java}
import org.apache.spark.sql.functions._

def longString: String = "a" * (64 * 1024 * 1024)
val long_string = udf(() => longString)

val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())

spark.conf.set("parquet.enable.dictionary", "false")
df.write.option("compression", 
"uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
 

This Spark job fails with the exception:
{code:java}
Caused by:

[jira] [Created] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList

2019-08-06 Thread Ivan Sadikov (JIRA)
Ivan Sadikov created PARQUET-1633:
-

 Summary: Integer overflow in ParquetFileReader.ConsecutiveChunkList
 Key: PARQUET-1633
 URL: https://issues.apache.org/jira/browse/PARQUET-1633
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.10.1
Reporter: Ivan Sadikov


When reading a large Parquet file (2.8GB), I encounter the following exception:
{code:java}
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
at 0 in block -1 in file 
dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228)
... 14 more
Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212
at java.util.ArrayList.(ArrayList.java:157)
at 
org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code}
 

The file metadata is:
 * block 1 (3 columns)

 ** rowCount: 110,100

 ** totalByteSize: 348,492,072

 ** compressedSize: 165,689,649

 * block 2 (3 columns)

 ** rowCount: 90,054

 ** totalByteSize: 3,243,165,541

 ** compressedSize: 2,509,579,966

 * block 3 (3 columns)

 ** rowCount: 105,119

 ** totalByteSize: 350,901,693

 ** compressedSize: 144,952,177

 * block 4 (3 columns)

 ** rowCount: 48,741

 ** totalByteSize: 1,275,995

 ** compressedSize: 914,205

I don't have the code to reproduce the issue, unfortunately; however, I looked 
at the code and it seems that integer {{length}} field in ConsecutiveChunkList 
overflows, which results in negative capacity for array list in {{readAll}} 
method:
{code:java}
int fullAllocations = length / options.getMaxAllocationSize();
int lastAllocationSize = length % options.getMaxAllocationSize();

int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0);
List buffers = new ArrayList<>(numAllocations);{code}
 

This is caused by cast to integer in {{readNextRowGroup}} method in 
ParquetFileReader:
{code:java}
currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, 
(int)mc.getTotalSize()));
{code}
which overflows when total size of the column is larger than Integer.MAX_VALUE.

I would appreciate if you could help addressing the issue. Thanks!

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (PARQUET-1632) Negative initial size when writing large values in parquet-mr

2019-08-06 Thread Ivan Sadikov (JIRA)
Ivan Sadikov created PARQUET-1632:
-

 Summary: Negative initial size when writing large values in 
parquet-mr
 Key: PARQUET-1632
 URL: https://issues.apache.org/jira/browse/PARQUET-1632
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.10.1
Reporter: Ivan Sadikov


I encountered an issue when writing large string values to Parquet.

Here is the code to reproduce the issue:
{code:java}
import org.apache.spark.sql.functions._

def longString: String = "a" * (64 * 1024 * 1024)
val long_string = udf(() => longString)

val df = spark.range(0, 40, 1, 1).withColumn("large_str", long_string())

spark.conf.set("parquet.enable.dictionary", "false")
df.write.option("compression", 
"uncompressed").mode("overwrite").parquet("/tmp/large.parquet"){code}
 

This Spark job fails with the exception:
{code:java}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 10861.0 failed 4 times, most recent failure: Lost task 0.3 in 
stage 10861.0 (TID 671168, 10.0.180.13, executor 14656): 
org.apache.spark.SparkException: Task failed while writing rows. at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
org.apache.spark.scheduler.Task.run(Task.scala:112) at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) Caused by: 
java.lang.IllegalArgumentException: Negative initial size: -1610612543 at 
java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:74) at 
org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:234) at 
org.apache.parquet.bytes.BytesInput$BAOS.(BytesInput.java:232) at 
org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:202) at 
org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33)
 at 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:126)
 at 
org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
 at 
org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235) at 
org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
 at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172)
 at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
 at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
 at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
 at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1560)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
 ... 11 more{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (PARQUET-458) [C++] Implement support for DataPageV2

2019-07-02 Thread Ivan Sadikov (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876738#comment-16876738
 ] 

Ivan Sadikov commented on PARQUET-458:
--

Hello,

If it helps, Rust parquet library in Apache Arrow already implements Data page 
V2 for both writes and reads, so you can use that as a reference when working 
on C++ code and check correctness in either Rust or C++.

Here are some links for column reader/writer:

https://github.com/apache/arrow/blob/master/rust/parquet/src/column/reader.rs#L342
https://github.com/apache/arrow/blob/master/rust/parquet/src/column/writer.rs#L544

We also added a bunch of Data page v2 encodings added in Rust, and quite a few 
correctness tests to check encoding/decoding works.

Hopefully, it helps, [~wesmckinn].

> [C++] Implement support for DataPageV2
> --
>
> Key: PARQUET-458
> URL: https://issues.apache.org/jira/browse/PARQUET-458
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: cpp-1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Error in parquet-testing/data/datapage_v2.snappy.parquet?

2019-06-03 Thread Ivan Sadikov
Hi Wes,

I think it’s the file that I added for parquet-rs to test data page v2 back
then - it is not used anywhere else.


Cheers,

Ivan

On Mon, 3 Jun 2019 at 10:15 PM, Wes McKinney  wrote:

> I took a quick look at this -- DataPageV2 has a slightly different
> structure from DataPageV1, as indicated here
>
>
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L555
>
> In DataPageV1, the encoded repetition/definition levels are compressed
> together with the values in the data page. In DataPageV2, only the
> values are compressed. I'll see if I can fashion a fix sufficient to
> read the test data file, but more extensive testing will be required
> to extend the other unit tests to test both reading and writing both
> types of data pages.
>
> On Tue, Apr 30, 2019 at 8:56 AM Curt Hagenlocher 
> wrote:
> >
> > Thanks! Either the documentation is a bit sparse for that level of
> detail,
> > or I haven't been looking in the right place. The factoring of the Java
> > implementation makes it hard for me to see what's going on there, but the
> > Rust implementation is straightforward enough despite my utter lack of
> > familiarity with the language.
> >
> > On Mon, Apr 29, 2019 at 10:41 AM Ivan Sadikov 
> > wrote:
> >
> > > Not in V2, in V1 the whole page is encoded, but in V2 it is only
> values, if
> > > I remember correctly. So we would have to extract repetition and
> definition
> > > levels bytes and then decode values.
> > >
> > > You can check out code in parquet rust module!
> > >
> > > I am not sure about parquet-cpp, we can use that implementation as
> > > reference there.
> > >
> > >
> > > On Mon, 29 Apr 2019 at 5:39 PM, Curt Hagenlocher  >
> > > wrote:
> > >
> > > > Would that be covered by PARQUET-458 (
> > > > https://issues.apache.org/jira/browse/PARQUET-458)?
> > > >
> > > > On Mon, Apr 29, 2019 at 8:18 AM Wes McKinney 
> > > wrote:
> > > >
> > > > > Is there a JIRA issue about data page v2 issues in parquet-cpp?
> > > > >
> > > > > On Mon, Apr 29, 2019 at 9:57 AM Curt Hagenlocher <
> c...@hagenlocher.org
> > > >
> > > > > wrote:
> > > > > >
> > > > > > But the data page is decoded only after it is decompressed, so I
> > > > > wouldn’t expect an unsupported data page to cause a decompression
> > > > failure.
> > > > > >
> > > > > > (I am playing with adding V2 support to Parquet.Net.)
> > > > > >
> > > > > > Sent from my iPhone
> > > > > >
> > > > > > > On Apr 29, 2019, at 7:30 AM, Ivan Sadikov <
> ivan.sadi...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > If you are referring to the file in Apache/parquet-testing
> > > > repository,
> > > > > it
> > > > > > > is a valid Parquet file with data encoded into data page v2.
> > > > > > >
> > > > > > > You can easily test it with “cargo install parquet” and
> > > “parquet-read
> > > > > > > filepath”.
> > > > > > >
> > > > > > > I am not sure what kind of code you have written, but the
> error you
> > > > > have
> > > > > > > encountered could be related to the fact that parquet-cpp does
> not
> > > > > support
> > > > > > > decoding of data page v2.
> > > > > > >
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Ivan
> > > > > > >
> > > > > > > On Mon, 29 Apr 2019 at 3:36 PM, Curt Hagenlocher <
> > > > c...@hagenlocher.org
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> To the best of my ability to tell, there is invalid Snappy
> data in
> > > > > the file
> > > > > > >> parquet-testing/data/datapage_v2.snappy.parquet. I can neither
> > > read
> > > > > it with
> > > > > > >> my own code nor with pyarrow 0.13.0. Is this expected to work?
> > > > > > >>
> > > > > > >> Thanks!
> > > > > > >> -Curt
> > > > > > >>
> > > > >
> > > >
> > >
>


Re: Error in parquet-testing/data/datapage_v2.snappy.parquet?

2019-04-29 Thread Ivan Sadikov
Yeah, you are right. Looks like the right JIRA ticket.

On Mon, 29 Apr 2019 at 5:39 PM, Curt Hagenlocher 
wrote:

> Would that be covered by PARQUET-458 (
> https://issues.apache.org/jira/browse/PARQUET-458)?
>
> On Mon, Apr 29, 2019 at 8:18 AM Wes McKinney  wrote:
>
> > Is there a JIRA issue about data page v2 issues in parquet-cpp?
> >
> > On Mon, Apr 29, 2019 at 9:57 AM Curt Hagenlocher 
> > wrote:
> > >
> > > But the data page is decoded only after it is decompressed, so I
> > wouldn’t expect an unsupported data page to cause a decompression
> failure.
> > >
> > > (I am playing with adding V2 support to Parquet.Net.)
> > >
> > > Sent from my iPhone
> > >
> > > > On Apr 29, 2019, at 7:30 AM, Ivan Sadikov 
> > wrote:
> > > >
> > > > If you are referring to the file in Apache/parquet-testing
> repository,
> > it
> > > > is a valid Parquet file with data encoded into data page v2.
> > > >
> > > > You can easily test it with “cargo install parquet” and “parquet-read
> > > > filepath”.
> > > >
> > > > I am not sure what kind of code you have written, but the error you
> > have
> > > > encountered could be related to the fact that parquet-cpp does not
> > support
> > > > decoding of data page v2.
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > Ivan
> > > >
> > > > On Mon, 29 Apr 2019 at 3:36 PM, Curt Hagenlocher <
> c...@hagenlocher.org
> > >
> > > > wrote:
> > > >
> > > >> To the best of my ability to tell, there is invalid Snappy data in
> > the file
> > > >> parquet-testing/data/datapage_v2.snappy.parquet. I can neither read
> > it with
> > > >> my own code nor with pyarrow 0.13.0. Is this expected to work?
> > > >>
> > > >> Thanks!
> > > >> -Curt
> > > >>
> >
>


Re: Error in parquet-testing/data/datapage_v2.snappy.parquet?

2019-04-29 Thread Ivan Sadikov
Not in V2, in V1 the whole page is encoded, but in V2 it is only values, if
I remember correctly. So we would have to extract repetition and definition
levels bytes and then decode values.

You can check out code in parquet rust module!

I am not sure about parquet-cpp, we can use that implementation as
reference there.


On Mon, 29 Apr 2019 at 5:39 PM, Curt Hagenlocher 
wrote:

> Would that be covered by PARQUET-458 (
> https://issues.apache.org/jira/browse/PARQUET-458)?
>
> On Mon, Apr 29, 2019 at 8:18 AM Wes McKinney  wrote:
>
> > Is there a JIRA issue about data page v2 issues in parquet-cpp?
> >
> > On Mon, Apr 29, 2019 at 9:57 AM Curt Hagenlocher 
> > wrote:
> > >
> > > But the data page is decoded only after it is decompressed, so I
> > wouldn’t expect an unsupported data page to cause a decompression
> failure.
> > >
> > > (I am playing with adding V2 support to Parquet.Net.)
> > >
> > > Sent from my iPhone
> > >
> > > > On Apr 29, 2019, at 7:30 AM, Ivan Sadikov 
> > wrote:
> > > >
> > > > If you are referring to the file in Apache/parquet-testing
> repository,
> > it
> > > > is a valid Parquet file with data encoded into data page v2.
> > > >
> > > > You can easily test it with “cargo install parquet” and “parquet-read
> > > > filepath”.
> > > >
> > > > I am not sure what kind of code you have written, but the error you
> > have
> > > > encountered could be related to the fact that parquet-cpp does not
> > support
> > > > decoding of data page v2.
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > Ivan
> > > >
> > > > On Mon, 29 Apr 2019 at 3:36 PM, Curt Hagenlocher <
> c...@hagenlocher.org
> > >
> > > > wrote:
> > > >
> > > >> To the best of my ability to tell, there is invalid Snappy data in
> > the file
> > > >> parquet-testing/data/datapage_v2.snappy.parquet. I can neither read
> > it with
> > > >> my own code nor with pyarrow 0.13.0. Is this expected to work?
> > > >>
> > > >> Thanks!
> > > >> -Curt
> > > >>
> >
>


Re: Error in parquet-testing/data/datapage_v2.snappy.parquet?

2019-04-29 Thread Ivan Sadikov
If you are referring to the file in Apache/parquet-testing repository, it
is a valid Parquet file with data encoded into data page v2.

You can easily test it with “cargo install parquet” and “parquet-read
filepath”.

I am not sure what kind of code you have written, but the error you have
encountered could be related to the fact that parquet-cpp does not support
decoding of data page v2.


Cheers,

Ivan

On Mon, 29 Apr 2019 at 3:36 PM, Curt Hagenlocher 
wrote:

> To the best of my ability to tell, there is invalid Snappy data in the file
> parquet-testing/data/datapage_v2.snappy.parquet. I can neither read it with
> my own code nor with pyarrow 0.13.0. Is this expected to work?
>
> Thanks!
> -Curt
>


Re: [DISCUSS] Rust add adapter for parquet

2018-11-20 Thread Ivan Sadikov
Hello!

That would be great! Agree with Chao and Wes, we should do it similar to
parquet-cpp, as long as it does not make it difficult for others to work
with the Arrow repository:).

Ha, Arrow data source sounds interesting. I will also catch up on the Arrow
development.


Cheers,

Ivan
On Tue, 20 Nov 2018 at 9:49 PM, Andy Grove  wrote:

> This sounds like a great idea.
>
> With support for both CSV and Parquet in the Arrow crate, it would be nice
> to design a standard interface for Arrow data sources. Maybe this is as
> simple as implementing `Iterator`.
>
> Andy.
>
> On Tue, Nov 20, 2018 at 11:46 AM Chao Sun  wrote:
>
>> Yes, we'd be interested to move forward. I'm inclined to merge this into
>> Arrow because of the issues that you pointed out with parquet c++ merge,
>> and I do see a tight relationship between the two projects, and potential
>> sharing of common libraries. @Ivan Sadikov  what
>> do you think?
>>
>> Chao
>>
>> On Tue, Nov 20, 2018 at 10:23 AM Wes McKinney 
>> wrote:
>>
>>> hi folks,
>>>
>>> Would you all be interested in moving forward the parquet-rs project?
>>> I have a little more bandwidth to help with the code donation in the
>>> next month or two.
>>>
>>> I know we voted on the Parquet mailing list about the donation
>>> already. One big question is whether you want to create an
>>> apache/parquet-rs repository or whether you want to co-develop
>>> parquet-rs together with Arrow in Rust, similar to what we are doing
>>> with C++. It's possible you might run into the same kinds of issues
>>> that led us to consider the monorepo arrangement.
>>>
>>> Thanks
>>> Wes
>>> On Sun, Aug 19, 2018 at 11:11 PM Renjie Liu 
>>> wrote:
>>> >
>>> > Hi, Chao:
>>> > I've opened an jira issue for that and planning to work on that.
>>> >
>>> > On Mon, Aug 20, 2018 at 11:03 AM Renjie Liu 
>>> wrote:
>>> >
>>> > > Yes, it's a mistake, sorry for that
>>> > >
>>> > >
>>> > > On Mon, Aug 20, 2018 at 10:57 AM Chao Sun 
>>> wrote:
>>> > >
>>> > >> (s/flink/arrow - it is a mistake?)
>>> > >>
>>> > >> Thanks Renjie for your interest. Yes, one of the next step in
>>> parquet-rs
>>> > >> is to integrate with Apache Arrow. Actually we just had a discussion
>>> > >> <https://github.com/sunchao/parquet-rs/issues/140> about this
>>> recently.
>>> > >> Feel free to share your comments on the github.
>>> > >>
>>> > >> Best,
>>> > >> Chao
>>> > >>
>>> > >> On Sun, Aug 19, 2018 at 7:39 PM, Renjie Liu <
>>> liurenjie2...@gmail.com>
>>> > >> wrote:
>>> > >>
>>> > >>> cc:Sunchao and Any
>>> > >>>
>>> > >>>
>>> > >>> -- Forwarded message -
>>> > >>> From: Uwe L. Korn 
>>> > >>> Date: Sun, Aug 19, 2018 at 5:08 PM
>>> > >>> Subject: Re: [DISCUSS] Rust add adapter for parquet
>>> > >>> To: 
>>> > >>>
>>> > >>>
>>> > >>> Hello,
>>> > >>>
>>> > >>> you might also want to raise this with the
>>> > >>> https://github.com/sunchao/parquet-rs project. The overlap
>>> between the
>>> > >>> developers of this project and the Arrow Rust implementation is
>>> quite large
>>> > >>> but still it may make sense to also start a discussion there.
>>> > >>>
>>> > >>> Uwe
>>> > >>>
>>> > >>> On Thu, Aug 16, 2018, at 9:14 AM, Renjie Liu wrote:
>>> > >>> > Hi, all:
>>> > >>> >
>>> > >>> > Now the rust component is approaching a stable state and rust
>>> reader
>>> > >>> for
>>> > >>> > parquet is ready. I think it maybe a good time to start an
>>> adapter for
>>> > >>> > parquet, just like adapter for orc in cpp. How you guys think
>>> about it?
>>> > >>> > --
>>> > >>> > Liu, Renjie
>>> > >>> > Software Engineer, MVAD
>>> > >>> --
>>> > >>> Liu, Renjie
>>> > >>> Software Engineer, MVAD
>>> > >>>
>>> > >>
>>> > >> --
>>> > > Liu, Renjie
>>> > > Software Engineer, MVAD
>>> > >
>>> > --
>>> > Liu, Renjie
>>> > Software Engineer, MVAD
>>>
>>


Parquet-cpp schema does not conform to the Thrift definition

2018-11-01 Thread Ivan Sadikov
Hi all,

When debugging files written by parquet-cpp, I found that the library does
not set num_children field to None for PrimitiveType, when serialising to
Thrift, this causes all fields to have Some(0), but the definition says
that num_chilldren should be set only for GroupType.

Another odd behaviour that I found  is setting or keeping repetition level
for message type and not having the corresponding check on that when
reading the schema from Thrift. In the example file, the root type has
REQUIRED repetition, but it should not be set at all according to the
Thrift definition.

I looked at the code and it seems like those inconsistencies still exist in
the current master branch. Example file and discussion are here:
https://github.com/sunchao/parquet-rs/issues/178.

It may not be an issue at all, I would appreciate any suggestions and
feedback. Thanks!

It looks like our parsing is not as robust as parquet-cpp or parquet-mr, I
am going to update that as well.


Kind regards,

Ivan


Re: Small malloc at file open and metadata parsing

2018-07-30 Thread Ivan Sadikov
Sorry to jump in like this, but I was wondering if parquet-rs can read the
file correctly, or the issue also happens there.
Alex, could you give it a go and see if file and metadata can be read with
parquet-rs (https://github.com/sunchao/parquet-rs, you can run cargo
install parquet to install parquet tools).


Cheers,

Ivan

On Mon, 30 Jul 2018 at 21:49 ALeX Wang  wrote:

> Thanks for the quick reply @Wes,
>
> Too bad this is causing a lot of delays (due to page fault handing) for
> light queries (ones that query only few rows/columns),
>
> Will try to use jemallc and see,,,
>
> One more question, when i upgrade to 1.4.0 or later code, and use the same
> cmake options, and environment, OpenFile result in segfault,,,
>
> ```
> awake@ev003:/tmp$ cat tmpfile
> (gdb) where
> #0  0x7fc542eebc3c in free () from /lib64/libc.so.6
> #1  0x00f13cb1 in arrow::DefaultMemoryPool::Free (this=0x16e71e0
> , buffer=0x7fc52f425040
> , size=616512)
> at
>
> /opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/memory_pool.cc:147
> #2  0x00f117b6 in arrow::PoolBuffer::~PoolBuffer (this=0x34b5fb8,
> __in_chrg=) at
> /opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/buffer.cc:70
> #3  0x00e364b7 in
> __gnu_cxx::new_allocator::destroy
> (this=0x34b5fb0, __p=0x34b5fb8) at
> /usr/include/c++/4.8.2/ext/new_allocator.h:124
> #4  0x00e35e10 in
> std::allocator_traits
> >::_S_destroy (__a=..., __p=0x34b5fb8) at
> /usr/include/c++/4.8.2/bits/alloc_traits.h:281
> #5  0x00e34ea3 in
> std::allocator_traits
> >::destroy (__a=..., __p=0x34b5fb8) at
> /usr/include/c++/4.8.2/bits/alloc_traits.h:405
> #6  0x00e33f01 in std::_Sp_counted_ptr_inplace std::allocator, (__gnu_cxx::_Lock_policy)2>::_M_dispose
> (this=0x34b5fa0) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:407
> #7  0x00e27748 in
> std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
> (this=0x34b5fa0) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:144
> #8  0x00e255bb in
> std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count
> (this=0x7ffea5fffc88, __in_chrg=) at
> /usr/include/c++/4.8.2/bits/shared_ptr_base.h:546
> #9  0x00e23eae in std::__shared_ptr (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7ffea5fffc80,
> __in_chrg=) at
> /usr/include/c++/4.8.2/bits/shared_ptr_base.h:781
> #10 0x00e23ec8 in std::shared_ptr::~shared_ptr
> (this=0x7ffea5fffc80, __in_chrg=) at
> /usr/include/c++/4.8.2/bits/shared_ptr.h:93
> #11 0x00e875a4 in parquet::SerializedFile::ParseMetaData
> (this=0x34b5f60) at /opt/parquet-cpp/src/parquet/file_reader.cc:213
> #12 0x00e858d4 in parquet::ParquetFileReader::Contents::Open
> (source=std::unique_ptr containing 0x0,
> props=..., metadata=std::shared_ptr (empty) 0x0) at
> /opt/parquet-cpp/src/parquet/file_reader.cc:247
> ---Type  to continue, or q  to quit---
> #13 0x00e85a6f in parquet::ParquetFileReader::Open
> (source=std::unique_ptr containing 0x0,
> props=..., metadata=std::shared_ptr (empty) 0x0) at
> /opt/parquet-cpp/src/parquet/file_reader.cc:265
> #14 0x00e859ba in parquet::ParquetFileReader::Open
> (source=std::shared_ptr (count 2, weak 0) 0x34b5e50, props=...,
> metadata=std::shared_ptr (empty) 0x0) at
> /opt/parquet-cpp/src/parquet/file_reader.cc:259
> #15 0x00e85df4 in parquet::ParquetFileReader::OpenFile
>
> (path="/data-slow/data0/test_parquet_file/seg0/0_1530129731023-1530136801030",
> memory_map=false, props=..., metadata=std::shared_ptr (empty) 0x0) at
> /opt/parquet-cpp/src/parquet/file_reader.cc:287
> ```
>
> Is this a known issue?
>
> Thanks,
> Alex Wang,
>
>
>
> On Mon, Jul 30, 2018, 11:22 AM Wes McKinney  wrote:
>
> > hi Alex,
> >
> > It looks like the mallocs are coming from Thrift
> > (parquet/parquet_types.cpp is generated by Thrift). I'm not sure if we
> > can do much about this. I'm curious if it's possible to pass a custom
> > STL allocator to Thrift so we could use a different allocation
> > strategy than the default STL allocator
> >
> > - Wes
> >
> > On Mon, Jul 30, 2018 at 1:54 PM, ALeX Wang  wrote:
> > > Hi,
> > >
> > > I'm reading parquet file (generated by Java parquet library).  Our
> schema
> > > has 400 columns (including non-array elements, 1-dimensional array
> > > elements).
> > >
> > > I'm using release 1.3.1, gcc 4.8.5, boost static library 1.53,
> > >
> > > I compile parquet-cpp with following cmake options,
> > > ```
> > > cmake3-DCMAKE_BUILD_TYPE=Debug -DPARQUET_BUILD_EXAMPLES=OFF
> > >  -DPARQUET_BUILD_TESTS=OFF -DPARQUET_ARROW_LINKAGE="static"
> > >  -DPARQUET_BUILD_SHARED=OFF -DPARQUET_BOOST_USE_SHARED=OFF .
> > > ```
> > >
> > > One thing we noticed is that the cpp library conducts a lot of small
> > > mallocs during the open file and the reading metadata phases...  shown
> > > below:
> > >
> > > ```
> > > (gdb) where
> > > #0  0x7fdf40594801 in malloc () from /lib64/libc.so.6
> > > #1  0x7fdf40e52ecd in operator 

Re: Contributing parquet-rs to Apache?

2018-02-28 Thread Ivan Sadikov
Hello,

Yes, same for me. Developed on my own time and laptop.


Cheers,

Ivan
On Thu, 1 Mar 2018 at 3:26 PM, Chao Sun <sunc...@apache.org> wrote:

> Correct. On my side, this is solely developed on my own time and equipment.
> No employer will be able to claim the copyright.
> @Ivan: how about your side?
>
> Thanks,
> Chao
>
> On Wed, Feb 28, 2018 at 9:48 AM, Ryan Blue <rb...@netflix.com> wrote:
>
> > Well, that was easy. No employers that might be able to claim copyright?
> >
> > It sounds like we will just need to start a resolution on the Incubator's
> > general list to accept the code and have you and Ivan sign ICLAs.
> >
> > rb
> >
> > On Mon, Feb 26, 2018 at 11:45 AM, Chao Sun <sunc...@apache.org> wrote:
> >
> >> Thanks Ryan. The code is currently owned by me, and is being worked by
> >> Ivan and myself. Yes, I believe both of us will continue working on it
> >> after moving to Apache.
> >>
> >> Best,
> >> Chao
> >>
> >> On Mon, Feb 26, 2018 at 10:36 AM, Ryan Blue <rb...@netflix.com.invalid>
> >> wrote:
> >>
> >>> Chao,
> >>>
> >>> From looking at the doc that Wes sent, I think the first thing to do is
> >>> to
> >>> find out who own copyright for the code that would be imported. Then we
> >>> would plan how to get grants or license agreements from copyright
> >>> holders,
> >>> and finally start a resolution in the incubator to accept the code.
> >>>
> >>> So I guess the question is: who owns the code and will those people or
> >>> organizations work on it once it is moved to Apache?
> >>>
> >>> rb
> >>>
> >>> On Fri, Feb 23, 2018 at 11:06 PM, Chao Sun <chao.apa...@gmail.com>
> >>> wrote:
> >>>
> >>> > Thanks Wes. We are ready to start the IP clearance - can some PMC
> help
> >>> to
> >>> > guide through the process?
> >>> >
> >>> > Best,
> >>> > Chao
> >>> >
> >>> > On Wed, Feb 21, 2018 at 12:43 PM, Wes McKinney <wesmck...@gmail.com>
> >>> > wrote:
> >>> >
> >>> > > hi Ivan and Chao,
> >>> > >
> >>> > > Since this work is ongoing for more than a year, it would be best
> to
> >>> > > conduct an IP clearance to import it into the Apache Parquet
> project
> >>> > > (http://incubator.apache.org/ip-clearance/).
> >>> > >
> >>> > > One or more members of the PMC will need to assist with this to
> >>> > > prepare the documentation for the IP review and to initiate the
> >>> > > process.
> >>> > >
> >>> > > It is OK for the library to not be feature complete.
> >>> > >
> >>> > > Thanks
> >>> > > Wes
> >>> > >
> >>> > > On Fri, Feb 16, 2018 at 3:39 AM, Ivan Sadikov <
> >>> ivan.sadi...@gmail.com>
> >>> > > wrote:
> >>> > > > Hi Chao,
> >>> > > >
> >>> > > > Great to hear that you are pushing this forward.
> >>> > > > Apologies for not forwarding the email thread earlier.
> >>> > > >
> >>> > > > I will try fixing some issues of the milestone 1, so that we
> could
> >>> have
> >>> > > the
> >>> > > > read part complete.
> >>> > > >
> >>> > > >
> >>> > > > Cheers,
> >>> > > >
> >>> > > > Ivan
> >>> > > > On Fri, 16 Feb 2018 at 5:33 PM, Chao Sun <sunc...@apache.org>
> >>> wrote:
> >>> > > >
> >>> > > >> Hi,
> >>> > > >>
> >>> > > >> Just joined this mailing list. Ivan and me have been working on
> a
> >>> Rust
> >>> > > >> implementation of Parquet <
> https://github.com/sunchao/parquet-rs>
> >>> for
> >>> > > some
> >>> > > >> time. It still lacks many features but the eventual goal is to
> >>> > > contribute
> >>> > > >> it to the Apache community.
> >>> > > >>
> >>> > > >> I saw a few weeks ago there's a discussion between Ivan and Wes
> >>> about
> >>> > > this
> >>> > > >> topic, and wonder what is the required steps to realize this. Is
> >>> it OK
> >>> > > if
> >>> > > >> it is not feature complete yet?
> >>> > > >>
> >>> > > >> Thanks,
> >>> > > >> Chao
> >>> > > >>
> >>> > >
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Ryan Blue
> >>> Software Engineer
> >>> Netflix
> >>>
> >>
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>


Re: Contributing parquet-rs to Apache?

2018-02-16 Thread Ivan Sadikov
Hi Chao,

Great to hear that you are pushing this forward.
Apologies for not forwarding the email thread earlier.

I will try fixing some issues of the milestone 1, so that we could have the
read part complete.


Cheers,

Ivan
On Fri, 16 Feb 2018 at 5:33 PM, Chao Sun  wrote:

> Hi,
>
> Just joined this mailing list. Ivan and me have been working on a Rust
> implementation of Parquet  for some
> time. It still lacks many features but the eventual goal is to contribute
> it to the Apache community.
>
> I saw a few weeks ago there's a discussion between Ivan and Wes about this
> topic, and wonder what is the required steps to realize this. Is it OK if
> it is not feature complete yet?
>
> Thanks,
> Chao
>


Help with Parquet record materialiser

2018-02-11 Thread Ivan Sadikov
Hello,

Are there any docs or blogs on how parquet record materialiser works? It
would be great, if someone could explain me what the input is and how
record materialiser traverses scheme and reconstructs a record.

I understand how API is structured and how to use it, I am more after help
with the content of RecordReaderImplementation.java or an equivalent in
parquet-cpp. Would appreciate an explanation of init and read methods.

Thank you in advance!


Cheers,

Ivan


Re: Parquet DeltaLengthByteArrayDecoder question

2018-01-29 Thread Ivan Sadikov
Link: https://github.com/sunchao/parquet-rs

I think @sunchao is in Apache Community already, there is an email on the
GitHub profile page.

We are just trying to bring it up to speed with other Parquet
implementations, but there is still a lot of work to do:) Would appreciate
any help!

Currently adding encodings and decodings, I think only Delta byte array
encodings and decodings are left - I will be adding them shortly.


Cheers,

Ivan
On Tue, 30 Jan 2018 at 11:12 AM, Wes McKinney <wesmck...@gmail.com> wrote:

> Cool. Where is this development happening? Would you like to join the
> Apache Parquet community?
>
> - Wes
>
> On Mon, Jan 29, 2018 at 4:20 PM, Ivan Sadikov <ivan.sadi...@gmail.com>
> wrote:
> > Thanks Wes. It is okay, I fixed the issues, so everything is great.
> >
> > We are currently pushing parquet-rs to be feature compatible with
> > parquet-mr and parquet-cpp.
> > On Tue, 30 Jan 2018 at 9:57 AM, Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> >> hi Ivan -- note that this code has not been actively maintained
> >> because this encoding is not in wide use yet (so there could be
> >> discrepancies vs. what is in parquet-mr).
> >>
> >> thanks,
> >> Wes
> >>
> >> On Sun, Jan 28, 2018 at 10:50 PM, Ivan Sadikov <ivan.sadi...@gmail.com>
> >> wrote:
> >> > Hello,
> >> >
> >> > I am currently trying to debug DeltaLengthByArrayDecoder in
> parquet-cpp
> >> and
> >> > cannot understand how it knows where encoded lengths part ends (for
> delta
> >> > bit packing decoder) and actual byte array data begins.
> >> >
> >> > I can see that parquet-me simply loads all data to reach the end of
> >> encoded
> >> > lengths, but it looks like parquet-cpp does it differently.
> >> >
> >> > Would appreciate any help with this!
> >> > Thanks!
> >> >
> >> >
> >> > Cheers,
> >> >
> >> > Ivan
> >>
>


Re: Parquet DeltaLengthByteArrayDecoder question

2018-01-29 Thread Ivan Sadikov
Thanks Wes. It is okay, I fixed the issues, so everything is great.

We are currently pushing parquet-rs to be feature compatible with
parquet-mr and parquet-cpp.
On Tue, 30 Jan 2018 at 9:57 AM, Wes McKinney <wesmck...@gmail.com> wrote:

> hi Ivan -- note that this code has not been actively maintained
> because this encoding is not in wide use yet (so there could be
> discrepancies vs. what is in parquet-mr).
>
> thanks,
> Wes
>
> On Sun, Jan 28, 2018 at 10:50 PM, Ivan Sadikov <ivan.sadi...@gmail.com>
> wrote:
> > Hello,
> >
> > I am currently trying to debug DeltaLengthByArrayDecoder in parquet-cpp
> and
> > cannot understand how it knows where encoded lengths part ends (for delta
> > bit packing decoder) and actual byte array data begins.
> >
> > I can see that parquet-me simply loads all data to reach the end of
> encoded
> > lengths, but it looks like parquet-cpp does it differently.
> >
> > Would appreciate any help with this!
> > Thanks!
> >
> >
> > Cheers,
> >
> > Ivan
>


Parquet DeltaLengthByteArrayDecoder question

2018-01-28 Thread Ivan Sadikov
Hello,

I am currently trying to debug DeltaLengthByArrayDecoder in parquet-cpp and
cannot understand how it knows where encoded lengths part ends (for delta
bit packing decoder) and actual byte array data begins.

I can see that parquet-me simply loads all data to reach the end of encoded
lengths, but it looks like parquet-cpp does it differently.

Would appreciate any help with this!
Thanks!


Cheers,

Ivan


[jira] [Commented] (PARQUET-750) Parquet tools cat/head parity, small fixes

2016-10-14 Thread Ivan Sadikov (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576394#comment-15576394
 ] 

Ivan Sadikov commented on PARQUET-750:
--

PR: https://github.com/apache/parquet-mr/pull/378

> Parquet tools cat/head parity, small fixes
> --
>
> Key: PARQUET-750
> URL: https://issues.apache.org/jira/browse/PARQUET-750
> Project: Parquet
>  Issue Type: Bug
>        Reporter: Ivan Sadikov
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-750) Parquet tools cat/head parity, small fixes

2016-10-14 Thread Ivan Sadikov (JIRA)
Ivan Sadikov created PARQUET-750:


 Summary: Parquet tools cat/head parity, small fixes
 Key: PARQUET-750
 URL: https://issues.apache.org/jira/browse/PARQUET-750
 Project: Parquet
  Issue Type: Bug
Reporter: Ivan Sadikov
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-748) Update assembly for parquet tools

2016-10-14 Thread Ivan Sadikov (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Sadikov resolved PARQUET-748.
--
Resolution: Won't Fix

> Update assembly for parquet tools
> -
>
> Key: PARQUET-748
> URL: https://issues.apache.org/jira/browse/PARQUET-748
> Project: Parquet
>  Issue Type: Improvement
>        Reporter: Ivan Sadikov
>Priority: Trivial
>
> Small update for {{parquet-tools}} module to be able to create distribution 
> including shell scripts and all dependencies. Also documentation update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-748) Update assembly for parquet tools

2016-10-11 Thread Ivan Sadikov (JIRA)
Ivan Sadikov created PARQUET-748:


 Summary: Update assembly for parquet tools
 Key: PARQUET-748
 URL: https://issues.apache.org/jira/browse/PARQUET-748
 Project: Parquet
  Issue Type: Improvement
Reporter: Ivan Sadikov
Priority: Trivial


Small update for {{parquet-tools}} module to be able to create distribution 
including shell scripts and all dependencies. Also documentation update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-748) Update assembly for parquet tools

2016-10-11 Thread Ivan Sadikov (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566544#comment-15566544
 ] 

Ivan Sadikov commented on PARQUET-748:
--

I will work on this, it is fairly trivial and will get myself familiarized with 
code base a little bit.

> Update assembly for parquet tools
> -
>
> Key: PARQUET-748
> URL: https://issues.apache.org/jira/browse/PARQUET-748
> Project: Parquet
>  Issue Type: Improvement
>        Reporter: Ivan Sadikov
>Priority: Trivial
>
> Small update for {{parquet-tools}} module to be able to create distribution 
> including shell scripts and all dependencies. Also documentation update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)