date:20180803

[jira] [Commented] (PARQUET-1370) Read consecutive column chunks in a single scan

2018-08-03 Thread Robert Gruener (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568428#comment-16568428
 ] 

Robert Gruener commented on PARQUET-1370:
-

I see, in my case I am using the file handle from the pyarrow hdfs class which 
does not seem to implement the RawIOBase API. Though it should be pretty easy 
to work around.

> Read consecutive column chunks in a single scan
> ---
>
> Key: PARQUET-1370
> URL: https://issues.apache.org/jira/browse/PARQUET-1370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Priority: Major
>
> Currently parquet-cpp calls for a filesystem scan with every single data page 
> see 
> [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]
> For remote filesystems this can be very inefficient when reading many small 
> columns. The java implementation already does this and will read consecutive 
> column chunks (and the resulting pages) in a single scan see 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]
>  
> This might be a bit difficult to do, as it would require changing a lot of 
> the code structure but it would certainly be valuable for workloads concerned 
> with optimal read performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1370) Read consecutive column chunks in a single scan

2018-08-03 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568403#comment-16568403
 ] 

Uwe L. Korn commented on PARQUET-1370:
--

I'm doing the same, my code looks as follows:
{code:java}
reader = …some file handle…
reader = io.BufferedReader(reader, 512 * 1024)
parquet_file = ParquetFile(reader){code}
This was so simple that I thought it might not be relevant for now. Having a 
general C++ implementation of {{io.BufferedReader}} in Arrow C++ might be a 
simpler approach to our problem. The usage of `io.BufferedReader` involves 
probably some additional memory copies and overhead as we have to switch 
between Python and C++ often.

(In my case, the file handle is coming from [https://github.com/mbr/simplekv] / 
[https://github.com/blue-yonder/storefact] )

 

> Read consecutive column chunks in a single scan
> ---
>
> Key: PARQUET-1370
> URL: https://issues.apache.org/jira/browse/PARQUET-1370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Priority: Major
>
> Currently parquet-cpp calls for a filesystem scan with every single data page 
> see 
> [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]
> For remote filesystems this can be very inefficient when reading many small 
> columns. The java implementation already does this and will read consecutive 
> column chunks (and the resulting pages) in a single scan see 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]
>  
> This might be a bit difficult to do, as it would require changing a lot of 
> the code structure but it would certainly be valuable for workloads concerned 
> with optimal read performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1370) Read consecutive column chunks in a single scan

2018-08-03 Thread Robert Gruener (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568384#comment-16568384
 ] 

Robert Gruener commented on PARQUET-1370:
-

Thanks for the tip [~xhochy]! Though we are reading parquet using pyarrow so I 
dont think it would be as straight forward as adding that buffer. Unless there 
is something I am not seeing?

 

Either way it would be nice to not have to worry about this as a user of the 
library.

> Read consecutive column chunks in a single scan
> ---
>
> Key: PARQUET-1370
> URL: https://issues.apache.org/jira/browse/PARQUET-1370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Priority: Major
>
> Currently parquet-cpp calls for a filesystem scan with every single data page 
> see 
> [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]
> For remote filesystems this can be very inefficient when reading many small 
> columns. The java implementation already does this and will read consecutive 
> column chunks (and the resulting pages) in a single scan see 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]
>  
> This might be a bit difficult to do, as it would require changing a lot of 
> the code structure but it would certainly be valuable for workloads concerned 
> with optimal read performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1370) Read consecutive column chunks in a single scan

2018-08-03 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568368#comment-16568368
 ] 

Uwe L. Korn commented on PARQUET-1370:
--

[~rgruener] I was also plagued by this issue but I wrapped my Python code in 
[https://docs.python.org/3/library/io.html#io.BufferedReader] and this gave me 
sufficient performance. This was especially useful for me as I'm working with 
object stores like S3 or Azure Blob where consecutive reads of 40kb or 512kb 
nearly make no difference but the HTTP request overhead is the main bottleneck. 

> Read consecutive column chunks in a single scan
> ---
>
> Key: PARQUET-1370
> URL: https://issues.apache.org/jira/browse/PARQUET-1370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Priority: Major
>
> Currently parquet-cpp calls for a filesystem scan with every single data page 
> see 
> [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]
> For remote filesystems this can be very inefficient when reading many small 
> columns. The java implementation already does this and will read consecutive 
> column chunks (and the resulting pages) in a single scan see 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]
>  
> This might be a bit difficult to do, as it would require changing a lot of 
> the code structure but it would certainly be valuable for workloads concerned 
> with optimal read performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PARQUET-1370) Read consecutive column chunks in a single scan

2018-08-03 Thread Robert Gruener (JIRA)

Robert Gruener created PARQUET-1370:
---

 Summary: Read consecutive column chunks in a single scan
 Key: PARQUET-1370
 URL: https://issues.apache.org/jira/browse/PARQUET-1370
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Robert Gruener


Currently parquet-cpp calls for a filesystem scan with every single data page 
see 
[https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]

For remote filesystems this can be very inefficient when reading many small 
columns. The java implementation already does this and will read consecutive 
column chunks (and the resulting pages) in a single scan see 
[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]

 

This might be a bit difficult to do, as it would require changing a lot of the 
code structure but it would certainly be valuable for workloads concerned with 
optimal read performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1369) [Python] Unavailable Parquet column statistics from Spark-generated file

2018-08-03 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568247#comment-16568247
 ] 

Uwe L. Korn commented on PARQUET-1369:
--

[~rgruener] Moved it.

> [Python] Unavailable Parquet column statistics from Spark-generated file
> 
>
> Key: PARQUET-1369
> URL: https://issues.apache.org/jira/browse/PARQUET-1369
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Robert Gruener
>Assignee: Robert Gruener
>Priority: Major
>  Labels: parquet
> Fix For: cpp-1.5.0
>
>
> I have a dataset generated by spark which shows it has statistics for the 
> string column when using the java parquet-mr code (shown by using 
> `parquet-tools meta`) however reading from pyarrow shows that the statistics 
> for that column are not set.  I should not the column only has a single 
> value, though it still seems like a problem that pyarrow can't recognize it 
> (it can recognize statistics set for the long and double types).
> See https://github.com/apache/arrow/files/2161147/metadata.zip for file 
> example.
> Pyarrow Code To Check Statistics:
> {code}
> from pyarrow import parquet as pq
> meta = pq.read_metadata('/tmp/metadata.parquet')
> # No Statistics For String Column, prints false and statistics object is None
> print(meta.row_group(0).column(1).is_stats_set)
> {code}
> Example parquet-meta output:
> {code}
> file schema: spark_schema 
> 
> int: REQUIRED INT64 R:0 D:0
> string:  OPTIONAL BINARY O:UTF8 R:0 D:1
> float:   REQUIRED DOUBLE R:0 D:0
> row group 1: RC:8333 TS:76031 OFFSET:4 
> 
> int:  INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
> string:   BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 
> 4192]
> float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 
> ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, 
> num_nulls: 0]
> {code}
> I realize the column only has a single value though it still seems like 
> pyarrow should be able to read the statistics set. I made this here and not a 
> JIRA since I wanted to be sure this is actually an issue and there wasnt a 
> ticket already made there (I couldnt find one but I wanted to be sure). 
> Either way I would like to understand why this is



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Moved] (PARQUET-1369) [Python] Unavailable Parquet column statistics from Spark-generated file

2018-08-03 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn moved ARROW-2800 to PARQUET-1369:
-

Fix Version/s: (was: 0.11.0)
   cpp-1.5.0
Affects Version/s: (was: 0.9.0)
   cpp-1.4.0
  Component/s: (was: Python)
   parquet-cpp
 Workflow: patch-available, re-open possible  (was: jira)
  Key: PARQUET-1369  (was: ARROW-2800)
  Project: Parquet  (was: Apache Arrow)

> [Python] Unavailable Parquet column statistics from Spark-generated file
> 
>
> Key: PARQUET-1369
> URL: https://issues.apache.org/jira/browse/PARQUET-1369
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Robert Gruener
>Assignee: Robert Gruener
>Priority: Major
>  Labels: parquet
> Fix For: cpp-1.5.0
>
>
> I have a dataset generated by spark which shows it has statistics for the 
> string column when using the java parquet-mr code (shown by using 
> `parquet-tools meta`) however reading from pyarrow shows that the statistics 
> for that column are not set.  I should not the column only has a single 
> value, though it still seems like a problem that pyarrow can't recognize it 
> (it can recognize statistics set for the long and double types).
> See https://github.com/apache/arrow/files/2161147/metadata.zip for file 
> example.
> Pyarrow Code To Check Statistics:
> {code}
> from pyarrow import parquet as pq
> meta = pq.read_metadata('/tmp/metadata.parquet')
> # No Statistics For String Column, prints false and statistics object is None
> print(meta.row_group(0).column(1).is_stats_set)
> {code}
> Example parquet-meta output:
> {code}
> file schema: spark_schema 
> 
> int: REQUIRED INT64 R:0 D:0
> string:  OPTIONAL BINARY O:UTF8 R:0 D:1
> float:   REQUIRED DOUBLE R:0 D:0
> row group 1: RC:8333 TS:76031 OFFSET:4 
> 
> int:  INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
> string:   BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 
> ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 
> 4192]
> float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 
> ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, 
> num_nulls: 0]
> {code}
> I realize the column only has a single value though it still seems like 
> pyarrow should be able to read the statistics set. I made this here and not a 
> JIRA since I wanted to be sure this is actually an issue and there wasnt a 
> ticket already made there (I couldnt find one but I wanted to be sure). 
> Either way I would like to understand why this is



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: num_level in Parquet Cpp library & how to add a JSON field?

2018-08-03 Thread Uwe L. Korn

Hello Ivy,

"primitive binary" means `Type::BYTE_ARRAY`, so you're correct. I have not yet 
seen anyone use the JSON field with parquet-cpp but the JSON type is simply a 
binary string with an annotation so I would expect everything to just work.

Uwe

On Thu, Aug 2, 2018, at 7:59 PM, ivywu...@gmail.com wrote:
> Hi, 
> I’m creating a parquet file using the parquet C++ library. I’ve been 
> looking for answers online but still can’t figure out the following 
> questions.
> 
> 1. What does num_level mean in the WriteBatch method?
>  WriteBatch(int64_t num_levels, const int16_t* def_levels,
> const int16_t* rep_levels,
> const typename ParquetType::c_type* values)
> 
> 2. How to create a filed for JSON datatype?  By looking at this link 
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, it 
> seems JSON is not considered as a nested datatype.  To create a filed 
> for JSON data, what primitive type should it be? According to the link, 
> it says “binary primitive type”,  does it mean "Type::BYTE_ARRAY”?
>   PrimitiveNode::Make(“JSON_field", Repetition::REQUIRED, Type:: ?, 
> LogicalType::JSON))
>   
> Any help is appreciated! 
> Thanks,
> Ivy
>

[jira] [Updated] (PARQUET-1367) upgrade libraries to work around security issues

2018-08-03 Thread Matt Darwin (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Darwin updated PARQUET-1367:
-
Description: 
There are a number of libraries which need updating.  Among other reasons, 
there are several security issues filed in CVE for 
[Hadoop|https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=hadoop] and 
[guava|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-10237]

 

 

  was:
There are a number of libraries which need updating.  Among other reasons, 
there are [several security issues filed in CVE for 
[Hadoop|https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=hadoop] and 
[guava|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-10237]

 

 


> upgrade libraries to work around security issues
> 
>
> Key: PARQUET-1367
> URL: https://issues.apache.org/jira/browse/PARQUET-1367
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Matt Darwin
>Priority: Major
>  Labels: pull-request-available, security
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> There are a number of libraries which need updating.  Among other reasons, 
> there are several security issues filed in CVE for 
> [Hadoop|https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=hadoop] and 
> [guava|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-10237]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1310) Column indexes: Filtering

2018-08-03 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/PARQUET-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567913#comment-16567913
 ] 

ASF GitHub Bot commented on PARQUET-1310:
-

HyukjinKwon opened a new pull request #510: PARQUET-1310: ParquetFileReader 
should close its input stream for the failure in constructor
URL: https://github.com/apache/parquet-mr/pull/510
 
 
   This PR proposes to close the stream open in ParquetFileReader's constructor 
when it throws an error when it is used (`readFooter`), which causes a resource 
leak. Otherwise, looks there's no way to close it outside.
   
   For more details, please see the JIRA ticket 
https://issues.apache.org/jira/browse/PARQUET-1368.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Column indexes: Filtering
> -
>
> Key: PARQUET-1310
> URL: https://issues.apache.org/jira/browse/PARQUET-1310
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1364) Column Indexes: Invalid row indexes for pages starting with nulls

2018-08-03 Thread Nandor Kollar (JIRA)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PARQUET-1364:
---
Fix Version/s: 1.11.0

> Column Indexes: Invalid row indexes for pages starting with nulls
> -
>
> Key: PARQUET-1364
> URL: https://issues.apache.org/jira/browse/PARQUET-1364
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> The current implementation for writing managing row indexes for the pages is 
> not reliable. There is a logic 
> [MessageColumnIO|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L153]
>  which caches null values and flush them just *before* opening a new group. 
> This logic might cause starting pages with these cached nulls which are not 
> correctly counted in the written rows so the rowIndexes are incorrect. It 
> does not cause any issues if all the pages are read continuously put it is a 
> huge problem for column index based filtering.
> The implementation described above is really complicated and would not like 
> to redesign because of the mentioned issue. It is easier to simply count the 
> {{0}} repetition levels as record boundaries at the column writer level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PARQUET-1368) ParquetFileReader should close its input stream for the failure in constructor

2018-08-03 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created PARQUET-1368:
-

 Summary: ParquetFileReader should close its input stream for the 
failure in constructor
 Key: PARQUET-1368
 URL: https://issues.apache.org/jira/browse/PARQUET-1368
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.10.0
Reporter: Hyukjin Kwon


I was trying to replace deprecated usage {{readFooter}} to 
{{ParquetFileReader.open}} according to the node:

{code}

[warn] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:368:
 method readFooter in object ParquetFileReader is deprecated: see corresponding 
Javadoc for more information.
[warn] ParquetFileReader.readFooter(sharedConf, filePath, 
SKIP_ROW_GROUPS).getFileMetaData
[warn]   ^

[warn] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:545:
 method readFooter in object ParquetFileReader is deprecated: see corresponding 
Javadoc for more information.
[warn] ParquetFileReader.readFooter(
[warn]   ^
{code}

Then, I realised some test suites reports resource leak:

{code}
java.lang.Throwable
at 
org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
at 
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:65)
at 
org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:687)
at 
org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:595)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.createParquetReader(ParquetUtils.scala:67)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.readFooter(ParquetUtils.scala:46)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:544)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:539)
at 
scala.collection.parallel.AugmentedIterableIterator$class.flatmap2combiner(RemainsIterator.scala:132)
at 
scala.collection.parallel.immutable.ParVector$ParVectorIterator.flatmap2combiner(ParVector.scala:62)
at 
scala.collection.parallel.ParIterableLike$FlatMap.leaf(ParIterableLike.scala:1072)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
at 
scala.collection.parallel.ParIterableLike$FlatMap.tryLeaf(ParIterableLike.scala:1068)
at 
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:159)
at 
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:443)
at 
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:149)
at 
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
at 
scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinTask.doJoin(ForkJoinTask.java:341)
at scala.concurrent.forkjoin.ForkJoinTask.join(ForkJoinTask.java:673)
at 
scala.collection.parallel.ForkJoinTasks$WrappedTask$class.sync(Tasks.scala:378)
at 
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.sync(Tasks.scala:443)
at 
scala.collection.parallel.ForkJoinTasks$class.executeAndWaitResult(Tasks.scala:426)
at 
scala.collection.parallel.ForkJoinTaskSupport.executeAndWaitResult(TaskSupport.scala:56)
at 
scala.collection.parallel.ParIterableLike$ResultMapping.leaf(ParIterableLike.scala:958)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
at 
scala.collection.parallel.ParIterableLike$ResultMapping.tryLeaf(ParIterableLike.scala:953)
at

[jira] [Commented] (PARQUET-1370) Read consecutive column chunks in a single scan

[jira] [Commented] (PARQUET-1370) Read consecutive column chunks in a single scan

[jira] [Commented] (PARQUET-1370) Read consecutive column chunks in a single scan

[jira] [Commented] (PARQUET-1370) Read consecutive column chunks in a single scan

[jira] [Created] (PARQUET-1370) Read consecutive column chunks in a single scan

[jira] [Commented] (PARQUET-1369) [Python] Unavailable Parquet column statistics from Spark-generated file

[jira] [Moved] (PARQUET-1369) [Python] Unavailable Parquet column statistics from Spark-generated file

Re: num_level in Parquet Cpp library & how to add a JSON field?

[jira] [Updated] (PARQUET-1367) upgrade libraries to work around security issues

[jira] [Commented] (PARQUET-1310) Column indexes: Filtering

[jira] [Updated] (PARQUET-1364) Column Indexes: Invalid row indexes for pages starting with nulls

[jira] [Created] (PARQUET-1368) ParquetFileReader should close its input stream for the failure in constructor

12 matches

Site Navigation

Mail list logo

Footer information