[jira] [Commented] (PARQUET-2045) ConsecutiveChunkList's length field should be long instead of int

2021-05-10 Thread Xianjin YE (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17341789#comment-17341789
 ] 

Xianjin YE commented on PARQUET-2045:
-

I'd like to propose a fix that changing the length field in 
ConsecutiveChunkList to long.

> ConsecutiveChunkList's length field should be long instead of int
> -
>
> Key: PARQUET-2045
> URL: https://issues.apache.org/jira/browse/PARQUET-2045
> Project: Parquet
>  Issue Type: Bug
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>Priority: Major
> Attachments: image-2021-05-10-17-12-00-083.png, 
> image-2021-05-10-17-14-45-401.png
>
>
> Hi, we encountered some read failure for large column chunk(size > 
> Int.MaxValue). After some debugging, the buggy code is that 
> ConsecutiveChunkList's length field is int, and it overflows when the 
> uncompressed size of one ColumnChunk is large than Int.MaxValue.
>  
> Below is the exception stack:
> !image-2021-05-10-17-12-00-083.png!
>  
> The column size is some what:
> !image-2021-05-10-17-14-45-401.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2045) ConsecutiveChunkList's length field should be long instead of int

2021-05-10 Thread Xianjin YE (Jira)
Xianjin YE created PARQUET-2045:
---

 Summary: ConsecutiveChunkList's length field should be long 
instead of int
 Key: PARQUET-2045
 URL: https://issues.apache.org/jira/browse/PARQUET-2045
 Project: Parquet
  Issue Type: Bug
Reporter: Xianjin YE
Assignee: Xianjin YE
 Attachments: image-2021-05-10-17-12-00-083.png, 
image-2021-05-10-17-14-45-401.png

Hi, we encountered some read failure for large column chunk(size > 
Int.MaxValue). After some debugging, the buggy code is that 
ConsecutiveChunkList's length field is int, and it overflows when the 
uncompressed size of one ColumnChunk is large than Int.MaxValue.

 

Below is the exception stack:

!image-2021-05-10-17-12-00-083.png!

 

The column size is some what:

!image-2021-05-10-17-14-45-401.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1257) GetRecordBatchReader in parquet/arrow/reader.h should be able to specify chunksize

2018-03-28 Thread Xianjin YE (JIRA)
Xianjin YE created PARQUET-1257:
---

 Summary: GetRecordBatchReader in parquet/arrow/reader.h should be 
able to specify chunksize
 Key: PARQUET-1257
 URL: https://issues.apache.org/jira/browse/PARQUET-1257
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Xianjin YE


see [https://github.com/apache/parquet-cpp/pull/445] comments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2017-11-27 Thread Xianjin YE (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266954#comment-16266954
 ] 

Xianjin YE commented on PARQUET-1166:
-

All right then, I will send pr soon and will try to reuse Arrow's code whenever 
possible.

> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2017-11-26 Thread Xianjin YE (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianjin YE updated PARQUET-1166:

Description: 
Hi, I'd like to proposal a new API to better support splittable reading for 
Parquet File.

The intent for this API is that we can selective reading RowGroups(normally be 
contiguous, but can be arbitrary as long as the row_group_idxes are sorted and 
unique, [1, 3, 5] for example). 

The proposed API would be something like this:

{code:java}
::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices,

std::shared_ptr<::arrow::RecordBatchReader>* out);

::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices,
const 
std::vector& column_indices,

std::shared_ptr<::arrow::RecordBatchReader>* out);

{code}

With new API, we can split Parquet file into RowGroups and can be processed by 
multiple tasks(maybe be on different hosts, like the Map task in MapReduce)

[~wesmckinn]@xch

  was:
Hi, I'd like to proposal a new API to better support splittable reading for 
Parquet File.

The intent for this API is that we can selective reading RowGroups(normally be 
contiguous, but can be arbitrary as long as the row_group_idxes are sorted and 
unique, [1, 3, 5] for example). 

The proposed API would be something like this:

{code:java}
::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices,

std::shared_ptr<::arrow::RecordBatchReader>* out);

::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices,
const 
std::vector& column_indices,

std::shared_ptr<::arrow::RecordBatchReader>* out);

{code}

With new API, we can split Parquet file into RowGroups and can be processed by 
multiple tasks(maybe be on different hosts, like the Map task in MapReduce)


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn]@xch



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2017-11-26 Thread Xianjin YE (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianjin YE updated PARQUET-1166:

Description: 
Hi, I'd like to proposal a new API to better support splittable reading for 
Parquet File.

The intent for this API is that we can selective reading RowGroups(normally be 
contiguous, but can be arbitrary as long as the row_group_idxes are sorted and 
unique, [1, 3, 5] for example). 

The proposed API would be something like this:

{code:java}
::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices,

std::shared_ptr<::arrow::RecordBatchReader>* out);

::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices,
const 
std::vector& column_indices,

std::shared_ptr<::arrow::RecordBatchReader>* out);

{code}

With new API, we can split Parquet file into RowGroups and can be processed by 
multiple tasks(maybe be on different hosts, like the Map task in MapReduce)

[~wesmckinn][~xhochy] What do you think?

  was:
Hi, I'd like to proposal a new API to better support splittable reading for 
Parquet File.

The intent for this API is that we can selective reading RowGroups(normally be 
contiguous, but can be arbitrary as long as the row_group_idxes are sorted and 
unique, [1, 3, 5] for example). 

The proposed API would be something like this:

{code:java}
::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices,

std::shared_ptr<::arrow::RecordBatchReader>* out);

::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices,
const 
std::vector& column_indices,

std::shared_ptr<::arrow::RecordBatchReader>* out);

{code}

With new API, we can split Parquet file into RowGroups and can be processed by 
multiple tasks(maybe be on different hosts, like the Map task in MapReduce)

[~wesmckinn]@xch


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2017-11-26 Thread Xianjin YE (JIRA)
Xianjin YE created PARQUET-1166:
---

 Summary: [API Proposal] Add GetRecordBatchReader in 
parquet/arrow/reader.h
 Key: PARQUET-1166
 URL: https://issues.apache.org/jira/browse/PARQUET-1166
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Xianjin YE


Hi, I'd like to proposal a new API to better support splittable reading for 
Parquet File.

The intent for this API is that we can selective reading RowGroups(normally be 
contiguous, but can be arbitrary as long as the row_group_idxes are sorted and 
unique, [1, 3, 5] for example). 

The proposed API would be something like this:

{code:java}
::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices,

std::shared_ptr<::arrow::RecordBatchReader>* out);

::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices,
const 
std::vector& column_indices,

std::shared_ptr<::arrow::RecordBatchReader>* out);

{code}

With new API, we can split Parquet file into RowGroups and can be processed by 
multiple tasks(maybe be on different hosts, like the Map task in MapReduce)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-970) Add Add Lz4 and Zstd compression codecs

2017-11-22 Thread Xianjin YE (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianjin YE updated PARQUET-970:
---
Description: 
https://github.com/facebook/zstd looks quite promising, I'd like to add a 
compressor in parquet-cpp.


Lz4 and Zstd codecs are added as parquet-format has already added these codecs.

  was:https://github.com/facebook/zstd looks quite promising, I'd like to add a 
compressor in parquet-cpp.


> Add Add Lz4 and Zstd compression codecs
> ---
>
> Key: PARQUET-970
> URL: https://issues.apache.org/jira/browse/PARQUET-970
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>
> https://github.com/facebook/zstd looks quite promising, I'd like to add a 
> compressor in parquet-cpp.
> Lz4 and Zstd codecs are added as parquet-format has already added these 
> codecs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-970) Add Add Lz4 and Zstd compression codecs

2017-11-22 Thread Xianjin YE (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianjin YE updated PARQUET-970:
---
Summary: Add Add Lz4 and Zstd compression codecs  (was: Add ZstdCompressor 
for parquet-cpp compressor interface)

> Add Add Lz4 and Zstd compression codecs
> ---
>
> Key: PARQUET-970
> URL: https://issues.apache.org/jira/browse/PARQUET-970
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>
> https://github.com/facebook/zstd looks quite promising, I'd like to add a 
> compressor in parquet-cpp.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-970) Add ZstdCompressor for parquet-cpp compressor interface

2017-07-24 Thread Xianjin YE (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099507#comment-16099507
 ] 

Xianjin YE commented on PARQUET-970:


Great. I would do some experiment when I get some spare time...

> Add ZstdCompressor for parquet-cpp compressor interface
> ---
>
> Key: PARQUET-970
> URL: https://issues.apache.org/jira/browse/PARQUET-970
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>
> https://github.com/facebook/zstd looks quite promising, I'd like to add a 
> compressor in parquet-cpp.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1012) parquet-cpp and parquet-mr version parse inconsistent

2017-06-02 Thread Xianjin YE (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035787#comment-16035787
 ] 

Xianjin YE commented on PARQUET-1012:
-

[~mdeepak] I think this issued is fixed by PARQUET-349, but only in parquet-mr 
1.9.
We may need to add a build hash yo make parquet-mr 1.8.2(Spark 2.1) happy.

> parquet-cpp and parquet-mr version parse inconsistent
> -
>
> Key: PARQUET-1012
> URL: https://issues.apache.org/jira/browse/PARQUET-1012
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Assignee: Deepak Majeti
>
> Spark 2.1 uses parquet-mr(common) 1.8.2 which requires created_by to match 
> certain pattern.  I found the following exception when using spark to read 
> parquet file generated by parquet-cpp.
> 17/05/31 16:33:53 WARN CorruptStatistics: Ignoring statistics because 
> created_by could not be parsed (see PARQUET-251): parquet-cpp version 1.0.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-cpp version 1.0.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
> Proposal to fix this issue:  set created_by to the certain pattern.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (PARQUET-1012) parquet-cpp and parquet-mr version parse inconsistent

2017-06-02 Thread Xianjin YE (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035787#comment-16035787
 ] 

Xianjin YE edited comment on PARQUET-1012 at 6/3/17 3:52 AM:
-

[~mdeepak] I think this issued is fixed by PARQUET-349, but only in parquet-mr 
1.9.
We may need to add a build hash to make parquet-mr 1.8.2(Spark 2.1) happy.


was (Author: advancedxy):
[~mdeepak] I think this issued is fixed by PARQUET-349, but only in parquet-mr 
1.9.
We may need to add a build hash yo make parquet-mr 1.8.2(Spark 2.1) happy.

> parquet-cpp and parquet-mr version parse inconsistent
> -
>
> Key: PARQUET-1012
> URL: https://issues.apache.org/jira/browse/PARQUET-1012
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Assignee: Deepak Majeti
>
> Spark 2.1 uses parquet-mr(common) 1.8.2 which requires created_by to match 
> certain pattern.  I found the following exception when using spark to read 
> parquet file generated by parquet-cpp.
> 17/05/31 16:33:53 WARN CorruptStatistics: Ignoring statistics because 
> created_by could not be parsed (see PARQUET-251): parquet-cpp version 1.0.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-cpp version 1.0.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
> Proposal to fix this issue:  set created_by to the certain pattern.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-1012) parquet-cpp and parquet-mr version parse inconsistent

2017-05-31 Thread Xianjin YE (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031451#comment-16031451
 ] 

Xianjin YE commented on PARQUET-1012:
-

Thanks.

> parquet-cpp and parquet-mr version parse inconsistent
> -
>
> Key: PARQUET-1012
> URL: https://issues.apache.org/jira/browse/PARQUET-1012
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Assignee: Deepak Majeti
>
> Spark 2.1 uses parquet-mr(common) 1.8.2 which requires created_by to match 
> certain pattern.  I found the following exception when using spark to read 
> parquet file generated by parquet-cpp.
> 17/05/31 16:33:53 WARN CorruptStatistics: Ignoring statistics because 
> created_by could not be parsed (see PARQUET-251): parquet-cpp version 1.0.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-cpp version 1.0.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
> Proposal to fix this issue:  set created_by to the certain pattern.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PARQUET-1012) parquet-cpp and parquet-mr version parse inconsistent

2017-05-31 Thread Xianjin YE (JIRA)
Xianjin YE created PARQUET-1012:
---

 Summary: parquet-cpp and parquet-mr version parse inconsistent
 Key: PARQUET-1012
 URL: https://issues.apache.org/jira/browse/PARQUET-1012
 Project: Parquet
  Issue Type: Improvement
Reporter: Xianjin YE


Spark 2.1 uses parquet-mr(common) 1.8.2 which requires created_by to match 
certain pattern.  I found the following exception when using spark to read 
parquet file generated by parquet-cpp.

17/05/31 16:33:53 WARN CorruptStatistics: Ignoring statistics because 
created_by could not be parsed (see PARQUET-251): parquet-cpp version 1.0.0
org.apache.parquet.VersionParser$VersionParseException: Could not parse 
created_by: parquet-cpp version 1.0.0 using format: (.+) version ((.*) 
)?\(build ?(.*)\)

Proposal to fix this issue:  set created_by to the certain pattern.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PARQUET-1012) parquet-cpp and parquet-mr version parse inconsistent

2017-05-31 Thread Xianjin YE (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianjin YE updated PARQUET-1012:

Component/s: parquet-cpp

> parquet-cpp and parquet-mr version parse inconsistent
> -
>
> Key: PARQUET-1012
> URL: https://issues.apache.org/jira/browse/PARQUET-1012
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>
> Spark 2.1 uses parquet-mr(common) 1.8.2 which requires created_by to match 
> certain pattern.  I found the following exception when using spark to read 
> parquet file generated by parquet-cpp.
> 17/05/31 16:33:53 WARN CorruptStatistics: Ignoring statistics because 
> created_by could not be parsed (see PARQUET-251): parquet-cpp version 1.0.0
> org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-cpp version 1.0.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
> Proposal to fix this issue:  set created_by to the certain pattern.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-995) [C++] Int96 reader in parquet_arrow uses size of Int96Type instead of Int96

2017-05-22 Thread Xianjin YE (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019758#comment-16019758
 ] 

Xianjin YE commented on PARQUET-995:


Hi, [~wesmckinn] how is this issue found? 

I was haunted by this bug all day today. In our application, the segfault 
showed up in the core dump pointed to some innocent code(not related to 
parquet) due to the bad memory access. I tried everything and until I found 
this issue,  it occurred to me this might be the root cause. And indeed the 
problem is solved when I merged the latest parquet code.

> [C++] Int96 reader in parquet_arrow uses size of Int96Type instead of Int96
> ---
>
> Key: PARQUET-995
> URL: https://issues.apache.org/jira/browse/PARQUET-995
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
> Fix For: cpp-1.1.0
>
>
> This produces a segfault when reading {{alltypes_plain.parquet}} with 
> parquet::arrow. I will see if I can reproduce with a test case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (PARQUET-924) [C++] Persist original type metadata from Arrow schemas

2017-05-04 Thread Xianjin YE (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianjin YE reassigned PARQUET-924:
--

Assignee: Xianjin YE

> [C++] Persist original type metadata from Arrow schemas
> ---
>
> Key: PARQUET-924
> URL: https://issues.apache.org/jira/browse/PARQUET-924
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Xianjin YE
>
> This will enable us to convert back to the original type in some cases 
> (DictionaryArray, Time with seconds)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (PARQUET-914) [C++] Throw more informative exception when user writes too many values to a column in a row group

2017-05-04 Thread Xianjin YE (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianjin YE reassigned PARQUET-914:
--

Assignee: Xianjin YE

> [C++] Throw more informative exception when user writes too many values to a 
> column in a row group
> --
>
> Key: PARQUET-914
> URL: https://issues.apache.org/jira/browse/PARQUET-914
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Xianjin YE
>
> In 
> https://github.com/apache/parquet-cpp/blob/5e59bc5c6491a7505585c08fd62aa52f9a6c9afc/src/parquet/column/writer.cc#L159
>  if the user writes more values than the size of the row group, the message 
> in the exception raised is misleading



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-970) Add ZstdCompressor for parquet-cpp compressor interface

2017-05-02 Thread Xianjin YE (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15994098#comment-15994098
 ] 

Xianjin YE commented on PARQUET-970:


Thanks for your input.  The ZstdCompressor interface in Arrow works for me. I 
can work on that after [~wesmckinn] moves the interfaces.  
And It's more convince if we have applied it in the arrow lib before proposing 
to parquet-format.

> Add ZstdCompressor for parquet-cpp compressor interface
> ---
>
> Key: PARQUET-970
> URL: https://issues.apache.org/jira/browse/PARQUET-970
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>
> https://github.com/facebook/zstd looks quite promising, I'd like to add a 
> compressor in parquet-cpp.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PARQUET-970) Add ZstdCompressor for parquet-cpp compressor interface

2017-05-02 Thread Xianjin YE (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianjin YE updated PARQUET-970:
---
Description: https://github.com/facebook/zstd looks quite promising, I'd 
like to add a compressor in parquet-cpp.  (was: 
https://github.com/facebook/zstd looks quite promising, I'd like a compressor 
support in parquet-cpp.)

> Add ZstdCompressor for parquet-cpp compressor interface
> ---
>
> Key: PARQUET-970
> URL: https://issues.apache.org/jira/browse/PARQUET-970
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>
> https://github.com/facebook/zstd looks quite promising, I'd like to add a 
> compressor in parquet-cpp.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PARQUET-970) Add ZstdCompressor for parquet-cpp compressor interface

2017-05-02 Thread Xianjin YE (JIRA)
Xianjin YE created PARQUET-970:
--

 Summary: Add ZstdCompressor for parquet-cpp compressor interface
 Key: PARQUET-970
 URL: https://issues.apache.org/jira/browse/PARQUET-970
 Project: Parquet
  Issue Type: New Feature
Reporter: Xianjin YE
Assignee: Xianjin YE


https://github.com/facebook/zstd looks quite promising, I'd like a compressor 
support in parquet-cpp.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PARQUET-970) Add ZstdCompressor for parquet-cpp compressor interface

2017-05-02 Thread Xianjin YE (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianjin YE updated PARQUET-970:
---
Component/s: parquet-cpp

> Add ZstdCompressor for parquet-cpp compressor interface
> ---
>
> Key: PARQUET-970
> URL: https://issues.apache.org/jira/browse/PARQUET-970
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>
> https://github.com/facebook/zstd looks quite promising, I'd like a compressor 
> support in parquet-cpp.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-924) [C++] Persist original type metadata from Arrow schemas

2017-05-02 Thread Xianjin YE (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15993129#comment-15993129
 ] 

Xianjin YE commented on PARQUET-924:


Sounds reasonable. [~wesmckinn] you can assign this to me if no one is working 
on this.

Will work on this tomorrow if i have some spare time.

> [C++] Persist original type metadata from Arrow schemas
> ---
>
> Key: PARQUET-924
> URL: https://issues.apache.org/jira/browse/PARQUET-924
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>
> This will enable us to convert back to the original type in some cases 
> (DictionaryArray, Time with seconds)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-924) [C++] Persist original type metadata from Arrow schemas

2017-05-02 Thread Xianjin YE (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15993007#comment-15993007
 ] 

Xianjin YE commented on PARQUET-924:


How should we persist original Arrow Schema? Flat Buffer Message in the key 
value metadata?

> [C++] Persist original type metadata from Arrow schemas
> ---
>
> Key: PARQUET-924
> URL: https://issues.apache.org/jira/browse/PARQUET-924
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>
> This will enable us to convert back to the original type in some cases 
> (DictionaryArray, Time with seconds)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PARQUET-936) [C++] parquet::arrow::WriteTable can enter infinite loop if chunk_size is 0

2017-05-02 Thread Xianjin YE (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992993#comment-15992993
 ] 

Xianjin YE commented on PARQUET-936:


Is anyone working on this? If not, [~wesmckinn] you can assign it to me in case 
of another duplicate work.

> [C++] parquet::arrow::WriteTable can enter infinite loop if chunk_size is 0
> ---
>
> Key: PARQUET-936
> URL: https://issues.apache.org/jira/browse/PARQUET-936
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
> Fix For: cpp-1.1.0
>
>
> See also ARROW-723



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)