[jira] [Commented] (PARQUET-1241) Use LZ4 frame format

2018-08-19 Thread Jonathan Underwood (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585292#comment-16585292
 ] 

Jonathan Underwood commented on PARQUET-1241:
-

Here are some questions that we should probably answer.

Referring to my list of 3 implementations above:
 * Should Arrow have implementations (1) and (2), or should (1) be replaced by 
(2)? Replacing (1) with (2) will break backwards compatibility and wouldn't be 
very customer friendly.
 * Referring to the frame implementation (3), which of the following do we wish 
to store in the compressed frame: (a) uncompressed column data size; (b) 
uncompressed data checksum; (c) per-block uncompressed data checksums. In 
Parquet, Items (a) and (b) are usually stored externally to the column data as 
column metadata, I believe, so there would be redundancy. On the other hand, 
storing (a) and (b) in the column data directly simplifies the use of the LZ4 
library.

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1241) Use LZ4 frame format

2018-08-19 Thread Jonathan Underwood (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585291#comment-16585291
 ] 

Jonathan Underwood commented on PARQUET-1241:
-

[~ee07b291] - not sure there's any need for separate tickets (I have no strong 
feeling either way-I'm not a contributor!), I just wanted to make sure everyone 
was on the same page.

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1241) Use LZ4 frame format

2018-08-19 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585290#comment-16585290
 ] 

Alex Wang commented on PARQUET-1241:


Thanks a lot [~jonathan.underw...@gmail.com] for the clarification, did not 
mean to cross posting and saw there was a discussion about how Hadoop codec 
works.

 

If need, I could create another ticket for "parquet-mr using Hadoop codec not 
compatible with arrow/cpp codec".

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1241) Use LZ4 frame format

2018-08-19 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585285#comment-16585285
 ] 

Alex Wang edited comment on PARQUET-1241 at 8/19/18 11:26 PM:
--

[~wesmckinn] sorry for this delayed replay,

 

-I'd like to add a lz4-hadoop(framed) format to arrow which aligns with my work 
interest.-  For official LZ4-framed, I'd like to help with that as well but 
depends on my work schedule.

 

Sorry on second thought I meant to add LZ4 compressor (which uses open source 
github/lz4-java) to parquet-mr like the SnappyCompressor.java 
[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/SnappyCompressor.java]

 

The reason being even if I added a new lz4 hadoop codec to arrow/cpp, the 
parquet-mr still writes the hadoop LZ4 format and sets the compression type to 
LZ4 in the file's metadata.

 


was (Author: ee07b291):
[~wesmckinn] sorry for this delayed replay,

 

I'd like to add a lz4-hadoop(framed) format to arrow which aligns with my work 
interest.  For official LZ4-framed, I'd like to help with that as well but 
depends on my work schedule.

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1241) Use LZ4 frame format

2018-08-19 Thread Jonathan Underwood (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585288#comment-16585288
 ] 

Jonathan Underwood edited comment on PARQUET-1241 at 8/19/18 11:10 PM:
---

I think there's a danger here of misunderstanding, as a few different things 
are being discussed. There are 3 possible implementations:
 # The current arrow implementation which uses the LZ4 +block+ storage format 
for column data, with the size of the +_uncompressed_+ data stored as external 
column metadata.
 # The current Hadoop LZ4 implementation which uses the LZ4 +block+ storage 
format with the sizes of the +_compressed_+ and the _+uncompressed+_ column 
data prepended to the column data as 8 bytes of extra data.
 # A proposed new storage format which would use the LZ4 +frame+ format to 
store the compressed column data. The frame format allows for the size of the 
uncompressed data to be stored internally to the compressed frame as frame 
metadata, but does not require this - a decision would be needed as to whether 
the LZ4 frame compressed column data should include the size of the 
uncompressed data, the checksum of the data etc. Those extra pieces of data are 
currently already stored as column metadata, so this would be a duplication of 
information, but to no harm.

I thought it might be helpful to add this clarification so that wires don't 
become crossed.


was (Author: jonathan.underw...@gmail.com):
I think there's a danger here of misunderstanding, as a few different things 
are being discussed. There are 3 possible implementations:
 # The current arrow implementation which uses the LZ4 +block+ storage format 
for column data, with the size of the +_uncompressed_+ data stored as external 
column metadata.
 # The current Hadoop LZ4 implementation which uses the LZ4 +block+ storage 
format with the size of the +_compressed_+ and the _+uncompressed+_ data 
prepended to the column data as 8 bytes of extra data.
 # A proposed new storage format which would use the LZ4 +frame+ format to 
store the compressed column data. The frame format allows for the size of the 
uncompressed data to be stored internally to the compressed frame as frame 
metadata, but does not require this - a decision would be needed as to whether 
the LZ4 frame compressed column data should include the size of the 
uncompressed data, the checksum of the data etc. Those extra pieces of data are 
currently already stored as column metadata, so this would be a duplication of 
information, but to no harm.

I thought it might be helpful to add this clarification so that wires don't 
become crossed.

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1241) Use LZ4 frame format

2018-08-19 Thread Jonathan Underwood (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585288#comment-16585288
 ] 

Jonathan Underwood edited comment on PARQUET-1241 at 8/19/18 11:10 PM:
---

I think there's a danger here of misunderstanding, as a few different things 
are being discussed. There are 3 possible implementations:
 # The current arrow implementation which uses the LZ4 +block+ storage format 
for column data, with the size of the +_uncompressed_+ data stored as external 
column metadata.
 # The current Hadoop LZ4 implementation which uses the LZ4 +block+ storage 
format with the size of the +_compressed_+ and the _+uncompressed+_ data 
prepended to the column data as 8 bytes of extra data.
 # A proposed new storage format which would use the LZ4 +frame+ format to 
store the compressed column data. The frame format allows for the size of the 
uncompressed data to be stored internally to the compressed frame as frame 
metadata, but does not require this - a decision would be needed as to whether 
the LZ4 frame compressed column data should include the size of the 
uncompressed data, the checksum of the data etc. Those extra pieces of data are 
currently already stored as column metadata, so this would be a duplication of 
information, but to no harm.

I thought it might be helpful to add this clarification so that wires don't 
become crossed.


was (Author: jonathan.underw...@gmail.com):
I think there's a danger here of misunderstanding, as a few different things 
are being discussed. There are 3 possible implementations:
 # The current arrow implementation which uses the LZ4 +block+ storage format 
for column data, with the size of the +_uncompressed_+ data stored as external 
column metadata.
 # The current Hadoop LZ4 implementation which uses the LZ4 +block+ storage 
format with the size of the +_compressed_+ and the _+uncompressed+_ data 
prepended to the column data as 8 bytes of extra data.
 # A proposed new storage format which would use the LZ4 +frame+ format to 
store the compressed column data. The frame format allows for the size of the 
uncompressed data to be stored internally to the compressed frame as frame 
metadata, but does not require this - a decision would be needed as to whether 
the LZ4 frame compressed column data should include the size of the 
uncompressed data, the checksum of the data etc. Those extra pieces of data are 
currently already stored as column metadata, so this would be a duplication of 
information, but to no harm.

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1241) Use LZ4 frame format

2018-08-19 Thread Jonathan Underwood (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585288#comment-16585288
 ] 

Jonathan Underwood commented on PARQUET-1241:
-

I think there's a danger here of misunderstanding, as a few different things 
are being discussed. There are 3 possible implementations:
 # The current arrow implementation which uses the LZ4 +block+ storage format 
for column data, with the size of the +_uncompressed_+ data stored as external 
column metadata.
 # The current Hadoop LZ4 implementation which uses the LZ4 +block+ storage 
format with the size of the +_compressed_+ and the _+uncompressed+_ data 
prepended to the column data as 8 bytes of extra data.
 # A proposed new storage format which would use the LZ4 +frame+ format to 
store the compressed column data. The frame format allows for the size of the 
uncompressed data to be stored internally to the compressed frame as frame 
metadata, but does not require this - a decision would be needed as to whether 
the LZ4 frame compressed column data should include the size of the 
uncompressed data, the checksum of the data etc. Those extra pieces of data are 
currently already stored as column metadata, so this would be a duplication of 
information, but to no harm.

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1241) Use LZ4 frame format

2018-08-19 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585285#comment-16585285
 ] 

Alex Wang commented on PARQUET-1241:


[~wesmckinn] sorry for this delayed replay,

 

I'd like to add a lz4-hadoop(framed) format to arrow which aligns with my work 
interest.  For official LZ4-framed, I'd like to help with that as well but 
depends on my work schedule.

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Doing a 1.5.0 C++ release

2018-08-19 Thread Deepak Majeti
Uwe,

I would like to get https://issues.apache.org/jira/browse/PARQUET-1372 into
this release as well. There is a PR already open for this JIRA and I got
some feedback. I will address the feedback in the next couple of days.

On Sun, Aug 19, 2018 at 8:48 AM Uwe L. Korn  wrote:

> Hello,
>
> as we are in the process of doing/voting on a repo merge with the Arrow
> project and also because there was some time since the last release, I
> would like to proceed with a 1.5.0 release soon. Please have a look over
> the issues at
> https://issues.apache.org/jira/projects/PARQUET/versions/12342373 and
> move the non-critical ones to 1.6.0 or help in fixing those that should go
> into 1.5.0. Is there anything else currently in progress that should be
> merged before we release?
>
> Uwe
>


-- 
regards,
Deepak Majeti


[jira] [Updated] (PARQUET-1372) [C++] Add an API to allow writing RowGroups based on their size rather than num_rows

2018-08-19 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti updated PARQUET-1372:
---
Fix Version/s: (was: 1.5.0)
   cpp-1.5.0

> [C++] Add an API to allow writing RowGroups based on their size rather than 
> num_rows
> 
>
> Key: PARQUET-1372
> URL: https://issues.apache.org/jira/browse/PARQUET-1372
> Project: Parquet
>  Issue Type: Task
>Reporter: Anatoli Shein
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> The current API allows writing RowGroups with specified numbers of rows, 
> however does not allow writing RowGroups with specified size. In order to 
> write RowGroups of specified size we need to write rows in chunks while 
> checking the total_bytes_written after each chunk is written. This is 
> currently impossible because the call to NextColumn() closes the current 
> column writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Moving Apache Parquet C++ development process to a monorepo structure with Apache Arrow C++

2018-08-19 Thread Wes McKinney
OK. I'm a bit -0 on doing anything that results in Arrow having a
nonlinear git history (and rebasing is not really an option) but we
can discuss that more later

On Sun, Aug 19, 2018 at 8:50 AM, Uwe L. Korn  wrote:
> +1 on this but also see my comments in the mail on the discussions.
>
> We should also keep the git history of parquet-cpp, that should not be hard 
> with git and there is probably a StackOverflow answer out there that gives 
> you the commands to do the merge.
>
> Uwe
>
> On Fri, Aug 17, 2018, at 12:57 AM, Wes McKinney wrote:
>> In case any are interested: my estimate of the work involved in the
>> migration to be about a full day of total work, possibly less. As soon
>> as the migration plan is decided upon I intend to execute ASAP so that
>> ongoing development efforts are not disrupted.
>>
>> Additionally, in flight patches do not all need to be merged. Patches
>> can be easily edited to apply against the modified repository
>> structure
>>
>> On Wed, Aug 15, 2018 at 6:04 PM, Wes McKinney  wrote:
>> > hi all,
>> >
>> > As discussed on the mailing list [1] I am proposing to undertake a
>> > restructuring of the development process for parquet-cpp and its
>> > consumption in the Arrow ecosystem to benefit the developers and users
>> > of both communities.
>> >
>> > The specific actions we would take would be:
>> >
>> > 1) Move the source code currently located at src/ in the
>> > apache/parquet-cpp repository [2] to the cpp/src/ directory located in
>> > apache/arrow [3]
>> >
>> > 2) The parquet code tree would remain separate from the Arrow code
>> > tree, though the two projects will continue to share code as they do
>> > now
>> >
>> > 3) The build system in apache/parquet-cpp would be effectively
>> > deprecated and can be mostly discarded, as it is largely redundant and
>> > duplicated from the build system in apache/arrow
>> >
>> > 4) The Parquet and Arrow C++ communities will collaborate to provide
>> > development workflows to enable contributors working exclusively on
>> > the Parquet core functionality to be able to work unencumbered with
>> > unnecessary build or test dependencies from the rest of the Arrow
>> > codebase. Note that parquet-cpp already builds a significant portion
>> > of Apache Arrow en route to creating its libraries
>> >
>> > 5) The Parquet community can create scripts to "cut" Parquet C++
>> > releases by packaging up the appropriate components and ensuring that
>> > they can be built and installed independently as now
>> >
>> > 6) The CI processes would be merged -- since we already build the
>> > Parquet libraries in Arrow's CI workflow, this would amount to
>> > building the Parquet unit tests and running them.
>> >
>> > 7) Patches contributed that do not involve Arrow-related functionality
>> > could use the PARQUET- marking, though some ARROW- patches may
>> > span both codebases
>> >
>> > 8) Parquet C++ committers can be given push rights on apache/arrow
>> > subject to ongoing good citizenry (e.g. not merging patches that break
>> > builds). The Arrow PMC may need to vote on the procedure for offering
>> > pass-through commit rights to anyone who has been invited to be a
>> > committer for Apache Parquet
>> >
>> > 9) The contributors who work on both Arrow and Parquet will work in
>> > good faith to ensure that that needs of Parquet-only developers (i.e.
>> > who consume Parquet files in some way unrelated to the Arrow columnar
>> > standard) are accommodated
>> >
>> > There are a number of particular details we will need to discuss
>> > further (such as the specific logistics of the codebase surgery; e.g.
>> > how to manage the commit history in apache/parquet-cpp -- do we care
>> > about git blame?)
>> >
>> > This vote is to determine if the Parquet PMC is in favor of working in
>> > good faith to execute on the above plan. I will inquire with the Arrow
>> > PMC to see if we need to have a corresponding vote there, and also how
>> > to handle the management of commit rights.
>> >
>> > [ ] +1: In favor of implementing the proposed monorepo plan
>> > [ ] +0: . . .
>> > [ ] -1: Not in favor because . . .
>> >
>> > Here is my vote: +1.
>> >
>> > Thank you,
>> > Wes
>> >
>> > [1]: 
>> > https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E
>> > [2]: https://github.com/apache/parquet-cpp/tree/master/src/parquet
>> > [3]: https://github.com/apache/arrow/tree/master/cpp/src


Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-08-19 Thread Wes McKinney
hi Uwe,

I agree with your points. Currently we have 3 software artifacts:

1. Arrow C++ libraries
2. Parquet C++ libraries with Arrow columnar integration
3. C++ interop layer for Python + Cython bindings

Changes in #1 prompt an awkward workflow involving multiple PRs; as a
result of this we just recently jumped 8 months from the pinned
version of Arrow in parquet-cpp. This obviously is an antipattern. If
we had a much larger group of core developers, this might be more
maintainable

Of course changes in #2 also impact #3; a lot of our bug reports and
feature requests are coming inbound because of #3, and we have
struggled to be able to respond to the needs of users (and other
developers like Robert Gruener who are trying to use this software in
a large data warehouse)

There is also the release coordination issue where having users
simultaneously using a released version of both projects hasn't really
happened, so we effectively already have been treating Parquet like a
vendored component in our packaging process.

Realistically I think once #2 has become a more functionally complete
and as a result a more slowly moving piece of software, we can
contemplate splitting out all or parts of its development process back
into another repository. I think we have a ton of work to do yet on
Parquet core, particularly optimizing for high latency storage (HDFS,
S3, GCP, etc.), and it wouldn't really make sense to do such platform
level work anywhere but #1

- Wes

On Sun, Aug 19, 2018 at 8:37 AM, Uwe L. Korn  wrote:
> Back from vacation, I also want to finally raise my voice.
>
> With the current state of the Parquet<->Arrow development, I see a benefit in 
> merging the code base for now, but not necessarily forever.
>
> Parquet C++ is the main code base of an artefact for which an Arrow C++ 
> adapter is built and that uses some of the more standard-library features of 
> Arrow. It is the go-to place where also the same toolchain and CI setup is 
> used. Here we also directly apply all improvements that we make in Arrow 
> itself. These are the points that make it special in comparison to other 
> tools providing Arrow adapters like Turbodbc.
>
> Thus, I think that the current move to merge the code bases is ok for me. I 
> must say that I'm not 100% certain that this is the best move but currently I 
> lack better alternatives. As previously mentioned, we should take extra care 
> that we can still do separate releases and also provide a path for a future 
> where we split parquet-cpp into its own project/repository again.
>
> An important point that we should keep in (and why I was a bit concerned in 
> the previous times this discussion was raised) is that we have to be careful 
> to not pull everything that touches Arrow into the Arrow repository. Having 
> separate repositories for projects with each its own release cycle is for me 
> still the aim for the longterm. I expect that there will be many more 
> projects that will use Arrow's I/O libraries as well as will omit Arrow 
> structures. These libraries should be also usable in Python/C++/Ruby/R/… 
> These libraries are then hopefully not all developed by the same core group 
> of Arrow/Parquet developers we have currently. For this to function really 
> well, we will need a more stable API in Arrow as well as a good set of build 
> tooling that other libraries can build up when using Arrow functionality. In 
> addition to being stable, the API must also provide a good UX in the 
> abstraction layers the Arrow functions are provided so that high-performance 
> applications are not high-maintenance due to frequent API changes in Arrow. 
> That said, this is currently is wish for the future. We are currently 
> building and iterating heavily on these APIs to form a good basis for future 
> developments. Thus the repo merge will hopefully improve the development 
> speed so that we have to spent less time on toolchain maintenance and can 
> focus on the user-facing APIs.
>
> Uwe
>
> On Tue, Aug 7, 2018, at 10:45 PM, Wes McKinney wrote:
>> Thanks Ryan, will do. The people I'd still like to hear from are:
>>
>> * Phillip Cloud
>> * Uwe Korn
>>
>> As ASF contributors we are responsible to both be pragmatic as well as
>> act in the best interests of the community's health and productivity.
>>
>>
>>
>> On Tue, Aug 7, 2018 at 12:12 PM, Ryan Blue  wrote:
>> > I don't have an opinion here, but could someone send a summary of what is
>> > decided to the dev list once there is consensus? This is a long thread for
>> > parts of the project I don't work on, so I haven't followed it very 
>> > closely.
>> >
>> > On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney  wrote:
>> >
>> >> > It will be difficult to track parquet-cpp changes if they get mixed with
>> >> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
>> >> Can we enforce that parquet-cpp changes will not be committed without a
>> >> corresponding Parquet JIRA?
>> >>
>> >> I think we would 

[jira] [Created] (PARQUET-1393) [C++] Change parquet::arrow::FileReader::ReadRowGroups to read into continuous arrays

2018-08-19 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created PARQUET-1393:


 Summary: [C++] Change parquet::arrow::FileReader::ReadRowGroups to 
read into continuous arrays
 Key: PARQUET-1393
 URL: https://issues.apache.org/jira/browse/PARQUET-1393
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-cpp
Reporter: Uwe L. Korn
 Fix For: cpp-1.6.0


Instead of creating a chunk per RowGroup, we should read at least for primitive 
type into a single, pre-allocated Array. This needs some new functionality in 
the Record reader classes and thus should be done after 
https://github.com/apache/parquet-cpp/pull/462 is merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1392) [C++] Supply row group indices to parquet::arrow::FileReader::ReadTable

2018-08-19 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created PARQUET-1392:


 Summary: [C++] Supply row group indices to 
parquet::arrow::FileReader::ReadTable
 Key: PARQUET-1392
 URL: https://issues.apache.org/jira/browse/PARQUET-1392
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-cpp
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: cpp-1.5.0


By looking at the Parquet statistics, a user can already determine with its own 
logic which RowGroups are interesting for him. Currently we only provide 
functions to read the whole file or individual RowGroups. By supplying 
{{parquet::arrow}} with the RowGroups at once, it can better optimize its 
memory allocations as well as make better use of the underlying thread pool.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1158) [C++] Basic RowGroup filtering

2018-08-19 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned PARQUET-1158:


Assignee: Uwe L. Korn

> [C++] Basic RowGroup filtering
> --
>
> Key: PARQUET-1158
> URL: https://issues.apache.org/jira/browse/PARQUET-1158
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> See 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300
> We should be able to translate this into C++ enums and apply in the Arrow 
> read methods methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1158) [C++] Basic RowGroup filtering

2018-08-19 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1158:
-
Summary: [C++] Basic RowGroup filtering  (was: C++: Basic RowGroup 
filtering)

> [C++] Basic RowGroup filtering
> --
>
> Key: PARQUET-1158
> URL: https://issues.apache.org/jira/browse/PARQUET-1158
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> See 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300
> We should be able to translate this into C++ enums and apply in the Arrow 
> read methods methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1158) [C++] Basic RowGroup filtering

2018-08-19 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1158:
-
Fix Version/s: (was: cpp-1.5.0)
   cpp-1.6.0

> [C++] Basic RowGroup filtering
> --
>
> Key: PARQUET-1158
> URL: https://issues.apache.org/jira/browse/PARQUET-1158
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> See 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300
> We should be able to translate this into C++ enums and apply in the Arrow 
> read methods methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Moving Apache Parquet C++ development process to a monorepo structure with Apache Arrow C++

2018-08-19 Thread Uwe L. Korn
+1 on this but also see my comments in the mail on the discussions.

We should also keep the git history of parquet-cpp, that should not be hard 
with git and there is probably a StackOverflow answer out there that gives you 
the commands to do the merge.

Uwe

On Fri, Aug 17, 2018, at 12:57 AM, Wes McKinney wrote:
> In case any are interested: my estimate of the work involved in the
> migration to be about a full day of total work, possibly less. As soon
> as the migration plan is decided upon I intend to execute ASAP so that
> ongoing development efforts are not disrupted.
> 
> Additionally, in flight patches do not all need to be merged. Patches
> can be easily edited to apply against the modified repository
> structure
> 
> On Wed, Aug 15, 2018 at 6:04 PM, Wes McKinney  wrote:
> > hi all,
> >
> > As discussed on the mailing list [1] I am proposing to undertake a
> > restructuring of the development process for parquet-cpp and its
> > consumption in the Arrow ecosystem to benefit the developers and users
> > of both communities.
> >
> > The specific actions we would take would be:
> >
> > 1) Move the source code currently located at src/ in the
> > apache/parquet-cpp repository [2] to the cpp/src/ directory located in
> > apache/arrow [3]
> >
> > 2) The parquet code tree would remain separate from the Arrow code
> > tree, though the two projects will continue to share code as they do
> > now
> >
> > 3) The build system in apache/parquet-cpp would be effectively
> > deprecated and can be mostly discarded, as it is largely redundant and
> > duplicated from the build system in apache/arrow
> >
> > 4) The Parquet and Arrow C++ communities will collaborate to provide
> > development workflows to enable contributors working exclusively on
> > the Parquet core functionality to be able to work unencumbered with
> > unnecessary build or test dependencies from the rest of the Arrow
> > codebase. Note that parquet-cpp already builds a significant portion
> > of Apache Arrow en route to creating its libraries
> >
> > 5) The Parquet community can create scripts to "cut" Parquet C++
> > releases by packaging up the appropriate components and ensuring that
> > they can be built and installed independently as now
> >
> > 6) The CI processes would be merged -- since we already build the
> > Parquet libraries in Arrow's CI workflow, this would amount to
> > building the Parquet unit tests and running them.
> >
> > 7) Patches contributed that do not involve Arrow-related functionality
> > could use the PARQUET- marking, though some ARROW- patches may
> > span both codebases
> >
> > 8) Parquet C++ committers can be given push rights on apache/arrow
> > subject to ongoing good citizenry (e.g. not merging patches that break
> > builds). The Arrow PMC may need to vote on the procedure for offering
> > pass-through commit rights to anyone who has been invited to be a
> > committer for Apache Parquet
> >
> > 9) The contributors who work on both Arrow and Parquet will work in
> > good faith to ensure that that needs of Parquet-only developers (i.e.
> > who consume Parquet files in some way unrelated to the Arrow columnar
> > standard) are accommodated
> >
> > There are a number of particular details we will need to discuss
> > further (such as the specific logistics of the codebase surgery; e.g.
> > how to manage the commit history in apache/parquet-cpp -- do we care
> > about git blame?)
> >
> > This vote is to determine if the Parquet PMC is in favor of working in
> > good faith to execute on the above plan. I will inquire with the Arrow
> > PMC to see if we need to have a corresponding vote there, and also how
> > to handle the management of commit rights.
> >
> > [ ] +1: In favor of implementing the proposed monorepo plan
> > [ ] +0: . . .
> > [ ] -1: Not in favor because . . .
> >
> > Here is my vote: +1.
> >
> > Thank you,
> > Wes
> >
> > [1]: 
> > https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E
> > [2]: https://github.com/apache/parquet-cpp/tree/master/src/parquet
> > [3]: https://github.com/apache/arrow/tree/master/cpp/src


Doing a 1.5.0 C++ release

2018-08-19 Thread Uwe L. Korn
Hello,

as we are in the process of doing/voting on a repo merge with the Arrow project 
and also because there was some time since the last release, I would like to 
proceed with a 1.5.0 release soon. Please have a look over the issues at 
https://issues.apache.org/jira/projects/PARQUET/versions/12342373 and move the 
non-critical ones to 1.6.0 or help in fixing those that should go into 1.5.0. 
Is there anything else currently in progress that should be merged before we 
release?

Uwe


[jira] [Updated] (PARQUET-1122) [C++] Support 2-level list encoding in Arrow decoding

2018-08-19 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1122:
-
Fix Version/s: (was: cpp-1.5.0)
   cpp-1.6.0

> [C++] Support 2-level list encoding in Arrow decoding
> -
>
> Key: PARQUET-1122
> URL: https://issues.apache.org/jira/browse/PARQUET-1122
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: centos 7.3, Anaconda 4.4.0 python 3.6.1
>Reporter: Luke Higgins
>Priority: Minor
> Fix For: cpp-1.6.0
>
>
> While trying to read a parquetfile (written by nifi) I am getting an error.
> code:
> import pyarrow.parquet as pq
> t = pq.read_table('test.parq')
> error:
> Traceback (most recent call last):
>   File "parquet_reader.py", line 2, in 
> t = pq.read_table('test.parq')
>   File "/opt/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/opt/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 119, in read
> nthreads=nthreads)
>   File "pyarrow/_parquet.pyx", line 466, in 
> pyarrow._parquet.ParquetReader.read_all 
> (/arrow/python/build/temp.linux-x86_64-3.6/_parquet.cxx:9181)
>   File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8115)
> pyarrow.lib.ArrowNotImplementedError: No support for reading columns of type 
> list



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-08-19 Thread Uwe L. Korn
Back from vacation, I also want to finally raise my voice.

With the current state of the Parquet<->Arrow development, I see a benefit in 
merging the code base for now, but not necessarily forever.

Parquet C++ is the main code base of an artefact for which an Arrow C++ adapter 
is built and that uses some of the more standard-library features of Arrow. It 
is the go-to place where also the same toolchain and CI setup is used. Here we 
also directly apply all improvements that we make in Arrow itself. These are 
the points that make it special in comparison to other tools providing Arrow 
adapters like Turbodbc.

Thus, I think that the current move to merge the code bases is ok for me. I 
must say that I'm not 100% certain that this is the best move but currently I 
lack better alternatives. As previously mentioned, we should take extra care 
that we can still do separate releases and also provide a path for a future 
where we split parquet-cpp into its own project/repository again.

An important point that we should keep in (and why I was a bit concerned in the 
previous times this discussion was raised) is that we have to be careful to not 
pull everything that touches Arrow into the Arrow repository. Having separate 
repositories for projects with each its own release cycle is for me still the 
aim for the longterm. I expect that there will be many more projects that will 
use Arrow's I/O libraries as well as will omit Arrow structures. These 
libraries should be also usable in Python/C++/Ruby/R/… These libraries are then 
hopefully not all developed by the same core group of Arrow/Parquet developers 
we have currently. For this to function really well, we will need a more stable 
API in Arrow as well as a good set of build tooling that other libraries can 
build up when using Arrow functionality. In addition to being stable, the API 
must also provide a good UX in the abstraction layers the Arrow functions are 
provided so that high-performance applications are not high-maintenance due to 
frequent API changes in Arrow. That said, this is currently is wish for the 
future. We are currently building and iterating heavily on these APIs to form a 
good basis for future developments. Thus the repo merge will hopefully improve 
the development speed so that we have to spent less time on toolchain 
maintenance and can focus on the user-facing APIs.

Uwe

On Tue, Aug 7, 2018, at 10:45 PM, Wes McKinney wrote:
> Thanks Ryan, will do. The people I'd still like to hear from are:
> 
> * Phillip Cloud
> * Uwe Korn
> 
> As ASF contributors we are responsible to both be pragmatic as well as
> act in the best interests of the community's health and productivity.
> 
> 
> 
> On Tue, Aug 7, 2018 at 12:12 PM, Ryan Blue  wrote:
> > I don't have an opinion here, but could someone send a summary of what is
> > decided to the dev list once there is consensus? This is a long thread for
> > parts of the project I don't work on, so I haven't followed it very closely.
> >
> > On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney  wrote:
> >
> >> > It will be difficult to track parquet-cpp changes if they get mixed with
> >> Arrow changes. Will we establish some guidelines for filing Parquet JIRAs?
> >> Can we enforce that parquet-cpp changes will not be committed without a
> >> corresponding Parquet JIRA?
> >>
> >> I think we would use the following policy:
> >>
> >> * use PARQUET-XXX for issues relating to Parquet core
> >> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
> >> core (e.g. changes that are in parquet/arrow right now)
> >>
> >> We've already been dealing with annoyances relating to issues
> >> straddling the two projects (debugging an issue on Arrow side to find
> >> that it has to be fixed on Parquet side); this would make things
> >> simpler for us
> >>
> >> > I would also like to keep changes to parquet-cpp on a separate commit to
> >> simplify forking later (if needed) and be able to maintain the commit
> >> history.  I don't know if its possible to squash parquet-cpp commits and
> >> arrow commits separately before merging.
> >>
> >> This seems rather onerous for both contributors and maintainers and
> >> not in line with the goal of improving productivity. In the event that
> >> we fork I see it as a traumatic event for the community. If it does
> >> happen, then we can write a script (using git filter-branch and other
> >> such tools) to extract commits related to the forked code.
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti 
> >> wrote:
> >> > I have a few more logistical questions to add.
> >> >
> >> > It will be difficult to track parquet-cpp changes if they get mixed with
> >> > Arrow changes. Will we establish some guidelines for filing Parquet
> >> JIRAs?
> >> > Can we enforce that parquet-cpp changes will not be committed without a
> >> > corresponding Parquet JIRA?
> >> >
> >> > I would also like to keep changes to parquet-cpp on a separate commit to
> >> > 

Re: num_level in Parquet Cpp library & how to add a JSON field?

2018-08-19 Thread Uwe L. Korn
Hello Ivy,

> Is there any ways to read the data in logical format? because I want to 
> check if my final output is correct.

I usually use the parquet-cli from the parquet-mr project to check if my file 
is written correctly. This should give you much more informative output.

Simple usage:

git clone https://github.com/apache/parquet-mr
cd parquet-mr
mvn -DskipTests=true package
cd parquet-cli
mvn dependency:copy-dependencies
java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main meta 


Note that these commands may not all work out-of-the box for you. In case 
anything breaks I can highly recommend reading parquet-mr's READMEs.

Uwe

> 
> Thanks!
> -Ivy
> 
> On 2018/08/03 13:46:15, "Uwe L. Korn"  wrote: 
> > Hello Ivy,
> > 
> > "primitive binary" means `Type::BYTE_ARRAY`, so you're correct. I have not 
> > yet seen anyone use the JSON field with parquet-cpp but the JSON type is 
> > simply a binary string with an annotation so I would expect everything to 
> > just work.
> > 
> > Uwe
> > 
> > On Thu, Aug 2, 2018, at 7:59 PM, ivywu...@gmail.com wrote:
> > > Hi, 
> > > I’m creating a parquet file using the parquet C++ library. I’ve been 
> > > looking for answers online but still can’t figure out the following 
> > > questions.
> > > 
> > > 1. What does num_level mean in the WriteBatch method?
> > >  WriteBatch(int64_t num_levels, const int16_t* def_levels,
> > > const int16_t* rep_levels,
> > > const typename ParquetType::c_type* values)
> > > 
> > > 2. How to create a filed for JSON datatype?  By looking at this link 
> > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, it 
> > > seems JSON is not considered as a nested datatype.  To create a filed 
> > > for JSON data, what primitive type should it be? According to the link, 
> > > it says “binary primitive type”,  does it mean "Type::BYTE_ARRAY”?
> > >   PrimitiveNode::Make(“JSON_field", Repetition::REQUIRED, Type:: ?, 
> > > LogicalType::JSON))
> > >   
> > > Any help is appreciated! 
> > > Thanks,
> > > Ivy
> > > 
> > 


[jira] [Resolved] (PARQUET-1390) [Java] Upgrade to Arrow 0.10.0

2018-08-19 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1390.
--
Resolution: Fixed

Issue resolved by pull request 516
[https://github.com/apache/parquet-mr/pull/516]

> [Java] Upgrade to Arrow 0.10.0
> --
>
> Key: PARQUET-1390
> URL: https://issues.apache.org/jira/browse/PARQUET-1390
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Andy Grove
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Parquet is using Arrow 0.8.0 but version 0.10.0 was recently released. There 
> are numerous bug fixes and improvements, including building with JDK 8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1390) [Java] Upgrade to Arrow 0.10.0

2018-08-19 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned PARQUET-1390:


Assignee: Andy Grove

> [Java] Upgrade to Arrow 0.10.0
> --
>
> Key: PARQUET-1390
> URL: https://issues.apache.org/jira/browse/PARQUET-1390
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Parquet is using Arrow 0.8.0 but version 0.10.0 was recently released. There 
> are numerous bug fixes and improvements, including building with JDK 8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1390) [Java] Upgrade to Arrow 0.10.0

2018-08-19 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1390:
-
Fix Version/s: 1.11.0

> [Java] Upgrade to Arrow 0.10.0
> --
>
> Key: PARQUET-1390
> URL: https://issues.apache.org/jira/browse/PARQUET-1390
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Andy Grove
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Parquet is using Arrow 0.8.0 but version 0.10.0 was recently released. There 
> are numerous bug fixes and improvements, including building with JDK 8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1390) [Java] Upgrade to Arrow 0.10.0

2018-08-19 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585087#comment-16585087
 ] 

ASF GitHub Bot commented on PARQUET-1390:
-

xhochy closed pull request #516: PARQUET-1390: Upgrade Arrow to 0.10.0
URL: https://github.com/apache/parquet-mr/pull/516
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/parquet-arrow/pom.xml b/parquet-arrow/pom.xml
index 232167ecb..e0f305acb 100644
--- a/parquet-arrow/pom.xml
+++ b/parquet-arrow/pom.xml
@@ -33,7 +33,7 @@
   https://parquet.apache.org
 
   
-0.8.0
+0.10.0
   
 
   
diff --git 
a/parquet-arrow/src/main/java/org/apache/parquet/arrow/schema/SchemaConverter.java
 
b/parquet-arrow/src/main/java/org/apache/parquet/arrow/schema/SchemaConverter.java
index a7df48cee..b0f122ce0 100644
--- 
a/parquet-arrow/src/main/java/org/apache/parquet/arrow/schema/SchemaConverter.java
+++ 
b/parquet-arrow/src/main/java/org/apache/parquet/arrow/schema/SchemaConverter.java
@@ -278,6 +278,11 @@ public TypeMapping visit(Interval type) {
 return primitiveFLBA(12, INTERVAL);
   }
 
+  @Override
+  public TypeMapping visit(ArrowType.FixedSizeBinary fixedSizeBinary) {
+return primitive(BINARY);
+  }
+
   private TypeMapping mapping(PrimitiveType parquetType) {
 return new PrimitiveTypeMapping(field, parquetType);
   }
@@ -663,6 +668,11 @@ public TypeMapping visit(Interval type) {
 return primitive();
   }
 
+  @Override
+  public TypeMapping visit(ArrowType.FixedSizeBinary fixedSizeBinary) {
+return primitive();
+  }
+
   private TypeMapping primitive() {
 if (!parquetField.isPrimitive()) {
   throw new IllegalArgumentException("Can not map schemas as one is 
primitive and the other is not: " + arrowField + " != " + parquetField);


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Upgrade to Arrow 0.10.0
> --
>
> Key: PARQUET-1390
> URL: https://issues.apache.org/jira/browse/PARQUET-1390
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Andy Grove
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Parquet is using Arrow 0.8.0 but version 0.10.0 was recently released. There 
> are numerous bug fixes and improvements, including building with JDK 8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Status of column index in parquet-mr

2018-08-19 Thread Uwe L. Korn
Hello Gabor,

comment in-line

> The implementation was done based on the original design of column indexes
>  meaning
> that no row alignment is required between the pages (the only requirement
> is for the pages to respect row boundaries).
> As we described in the preview parquet sync the desing/implementation would
> be much more clear (and might perform a bit better) if the row alignment
> would also be required. I would be happy to modify the implementation if we
> would decide to align pages on rows.* I would like to have a final decision
> on this topic before merging this feature.*

I'm not 100% certain what "row alignment" could mean, I thinking of two very 
different things.

1.  It would mean that all columns in a RowGroup would have the same number of 
pages that would all align on the same set of rows.
2. It would mean that pages are only split on the highest nesting level, i.e. 
only split on what would be the horizontal boundaries on a 2D-table. I.e. not 
splitting any cells of this table structure.

If the interpretation is 1, then I think this is generating far too much pages 
for very sparse columns. But I'm guessing that the interpretation is rather 2 
and there I would be more interested the concerns that were raised in the sync. 
This type of alignment also is something that made me some headaches when 
implementing things in parquet-cpp. From a Parquet developer's perspective, 
this would really ease the implementation but I'm wondering if there are 
use-cases where a single cell of a table becomes larger than what we would 
normally put into a page.

Uwe


Status of Bloom filter

2018-08-19 Thread 俊杰陈
Hi

Status as of sync-up at June:

The Bloom filter benchmark was upload to PARQUET-41 jira.

The PARQUET-41 was broken into several sub tasks as following:

   - parquet format: PARQUET-319.
   - Add Bloom filter utility class: PARUQET-1342  for java, PARQUET-1332
   for c++.
   - read/write side implementation:  PARQUET-1328 for java, PARQUET-1327
   for c++.
   - Integrate Bloom filter logic: PARQUET-1391 for java, PARQUET-1329 for
   c++

We have patches available for PARQUET-319, and PARQUE-1342 and
PARQUET-1332. PARQUET-1332 was committed with some following optimization
jira includes moving cross compatibility test binary to parquet-testing
repo and two c++ optimization.

Now I'm pending on PARUQET-1342 PR. It's been very long time no review
comments from the last review, welcome to provide your review comments :)

Next step: I had implemented java side read/write logic locally for
previous benchmark, and will submit PR once PARQUET-1342 getting resolved.


Thanks a lot.


[jira] [Created] (PARQUET-1391) [java] Integrate Bloom filter logic

2018-08-19 Thread Junjie Chen (JIRA)
Junjie Chen created PARQUET-1391:


 Summary: [java] Integrate Bloom filter logic
 Key: PARQUET-1391
 URL: https://issues.apache.org/jira/browse/PARQUET-1391
 Project: Parquet
  Issue Type: Sub-task
  Components: parquet-mr
Reporter: Junjie Chen






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1329) [C++] Integrate Bloom filter into row group filter logic

2018-08-19 Thread Junjie Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junjie Chen updated PARQUET-1329:
-
Summary: [C++] Integrate Bloom filter into row group filter logic  (was: 
integrate parquet bloom filter into row group filter logic)

> [C++] Integrate Bloom filter into row group filter logic
> 
>
> Key: PARQUET-1329
> URL: https://issues.apache.org/jira/browse/PARQUET-1329
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Junjie Chen
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1328) [java]Bloom filter read/write implementation

2018-08-19 Thread Junjie Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junjie Chen updated PARQUET-1328:
-
Summary: [java]Bloom filter read/write implementation  (was: parquet bloom 
filter writer implementation)

> [java]Bloom filter read/write implementation
> 
>
> Key: PARQUET-1328
> URL: https://issues.apache.org/jira/browse/PARQUET-1328
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Junjie Chen
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1328) [java]Bloom filter read/write implementation

2018-08-19 Thread Junjie Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junjie Chen reassigned PARQUET-1328:


Assignee: Junjie Chen

> [java]Bloom filter read/write implementation
> 
>
> Key: PARQUET-1328
> URL: https://issues.apache.org/jira/browse/PARQUET-1328
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Junjie Chen
>Assignee: Junjie Chen
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1327) [C++]Bloom filter read/write implementation

2018-08-19 Thread Junjie Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junjie Chen updated PARQUET-1327:
-
Summary: [C++]Bloom filter read/write implementation  (was: parquet bloom 
filter reader implementation)

> [C++]Bloom filter read/write implementation
> ---
>
> Key: PARQUET-1327
> URL: https://issues.apache.org/jira/browse/PARQUET-1327
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Junjie Chen
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)