[jira] [Commented] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types

2018-08-01 Thread Ken Terada (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566138#comment-16566138
 ] 

Ken Terada commented on PARQUET-1361:
-

Thank you for the responses, gentlemen.

To expand on the problem statement, the issue in allowing creation of a file 
like the one described above is that downstream readers are not consistent in 
how they are consumed. e.g., [parquet-mr|https://github.com/apache/parquet-mr] 
clearly disallows such a representation for {{col3}} and will throw an 
exception 
[here|https://github.com/apache/parquet-mr/blob/e9e36cdc44a68662885e35773187cca00d20239e/parquet-column/src/main/java/org/apache/parquet/schema/Types.java#L513].
 If the Parquet specification allows this behavior, is the recommendation to 
correct this conflict upstream (writer) or downstream (reader)?

> [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT 
> types
> ---
>
> Key: PARQUET-1361
> URL: https://issues.apache.org/jira/browse/PARQUET-1361
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Ken Terada
>Priority: Major
> Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, 
> sample_w_null.csv
>
>
> The parquet-cpp v1.4.1 library allows generation of parquet files with NULL 
> values for INT type columns which causes unexpected parsing errors in 
> downstream systems ingesting those files.
> e.g.,
> {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive 
> type}}
> *+Reproduction Steps+*
> OS: CentOS 7.5.1804
> Python: 3.4.8
> +Prerequisites:+
> * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, 
> {{PyArrow: 0.9.0}}
> +Step 1+
> Generate the parquet file.
> {{sample_w_null.csv}}
> {code}
> col1,col2,col3,col4,col5
> 1,2,,4,5
> {code}
> {{parquet-1361-repro-1.py}}
> {code}
> #!/usr/bin/python
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> input_file = 'sample_w_null.csv'
> output_file = 'int_unknown.parquet'
> p_schema = {'col1': np.int32,
> 'col2': np.int32,
> 'col3': np.unicode_,
> 'col4': np.int32,
> 'col5': np.int32}
> df = pd.read_csv(input_file, dtype=p_schema)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, output_file)
> {code}
> +Step 2+
> Inspect the metadata of the generated file.
> {{parquet-1361-repro-2.py}}
> {code}
> #!/usr/bin/python
> import pyarrow.parquet as pq
> for filename in ['int_unknown.parquet']:
> pq_file = pq.ParquetFile(filename)
> print(pq_file.metadata)
> print(pq_file.schema)
> print(pq_file.num_row_groups)
> print(pq.read_table(filename, 
> columns=['col1','col2','col3','col4','col5']).to_pandas())
> {code}
> Results
> {code}
> 
>   created_by: parquet-cpp version 1.4.1-SNAPSHOT
>   num_columns: 6
>   num_rows: 1
>   num_row_groups: 1
>   format_version: 1.0
>   serialized_size: 1434
> 
> col1: INT32
> col2: INT32
> col3: INT32 UNKNOWN
> col4: INT32
> col5: INT32
> __index_level_0__: INT64
> 1
>col1  col2  col3  col4  col5
> 0 1 2  None 4 5
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-08-01 Thread Wes McKinney
Thanks Tim.

Indeed, it's not very simple. Just today Antoine cleaned up some
platform code intending to improve the performance of bit-packing in
Parquet writes, and we resulted with 2 interdependent PRs

* https://github.com/apache/parquet-cpp/pull/483
* https://github.com/apache/arrow/pull/2355

Changes that impact the Python interface to Parquet are even more complex.

Adding options to Arrow's CMake build system to only build
Parquet-related code and dependencies (in a monorepo framework) would
not be difficult, and amount to writing "make parquet".

See e.g. https://stackoverflow.com/a/17201375. The desired commands to
build and install the Parquet core libraries and their dependencies
would be:

ninja parquet && ninja install

- Wes

On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
 wrote:
> I don't have a direct stake in this beyond wanting to see Parquet be
> successful, but I thought I'd give my two cents.
>
> For me, the thing that makes the biggest difference in contributing to a
> new codebase is the number of steps in the workflow for writing, testing,
> posting and iterating on a commit and also the number of opportunities for
> missteps. The size of the repo and build/test times matter but are
> secondary so long as the workflow is simple and reliable.
>
> I don't really know what the current state of things is, but it sounds like
> it's not as simple as check out -> build -> test if you're doing a
> cross-repo change. Circular dependencies are a real headache.
>
> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney  wrote:
>
>> hi,
>>
>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti 
>> wrote:
>> > I think the circular dependency can be broken if we build a new library
>> for
>> > the platform code. This will also make it easy for other projects such as
>> > ORC to use it.
>> > I also remember your proposal a while ago of having a separate project
>> for
>> > the platform code.  That project can live in the arrow repo. However, one
>> > has to clone the entire apache arrow repo but can just build the platform
>> > code. This will be temporary until we can find a new home for it.
>> >
>> > The dependency will look like:
>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
>> > libplatform(platform api)
>> >
>> > CI workflow will clone the arrow project twice, once for the platform
>> > library and once for the arrow-core/bindings library.
>>
>> This seems like an interesting proposal; the best place to work toward
>> this goal (if it is even possible; the build system interactions and
>> ASF release management are the hard problems) is to have all of the
>> code in a single repository. ORC could already be using Arrow if it
>> wanted, but the ORC contributors aren't active in Arrow.
>>
>> >
>> > There is no doubt that the collaborations between the Arrow and Parquet
>> > communities so far have been very successful.
>> > The reason to maintain this relationship moving forward is to continue to
>> > reap the mutual benefits.
>> > We should continue to take advantage of sharing code as well. However, I
>> > don't see any code sharing opportunities between arrow-core and the
>> > parquet-core. Both have different functions.
>>
>> I think you mean the Arrow columnar format. The Arrow columnar format
>> is only one part of a project that has become quite large already
>> (https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
>> platform-for-inmemory-data-105427919).
>>
>> >
>> > We are at a point where the parquet-cpp public API is pretty stable. We
>> > already passed that difficult stage. My take at arrow and parquet is to
>> > keep them nimble since we can.
>>
>> I believe that parquet-core has progress to make yet ahead of it. We
>> have done little work in asynchronous IO and concurrency which would
>> yield both improved read and write throughput. This aligns well with
>> other concurrency and async-IO work planned in the Arrow platform. I
>> believe that more development will happen on parquet-core once the
>> development process issues are resolved by having a single codebase,
>> single build system, and a single CI framework.
>>
>> I have some gripes about design decisions made early in parquet-cpp,
>> like the use of C++ exceptions. So while "stability" is a reasonable
>> goal I think we should still be open to making significant changes in
>> the interest of long term progress.
>>
>> Having now worked on these projects for more than 2 and a half years
>> and the most frequent contributor to both codebases, I'm sadly far
>> past the "breaking point" and not willing to continue contributing in
>> a significant way to parquet-cpp if the projects remained structured
>> as they are now. It's hampering progress and not serving the
>> community.
>>
>> - Wes
>>
>> >
>> >
>> >
>> >
>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney 
>> wrote:
>> >
>> >> > The current Arrow adaptor code for parquet should live in the arrow
>> >> repo. That will remove a majority of the 

[jira] [Resolved] (PARQUET-1366) [C++] Streamline use of Arrow bit-util.h

2018-08-01 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1366.
---
   Resolution: Fixed
Fix Version/s: cpp-1.5.0

Issue resolved by pull request 483
[https://github.com/apache/parquet-cpp/pull/483]

> [C++] Streamline use of Arrow bit-util.h
> 
>
> Key: PARQUET-1366
> URL: https://issues.apache.org/jira/browse/PARQUET-1366
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> Required for ARROW-2950: stop using certain bit-util APIs that will be 
> removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1366) [C++] Streamline use of Arrow bit-util.h

2018-08-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565841#comment-16565841
 ] 

ASF GitHub Bot commented on PARQUET-1366:
-

wesm closed pull request #483: PARQUET-1366: [C++] Streamline use of Arrow's 
bit-util.h APIs
URL: https://github.com/apache/parquet-cpp/pull/483
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/parquet/arrow/test-util.h b/src/parquet/arrow/test-util.h
index bfc78c87..2babacb8 100644
--- a/src/parquet/arrow/test-util.h
+++ b/src/parquet/arrow/test-util.h
@@ -368,7 +368,7 @@ Status MakeListArray(const std::shared_ptr& values, 
int64_t size,
   int32_t* offsets_ptr = reinterpret_cast(offsets->mutable_data());
 
   auto null_bitmap = AllocateBuffer();
-  int64_t bitmap_size = ::arrow::BitUtil::CeilByte(size) / 8;
+  int64_t bitmap_size = ::arrow::BitUtil::BytesForBits(size);
   RETURN_NOT_OK(null_bitmap->Resize(bitmap_size));
   uint8_t* null_bitmap_ptr = null_bitmap->mutable_data();
   memset(null_bitmap_ptr, 0, bitmap_size);
diff --git a/src/parquet/column_reader.cc b/src/parquet/column_reader.cc
index bc3ee8aa..28d0dcb6 100644
--- a/src/parquet/column_reader.cc
+++ b/src/parquet/column_reader.cc
@@ -60,7 +60,7 @@ int LevelDecoder::SetData(Encoding::type encoding, int16_t 
max_level,
 }
 case Encoding::BIT_PACKED: {
   num_bytes =
-  static_cast(BitUtil::Ceil(num_buffered_values * bit_width_, 
8));
+  static_cast(BitUtil::BytesForBits(num_buffered_values * 
bit_width_));
   if (!bit_packed_decoder_) {
 bit_packed_decoder_.reset(new ::arrow::BitReader(data, num_bytes));
   } else {
diff --git a/src/parquet/column_writer-test.cc 
b/src/parquet/column_writer-test.cc
index aac582a2..6c0794a1 100644
--- a/src/parquet/column_writer-test.cc
+++ b/src/parquet/column_writer-test.cc
@@ -137,7 +137,7 @@ class TestPrimitiveWriter : public 
PrimitiveTypedTest {
bool enable_dictionary, bool 
enable_statistics,
int64_t num_rows) {
 std::vector valid_bits(
-BitUtil::RoundUpNumBytes(static_cast(this->values_.size())) 
+ 1, 255);
+BitUtil::BytesForBits(static_cast(this->values_.size())) + 
1, 255);
 ColumnProperties column_properties(encoding, compression, 
enable_dictionary,
enable_statistics);
 std::shared_ptr> writer =
diff --git a/src/parquet/column_writer.cc b/src/parquet/column_writer.cc
index 7d47d3f6..48fba555 100644
--- a/src/parquet/column_writer.cc
+++ b/src/parquet/column_writer.cc
@@ -50,7 +50,7 @@ void LevelEncoder::Init(Encoding::type encoding, int16_t 
max_level,
 }
 case Encoding::BIT_PACKED: {
   int num_bytes =
-  static_cast(BitUtil::Ceil(num_buffered_values * bit_width_, 8));
+  static_cast(BitUtil::BytesForBits(num_buffered_values * 
bit_width_));
   bit_packed_encoder_.reset(new BitWriter(data, num_bytes));
   break;
 }
@@ -72,7 +72,8 @@ int LevelEncoder::MaxBufferSize(Encoding::type encoding, 
int16_t max_level,
   break;
 }
 case Encoding::BIT_PACKED: {
-  num_bytes = static_cast(BitUtil::Ceil(num_buffered_values * 
bit_width, 8));
+  num_bytes =
+  static_cast(BitUtil::BytesForBits(num_buffered_values * 
bit_width));
   break;
 }
 default:
diff --git a/src/parquet/encoding-internal.h b/src/parquet/encoding-internal.h
index 98f9e4a8..2dfb9ff3 100644
--- a/src/parquet/encoding-internal.h
+++ b/src/parquet/encoding-internal.h
@@ -151,12 +151,17 @@ class PlainDecoder : public 
Decoder {
   int Decode(uint8_t* buffer, int max_values) {
 max_values = std::min(max_values, num_values_);
 bool val;
+::arrow::internal::BitmapWriter bit_writer(buffer, 0, max_values);
 for (int i = 0; i < max_values; ++i) {
   if (!bit_reader_.GetValue(1, )) {
 ParquetException::EofException();
   }
-  BitUtil::SetArrayBit(buffer, i, val);
+  if (val) {
+bit_writer.Set();
+  }
+  bit_writer.Next();
 }
+bit_writer.Finish();
 num_values_ -= max_values;
 return max_values;
   }
diff --git a/src/parquet/encoding-test.cc b/src/parquet/encoding-test.cc
index 60285ab2..50e1394c 100644
--- a/src/parquet/encoding-test.cc
+++ b/src/parquet/encoding-test.cc
@@ -43,7 +43,7 @@ namespace test {
 TEST(VectorBooleanTest, TestEncodeDecode) {
   // PARQUET-454
   int nvalues = 1;
-  int nbytes = static_cast(BitUtil::Ceil(nvalues, 8));
+  int nbytes = static_cast(BitUtil::BytesForBits(nvalues));
 
   // seed the prng so failure is deterministic
   vector draws = flip_coins_seed(nvalues, 0.5, 0);
@@ -252,7 +252,7 @@ class 

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-08-01 Thread Tim Armstrong
I don't have a direct stake in this beyond wanting to see Parquet be
successful, but I thought I'd give my two cents.

For me, the thing that makes the biggest difference in contributing to a
new codebase is the number of steps in the workflow for writing, testing,
posting and iterating on a commit and also the number of opportunities for
missteps. The size of the repo and build/test times matter but are
secondary so long as the workflow is simple and reliable.

I don't really know what the current state of things is, but it sounds like
it's not as simple as check out -> build -> test if you're doing a
cross-repo change. Circular dependencies are a real headache.

On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney  wrote:

> hi,
>
> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti 
> wrote:
> > I think the circular dependency can be broken if we build a new library
> for
> > the platform code. This will also make it easy for other projects such as
> > ORC to use it.
> > I also remember your proposal a while ago of having a separate project
> for
> > the platform code.  That project can live in the arrow repo. However, one
> > has to clone the entire apache arrow repo but can just build the platform
> > code. This will be temporary until we can find a new home for it.
> >
> > The dependency will look like:
> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
> > libplatform(platform api)
> >
> > CI workflow will clone the arrow project twice, once for the platform
> > library and once for the arrow-core/bindings library.
>
> This seems like an interesting proposal; the best place to work toward
> this goal (if it is even possible; the build system interactions and
> ASF release management are the hard problems) is to have all of the
> code in a single repository. ORC could already be using Arrow if it
> wanted, but the ORC contributors aren't active in Arrow.
>
> >
> > There is no doubt that the collaborations between the Arrow and Parquet
> > communities so far have been very successful.
> > The reason to maintain this relationship moving forward is to continue to
> > reap the mutual benefits.
> > We should continue to take advantage of sharing code as well. However, I
> > don't see any code sharing opportunities between arrow-core and the
> > parquet-core. Both have different functions.
>
> I think you mean the Arrow columnar format. The Arrow columnar format
> is only one part of a project that has become quite large already
> (https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
> platform-for-inmemory-data-105427919).
>
> >
> > We are at a point where the parquet-cpp public API is pretty stable. We
> > already passed that difficult stage. My take at arrow and parquet is to
> > keep them nimble since we can.
>
> I believe that parquet-core has progress to make yet ahead of it. We
> have done little work in asynchronous IO and concurrency which would
> yield both improved read and write throughput. This aligns well with
> other concurrency and async-IO work planned in the Arrow platform. I
> believe that more development will happen on parquet-core once the
> development process issues are resolved by having a single codebase,
> single build system, and a single CI framework.
>
> I have some gripes about design decisions made early in parquet-cpp,
> like the use of C++ exceptions. So while "stability" is a reasonable
> goal I think we should still be open to making significant changes in
> the interest of long term progress.
>
> Having now worked on these projects for more than 2 and a half years
> and the most frequent contributor to both codebases, I'm sadly far
> past the "breaking point" and not willing to continue contributing in
> a significant way to parquet-cpp if the projects remained structured
> as they are now. It's hampering progress and not serving the
> community.
>
> - Wes
>
> >
> >
> >
> >
> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney 
> wrote:
> >
> >> > The current Arrow adaptor code for parquet should live in the arrow
> >> repo. That will remove a majority of the dependency issues. Joshua's
> work
> >> would not have been blocked in parquet-cpp if that adapter was in the
> arrow
> >> repo.  This will be similar to the ORC adaptor.
> >>
> >> This has been suggested before, but I don't see how it would alleviate
> >> any issues because of the significant dependencies on other parts of
> >> the Arrow codebase. What you are proposing is:
> >>
> >> - (Arrow) arrow platform
> >> - (Parquet) parquet core
> >> - (Arrow) arrow columnar-parquet adapter interface
> >> - (Arrow) Python bindings
> >>
> >> To make this work, somehow Arrow core / libarrow would have to be
> >> built before invoking the Parquet core part of the build system. You
> >> would need to pass dependent targets across different CMake build
> >> systems; I don't know if it's possible (I spent some time looking into
> >> it earlier this year). This is what I meant by the lack of a "concrete
> >> 

[jira] [Commented] (PARQUET-1366) [C++] Streamline use of Arrow bit-util.h

2018-08-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565382#comment-16565382
 ] 

ASF GitHub Bot commented on PARQUET-1366:
-

pitrou opened a new pull request #483: PARQUET-1366: [C++] Streamline use of 
Arrow's bit-util.h APIs
URL: https://github.com/apache/parquet-cpp/pull/483
 
 
   This is required before we can remove some duplicate or little-used APIs in 
ARROW-2950.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Streamline use of Arrow bit-util.h
> 
>
> Key: PARQUET-1366
> URL: https://issues.apache.org/jira/browse/PARQUET-1366
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> Required for ARROW-2950: stop using certain bit-util APIs that will be 
> removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1366) [C++] Streamline use of Arrow bit-util.h

2018-08-01 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1366:

Labels: pull-request-available  (was: )

> [C++] Streamline use of Arrow bit-util.h
> 
>
> Key: PARQUET-1366
> URL: https://issues.apache.org/jira/browse/PARQUET-1366
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> Required for ARROW-2950: stop using certain bit-util APIs that will be 
> removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1366) [C++] Streamline use of Arrow bit-util.h

2018-08-01 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created PARQUET-1366:
---

 Summary: [C++] Streamline use of Arrow bit-util.h
 Key: PARQUET-1366
 URL: https://issues.apache.org/jira/browse/PARQUET-1366
 Project: Parquet
  Issue Type: Task
  Components: parquet-cpp
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Required for ARROW-2950: stop using certain bit-util APIs that will be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1357) [C++] FormatStatValue truncates binary statistics on zero character

2018-08-01 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1357.
--
Resolution: Fixed

Issue resolved by PR https://github.com/apache/parquet-cpp/pull/479

> [C++] FormatStatValue truncates binary statistics on zero character
> ---
>
> Key: PARQUET-1357
> URL: https://issues.apache.org/jira/browse/PARQUET-1357
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> As {{FormatStatValue}} is currently called with a C-style string, we cannot 
> pass the actual binary content with its length. Instead change the interface 
> to {{std::string}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1357) [C++] FormatStatValue truncates binary statistics on zero character

2018-08-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565329#comment-16565329
 ] 

ASF GitHub Bot commented on PARQUET-1357:
-

xhochy closed pull request #479: PARQUET-1357: FormatStatValue truncates binary 
statistics on zero character
URL: https://github.com/apache/parquet-cpp/pull/479
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/CMakeLists.txt b/CMakeLists.txt
index b53f5980..927f7289 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -84,7 +84,7 @@ enable_testing()
 set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/cmake_modules")
 set(BUILD_SUPPORT_DIR "${CMAKE_SOURCE_DIR}/build-support")
 
-set(CLANG_FORMAT_VERSION "5.0")
+set(CLANG_FORMAT_VERSION "6.0")
 find_package(ClangTools)
 if ("$ENV{CMAKE_EXPORT_COMPILE_COMMANDS}" STREQUAL "1" OR CLANG_TIDY_FOUND)
   # Generate a Clang compile_commands.json "compilation database" file for use
diff --git a/cmake_modules/FindClangTools.cmake 
b/cmake_modules/FindClangTools.cmake
index e9221ff2..215a5cd9 100644
--- a/cmake_modules/FindClangTools.cmake
+++ b/cmake_modules/FindClangTools.cmake
@@ -30,6 +30,12 @@
 #  CLANG_FORMAT_BIN, The path to the clang format binary
 #  CLANG_TIDY_FOUND, Whether clang format was found
 
+if (DEFINED ENV{HOMEBREW_PREFIX})
+  set(HOMEBREW_PREFIX "${ENV{HOMEBREW_PREFIX}")
+else()
+  set(HOMEBREW_PREFIX "/usr/local")
+endif()
+
 find_program(CLANG_TIDY_BIN
   NAMES clang-tidy-4.0
   clang-tidy-3.9
@@ -37,7 +43,7 @@ find_program(CLANG_TIDY_BIN
   clang-tidy-3.7
   clang-tidy-3.6
   clang-tidy
-  PATHS ${ClangTools_PATH} $ENV{CLANG_TOOLS_PATH} /usr/local/bin /usr/bin
+  PATHS ${ClangTools_PATH} $ENV{CLANG_TOOLS_PATH} /usr/local/bin /usr/bin 
"${HOMEBREW_PREFIX}/bin"
 NO_DEFAULT_PATH
 )
 
@@ -55,7 +61,7 @@ if (CLANG_FORMAT_VERSION)
   PATHS
 ${ClangTools_PATH}
 $ENV{CLANG_TOOLS_PATH}
-/usr/local/bin /usr/bin
+/usr/local/bin /usr/bin "${HOMEBREW_PREFIX}/bin"
 NO_DEFAULT_PATH
 )
 
@@ -67,16 +73,26 @@ if (CLANG_FORMAT_VERSION)
 if ("${CLANG_FORMAT_MINOR_VERSION}" STREQUAL "0")
 find_program(CLANG_FORMAT_BIN
   NAMES clang-format
-  PATHS /usr/local/opt/llvm@${CLANG_FORMAT_MAJOR_VERSION}/bin
+  PATHS 
"${HOMEBREW_PREFIX}/opt/llvm@${CLANG_FORMAT_MAJOR_VERSION}/bin"
 NO_DEFAULT_PATH
 )
 else()
 find_program(CLANG_FORMAT_BIN
   NAMES clang-format
-  PATHS /usr/local/opt/llvm@${CLANG_FORMAT_VERSION}/bin
+  PATHS "${HOMEBREW_PREFIX}/opt/llvm@${CLANG_FORMAT_VERSION}/bin"
 NO_DEFAULT_PATH
 )
 endif()
+
+if ("${CLANG_FORMAT_BIN}" STREQUAL "CLANG_FORMAT_BIN-NOTFOUND")
+  # binary was still not found, look into Cellar
+  file(GLOB CLANG_FORMAT_PATH 
"${HOMEBREW_PREFIX}/Cellar/llvm/${CLANG_FORMAT_VERSION}.*")
+  find_program(CLANG_FORMAT_BIN
+NAMES clang-format
+PATHS "${CLANG_FORMAT_PATH}/bin"
+  NO_DEFAULT_PATH
+  )
+endif()
 endif()
 else()
 find_program(CLANG_FORMAT_BIN
@@ -86,7 +102,7 @@ else()
   clang-format-3.7
   clang-format-3.6
   clang-format
-  PATHS ${ClangTools_PATH} $ENV{CLANG_TOOLS_PATH} /usr/local/bin /usr/bin
+  PATHS ${ClangTools_PATH} $ENV{CLANG_TOOLS_PATH} /usr/local/bin /usr/bin 
"${HOMEBREW_PREFIX}/bin"
 NO_DEFAULT_PATH
 )
 endif()
diff --git a/src/parquet/arrow/arrow-reader-writer-test.cc 
b/src/parquet/arrow/arrow-reader-writer-test.cc
index 1c2f3225..8955b0ab 100644
--- a/src/parquet/arrow/arrow-reader-writer-test.cc
+++ b/src/parquet/arrow/arrow-reader-writer-test.cc
@@ -52,16 +52,16 @@ using arrow::Buffer;
 using arrow::ChunkedArray;
 using arrow::Column;
 using arrow::DataType;
+using arrow::default_memory_pool;
 using arrow::ListArray;
-using arrow::ResizableBuffer;
 using arrow::PrimitiveArray;
+using arrow::ResizableBuffer;
 using arrow::Status;
 using arrow::Table;
 using arrow::TimeUnit;
 using arrow::compute::Datum;
 using arrow::compute::DictionaryEncode;
 using arrow::compute::FunctionContext;
-using arrow::default_memory_pool;
 using arrow::io::BufferReader;
 
 using arrow::test::randint;
@@ -1453,13 +1453,13 @@ TEST(TestArrowReadWrite, ConvertedDateTimeTypes) {
 // Regression for ARROW-2802
 TEST(TestArrowReadWrite, CoerceTimestampsAndSupportDeprecatedInt96) {
   using ::arrow::Column;
+  using ::arrow::default_memory_pool;
   using ::arrow::Field;
   using ::arrow::Schema;
   using ::arrow::Table;
-  using ::arrow::TimeUnit;
   using 

[jira] [Resolved] (PARQUET-1356) Error when closing writer - Statistics comparator mismatched

2018-08-01 Thread Andres (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andres resolved PARQUET-1356.
-
Resolution: Not A Bug

> Error when closing writer - Statistics comparator mismatched
> 
>
> Key: PARQUET-1356
> URL: https://issues.apache.org/jira/browse/PARQUET-1356
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
> Environment: Mac OS Sierra 10.12.6
> IntelliJ 2018.1.6
> sbt 0.1
> scala-sdk-2.12.4
> java jdk1.8.0_144
>Reporter: Andres
>Priority: Blocker
>
> Hi all
> After getting some error in my custom implementation, I was trying to run a 
> test case copied from 
> [here|https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/test/java/org/apache/parquet/avro/TestReadWrite.java]
>  and I surprisingly got the same.
> {code:java}
> val schema = new Schema.Parser().parse("{\n \"type\": \"record\",\n \"name\": 
> \"myrecord\",\n \"fields\": [ {\n \"name\": \"myarray\",\n \"type\": {\n 
> \"type\": \"array\",\n \"items\": \"int\"\n }\n } ]\n}")
> val tmp = File.createTempFile(getClass().getSimpleName(), ".tmp");
> tmp.deleteOnExit();
> tmp.delete();
> val file = new Path(tmp.getPath());
> val testConf = new Configuration();
> val writer = AvroParquetWriter
>  .builder[GenericRecord](file)
>  .withSchema(schema)
>  .withConf(testConf)
>  .build();
> // Write a record with an empty array.
> val emptyArray = new util.ArrayList[Integer]();
> val record = new GenericRecordBuilder(schema)
>  .set("myarray", emptyArray).build();
> writer.write(record);
> writer.close();
> val reader = new AvroParquetReader[GenericRecord](testConf, file);
> val nextRecord = reader.read(){code}
> The project is scala + sbt with dependencies as follow
>  
> {code:java}
> lazy val parquetVersion = "1.10.0"
> lazy val parquet = "org.apache.parquet" % "parquet" % Version.parquetVersion
> lazy val parquetAvro = "org.apache.parquet" % "parquet-avro" % 
> Version.parquetVersion
> {code}
>  
> And this is the stack trace:
>  
> {code:java}
> Statistics comparator mismatched: SIGNED_INT32_COMPARATOR vs. 
> SIGNED_INT32_COMPARATOR (39 milliseconds)
> [info] org.apache.parquet.column.statistics.StatisticsClassException: 
> Statistics comparator mismatched: SIGNED_INT32_COMPARATOR vs. 
> SIGNED_INT32_COMPARATOR
> [info] at 
> org.apache.parquet.column.statistics.StatisticsClassException.create(StatisticsClassException.java:42)
> [info] at 
> org.apache.parquet.column.statistics.Statistics.mergeStatistics(Statistics.java:327)
> [info] at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:119)
> [info] at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
> [info] at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235)
> [info] at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
> [info] at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:169)
> [info] at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
> [info] at 
> org.apache.parquet.hadoop.ParquetWriter.close(ParquetWriter.java:301)
> [info] at 
> uk.co.mypackage.di.nrt.HdfsPipelineSpec.$anonfun$new$4(HdfsPipelineSpec.scala:132)
>  
> {code}
> As you can see, this is confusing. The error is itself strange because the 
> mismatch doesn't happen at all. Would really appreciate help with this issue.
> Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1356) Error when closing writer - Statistics comparator mismatched

2018-08-01 Thread Andres (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565059#comment-16565059
 ] 

Andres commented on PARQUET-1356:
-

Finally we have a fix, this is not related with the library. We had another 
dependency that uses a different version of hadoop library, so we excluded it 
and now parquet-mr is working.
{code:java}
lazy val kuduClient = "org.apache.kudu" % "kudu-client-tools" % 
Version.kuduClient excludeAll( ExclusionRule(organization = "org.apache.hadoop")
{code}
Apologies for the inconveniences

> Error when closing writer - Statistics comparator mismatched
> 
>
> Key: PARQUET-1356
> URL: https://issues.apache.org/jira/browse/PARQUET-1356
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
> Environment: Mac OS Sierra 10.12.6
> IntelliJ 2018.1.6
> sbt 0.1
> scala-sdk-2.12.4
> java jdk1.8.0_144
>Reporter: Andres
>Priority: Blocker
>
> Hi all
> After getting some error in my custom implementation, I was trying to run a 
> test case copied from 
> [here|https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/test/java/org/apache/parquet/avro/TestReadWrite.java]
>  and I surprisingly got the same.
> {code:java}
> val schema = new Schema.Parser().parse("{\n \"type\": \"record\",\n \"name\": 
> \"myrecord\",\n \"fields\": [ {\n \"name\": \"myarray\",\n \"type\": {\n 
> \"type\": \"array\",\n \"items\": \"int\"\n }\n } ]\n}")
> val tmp = File.createTempFile(getClass().getSimpleName(), ".tmp");
> tmp.deleteOnExit();
> tmp.delete();
> val file = new Path(tmp.getPath());
> val testConf = new Configuration();
> val writer = AvroParquetWriter
>  .builder[GenericRecord](file)
>  .withSchema(schema)
>  .withConf(testConf)
>  .build();
> // Write a record with an empty array.
> val emptyArray = new util.ArrayList[Integer]();
> val record = new GenericRecordBuilder(schema)
>  .set("myarray", emptyArray).build();
> writer.write(record);
> writer.close();
> val reader = new AvroParquetReader[GenericRecord](testConf, file);
> val nextRecord = reader.read(){code}
> The project is scala + sbt with dependencies as follow
>  
> {code:java}
> lazy val parquetVersion = "1.10.0"
> lazy val parquet = "org.apache.parquet" % "parquet" % Version.parquetVersion
> lazy val parquetAvro = "org.apache.parquet" % "parquet-avro" % 
> Version.parquetVersion
> {code}
>  
> And this is the stack trace:
>  
> {code:java}
> Statistics comparator mismatched: SIGNED_INT32_COMPARATOR vs. 
> SIGNED_INT32_COMPARATOR (39 milliseconds)
> [info] org.apache.parquet.column.statistics.StatisticsClassException: 
> Statistics comparator mismatched: SIGNED_INT32_COMPARATOR vs. 
> SIGNED_INT32_COMPARATOR
> [info] at 
> org.apache.parquet.column.statistics.StatisticsClassException.create(StatisticsClassException.java:42)
> [info] at 
> org.apache.parquet.column.statistics.Statistics.mergeStatistics(Statistics.java:327)
> [info] at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:119)
> [info] at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
> [info] at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235)
> [info] at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
> [info] at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:169)
> [info] at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
> [info] at 
> org.apache.parquet.hadoop.ParquetWriter.close(ParquetWriter.java:301)
> [info] at 
> uk.co.mypackage.di.nrt.HdfsPipelineSpec.$anonfun$new$4(HdfsPipelineSpec.scala:132)
>  
> {code}
> As you can see, this is confusing. The error is itself strange because the 
> mismatch doesn't happen at all. Would really appreciate help with this issue.
> Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)