[GitHub] [arrow] ggershinsky commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-09-10 Thread GitBox


ggershinsky commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r486786803



##
File path: cpp/src/parquet/key_metadata.h
##
@@ -0,0 +1,91 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "parquet/key_material.h"
+
+namespace parquet {
+namespace encryption {
+
+// Parquet encryption specification defines "key metadata" as an arbitrary 
byte array,
+// generated by file writers for each encryption key, and passed to the low 
level API for
+// storage in the file footer . The "key metadata" field is made available to 
file readers
+// to enable recovery of the key. This simple interface can be utilized for 
implementation
+// of any key management scheme.
+//
+// The keytools package (PARQUET-1373) implements one approach, of many 
possible, to key
+// management and to generation of the "key metadata" fields. This approach, 
based on the
+// "envelope encryption" pattern, allows to work with KMS servers. It keeps 
the actual
+// material, required to recover a key, in a "key material" object (see the 
KeyMaterial
+// class for details).
+//
+// KeyMetadata class writes (and reads) the "key metadata" field as a flat 
json object,
+// with the following fields:
+// 1. "keyMaterialType" - a String, with the type of  key material. In the 
current

Review comment:
   this indeed can be ambiguous. I'd suggest removing the second part of 
the comment entirely, starting from "In the current version..". The main 
purpose of the comment is served by its first part.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] tianchen92 commented on a change in pull request #7887: ARROW-9304: [C++] Add "AppendEmpty" builder APIs for use inside StructBuilder::AppendNull

2020-09-10 Thread GitBox


tianchen92 commented on a change in pull request #7887:
URL: https://github.com/apache/arrow/pull/7887#discussion_r486768419



##
File path: cpp/src/arrow/array/builder_base.h
##
@@ -98,6 +98,9 @@ class ARROW_EXPORT ArrayBuilder {
   virtual Status AppendNull() = 0;
   virtual Status AppendNulls(int64_t length) = 0;
 
+  virtual Status AppendEmptyValue() = 0;

Review comment:
   I am ok, let's hold this on.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #8156: ARROW-9810: [C++] Generalized nested reconstruction helpers

2020-09-10 Thread GitBox


emkornfield commented on pull request #8156:
URL: https://github.com/apache/arrow/pull/8156#issuecomment-690859782


   @pitrou I've removed the code for reconstructing list information based on 
bitmaps, I would appreciate keeping the rest of the in this PR.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #7887: ARROW-9304: [C++] Add "AppendEmpty" builder APIs for use inside StructBuilder::AppendNull

2020-09-10 Thread GitBox


emkornfield commented on a change in pull request #7887:
URL: https://github.com/apache/arrow/pull/7887#discussion_r486759286



##
File path: cpp/src/arrow/array/builder_base.h
##
@@ -98,6 +98,9 @@ class ARROW_EXPORT ArrayBuilder {
   virtual Status AppendNull() = 0;
   virtual Status AppendNulls(int64_t length) = 0;
 
+  virtual Status AppendEmptyValue() = 0;

Review comment:
   Yes, the change meets the spec but potentially breaks parquet writing 
for use-cases going through th builder per 
https://issues.apache.org/jira/browse/ARROW-9603?filter=-2 (which relies on the 
behavior of forcing nulls).  I would prefer not to change this behavior before 
fixing the referenced bug.  





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8167: ARROW-9973: [Java] JDBC DateConsumer does not allow dates before epoch

2020-09-10 Thread GitBox


github-actions[bot] commented on pull request #8167:
URL: https://github.com/apache/arrow/pull/8167#issuecomment-690845681


   https://issues.apache.org/jira/browse/ARROW-9973



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pwoody opened a new pull request #8167: ARROW-9973: [Java] JDBC DateConsumer does not allow dates before epoch

2020-09-10 Thread GitBox


pwoody opened a new pull request #8167:
URL: https://github.com/apache/arrow/pull/8167


   Reported here: https://issues.apache.org/jira/browse/ARROW-9973
   
   Looks like a pretty straightforward fix. Let me know if you have any 
comments - thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] tianchen92 commented on a change in pull request #7887: ARROW-9304: [C++] Add "AppendEmpty" builder APIs for use inside StructBuilder::AppendNull

2020-09-10 Thread GitBox


tianchen92 commented on a change in pull request #7887:
URL: https://github.com/apache/arrow/pull/7887#discussion_r486737805



##
File path: cpp/src/arrow/array/builder_base.h
##
@@ -98,6 +98,9 @@ class ARROW_EXPORT ArrayBuilder {
   virtual Status AppendNull() = 0;
   virtual Status AppendNulls(int64_t length) = 0;
 
+  virtual Status AppendEmptyValue() = 0;

Review comment:
   I add documentation for this API. The change meets the 
[spec](http://arrow.apache.org/docs/format/Columnar.html#struct-layout)(There 
is no need for the child arrays to be null when the top-level struct slot is 
null) .
   The difference between append empty and append null value is different.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou closed pull request #8166: ARROW-9972: [CI] Work around grpc-re2 clash on Homebrew

2020-09-10 Thread GitBox


kou closed pull request #8166:
URL: https://github.com/apache/arrow/pull/8166


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #8101: ARROW-9868: [C++][R] Provide CopyFiles for copying files between FileSystems

2020-09-10 Thread GitBox


nealrichardson commented on pull request #8101:
URL: https://github.com/apache/arrow/pull/8101#issuecomment-690775030


   @bkietz is this done?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] eerhardt commented on a change in pull request #8146: ARROW-5034: [C#] ArrowStreamWriter implement sync WriteRecordBatch

2020-09-10 Thread GitBox


eerhardt commented on a change in pull request #8146:
URL: https://github.com/apache/arrow/pull/8146#discussion_r486678511



##
File path: csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs
##
@@ -212,6 +212,96 @@ private void CountSelfAndChildrenNodes(IArrowType type, 
ref int count)
 count++;
 }
 
+private protected void WriteRecordBatchInternal(RecordBatch 
recordBatch)
+{
+// TODO: Truncate buffers with extraneous padding / unused capacity
+
+if (!HasWrittenSchema)
+{
+WriteSchema(Schema);
+HasWrittenSchema = true;
+}
+
+Builder.Clear();

Review comment:
   Yep - this is what I had in mind.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson edited a comment on pull request #8166: ARROW-9972: [CI] Work around grpc-re2 clash on Homebrew

2020-09-10 Thread GitBox


nealrichardson edited a comment on pull request #8166:
URL: https://github.com/apache/arrow/pull/8166#issuecomment-690763719


   This solves the re2 clash--and fixes the C/Glib Ruby and Python builds--but 
in the C++ build there's a later cmake problem with `c-ares` that I suspect is 
`grpc`-related too: 
https://github.com/apache/arrow/pull/8166/checks?check_run_id=1098979190#step:5:153
 
   
   @kou @pitrou 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #8166: ARROW-9972: [CI] Work around grpc-re2 clash on Homebrew

2020-09-10 Thread GitBox


nealrichardson commented on pull request #8166:
URL: https://github.com/apache/arrow/pull/8166#issuecomment-690763719


   This solves the re2 clash but there's a later cmake problem with `c-ares` 
that I suspect is `grpc`-related too: 
https://github.com/apache/arrow/pull/8166/checks?check_run_id=1098979190#step:5:153
 
   
   @kou @pitrou 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8166: ARROW-9972: [CI] Work around grpc-re2 clash on Homebrew

2020-09-10 Thread GitBox


github-actions[bot] commented on pull request #8166:
URL: https://github.com/apache/arrow/pull/8166#issuecomment-690763223


   https://issues.apache.org/jira/browse/ARROW-9972



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] suhsteve commented on a change in pull request #8146: ARROW-5034: [C#] ArrowStreamWriter implement sync WriteRecordBatch

2020-09-10 Thread GitBox


suhsteve commented on a change in pull request #8146:
URL: https://github.com/apache/arrow/pull/8146#discussion_r486665172



##
File path: csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs
##
@@ -212,6 +212,96 @@ private void CountSelfAndChildrenNodes(IArrowType type, 
ref int count)
 count++;
 }
 
+private protected void WriteRecordBatchInternal(RecordBatch 
recordBatch)
+{
+// TODO: Truncate buffers with extraneous padding / unused capacity
+
+if (!HasWrittenSchema)
+{
+WriteSchema(Schema);
+HasWrittenSchema = true;
+}
+
+Builder.Clear();

Review comment:
   Refactored this part out.  Let me know if this was what you were 
expecting.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] suhsteve commented on a change in pull request #8146: ARROW-5034: [C#] ArrowStreamWriter implement sync WriteRecordBatch

2020-09-10 Thread GitBox


suhsteve commented on a change in pull request #8146:
URL: https://github.com/apache/arrow/pull/8146#discussion_r486665032



##
File path: csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs
##
@@ -212,6 +212,96 @@ private void CountSelfAndChildrenNodes(IArrowType type, 
ref int count)
 count++;
 }
 
+private protected void WriteRecordBatchInternal(RecordBatch 
recordBatch)

Review comment:
   Added the `sync` version of `WriteRecordBatch` to `ArrowFileWriter`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] suhsteve commented on a change in pull request #8146: ARROW-5034: [C#] ArrowStreamWriter implement sync WriteRecordBatch

2020-09-10 Thread GitBox


suhsteve commented on a change in pull request #8146:
URL: https://github.com/apache/arrow/pull/8146#discussion_r486664790



##
File path: csharp/src/Apache.Arrow/Extensions/StreamExtensions.netcoreapp2.1.cs
##
@@ -25,5 +27,26 @@ public static int Read(this Stream stream, Memory 
buffer)
 {
 return stream.Read(buffer.Span);
 }
+
+public static void Write(this Stream stream, ReadOnlyMemory 
buffer)

Review comment:
   Thanks.  Modified.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson opened a new pull request #8166: ARROW-9972: [CI] Work around grpc-re2 clash on Homebrew

2020-09-10 Thread GitBox


nealrichardson opened a new pull request #8166:
URL: https://github.com/apache/arrow/pull/8166


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed pull request #8058: ARROW-9854: [R] Support reading/writing data to/from S3

2020-09-10 Thread GitBox


nealrichardson closed pull request #8058:
URL: https://github.com/apache/arrow/pull/8058


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #8058: ARROW-9854: [R] Support reading/writing data to/from S3

2020-09-10 Thread GitBox


nealrichardson commented on a change in pull request #8058:
URL: https://github.com/apache/arrow/pull/8058#discussion_r486656550



##
File path: r/R/filesystem.R
##
@@ -242,11 +242,31 @@ FileSystem <- R6Class("FileSystem", inherit = ArrowObject,
   )
 )
 FileSystem$from_uri <- function(uri) {
+  assert_that(is.string(uri))
   out <- fs___FileSystemFromUri(uri)
   out$fs <- shared_ptr(FileSystem, out$fs)$..dispatch()
   out
 }
 
+get_path_and_filesystem <- function(x, filesystem = NULL) {
+  # Wrapper around FileSystem$from_uri that handles local paths
+  # and an optional explicit filesystem
+  assert_that(is.string(x))
+  if (is_url(x)) {
+if (!is.null(filesystem)) {
+  # Stop? Can't have URL (which yields a fs) and another fs

Review comment:
   What happens here is that `filesystem` will be ignored if `x` is a URL, 
i.e. we'll use the filesystem that `from_uri(x)` returns. So that should either 
Just Work if the filesystems happen to be the same, or you're right, it should 
error with some kind of File Not Found later. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8112: ARROW-9970: [Go] fix checkptr failure in sum methods

2020-09-10 Thread GitBox


github-actions[bot] commented on pull request #8112:
URL: https://github.com/apache/arrow/pull/8112#issuecomment-690667598


   https://issues.apache.org/jira/browse/ARROW-9970



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #8164: ARROW-9968: [C++] Fix UBSAN build

2020-09-10 Thread GitBox


pitrou closed pull request #8164:
URL: https://github.com/apache/arrow/pull/8164


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8164: ARROW-9968: [C++] Fix UBSAN build

2020-09-10 Thread GitBox


pitrou commented on pull request #8164:
URL: https://github.com/apache/arrow/pull/8164#issuecomment-690583728


   The CI failures are the usual suspects, will merge.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8165: ARROW-9966: [Rust] Speedup kernels for sum,min,max by 10%-60%

2020-09-10 Thread GitBox


github-actions[bot] commented on pull request #8165:
URL: https://github.com/apache/arrow/pull/8165#issuecomment-690568469


   https://issues.apache.org/jira/browse/ARROW-9966



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8164: ARROW-9968: [C++] Fix UBSAN build

2020-09-10 Thread GitBox


github-actions[bot] commented on pull request #8164:
URL: https://github.com/apache/arrow/pull/8164#issuecomment-690568470


   https://issues.apache.org/jira/browse/ARROW-9968



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-09-10 Thread GitBox


pitrou commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r486518974



##
File path: cpp/src/parquet/key_toolkit.h
##
@@ -0,0 +1,123 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+#include 
+#include 
+#include 
+
+#include "parquet/key_encryption_key.h"
+#include "parquet/kms_client.h"
+#include "parquet/kms_client_factory.h"
+#include "parquet/platform.h"
+#include "parquet/two_level_cache_with_expiration.h"
+
+namespace parquet {
+namespace encryption {
+
+class KeyWithMasterID {

Review comment:
   You can use forward declarations to avoid exposing class definitions 
when only a reference or pointer is given.
   
   Also, it is good idea to hide implementation details by using the pimpl 
idiom. Another advantage is limit the growth of compile times.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorgecarleitao opened a new pull request #8165: ARROW-9966: [Rust] Speedup kernels for sum,min,max by 10%-60%

2020-09-10 Thread GitBox


jorgecarleitao opened a new pull request #8165:
URL: https://github.com/apache/arrow/pull/8165


   This PR speeds up some of the aggregations in arrow by 10-60% by simplifying 
their logic and overall allowing the optimizer to do its work.
   
   The first 3 commits (up to 29754d7) simply improve the benchmark itself by:
   * not taking the creation of the arrays into account, only the computation, 
   * moving it to another file
   * adding randomness to the data to reduce spurious results due to 
speculative execution and others
   * add case for data with nulls, since the kernels branch out on that 
condition
   
   The last 3 commits are the optimizations themselves.
   
   ```
   sum 512 time:   [535.66 ns 536.11 ns 536.57 ns]  
   
   change: [-58.421% -58.222% -57.957%] (p = 0.00 < 
0.05)
   Performance has improved.
   Found 10 outliers among 100 measurements (10.00%)
 4 (4.00%) high mild
 6 (6.00%) high severe
   
   min 512 time:   [766.77 ns 775.85 ns 788.35 ns]  
   
   change: [-41.555% -41.017% -40.388%] (p = 0.00 < 
0.05)
   Performance has improved.
   Found 10 outliers among 100 measurements (10.00%)
 4 (4.00%) high mild
 6 (6.00%) high severe
   
   sum nulls 512   time:   [1.0968 us 1.1000 us 1.1038 us]  
 
   change: [-8.9918% -7.6232% -5.7130%] (p = 0.00 < 
0.05)
   Performance has improved.
   Found 15 outliers among 100 measurements (15.00%)
 3 (3.00%) high mild
 12 (12.00%) high severe
   
   min nulls 512   time:   [1.3208 us 1.3242 us 1.3286 us]  
 
   change: [-11.028% -10.240% -9.4581%] (p = 0.00 < 
0.05)
   Performance has improved.
   Found 11 outliers among 100 measurements (11.00%)
 3 (3.00%) high mild
 8 (8.00%) high severe
   ```
   
   Command:
   ```
   git checkout 29754d7 && cargo bench --bench aggregate_kernels && git 
checkout agg_arrow && cargo bench --bench aggregate_kernels
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou opened a new pull request #8164: ARROW-9968: [C++] Fix UBSAN build

2020-09-10 Thread GitBox


pitrou opened a new pull request #8164:
URL: https://github.com/apache/arrow/pull/8164


   Native __int128_t gives link errors on clang with UBSAN enabled, revert to 
Boost in that case.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8143: ARROW-9949: [C++] Improve performance of Decimal128::FromString by 46%, and make the implementation reusable for Decimal256.

2020-09-10 Thread GitBox


pitrou commented on pull request #8143:
URL: https://github.com/apache/arrow/pull/8143#issuecomment-690545437


   See PR #8164.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8143: ARROW-9949: [C++] Improve performance of Decimal128::FromString by 46%, and make the implementation reusable for Decimal256.

2020-09-10 Thread GitBox


pitrou commented on pull request #8143:
URL: https://github.com/apache/arrow/pull/8143#issuecomment-690540405


   I'll post a PR for a quick fix.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] MingyuZhong commented on pull request #8143: ARROW-9949: [C++] Improve performance of Decimal128::FromString by 46%, and make the implementation reusable for Decimal256.

2020-09-10 Thread GitBox


MingyuZhong commented on pull request #8143:
URL: https://github.com/apache/arrow/pull/8143#issuecomment-690539773


   > Ouch, this accidentally broke the ASAN/UBSAN build and I hadn't seen it 
because of other (unrelated) CI failures:
   > https://github.com/apache/arrow/runs/1093237142
   > 
   > ```
   > src/arrow/util/CMakeFiles/arrow-utility-test.dir/decimal_test.cc.o: In 
function `arrow::Decimal128Test_Multiply_Test::TestBody()':
   > /arrow/cpp/src/arrow/util/decimal_test.cc:935: undefined reference to 
`__muloti4'
   > clang: error: linker command failed with exit code 1 (use -v to see 
invocation)
   > ```
   
   I think we hit 
https://stackoverflow.com/questions/49793632/clang-fsanitize-undefined-with-128-integer-operations-undefined-reference-to
 . Is there a way to use compiler-rt for ASAN/UBSAN? If not, we can remove that 
line or revert to always use boost.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8143: ARROW-9949: [C++] Improve performance of Decimal128::FromString by 46%, and make the implementation reusable for Decimal256.

2020-09-10 Thread GitBox


pitrou commented on pull request #8143:
URL: https://github.com/apache/arrow/pull/8143#issuecomment-690526165


   Ouch, this accidentally broke the ASAN/UBSAN build and I hadn't seen it 
because of other (unrelated) CI failures:
   https://github.com/apache/arrow/runs/1093237142
   ```
   src/arrow/util/CMakeFiles/arrow-utility-test.dir/decimal_test.cc.o: In 
function `arrow::Decimal128Test_Multiply_Test::TestBody()':
   /arrow/cpp/src/arrow/util/decimal_test.cc:935: undefined reference to 
`__muloti4'
   clang: error: linker command failed with exit code 1 (use -v to see 
invocation)
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8145: ARROW-9967: [Python] Add compute module docs

2020-09-10 Thread GitBox


github-actions[bot] commented on pull request #8145:
URL: https://github.com/apache/arrow/pull/8145#issuecomment-690524147


   https://issues.apache.org/jira/browse/ARROW-9967



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield edited a comment on pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

2020-09-10 Thread GitBox


emkornfield edited a comment on pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#issuecomment-690411317


   > One gotcha I ran into while using a fixed size list type (nested within 
another list) that in this case the extension type wasn't preserved. But I 
suppose this is because the fixed size list is not yet preserved on parquet 
roundtrip (read back as normal list type), and to add back the extension type, 
the storage type needs to be the same. So in hindsight, this behaviour is as 
expected (as long as fixed size list is not yet supported).
   
   @jorisvandenbossche Reading fixed size lists is not yet supported.  One 
use-case that will likely not be supported in the short term is reading parquet 
files with FixedSizeLists that have null elements.  Do your use-cases with 
FixedSizeList allow for null lists?  I would also add that writing 
FixedSizeLists with null elements only works under a very limited of 
circumstances today as it is.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

2020-09-10 Thread GitBox


emkornfield commented on pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#issuecomment-690411317


   > One gotcha I ran into while using a fixed size list type (nested within 
another list) that in this case the extension type wasn't preserved. But I 
suppose this is because the fixed size list is not yet preserved on parquet 
roundtrip (read back as normal list type), and to add back the extension type, 
the storage type needs to be the same. So in hindsight, this behaviour is as 
expected (as long as fixed size list is not yet supported).
   
   @jorisvandenbossche Reading fixed size lists is not yet supported.  One 
use-case that will likely not be supported in the short term is reading parquet 
files with FixedSizeLists that have null elements.  Do your use-cases with 
FixedSizeList allow for null lists?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #8156: ARROW-9810: [C++] Generalized nested reconstruction helpers

2020-09-10 Thread GitBox


emkornfield commented on a change in pull request #8156:
URL: https://github.com/apache/arrow/pull/8156#discussion_r486459473



##
File path: cpp/src/parquet/level_comparison.h
##
@@ -0,0 +1,91 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include 
+#include 
+
+#include "arrow/util/bit_util.h"
+
+namespace parquet {
+namespace internal {
+
+// These APIs are likely to be revised as part of ARROW-8494 to reduce 
duplicate code.
+// They currently represent minimal functionality for vectorized computation 
of definition
+// levels.
+
+/// Builds a bitmap by applying predicate to the level vector provided.
+///
+/// \param[in] levels Rep or def level array.
+/// \param[in] num_levels The number of levels to process (must be [0, 64])
+/// \param[in] predicate The predicate to apply (must have the signature `bool
+/// predicate(int16_t)`.
+/// \returns The bitmap using least significant "bit" ordering.
+///
+/// N.B. Correct byte ordering is dependent on little-endian architectures.
+///
+template 
+inline uint64_t LevelsToBitmap(const int16_t* levels, int64_t num_levels,
+   Predicate predicate) {
+  // Both clang and GCC can vectorize this automatically with SSE4/AVX2.
+  uint64_t mask = 0;
+  for (int x = 0; x < num_levels; x++) {
+mask |= static_cast(predicate(levels[x]) ? 1 : 0) << x;
+  }
+  return ::arrow::BitUtil::ToLittleEndian(mask);
+}
+
+/// Builds a  bitmap where each set bit indicates the corresponding level is 
greater

Review comment:
   This is rearranging code from a 
[prior](https://github.com/apache/arrow/pull/6985) which on my box showed 20% 
end-to-end parquet to arrow read performance on existing benchmarks, indicating 
that while cheap the comparison take a considerable amount of time in decoding.
   
   **Why generalize in this way?**
   The generation of bitmaps can be made cheaper than it currently is.  Right 
now parquet RLE encodes rep/def levels.  We fully decode these levels into 
int16_t and then try to reconstruct nested metadata from them.  For places 
where there are actually runs, the bitmaps become much less expensive to 
generate (its a simple mask of N-bits).  For cases when max def/rep-level is 
one, I don't think comparison would even be necessary for non-RLE encoded data 
(the data is already in bitmap form).
   
   Working at the bitmaps level allows taking advantage of bit-level 
parallelism at the cost of doing extra for each column (The scalar version of 
the algorithm I include actually also does the same extra work when compared 
with the current [reconstruction 
algorithm](https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader_internal.cc#L810)).
   
   It is also worth pointing out there are roughly three cases to consider:
   1.  No-nested lists (in which case using this method will directly generate 
validity bitmaps and be a strict performance win).
   2.  Single list (when BMI2 is usable I believe this approach will also be 
faster than any other approach, I'm a little bit less confident when BMI2 isn't 
usable, but I would expect at least equal performance).
   3.  Nested lists. At a certain point using something similar to the 
[batch](https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader_internal.cc#L810)
 approach will start to win versus the bitmap approach.  This could be as early 
as 2 lists, but I would guess it is likely >= 3 lists.
   
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] revit13 commented on pull request #8159: ARROW-9104: [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory

2020-09-10 Thread GitBox


revit13 commented on pull request #8159:
URL: https://github.com/apache/arrow/pull/8159#issuecomment-690501845


   @pitrou  Please ignore my previous comment. It seems that the branch is 
merged ok. Thanks again. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8163: ARROW-9465: [Python] Improve ergonomics of compute module

2020-09-10 Thread GitBox


github-actions[bot] commented on pull request #8163:
URL: https://github.com/apache/arrow/pull/8163#issuecomment-690496204


   https://issues.apache.org/jira/browse/ARROW-9465



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] revit13 removed a comment on pull request #8159: ARROW-9104: [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory

2020-09-10 Thread GitBox


revit13 removed a comment on pull request #8159:
URL: https://github.com/apache/arrow/pull/8159#issuecomment-690469456


   @pitrou Thanks again for the review. I would like to merge the changes but 
this pull request seems to be closed. I am not sure how to resolve this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou opened a new pull request #8163: ARROW-9465: [Python] Improve ergonomics of compute module

2020-09-10 Thread GitBox


pitrou opened a new pull request #8163:
URL: https://github.com/apache/arrow/pull/8163


   Lots of assorted things here:
   * Automatically generate global wrappers for calling
 registered compute functions
   * Improve metadata of such wrappers (e.g. name, docstring)
   * Make it easier to pass options (for example via kwargs)
   * Type-check options
   * Add some docstrings
   * Expose more function attributes (e.g. arity)
   * Fix some crashes



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] revit13 commented on pull request #8159: ARROW-9104: [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory

2020-09-10 Thread GitBox


revit13 commented on pull request #8159:
URL: https://github.com/apache/arrow/pull/8159#issuecomment-690469456


   @pitrou Thanks again for the review. I would like to merge the changes but 
this pull request seems to be closed. I am not sure how to resolve this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #8156: ARROW-9810: [C++] Generalized nested reconstruction helpers

2020-09-10 Thread GitBox


emkornfield commented on pull request #8156:
URL: https://github.com/apache/arrow/pull/8156#issuecomment-690377163


   > Hmm... I don't know if this is easy to tear apart, but I'd appreciate if 
the PR concentrated on the functionality, and micro-optimizations (SIMD, BMI or 
otherwise) separated into later PRs. Would that be reasonably doable?
   
   How about I remove the net new bitmap related code?  The SIMD/BMI code was 
already present hidden behind compiler flags instead of runtime dispatched.  
The runtime dispatch is cleaning up technical debt to allow users to make use 
of these in prepackaged binaries (the 
[PR](https://github.com/apache/arrow/pull/6985) that introduced them showed a 
20% improvement on our end to end parquet->Arrow reading benchmarks).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

2020-09-10 Thread GitBox


emkornfield commented on pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#issuecomment-690369288


   A few minor doc comments otherwise looks ok to me.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

2020-09-10 Thread GitBox


emkornfield commented on a change in pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#discussion_r486437260



##
File path: cpp/src/parquet/arrow/writer.cc
##
@@ -63,6 +66,10 @@ using arrow::Status;
 using arrow::Table;
 using arrow::TimeUnit;
 
+using ArrowTypeId = arrow::Type;

Review comment:
   nit: this actually makes the code less understandable to me and it seems 
like it only saves 1 character?  (not a big deal either way though).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

2020-09-10 Thread GitBox


emkornfield commented on a change in pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#discussion_r486435699



##
File path: cpp/src/parquet/arrow/schema.cc
##
@@ -682,48 +689,78 @@ Status GetOriginSchema(const std::shared_ptr& metadata,
 // Restore original Arrow field information that was serialized as Parquet 
metadata
 // but that is not necessarily present in the field reconstitued from Parquet 
data
 // (for example, Parquet timestamp types doesn't carry timezone information).
-Status ApplyOriginalMetadata(std::shared_ptr field, const Field& 
origin_field,
- std::shared_ptr* out) {
+
+Status ApplyOriginalStorageMetadata(const Field& origin_field, SchemaField* 
inferred) {
   auto origin_type = origin_field.type();
-  if (field->type()->id() == ::arrow::Type::TIMESTAMP) {
+  auto inferred_type = inferred->field->type();
+
+  if (inferred_type->id() == ::arrow::Type::TIMESTAMP) {
 // Restore time zone, if any
-const auto& ts_type = static_cast(*field->type());
+const auto& ts_type = static_cast(*inferred_type);
 const auto& ts_origin_type = static_cast(*origin_type);
 
 // If the unit is the same and the data is tz-aware, then set the original
 // time zone, since Parquet has no native storage for timezones
 if (ts_type.unit() == ts_origin_type.unit() && ts_type.timezone() == "UTC" 
&&
 ts_origin_type.timezone() != "") {
-  field = field->WithType(origin_type);
+  inferred->field = inferred->field->WithType(origin_type);
 }
   }
+
   if (origin_type->id() == ::arrow::Type::DICTIONARY &&
-  field->type()->id() != ::arrow::Type::DICTIONARY &&
-  IsDictionaryReadSupported(*field->type())) {
+  inferred_type->id() != ::arrow::Type::DICTIONARY &&
+  IsDictionaryReadSupported(*inferred_type)) {
 const auto& dict_origin_type =
 static_cast(*origin_type);
-field = field->WithType(
-::arrow::dictionary(::arrow::int32(), field->type(), 
dict_origin_type.ordered()));
+inferred->field = inferred->field->WithType(
+::arrow::dictionary(::arrow::int32(), inferred_type, 
dict_origin_type.ordered()));
+  }
+
+  // Restore field metadata
+  std::shared_ptr field_metadata = 
origin_field.metadata();
+  if (field_metadata != nullptr) {
+if (inferred->field->metadata()) {
+  // Prefer the metadata keys (like field_id) from the current metadata
+  field_metadata = field_metadata->Merge(*inferred->field->metadata());
+}
+inferred->field = inferred->field->WithMetadata(field_metadata);
   }
 
   if (origin_type->id() == ::arrow::Type::EXTENSION) {
 // Restore extension type, if the storage type is as read from Parquet
 const auto& ex_type = checked_cast(*origin_type);
-if (ex_type.storage_type()->Equals(*field->type())) {
-  field = field->WithType(origin_type);
+if (ex_type.storage_type()->Equals(*inferred_type)) {
+  inferred->field = inferred->field->WithType(origin_type);
 }
   }
 
-  // Restore field metadata
-  std::shared_ptr field_metadata = 
origin_field.metadata();
-  if (field_metadata != nullptr) {
-if (field->metadata()) {
-  // Prefer the metadata keys (like field_id) from the current metadata
-  field_metadata = field_metadata->Merge(*field->metadata());
+  // TODO Should apply metadata recursively, but for that we need to move 
metadata
+  // application inside NodeToSchemaField (ARROW-9943)
+
+  return Status::OK();
+}
+
+Status ApplyOriginalMetadata(const Field& origin_field, SchemaField* inferred) 
{
+  auto origin_type = origin_field.type();
+  auto inferred_type = inferred->field->type();
+
+  if (origin_type->id() == ::arrow::Type::EXTENSION) {
+const auto& ex_type = checked_cast(*origin_type);
+auto origin_storage_field = origin_field.WithType(ex_type.storage_type());
+
+// Apply metadata recursively to storage type
+RETURN_NOT_OK(ApplyOriginalStorageMetadata(*origin_storage_field, 
inferred));
+inferred->storage_field = inferred->field;
+
+// Restore extension type, if the storage type is as read from Parquet

Review comment:
   isn't everything here read from Parquet (is this in contrast to user 
provided?)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

2020-09-10 Thread GitBox


emkornfield commented on a change in pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#discussion_r486435150



##
File path: cpp/src/parquet/arrow/schema.cc
##
@@ -682,48 +689,78 @@ Status GetOriginSchema(const std::shared_ptr& metadata,
 // Restore original Arrow field information that was serialized as Parquet 
metadata
 // but that is not necessarily present in the field reconstitued from Parquet 
data
 // (for example, Parquet timestamp types doesn't carry timezone information).
-Status ApplyOriginalMetadata(std::shared_ptr field, const Field& 
origin_field,
- std::shared_ptr* out) {
+
+Status ApplyOriginalStorageMetadata(const Field& origin_field, SchemaField* 
inferred) {
   auto origin_type = origin_field.type();
-  if (field->type()->id() == ::arrow::Type::TIMESTAMP) {
+  auto inferred_type = inferred->field->type();
+
+  if (inferred_type->id() == ::arrow::Type::TIMESTAMP) {
 // Restore time zone, if any
-const auto& ts_type = static_cast(*field->type());
+const auto& ts_type = static_cast(*inferred_type);
 const auto& ts_origin_type = static_cast(*origin_type);
 
 // If the unit is the same and the data is tz-aware, then set the original
 // time zone, since Parquet has no native storage for timezones
 if (ts_type.unit() == ts_origin_type.unit() && ts_type.timezone() == "UTC" 
&&
 ts_origin_type.timezone() != "") {
-  field = field->WithType(origin_type);
+  inferred->field = inferred->field->WithType(origin_type);
 }
   }
+
   if (origin_type->id() == ::arrow::Type::DICTIONARY &&
-  field->type()->id() != ::arrow::Type::DICTIONARY &&
-  IsDictionaryReadSupported(*field->type())) {
+  inferred_type->id() != ::arrow::Type::DICTIONARY &&
+  IsDictionaryReadSupported(*inferred_type)) {
 const auto& dict_origin_type =
 static_cast(*origin_type);
-field = field->WithType(
-::arrow::dictionary(::arrow::int32(), field->type(), 
dict_origin_type.ordered()));
+inferred->field = inferred->field->WithType(
+::arrow::dictionary(::arrow::int32(), inferred_type, 
dict_origin_type.ordered()));
+  }
+
+  // Restore field metadata
+  std::shared_ptr field_metadata = 
origin_field.metadata();
+  if (field_metadata != nullptr) {
+if (inferred->field->metadata()) {
+  // Prefer the metadata keys (like field_id) from the current metadata
+  field_metadata = field_metadata->Merge(*inferred->field->metadata());
+}
+inferred->field = inferred->field->WithMetadata(field_metadata);
   }
 
   if (origin_type->id() == ::arrow::Type::EXTENSION) {
 // Restore extension type, if the storage type is as read from Parquet
 const auto& ex_type = checked_cast(*origin_type);
-if (ex_type.storage_type()->Equals(*field->type())) {
-  field = field->WithType(origin_type);
+if (ex_type.storage_type()->Equals(*inferred_type)) {
+  inferred->field = inferred->field->WithType(origin_type);
 }
   }
 
-  // Restore field metadata
-  std::shared_ptr field_metadata = 
origin_field.metadata();
-  if (field_metadata != nullptr) {
-if (field->metadata()) {
-  // Prefer the metadata keys (like field_id) from the current metadata
-  field_metadata = field_metadata->Merge(*field->metadata());
+  // TODO Should apply metadata recursively, but for that we need to move 
metadata
+  // application inside NodeToSchemaField (ARROW-9943)
+
+  return Status::OK();
+}
+
+Status ApplyOriginalMetadata(const Field& origin_field, SchemaField* inferred) 
{
+  auto origin_type = origin_field.type();
+  auto inferred_type = inferred->field->type();
+
+  if (origin_type->id() == ::arrow::Type::EXTENSION) {
+const auto& ex_type = checked_cast(*origin_type);
+auto origin_storage_field = origin_field.WithType(ex_type.storage_type());
+
+// Apply metadata recursively to storage type

Review comment:
   this seems to conflict with the TODO above.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] thamht4190 commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-09-10 Thread GitBox


thamht4190 commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r486434226



##
File path: cpp/src/parquet/key_toolkit.h
##
@@ -0,0 +1,123 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+#include 
+#include 
+#include 
+
+#include "parquet/key_encryption_key.h"
+#include "parquet/kms_client.h"
+#include "parquet/kms_client_factory.h"
+#include "parquet/platform.h"
+#include "parquet/two_level_cache_with_expiration.h"
+
+namespace parquet {
+namespace encryption {
+
+class KeyWithMasterID {

Review comment:
   > Can you put internal-only stuff in separate header files?
   
   `KeyWithMasterId` is an internal thing, however, since it appears in public 
header file `FileKeyUnwrapper.h" (i.e. `KeyWithMasterId 
FileKeyUnwrapper::GetDataEncryptionKey(const KeyMaterial& key_material) const` 
- which will be used in KeyToolkit key rotation later). So it must be in a 
public `key_toolkit.h`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] thamht4190 commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-09-10 Thread GitBox


thamht4190 commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r486434226



##
File path: cpp/src/parquet/key_toolkit.h
##
@@ -0,0 +1,123 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+#include 
+#include 
+#include 
+
+#include "parquet/key_encryption_key.h"
+#include "parquet/kms_client.h"
+#include "parquet/kms_client_factory.h"
+#include "parquet/platform.h"
+#include "parquet/two_level_cache_with_expiration.h"
+
+namespace parquet {
+namespace encryption {
+
+class KeyWithMasterID {

Review comment:
   > Can you put internal-only stuff in separate header files?
   
   `KeyWithMasterId` is an internal thing, however, since it appears in public 
header file `FileKeyUnwrapper.h"` (i.e. `KeyWithMasterId 
FileKeyUnwrapper::GetDataEncryptionKey(const KeyMaterial& key_material) const` 
- which will be used in KeyToolkit key rotation later). So it must be in a 
public `key_toolkit.h`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] thamht4190 commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-09-10 Thread GitBox


thamht4190 commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r486420973



##
File path: cpp/src/parquet/key_toolkit.h
##
@@ -0,0 +1,123 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+#include 
+#include 
+#include 
+
+#include "parquet/key_encryption_key.h"
+#include "parquet/kms_client.h"
+#include "parquet/kms_client_factory.h"
+#include "parquet/platform.h"
+#include "parquet/two_level_cache_with_expiration.h"
+
+namespace parquet {
+namespace encryption {
+
+class KeyWithMasterID {
+ public:
+  KeyWithMasterID(const std::string& key_bytes, const std::string& master_id)
+  : key_bytes_(key_bytes), master_id_(master_id) {}
+
+  const std::string& data_key() const { return key_bytes_; }
+  const std::string& master_id() const { return master_id_; }
+
+ private:
+  std::string key_bytes_;
+  std::string master_id_;
+};
+
+class PARQUET_EXPORT KeyToolkit {
+ public:
+  class KmsClientCache {
+   public:
+static KmsClientCache& GetInstance() {
+  static KmsClientCache instance;
+  return instance;
+}
+TwoLevelCacheWithExpiration>& cache() { return 
cache_; }
+
+   private:
+TwoLevelCacheWithExpiration> cache_;
+  };
+
+  class KEKWriteCache {
+   public:
+static KEKWriteCache& GetInstance() {
+  static KEKWriteCache instance;
+  return instance;
+}
+TwoLevelCacheWithExpiration& cache() { return cache_; }
+
+   private:
+TwoLevelCacheWithExpiration cache_;
+  };
+
+  class KEKReadCache {
+   public:
+static KEKReadCache& GetInstance() {
+  static KEKReadCache instance;
+  return instance;
+}
+TwoLevelCacheWithExpiration& cache() { return cache_; }
+
+   private:
+TwoLevelCacheWithExpiration cache_;
+  };
+
+  // KMS client two level cache: token -> KMSInstanceId -> KmsClient
+  static TwoLevelCacheWithExpiration>&
+  kms_client_cache_per_token() {
+return KmsClientCache::GetInstance().cache();
+  }
+
+  // KEK two level cache for wrapping: token -> MEK_ID -> KeyEncryptionKey
+  static TwoLevelCacheWithExpiration& 
kek_write_cache_per_token() {
+return KEKWriteCache::GetInstance().cache();
+  }
+
+  // KEK two level cache for unwrapping: token -> KEK_ID -> KEK bytes
+  static TwoLevelCacheWithExpiration& kek_read_cache_per_token() {
+return KEKReadCache::GetInstance().cache();
+  }
+
+  static std::shared_ptr GetKmsClient(
+  std::shared_ptr kms_client_factory,

Review comment:
   > Why isn't this a method on KmsClientFactory?
   
   @pitrou because here we get from the cache, we create a new client from 
factory only if the item is not in the cache (or the corresponding token 
expires).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #8008: ARROW-9369: [Python] Support conversion from python sequence to dictionary type

2020-09-10 Thread GitBox


kszucs commented on pull request #8008:
URL: https://github.com/apache/arrow/pull/8008#issuecomment-690347767


   We can close this in favor of #8088



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs closed pull request #8008: ARROW-9369: [Python] Support conversion from python sequence to dictionary type

2020-09-10 Thread GitBox


kszucs closed pull request #8008:
URL: https://github.com/apache/arrow/pull/8008


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ggershinsky commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-09-10 Thread GitBox


ggershinsky commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r486411073



##
File path: cpp/src/parquet/key_toolkit.h
##
@@ -0,0 +1,123 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+#include 
+#include 
+#include 
+
+#include "parquet/key_encryption_key.h"
+#include "parquet/kms_client.h"
+#include "parquet/kms_client_factory.h"
+#include "parquet/platform.h"
+#include "parquet/two_level_cache_with_expiration.h"
+
+namespace parquet {
+namespace encryption {
+
+class KeyWithMasterID {
+ public:
+  KeyWithMasterID(const std::string& key_bytes, const std::string& master_id)
+  : key_bytes_(key_bytes), master_id_(master_id) {}
+
+  const std::string& data_key() const { return key_bytes_; }
+  const std::string& master_id() const { return master_id_; }
+
+ private:
+  std::string key_bytes_;
+  std::string master_id_;
+};
+
+class PARQUET_EXPORT KeyToolkit {
+ public:
+  class KmsClientCache {
+   public:
+static KmsClientCache& GetInstance() {
+  static KmsClientCache instance;

Review comment:
   >  if the cache is a member of `PropertiesDrivenCryptoFactory`, users of 
API must keep the life time of `PropertiesDrivenCryptoFactory` object after 
getting `FileEncryptionProperties`/`FileDecryptionProperties`. If they forget 
that, it will be easy to get a crash in their programs.
   
   Nope, the lifetime is for an internal use; the entries (cached by the token) 
are removed after that time, and automatically re-created upon the next call 
(the token is checked again via the KMS interaction). Users don't need to 
handle this.
   
   
   > let users manage the life time of the cache themselves and pass it into 
PropertiesDrivenCryptoFactory.
   > user have to keep an object alive along the time of their program
   
   I'm afraid this virtually guarantees most of the users will run with a 
default behavior - without caches, and will suffer from performance hits. On 
the other hand, if the caches are automatically created in the 
PropertiesDrivenCryptoFactory, users will get performance optimization out of 
box. Also, this doesn't involve singletons.
   
   > needs to be well documented
   
   Certainly





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] thamht4190 commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-09-10 Thread GitBox


thamht4190 commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r486400053



##
File path: cpp/src/parquet/key_toolkit.h
##
@@ -0,0 +1,123 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+#include 
+#include 
+#include 
+
+#include "parquet/key_encryption_key.h"
+#include "parquet/kms_client.h"
+#include "parquet/kms_client_factory.h"
+#include "parquet/platform.h"
+#include "parquet/two_level_cache_with_expiration.h"
+
+namespace parquet {
+namespace encryption {
+
+class KeyWithMasterID {
+ public:
+  KeyWithMasterID(const std::string& key_bytes, const std::string& master_id)
+  : key_bytes_(key_bytes), master_id_(master_id) {}
+
+  const std::string& data_key() const { return key_bytes_; }
+  const std::string& master_id() const { return master_id_; }
+
+ private:
+  std::string key_bytes_;
+  std::string master_id_;
+};
+
+class PARQUET_EXPORT KeyToolkit {
+ public:
+  class KmsClientCache {
+   public:
+static KmsClientCache& GetInstance() {
+  static KmsClientCache instance;

Review comment:
   Eventually I think 2 above ways are identical. The user have to keep an 
object alive along the time of their program and these both needs to be well 
documented.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] thamht4190 commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-09-10 Thread GitBox


thamht4190 commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r486397802



##
File path: cpp/src/parquet/key_toolkit.h
##
@@ -0,0 +1,123 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+#include 
+#include 
+#include 
+
+#include "parquet/key_encryption_key.h"
+#include "parquet/kms_client.h"
+#include "parquet/kms_client_factory.h"
+#include "parquet/platform.h"
+#include "parquet/two_level_cache_with_expiration.h"
+
+namespace parquet {
+namespace encryption {
+
+class KeyWithMasterID {
+ public:
+  KeyWithMasterID(const std::string& key_bytes, const std::string& master_id)
+  : key_bytes_(key_bytes), master_id_(master_id) {}
+
+  const std::string& data_key() const { return key_bytes_; }
+  const std::string& master_id() const { return master_id_; }
+
+ private:
+  std::string key_bytes_;
+  std::string master_id_;
+};
+
+class PARQUET_EXPORT KeyToolkit {
+ public:
+  class KmsClientCache {
+   public:
+static KmsClientCache& GetInstance() {
+  static KmsClientCache instance;
+  return instance;
+}
+TwoLevelCacheWithExpiration>& cache() { return 
cache_; }
+
+   private:
+TwoLevelCacheWithExpiration> cache_;
+  };
+
+  class KEKWriteCache {
+   public:
+static KEKWriteCache& GetInstance() {
+  static KEKWriteCache instance;
+  return instance;
+}
+TwoLevelCacheWithExpiration& cache() { return cache_; }
+
+   private:
+TwoLevelCacheWithExpiration cache_;
+  };
+
+  class KEKReadCache {
+   public:
+static KEKReadCache& GetInstance() {
+  static KEKReadCache instance;
+  return instance;
+}
+TwoLevelCacheWithExpiration& cache() { return cache_; }
+
+   private:
+TwoLevelCacheWithExpiration cache_;
+  };
+
+  // KMS client two level cache: token -> KMSInstanceId -> KmsClient
+  static TwoLevelCacheWithExpiration>&
+  kms_client_cache_per_token() {
+return KmsClientCache::GetInstance().cache();
+  }
+
+  // KEK two level cache for wrapping: token -> MEK_ID -> KeyEncryptionKey
+  static TwoLevelCacheWithExpiration& 
kek_write_cache_per_token() {
+return KEKWriteCache::GetInstance().cache();
+  }
+
+  // KEK two level cache for unwrapping: token -> KEK_ID -> KEK bytes
+  static TwoLevelCacheWithExpiration& kek_read_cache_per_token() {
+return KEKReadCache::GetInstance().cache();
+  }
+
+  static std::shared_ptr GetKmsClient(
+  std::shared_ptr kms_client_factory,
+  const KmsConnectionConfig& kms_connection_config, bool is_wrap_locally,
+  uint64_t cache_entry_lifetime);
+
+  // Encrypts "key" with "master_key", using AES-GCM and the "aad"
+  static std::string EncryptKeyLocally(const std::string& key,

Review comment:
   I guess the purpose to put `EncryptKeyLocally`/`DecryptKeyLocally` 
method in `KeyToolkit` is that user can make use of this function to 
encrypt/decrypt the key locally without writing the code again. An example: We 
have used these methods in `TestOnlyInMemoryKms` class. So it's not really 
internal stuff. cc @ggershinsky 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] thamht4190 commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-09-10 Thread GitBox


thamht4190 commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r486382968



##
File path: cpp/src/parquet/key_material.h
##
@@ -0,0 +1,120 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+#include 
+
+#include 
+
+namespace parquet {
+namespace encryption {
+
+// KeyMaterial class represents the "key material", keeping the information 
that allows
+// readers to recover an encryption key (see description of the KeyMetadata 
class). The
+// keytools package (PARQUET-1373) implements the "envelope encryption" 
pattern, in a
+// "single wrapping" or "double wrapping" mode. In the single wrapping mode, 
the key
+// material is generated by encrypting the "data encryption key" (DEK) by a 
"master key".
+// In the double wrapping mode, the key material is generated by encrypting 
the DEK by a
+// "key encryption key" (KEK), that in turn is encrypted by a "master key".
+//
+// Key material is kept in a flat json object, with the following fields:
+// 1. "keyMaterialType" - a String, with the type of  key material. In the 
current
+// version, only one value is allowed - "PKMT1" (stands
+// for "parquet key management tools, version 1"). For external key 
material storage,
+// this field is written in both "key metadata" and "key material" jsons. 
For internal
+// key material storage, this field is written only once in the common 
json.
+// 2. "isFooterKey" - a boolean. If true, means that the material belongs to a 
file footer
+// key, and keeps additional information (such as
+// KMS instance ID and URL). If false, means that the material belongs to 
a column
+// key.
+// 3. "kmsInstanceID" - a String, with the KMS Instance ID. Written only in 
footer key
+// material.
+// 4. "kmsInstanceURL" - a String, with the KMS Instance URL. Written only in 
footer key
+// material.
+// 5. "masterKeyID" - a String, with the ID of the master key used to generate 
the
+// material.
+// 6. "wrappedDEK" - a String, with the wrapped DEK (base64 encoding).
+// 7. "doubleWrapping" - a boolean. If true, means that the material was 
generated in
+// double wrapping mode.
+// If false - in single wrapping mode.
+// 8. "keyEncryptionKeyID" - a String, with the ID of the KEK used to generate 
the
+// material. Written only in double wrapping mode.
+// 9. "wrappedKEK" - a String, with the wrapped KEK (base64 encoding). Written 
only in
+// double wrapping mode.
+class KeyMaterial {
+ public:
+  static constexpr char KEY_MATERIAL_TYPE_FIELD[] = "keyMaterialType";
+  static constexpr char KEY_MATERIAL_TYPE1[] = "PKMT1";
+
+  static constexpr char FOOTER_KEY_ID_IN_FILE[] = "footerKey";
+  static constexpr char COLUMN_KEY_ID_IN_FILE_PREFIX[] = "columnKey";
+
+  static constexpr char IS_FOOTER_KEY_FIELD[] = "isFooterKey";
+  static constexpr char DOUBLE_WRAPPING_FIELD[] = "doubleWrapping";
+  static constexpr char KMS_INSTANCE_ID_FIELD[] = "kmsInstanceID";
+  static constexpr char KMS_INSTANCE_URL_FIELD[] = "kmsInstanceURL";
+  static constexpr char MASTER_KEY_ID_FIELD[] = "masterKeyID";
+  static constexpr char WRAPPED_DEK_FIELD[] = "wrappedDEK";
+  static constexpr char KEK_ID_FIELD[] = "keyEncryptionKeyID";
+  static constexpr char WRAPPED_KEK_FIELD[] = "wrappedKEK";
+
+ public:
+  KeyMaterial() = default;
+
+  static KeyMaterial Parse(const std::string& key_material_string);
+
+  static KeyMaterial Parse(const rapidjson::Document& key_material_json);
+
+  static std::string CreateSerialized(bool is_footer_key,

Review comment:
   I updated it to `SerializeToJson`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] thamht4190 commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-09-10 Thread GitBox


thamht4190 commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r486382035



##
File path: cpp/src/parquet/file_key_wrapper.h
##
@@ -0,0 +1,68 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+#include 
+#include 
+
+#include "parquet/file_key_material_store.h"
+#include "parquet/key_encryption_key.h"
+#include "parquet/kms_client.h"
+#include "parquet/kms_client_factory.h"
+
+namespace parquet {
+namespace encryption {
+
+class FileKeyWrapper {
+ public:
+  static constexpr int KEK_LENGTH = 16;
+  static constexpr int KEK_ID_LENGTH = 16;
+
+  static constexpr int RND_MAX_BYTES = 32;
+
+  FileKeyWrapper(std::shared_ptr kms_client_factory,
+ const KmsConnectionConfig& kms_connection_config,
+ std::shared_ptr key_material_store,
+ uint64_t cache_entry_lifetime, bool double_wrapping,
+ bool is_wrap_locally);
+
+  std::string GetEncryptionKeyMetadata(const std::string& data_key,
+   const std::string& master_key_id,
+   bool is_footer_key);
+
+  std::string GetEncryptionKeyMetadata(const std::string& data_key,
+   const std::string& master_key_id,
+   bool is_footer_key, std::string 
key_id_in_file);
+
+ private:
+  KeyEncryptionKey CreateKeyEncryptionKey(const std::string& master_key_id);
+
+  // A map of MEK_ID -> KeyEncryptionKey, for the current token
+  std::map kek_per_master_key_id_;
+
+  std::shared_ptr kms_client_;
+  KmsConnectionConfig kms_connection_config_;
+  std::shared_ptr key_material_store_;
+  uint64_t cache_entry_lifetime_;
+  bool double_wrapping_;
+  uint16_t key_counter_;

Review comment:
   I removed it anw, since it's for external storage.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] thamht4190 commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-09-10 Thread GitBox


thamht4190 commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r486380916



##
File path: cpp/src/parquet/key_metadata.h
##
@@ -0,0 +1,91 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "parquet/key_material.h"
+
+namespace parquet {
+namespace encryption {
+
+// Parquet encryption specification defines "key metadata" as an arbitrary 
byte array,
+// generated by file writers for each encryption key, and passed to the low 
level API for
+// storage in the file footer . The "key metadata" field is made available to 
file readers
+// to enable recovery of the key. This simple interface can be utilized for 
implementation
+// of any key management scheme.
+//
+// The keytools package (PARQUET-1373) implements one approach, of many 
possible, to key
+// management and to generation of the "key metadata" fields. This approach, 
based on the
+// "envelope encryption" pattern, allows to work with KMS servers. It keeps 
the actual
+// material, required to recover a key, in a "key material" object (see the 
KeyMaterial
+// class for details).
+//
+// KeyMetadata class writes (and reads) the "key metadata" field as a flat 
json object,
+// with the following fields:
+// 1. "keyMaterialType" - a String, with the type of  key material. In the 
current

Review comment:
   @ggershinsky any suggestion for the comment?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] thamht4190 commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-09-10 Thread GitBox


thamht4190 commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r486372095



##
File path: cpp/src/parquet/file_key_wrapper.h
##
@@ -0,0 +1,68 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+#include 
+#include 
+
+#include "parquet/file_key_material_store.h"
+#include "parquet/key_encryption_key.h"
+#include "parquet/kms_client.h"
+#include "parquet/kms_client_factory.h"
+
+namespace parquet {
+namespace encryption {
+
+class FileKeyWrapper {
+ public:
+  static constexpr int KEK_LENGTH = 16;
+  static constexpr int KEK_ID_LENGTH = 16;
+
+  static constexpr int RND_MAX_BYTES = 32;
+
+  FileKeyWrapper(std::shared_ptr kms_client_factory,
+ const KmsConnectionConfig& kms_connection_config,
+ std::shared_ptr key_material_store,
+ uint64_t cache_entry_lifetime, bool double_wrapping,
+ bool is_wrap_locally);
+
+  std::string GetEncryptionKeyMetadata(const std::string& data_key,
+   const std::string& master_key_id,
+   bool is_footer_key);
+
+  std::string GetEncryptionKeyMetadata(const std::string& data_key,

Review comment:
   @ggershinsky I removed the method with `key_id_in_file`. It will be 
added when working with external storage.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] thamht4190 commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-09-10 Thread GitBox


thamht4190 commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r486371291



##
File path: cpp/src/parquet/kms_client_factory.h
##
@@ -0,0 +1,42 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "parquet/kms_client.h"
+#include "parquet/platform.h"
+
+namespace parquet {
+namespace encryption {
+
+class PARQUET_EXPORT KmsClientFactory {
+ public:
+  explicit KmsClientFactory(bool wrap_locally) : wrap_locally_(wrap_locally) {}
+
+  KmsClientFactory() : KmsClientFactory(false) {}
+
+  virtual ~KmsClientFactory() {}
+
+  virtual std::shared_ptr CreateKmsClient(

Review comment:
   > If this is only for internal use
   
   @pitrou it's for public use.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] thamht4190 commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-09-10 Thread GitBox


thamht4190 commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r486370626



##
File path: cpp/src/parquet/kms_client_factory.h
##
@@ -0,0 +1,42 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "parquet/kms_client.h"
+#include "parquet/platform.h"
+
+namespace parquet {
+namespace encryption {
+
+class PARQUET_EXPORT KmsClientFactory {
+ public:
+  explicit KmsClientFactory(bool wrap_locally) : wrap_locally_(wrap_locally) {}
+
+  KmsClientFactory() : KmsClientFactory(false) {}
+
+  virtual ~KmsClientFactory() {}
+
+  virtual std::shared_ptr CreateKmsClient(

Review comment:
   > The user can't create a KmsClientFactory directly?
   No, he can't. `KmsClientFactory` doesn't know the actual type of `KmsClient` 
yet. When the user create their own kms client class (inherited from 
`KmsClient`), he will have to create their own `KmsClientFactory` as well.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] thamht4190 commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-09-10 Thread GitBox


thamht4190 commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r486370626



##
File path: cpp/src/parquet/kms_client_factory.h
##
@@ -0,0 +1,42 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "parquet/kms_client.h"
+#include "parquet/platform.h"
+
+namespace parquet {
+namespace encryption {
+
+class PARQUET_EXPORT KmsClientFactory {
+ public:
+  explicit KmsClientFactory(bool wrap_locally) : wrap_locally_(wrap_locally) {}
+
+  KmsClientFactory() : KmsClientFactory(false) {}
+
+  virtual ~KmsClientFactory() {}
+
+  virtual std::shared_ptr CreateKmsClient(

Review comment:
   > The user can't create a KmsClientFactory directly?
   
   @pitrou No, he can't. `KmsClientFactory` doesn't know the actual type of 
`KmsClient` yet. When the user create their own kms client class (inherited 
from `KmsClient`), he will have to create their own `KmsClientFactory` as well.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] thamht4190 commented on a change in pull request #8023: ARROW-9318: [C++] Parquet encryption key management

2020-09-10 Thread GitBox


thamht4190 commented on a change in pull request #8023:
URL: https://github.com/apache/arrow/pull/8023#discussion_r486367648



##
File path: cpp/src/parquet/key_toolkit.h
##
@@ -0,0 +1,123 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+#include 
+#include 
+#include 
+
+#include "parquet/key_encryption_key.h"
+#include "parquet/kms_client.h"
+#include "parquet/kms_client_factory.h"
+#include "parquet/platform.h"
+#include "parquet/two_level_cache_with_expiration.h"
+
+namespace parquet {
+namespace encryption {
+
+class KeyWithMasterID {
+ public:
+  KeyWithMasterID(const std::string& key_bytes, const std::string& master_id)
+  : key_bytes_(key_bytes), master_id_(master_id) {}
+
+  const std::string& data_key() const { return key_bytes_; }
+  const std::string& master_id() const { return master_id_; }
+
+ private:
+  std::string key_bytes_;
+  std::string master_id_;
+};
+
+class PARQUET_EXPORT KeyToolkit {
+ public:
+  class KmsClientCache {
+   public:
+static KmsClientCache& GetInstance() {
+  static KmsClientCache instance;

Review comment:
   @ggershinsky I worry that if the cache is a member of 
`PropertiesDrivenCryptoFactory`, users of API must keep the life time of 
`PropertiesDrivenCryptoFactory` object after getting 
`FileEncryptionProperties`/`FileDecryptionProperties`. If they forget that, it 
will be easy to get a crash in their programs.
   I prefer to let users manage the life time of the cache themselves and pass 
it into `PropertiesDrivenCryptoFactory`.
   Any final decision?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8162: ARROW-9962: [Python] Fix conversion to_pandas with tz-aware index column and fixed offset timezones

2020-09-10 Thread GitBox


github-actions[bot] commented on pull request #8162:
URL: https://github.com/apache/arrow/pull/8162#issuecomment-690299825


   https://issues.apache.org/jira/browse/ARROW-9962



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche opened a new pull request #8162: ARROW-9962: [Python] Fix conversion to_pandas with tz-aware index column and fixed offset timezones

2020-09-10 Thread GitBox


jorisvandenbossche opened a new pull request #8162:
URL: https://github.com/apache/arrow/pull/8162


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] raduteo commented on pull request #8130: Parquet file writer snapshot API and proper ColumnChunk.file_path utilization

2020-09-10 Thread GitBox


raduteo commented on pull request #8130:
URL: https://github.com/apache/arrow/pull/8130#issuecomment-690275695


   RowGroup level file name is certainly supported by fastparquet:
   
https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187
 

   
   and the java code does read file_path (again with the one-file-per-rowgroup 
constraint): 
   
https://github.com/apache/parquet-mr/blob/65b95fb72be8f5a8a193a6f7bc4560fdcd742fc7/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1410
   
   but it’s less clear how it is used (certainly ParquetFileReader class seems 
to stick to a single file)
   
   More than anything the feature is fully in line with the parquet spec 
(unless I misreading something):
   
   
https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L769
   
   Also the code changes are not affecting any of the existing behavior, 
specifically even if one uses the proposed `Snapshot` method during file 
writing, the final file is still readable by the java and the fastparquet 
implementation.
   
   I am happy to open a discussion on the parquet list and push for broader 
support around this feature, but given that it is spec compliant and backward 
compatible with the existing code, I hope we can allow this PR to proceeded 
independently.  
   
   > On Sep 10, 2020, at 1:46 AM, emkornfield  wrote:
   > 
   > 
   > I don't think we should support this unless we can get consensus on 
dev@parquet mailing list that we want to support this across java and C++ (if 
java already supports it a pointer would be useful).
   > 
   > —
   > You are receiving this because you authored the thread.
   > Reply to this email directly, view it on GitHub 
, or 
unsubscribe 
.
   > 
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou edited a comment on pull request #8156: ARROW-9810: [C++] Generalized nested reconstruction helpers

2020-09-10 Thread GitBox


pitrou edited a comment on pull request #8156:
URL: https://github.com/apache/arrow/pull/8156#issuecomment-690272681


   Hmm... I don't know if this is easy to tear apart, but I'd appreciate if the 
PR concentrated on the functionality, and micro-optimizations (SIMD, BMI or 
otherwise) separated into later PRs. Would that be reasonably doable?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8156: ARROW-9810: [C++] Generalized nested reconstruction helpers

2020-09-10 Thread GitBox


pitrou commented on pull request #8156:
URL: https://github.com/apache/arrow/pull/8156#issuecomment-690272681


   Hmm... I don't know if this is easy to tear apart, but I'd appreciate if the 
PR concentrated on the functionality, and micro-optimizations (SIMD or 
otherwise) separated into later PRs. Would that be reasonably doable?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #8156: ARROW-9810: [C++] Generalized nested reconstruction helpers

2020-09-10 Thread GitBox


pitrou commented on a change in pull request #8156:
URL: https://github.com/apache/arrow/pull/8156#discussion_r486317787



##
File path: cpp/src/parquet/level_comparison.h
##
@@ -0,0 +1,91 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include 
+#include 
+
+#include "arrow/util/bit_util.h"
+
+namespace parquet {
+namespace internal {
+
+// These APIs are likely to be revised as part of ARROW-8494 to reduce 
duplicate code.
+// They currently represent minimal functionality for vectorized computation 
of definition
+// levels.
+
+/// Builds a bitmap by applying predicate to the level vector provided.
+///
+/// \param[in] levels Rep or def level array.
+/// \param[in] num_levels The number of levels to process (must be [0, 64])
+/// \param[in] predicate The predicate to apply (must have the signature `bool
+/// predicate(int16_t)`.
+/// \returns The bitmap using least significant "bit" ordering.
+///
+/// N.B. Correct byte ordering is dependent on little-endian architectures.
+///
+template 
+inline uint64_t LevelsToBitmap(const int16_t* levels, int64_t num_levels,
+   Predicate predicate) {
+  // Both clang and GCC can vectorize this automatically with SSE4/AVX2.
+  uint64_t mask = 0;
+  for (int x = 0; x < num_levels; x++) {
+mask |= static_cast(predicate(levels[x]) ? 1 : 0) << x;
+  }
+  return ::arrow::BitUtil::ToLittleEndian(mask);
+}
+
+/// Builds a  bitmap where each set bit indicates the corresponding level is 
greater

Review comment:
   I don't understand why you're doing this. Integer comparisons are cheap, 
while writing and reading bitmaps is expensive. Why go through an intermediate 
bitmap (and then spend some time trying to optimize it?)?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #8156: ARROW-9810: [C++] Generalized nested reconstruction helpers

2020-09-10 Thread GitBox


pitrou commented on a change in pull request #8156:
URL: https://github.com/apache/arrow/pull/8156#discussion_r486317787



##
File path: cpp/src/parquet/level_comparison.h
##
@@ -0,0 +1,91 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include 
+#include 
+
+#include "arrow/util/bit_util.h"
+
+namespace parquet {
+namespace internal {
+
+// These APIs are likely to be revised as part of ARROW-8494 to reduce 
duplicate code.
+// They currently represent minimal functionality for vectorized computation 
of definition
+// levels.
+
+/// Builds a bitmap by applying predicate to the level vector provided.
+///
+/// \param[in] levels Rep or def level array.
+/// \param[in] num_levels The number of levels to process (must be [0, 64])
+/// \param[in] predicate The predicate to apply (must have the signature `bool
+/// predicate(int16_t)`.
+/// \returns The bitmap using least significant "bit" ordering.
+///
+/// N.B. Correct byte ordering is dependent on little-endian architectures.
+///
+template 
+inline uint64_t LevelsToBitmap(const int16_t* levels, int64_t num_levels,
+   Predicate predicate) {
+  // Both clang and GCC can vectorize this automatically with SSE4/AVX2.
+  uint64_t mask = 0;
+  for (int x = 0; x < num_levels; x++) {
+mask |= static_cast(predicate(levels[x]) ? 1 : 0) << x;
+  }
+  return ::arrow::BitUtil::ToLittleEndian(mask);
+}
+
+/// Builds a  bitmap where each set bit indicates the corresponding level is 
greater

Review comment:
   I don't understand why you're doing this. Integer comparisons are cheap, 
while writing and reading bitmaps is expensive. While go through an 
intermediate bitmap (and then spend some time trying to optimize it?).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8148: ARROW-9104: [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory

2020-09-10 Thread GitBox


pitrou commented on pull request #8148:
URL: https://github.com/apache/arrow/pull/8148#issuecomment-690267262


   It appears this was obsoleted by #8159, closing.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #8148: ARROW-9104: [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory

2020-09-10 Thread GitBox


pitrou closed pull request #8148:
URL: https://github.com/apache/arrow/pull/8148


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #8143: ARROW-9949: [C++] Improve performance of Decimal128::FromString by 46%, and make the implementation reusable for Decimal256.

2020-09-10 Thread GitBox


pitrou closed pull request #8143:
URL: https://github.com/apache/arrow/pull/8143


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8143: ARROW-9949: [C++] Improve performance of Decimal128::FromString by 46%, and make the implementation reusable for Decimal256.

2020-09-10 Thread GitBox


pitrou commented on pull request #8143:
URL: https://github.com/apache/arrow/pull/8143#issuecomment-690265815


   For the record, I get a ~20% speedup on AMD Ryzen.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #8143: ARROW-9949: [C++] Improve performance of Decimal128::FromString by 46%, and make the implementation reusable for Decimal256.

2020-09-10 Thread GitBox


pitrou commented on a change in pull request #8143:
URL: https://github.com/apache/arrow/pull/8143#discussion_r486307010



##
File path: cpp/src/arrow/util/int128_internal.h
##
@@ -0,0 +1,42 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#ifndef __SIZEOF_INT128__
+#include 
+#endif
+
+namespace arrow {
+namespace internal {
+
+// NOTE: __int128_t and boost::multiprecision::int128_t are not 
interchangeable.
+// For example, __int128_t does not have any member function, and does not have
+// operator<<(std::ostream, __int128_t). On the other hand, the behavior of
+// boost::multiprecision::int128_t might be surprising with some configs (e.g.,
+// static_cast(boost::multiprecision::uint128_t) might return
+// ~uint64_t{0} instead of the lower 64 bits of the input).

Review comment:
   Fun :-)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #8159: ARROW-9104: [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory

2020-09-10 Thread GitBox


pitrou closed pull request #8159:
URL: https://github.com/apache/arrow/pull/8159


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8159: ARROW-9104: [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory

2020-09-10 Thread GitBox


pitrou commented on pull request #8159:
URL: https://github.com/apache/arrow/pull/8159#issuecomment-690253989


   @revit13 This is good, thank you!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8159: ARROW-9104: [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's director

2020-09-10 Thread GitBox


github-actions[bot] commented on pull request #8159:
URL: https://github.com/apache/arrow/pull/8159#issuecomment-690253470


   https://issues.apache.org/jira/browse/ARROW-9104



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

2020-09-10 Thread GitBox


pitrou commented on pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#issuecomment-690250651


   @emkornfield Would you like to take a quick look?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #8150: ARROW-9271: [R] Preserve data frame metadata in round trip

2020-09-10 Thread GitBox


romainfrancois commented on a change in pull request #8150:
URL: https://github.com/apache/arrow/pull/8150#discussion_r486283116



##
File path: r/R/table.R
##
@@ -185,6 +185,36 @@ Table <- R6Class("Table", inherit = ArrowObject,
   )
 )
 
+bad_attributes <- function(x) {
+  UseMethod("bad_attributes")
+}
+
+bad_attributes.default <- function(x) character()
+bad_attributes.data.frame <- function(x) c("class", "row.names", "names")

Review comment:
   And if we do that, we get to reconstruct grouped data frames as well: 
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   library(dplyr, warn.conflicts = FALSE)
   
   df <- data.frame(x = 1:10, g = rep(1:2, each = 5)) %>% group_by(g)
   as.data.frame(Table$create(df))
   #> # A tibble: 10 x 2
   #> # Groups:   g [2]
   #>x g
   #> 
   #>  1 1 1
   #>  2 2 1
   #>  3 3 1
   #>  4 4 1
   #>  5 5 1
   #>  6 6 2
   #>  7 7 2
   #>  8 8 2
   #>  9 9 2
   #> 1010 2
   ```
   
   Created on 2020-09-10 by the [reprex 
package](https://reprex.tidyverse.org) (v0.3.0.9001)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb commented on a change in pull request #8144: ARROW-9950: [Rust] [DataFusion] Made UDFs usable without a registry

2020-09-10 Thread GitBox


alamb commented on a change in pull request #8144:
URL: https://github.com/apache/arrow/pull/8144#discussion_r486271643



##
File path: rust/datafusion/examples/simple_udf.rs
##
@@ -111,20 +111,28 @@ fn main() -> Result<()> {
 pow,
 );
 
-// finally, register the UDF
-ctx.register_udf(pow);
+// at this point, we can use it or register it, depending on the use-case:
+// * if the UDF is expected to be used throughout the program in different 
contexts,
+//   we can register it, and call it later:
+ctx.register_udf(pow.clone()); // clone is only required in this example 
because we show both usages
 
-// at this point, we can use it. Note that the code below can be in a
-// scope on which we do not have access to `pow`.
+// * if the UDF is expected to be used directly in the scope, `.call` it 
directly:
+let expr = pow.call(vec![col("a"), col("b")]);
 
 // get a DataFrame from the context
 let df = ctx.table("t")?;
 
-// get the udf registry.
-let f = df.registry();
-
-// equivalent to `'SELECT pow(a, b) FROM t'`
-let df = df.select(vec![f.udf("pow", vec![col("a"), col("b")])?])?;
+// if we do not have `pow` in the scope and we registered it, we can get 
it from the registry
+let pow = df.registry().udf("pow")?;
+// equivalent to expr
+let expr1 = pow.call(vec![col("a"), col("b")]);
+

Review comment:
   ```suggestion
   assert_eq!(expr, expr1);
   ```

##
File path: rust/datafusion/src/logical_plan/mod.rs
##
@@ -1133,8 +1133,8 @@ pub trait FunctionRegistry {
 /// Set of all available udfs.
 fn udfs() -> HashSet;
 
-/// Constructs a logical expression with a call to the udf.
-fn udf(, name: , args: Vec) -> Result;
+/// Returns a reference to the udf named `name`.
+fn udf(, name: ) -> Result<>;

Review comment:
    





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8161: ARROW-9961: [Rust][DataFusion] Make to_timestamp function parses timestamp without timezone offset as local

2020-09-10 Thread GitBox


github-actions[bot] commented on pull request #8161:
URL: https://github.com/apache/arrow/pull/8161#issuecomment-690193622


   https://issues.apache.org/jira/browse/ARROW-9961



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alamb opened a new pull request #8161: ARROW-9961: [Rust][DataFusion] Make to_timestamp function parses timestamp without timezone offset as local

2020-09-10 Thread GitBox


alamb opened a new pull request #8161:
URL: https://github.com/apache/arrow/pull/8161


   The TO_TIMESTAMP function added in https://github.com/apache/arrow/pull/8142 
supports parsing timestamps without a specified timezone, such as 
`2020-09-08T13:42:29.190855`
   
   Such timestamps are supposed to be interpreted as in the local timezone, but 
instead are interpreted as UTC.
   
   This PR corrects that logical error



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] revit13 commented on pull request #8159: "ARROW-9104: [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory"

2020-09-10 Thread GitBox


revit13 commented on pull request #8159:
URL: https://github.com/apache/arrow/pull/8159#issuecomment-690157916


   @pitrou Thanks for your feedback. I changed the code to use TemporaryDir as 
you suggested. I appreciate it if you could review the new version of the code. 
 (The CI failures so far looks unrelated to the patch)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8159: "ARROW-9104: [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directo

2020-09-10 Thread GitBox


github-actions[bot] commented on pull request #8159:
URL: https://github.com/apache/arrow/pull/8159#issuecomment-690146389


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8160: ARROW-7302: [C++] CSV: allow dictionary types in explicit column types

2020-09-10 Thread GitBox


github-actions[bot] commented on pull request #8160:
URL: https://github.com/apache/arrow/pull/8160#issuecomment-690146222


   https://issues.apache.org/jira/browse/ARROW-7302



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou opened a new pull request #8160: ARROW-7302: [C++] CSV: allow dictionary types in explicit column types

2020-09-10 Thread GitBox


pitrou opened a new pull request #8160:
URL: https://github.com/apache/arrow/pull/8160


   Also implement conversion to dictionaries of numbers and decimals.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] revit13 opened a new pull request #8159: "ARROW-9104: [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory"

2020-09-10 Thread GitBox


revit13 opened a new pull request #8159:
URL: https://github.com/apache/arrow/pull/8159


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #8150: ARROW-9271: [R] Preserve data frame metadata in round trip

2020-09-10 Thread GitBox


romainfrancois commented on a change in pull request #8150:
URL: https://github.com/apache/arrow/pull/8150#discussion_r486197783



##
File path: r/R/record-batch.R
##
@@ -278,7 +278,10 @@ as.data.frame.RecordBatch <- function(x, row.names = NULL, 
optional = FALSE, ...
 apply_arrow_r_metadata <- function(x, r_metadata) {
   tryCatch({
 if (!is.null(r_metadata$attributes)) {
-  attributes(x) <- r_metadata$attributes
+  attributes(x)[names(r_metadata$attributes)] <- r_metadata$attributes
+  if (inherits(x, "POSIXlt")) {
+attr(x, "row.names") <- NULL

Review comment:
   Alternatively the test could use `expect_equal()` instead of 
`expect_identical()` and we would leave the `row.names` attribute: 
   
   ```r
   test_that("Date/time type roundtrip", {
 rb <- record_batch(example_with_times)
 expect_is(rb$schema$posixlt$type, "StructType")
 expect_equal(as.data.frame(rb), example_with_times)
   })
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #8150: ARROW-9271: [R] Preserve data frame metadata in round trip

2020-09-10 Thread GitBox


romainfrancois commented on a change in pull request #8150:
URL: https://github.com/apache/arrow/pull/8150#discussion_r486193205



##
File path: r/R/table.R
##
@@ -185,6 +185,36 @@ Table <- R6Class("Table", inherit = ArrowObject,
   )
 )
 
+bad_attributes <- function(x) {
+  UseMethod("bad_attributes")
+}
+
+bad_attributes.default <- function(x) character()
+bad_attributes.data.frame <- function(x) c("class", "row.names", "names")

Review comment:
   We might actually want to keep the class for data frames, so that we 
would faithfully roundtrip data frames, instead of enforcing a tibble: 
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   as.data.frame(record_batch(df = data.frame(x = 1)))$df
   #> # A tibble: 1 x 1
   #>   x
   #>   
   #> 1 1
   ```
   
   Created on 2020-09-10 by the [reprex 
package](https://reprex.tidyverse.org) (v0.3.0.9001)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #8150: ARROW-9271: [R] Preserve data frame metadata in round trip

2020-09-10 Thread GitBox


romainfrancois commented on a change in pull request #8150:
URL: https://github.com/apache/arrow/pull/8150#discussion_r486193425



##
File path: r/R/table.R
##
@@ -185,6 +185,36 @@ Table <- R6Class("Table", inherit = ArrowObject,
   )
 )
 
+bad_attributes <- function(x) {
+  UseMethod("bad_attributes")
+}
+
+bad_attributes.default <- function(x) character()
+bad_attributes.data.frame <- function(x) c("class", "row.names", "names")
+bad_attributes.factor <- function(x) c("class", "levels")
+bad_attributes.Date <- function(x) "class"
+bad_attributes.integer64 <- function(x) "class"
+bad_attributes.POSIXct <- function(x) c("class", "tzone")
+bad_attributes.hms <- function(x) c("class", "units")
+bad_attributes.difftime <- function(x) c("class", "units")
+
+arrow_attributes <- function(x, only_top_level = FALSE) {
+  att <- attributes(x)
+  att <- att[setdiff(names(att), bad_attributes(x))]
+  if (isTRUE(only_top_level)) {
+return(att)
+  }
+
+  if (is.data.frame(x)) {
+columns <- map(x, arrow_attributes)
+if (length(att) || !all(map_lgl(columns, is.null))) {
+  list(attributes = att, columns = columns)
+}
+  } else if (length(att)) {
+list(attributes = att, columns = NULL)
+  }

Review comment:
   Thanks. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #8150: ARROW-9271: [R] Preserve data frame metadata in round trip

2020-09-10 Thread GitBox


romainfrancois commented on a change in pull request #8150:
URL: https://github.com/apache/arrow/pull/8150#discussion_r486193205



##
File path: r/R/table.R
##
@@ -185,6 +185,36 @@ Table <- R6Class("Table", inherit = ArrowObject,
   )
 )
 
+bad_attributes <- function(x) {
+  UseMethod("bad_attributes")
+}
+
+bad_attributes.default <- function(x) character()
+bad_attributes.data.frame <- function(x) c("class", "row.names", "names")

Review comment:
   We might actually want to keep the class for data frames, so that we 
would faithfully roundtrip data frames, instead of enforcing a tibble: 
   
   ``` r
   library(arrow, warn.conflicts = TRUE)
   #> 
   #> Attaching package: 'arrow'
   #> The following object is masked from 'package:utils':
   #> 
   #> timestamp
   
   as.data.frame(record_batch(df = data.frame(x = 1)))$df
   #> # A tibble: 1 x 1
   #>   x
   #>   
   #> 1 1
   ```
   
   Created on 2020-09-10 by the [reprex 
package](https://reprex.tidyverse.org) (v0.3.0.9001)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #8150: ARROW-9271: [R] Preserve data frame metadata in round trip

2020-09-10 Thread GitBox


romainfrancois commented on a change in pull request #8150:
URL: https://github.com/apache/arrow/pull/8150#discussion_r486191107



##
File path: r/R/record-batch.R
##
@@ -278,7 +278,10 @@ as.data.frame.RecordBatch <- function(x, row.names = NULL, 
optional = FALSE, ...
 apply_arrow_r_metadata <- function(x, r_metadata) {
   tryCatch({
 if (!is.null(r_metadata$attributes)) {
-  attributes(x) <- r_metadata$attributes
+  attributes(x)[names(r_metadata$attributes)] <- r_metadata$attributes
+  if (inherits(x, "POSIXlt")) {
+attr(x, "row.names") <- NULL

Review comment:
   Perhaps this would be better handled by converting the POSIXlt to a 
Timestamp array rather than a struct type, because on the way back the struct 
type makes a data frame which needs the `row.names` attribute. 
   
   ``` r
   library(arrow, warn.conflicts = TRUE)
   
   example_with_times <- tibble::tibble(
 posixlt = as.POSIXlt(lubridate::ymd_hms("2018-10-07 19:04:05") + 1:10),
   )
   rb <- record_batch(example_with_times)
   
   attributes(as.data.frame(rb)$posixlt)
   #> $row.names
   #>  [1]  1  2  3  4  5  6  7  8  9 10
   #> 
   #> $names
   #> [1] "sec"   "min"   "hour"  "mday"  "mon"   "year"  "wday"  "yday"  
"isdst"
   #> 
   #> $class
   #> [1] "POSIXlt" "POSIXt" 
   #> 
   #> $tzone
   #> [1] "UTC"
   attributes(example_with_times$posixlt)
   #> $names
   #> [1] "sec"   "min"   "hour"  "mday"  "mon"   "year"  "wday"  "yday"  
"isdst"
   #> 
   #> $class
   #> [1] "POSIXlt" "POSIXt" 
   #> 
   #> $tzone
   #> [1] "UTC"
   ```
   
   Created on 2020-09-10 by the [reprex 
package](https://reprex.tidyverse.org) (v0.3.0.9001)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8058: ARROW-9854: [R] Support reading/writing data to/from S3

2020-09-10 Thread GitBox


jorisvandenbossche commented on a change in pull request #8058:
URL: https://github.com/apache/arrow/pull/8058#discussion_r486167891



##
File path: r/R/filesystem.R
##
@@ -242,11 +242,31 @@ FileSystem <- R6Class("FileSystem", inherit = ArrowObject,
   )
 )
 FileSystem$from_uri <- function(uri) {
+  assert_that(is.string(uri))
   out <- fs___FileSystemFromUri(uri)
   out$fs <- shared_ptr(FileSystem, out$fs)$..dispatch()
   out
 }
 
+get_path_and_filesystem <- function(x, filesystem = NULL) {
+  # Wrapper around FileSystem$from_uri that handles local paths
+  # and an optional explicit filesystem
+  assert_that(is.string(x))
+  if (is_url(x)) {
+if (!is.null(filesystem)) {
+  # Stop? Can't have URL (which yields a fs) and another fs
+}
+FileSystem$from_uri(x)
+  } else {
+list(
+  fs = filesystem %||% LocalFileSystem$create(),
+  path = clean_path_abs(x)
+)
+  }
+}
+
+is_url <- function(x) grepl("://", x)

Review comment:
   You might want to double check that the `get_path_and_filesystem` 
doesn't break local relative file paths (as I did in the python version .. 
;-)), but I suppose this `is_url` check before calling `from_uri` will indeed 
prevent that issue.

##
File path: r/R/filesystem.R
##
@@ -242,11 +242,31 @@ FileSystem <- R6Class("FileSystem", inherit = ArrowObject,
   )
 )
 FileSystem$from_uri <- function(uri) {
+  assert_that(is.string(uri))
   out <- fs___FileSystemFromUri(uri)
   out$fs <- shared_ptr(FileSystem, out$fs)$..dispatch()
   out
 }
 
+get_path_and_filesystem <- function(x, filesystem = NULL) {
+  # Wrapper around FileSystem$from_uri that handles local paths
+  # and an optional explicit filesystem
+  assert_that(is.string(x))
+  if (is_url(x)) {
+if (!is.null(filesystem)) {
+  # Stop? Can't have URL (which yields a fs) and another fs

Review comment:
   Indeed, the user shouldn't pass both a filesystem object and a URI, as 
then things like `filesystem$OpenInputFile(uri)` will fail (that was the root 
cause of the python parquet s3 test crashes -> 
https://github.com/apache/arrow/pull/8139). So I think can either let it pass 
here and rely on a later error in eg OpenInputFile, or otherwise raise a 
(potentially more informative) error message here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] xhochy commented on pull request #8127: WIP: ARROW-8359: [C++/Python] Enable linux-aarch64 builds

2020-09-10 Thread GitBox


xhochy commented on pull request #8127:
URL: https://github.com/apache/arrow/pull/8127#issuecomment-690078286


   Drone needs Credentials to be setup:
   
   ```
   github3.exceptions.AuthenticationFailed: 401 Requires authentication
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #8158: ARROW-7215: [C++][Gandiva] Implement castVARCHAR(numeric_type) functions

2020-09-10 Thread GitBox


github-actions[bot] commented on pull request #8158:
URL: https://github.com/apache/arrow/pull/8158#issuecomment-690076770


   https://issues.apache.org/jira/browse/ARROW-7215



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] projjal opened a new pull request #8158: ARROW-7215: [C++][Gandiva] Implement castVARCHAR(numeric_type) functions

2020-09-10 Thread GitBox


projjal opened a new pull request #8158:
URL: https://github.com/apache/arrow/pull/8158


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me opened a new pull request #8157: ARROW-9957: [Rust] Replace tempdir with tempfile

2020-09-10 Thread GitBox


nevi-me opened a new pull request #8157:
URL: https://github.com/apache/arrow/pull/8157


   tempdir is deprecated (https://crates.io/crates/tempdir)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org