[GitHub] [arrow] cyb70289 opened a new pull request #7476: ARROW-9168: [C++][Flight] configure TCP connection sharing in benchmark

2020-06-17 Thread GitBox


cyb70289 opened a new pull request #7476:
URL: https://github.com/apache/arrow/pull/7476


   Flight benchmark performs worse when working threads increases. It can
   be improved by setting gRPC option GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL
   to make each client creating its own TCP connection to server, without
   sharing one connection.
   
   This patch adds a cmdline parameter to configure TCP connection sharing.
   Default behaviour is to separate connection, which performs better when
   working threads increases.
   
   This patch also changes parameter "records_per_stream" from int32 to
   int64, to ease profiling by setting it to a very large value and keep
   benchmark running.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #7287: ARROW-8771: [C++] Add boost/process library to build support

2020-06-17 Thread GitBox


liyafan82 commented on pull request #7287:
URL: https://github.com/apache/arrow/pull/7287#issuecomment-645762806


   > That's odd. I wonder what's different about your build setup from the jobs 
we run on CI because I haven't seen that before. Do you think you could add a 
crossbow job that captures this build setup (is it just bundled boost and 
thrift, plus ARROW_FLIGHT=ON and tests on?)
   > 
   > FWIW this boost include has been removed in Thrift 0.13: 
[apache/thrift@1f34504#diff-73a92ed6f7bb65d6e8f29f74ae6c94bf](https://github.com/apache/thrift/commit/1f34504f43a7a409364d4114f180762bf2679e57#diff-73a92ed6f7bb65d6e8f29f74ae6c94bf)
   > 
   > So if/whenever we can [upgrade to 
0.13](https://issues.apache.org/jira/browse/ARROW-8049), this particular header 
won't ever be invoked.
   
   I agree with you that a crossbow job for this build setup is extremely 
useful, as even if this problem is fixed, there are other problems. I will open 
an issue to track it.
   
   I was using the following flags:
   
   -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_TESTS=ON -DARROW_BUILD_BENCHMARKS=ON 
-DARROW_FLIGHT=ON -DARROW_DATASET=ON -DBOOST_SOURCE=BUNDLED -DARROW_PARQUET=ON 
-DARROW_JNI=ON -DARROW_BUILD_EXAMPLES=ON 
-DCMAKE_CXX_COMPILER=.../gcc-7.2.0/bin/c++ 
-DCMAKE_C_COMPILER=.../gcc-7.2.0/bin/gcc



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-17 Thread GitBox


liyafan82 commented on a change in pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#discussion_r441931433



##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/rounding/DefaultRoundingPolicy.java
##
@@ -31,19 +28,18 @@
 
   public final long chunkSize;
 
+  /**
+   * default chunk size from Netty.
+   */
+  private static final long DEFAULT_CHUNK_SIZE = 16777216L;

Review comment:
   Maybe it is beneficial to add above discussion to the JavaDoc





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-17 Thread GitBox


liyafan82 commented on a change in pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#discussion_r441930746



##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/util/MemoryUtil.java
##
@@ -78,6 +77,55 @@ public Object run() {
   Field addressField = java.nio.Buffer.class.getDeclaredField("address");
   addressField.setAccessible(true);
   BYTE_BUFFER_ADDRESS_OFFSET = UNSAFE.objectFieldOffset(addressField);
+
+  Constructor directBufferConstructor;
+  long address = -1;
+  final ByteBuffer direct = ByteBuffer.allocateDirect(1);
+  try {
+
+final Object maybeDirectBufferConstructor =
+AccessController.doPrivileged(new PrivilegedAction() {
+  @Override
+  public Object run() {
+try {
+  final Constructor constructor =
+  direct.getClass().getDeclaredConstructor(long.class, 
int.class);
+  constructor.setAccessible(true);
+  return constructor;
+} catch (NoSuchMethodException e) {
+  return e;
+} catch (SecurityException e) {
+  return e;
+}
+  }
+});
+
+if (maybeDirectBufferConstructor instanceof Constructor) {
+  address = UNSAFE.allocateMemory(1);
+  // try to use the constructor now
+  try {
+((Constructor) 
maybeDirectBufferConstructor).newInstance(address, 1);
+directBufferConstructor = (Constructor) 
maybeDirectBufferConstructor;
+logger.debug("direct buffer constructor: available");
+  } catch (InstantiationException e) {
+directBufferConstructor = null;

Review comment:
   looks good. thanks.
   since the exception handling logic is identical, does it make the code 
conciser to use the following semantics?
   ```
   catch (InstantiationException | IllegalAccessException | 
InvocationTargetException e) {
 ...
   }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-17 Thread GitBox


liyafan82 commented on a change in pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#discussion_r441929412



##
File path: java/memory/src/main/java/io/netty/buffer/NettyArrowBuf.java
##
@@ -404,7 +407,7 @@ protected int _getUnsignedMediumLE(int index) {
 this.chk(index, 3);
 long addr = this.addr(index);
 return PlatformDependent.getByte(addr) & 255 |
-(Short.reverseBytes(PlatformDependent.getShort(addr + 1L)) & 
'\u') << 8;
+(Short.reverseBytes(PlatformDependent.getShort(addr + 1L)) & '\u') 
<< 8;

Review comment:
   Thanks for your effort. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7396: ARROW-9092: [C++][TRIAGE] Do not enable TestRoundFunctions when using LLVM 9 until gandiva-decimal-test is fixed

2020-06-17 Thread GitBox


kou commented on pull request #7396:
URL: https://github.com/apache/arrow/pull/7396#issuecomment-645726115


   Sorry. I couldn't reproduce this on my local environment including 
`docker-compose run conda-cpp` on my local machine: 
https://github.com/apache/arrow/pull/7280#issuecomment-635634955
   
   But it's reproduced on GitHub Actions with conda 
https://github.com/apache/arrow/runs/721758545 and without conda 
https://github.com/apache/arrow/runs/721758560 .
   
   I looked into this a bit on GitHub Actions with 
https://github.com/mxschmitt/action-tmate . But I couldn't find what is the 
problem.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on a change in pull request #7459: ARROW-6800: [C++] Support building libraries targeting C++14 or higher, disable GNU CXX extensions

2020-06-17 Thread GitBox


kou commented on a change in pull request #7459:
URL: https://github.com/apache/arrow/pull/7459#discussion_r441221973



##
File path: cpp/cmake_modules/SetupCxxFlags.cmake
##
@@ -62,12 +62,15 @@ endif()
 # Support C11
 set(CMAKE_C_STANDARD 11)
 
-# This ensures that things like gnu++11 get passed correctly
-set(CMAKE_CXX_STANDARD 11)
+# This ensures that things like c++11 get passed correctly
+set(CMAKE_CXX_STANDARD ${ARROW_CXX_STANDARD})

Review comment:
   How about using `CMAKE_CXX_STANDARD` directly instead of introducing 
`ARROW_CXX_STANDARD`?
   
   ```diff
   diff --git a/cpp/cmake_modules/DefineOptions.cmake 
b/cpp/cmake_modules/DefineOptions.cmake
   index 660caf91c..f916781d7 100644
   --- a/cpp/cmake_modules/DefineOptions.cmake
   +++ b/cpp/cmake_modules/DefineOptions.cmake
   @@ -82,8 +82,6 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL 
"${CMAKE_CURRENT_SOURCE_DIR}")

  define_option_string(ARROW_CXXFLAGS "Compiler flags to append when 
compiling Arrow" "")

   -  define_option(ARROW_CXX_STANDARD "C++ standard to target (default C++11)" 
11)
   -
  define_option(ARROW_BUILD_STATIC "Build static libraries" ON)

  define_option(ARROW_BUILD_SHARED "Build shared libraries" ON)
   diff --git a/cpp/cmake_modules/SetupCxxFlags.cmake 
b/cpp/cmake_modules/SetupCxxFlags.cmake
   index bf7a60138..ac7518eba 100644
   --- a/cpp/cmake_modules/SetupCxxFlags.cmake
   +++ b/cpp/cmake_modules/SetupCxxFlags.cmake
   @@ -63,7 +63,9 @@ endif()
set(CMAKE_C_STANDARD 11)

# This ensures that things like c++11 get passed correctly
   -set(CMAKE_CXX_STANDARD ${ARROW_CXX_STANDARD})
   +if(NOT DEFINED CMAKE_CXX_STANDARD)
   +  set(CMAKE_CXX_STANDARD 11)
   +endif()

# We require a C++11 compliant compiler
set(CMAKE_CXX_STANDARD_REQUIRED ON)
   ```
   
   The https://github.com/apache/arrow/pull/7459#issuecomment-645047309 error 
will be fixed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7459: ARROW-6800: [C++] Support building libraries targeting C++14 or higher, disable GNU CXX extensions

2020-06-17 Thread GitBox


kou commented on pull request #7459:
URL: https://github.com/apache/arrow/pull/7459#issuecomment-645720173


   Should we link to ARROW-6848 instead of ARROW-6800?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ursabot commented on pull request #7475: ARROW-8500: [C++] Add benchmark for using Filter on RecordBatch

2020-06-17 Thread GitBox


ursabot commented on pull request #7475:
URL: https://github.com/apache/arrow/pull/7475#issuecomment-645714003


   [AMD64 Ubuntu 18.04 C++ Benchmark 
(#113134)](https://ci.ursalabs.org/#builders/73/builds/84) builder failed with 
an exception.
   
   Revision: 999865b042c3131920b52b40a2387535168f3a08
   
   Archery: `'archery benchmark ...'` step's traceback:
   ```pycon
   Traceback (most recent call last):
 File 
"/home/ursabot/.conda/envs/ursabot/lib/python3.7/site-packages/twisted/internet/defer.py",
 line 654, in _runCallbacks
   current.result = callback(current.result, *args, **kw)
 File 
"/home/ursabot/.conda/envs/ursabot/lib/python3.7/site-packages/twisted/internet/defer.py",
 line 1475, in gotResult
   _inlineCallbacks(r, g, status)
 File 
"/home/ursabot/.conda/envs/ursabot/lib/python3.7/site-packages/twisted/internet/defer.py",
 line 1416, in _inlineCallbacks
   result = result.throwExceptionIntoGenerator(g)
 File 
"/home/ursabot/.conda/envs/ursabot/lib/python3.7/site-packages/twisted/python/failure.py",
 line 512, in throwExceptionIntoGenerator
   return g.throw(self.type, self.value, self.tb)
   ---  ---
 File 
"/home/ursabot/.conda/envs/ursabot/lib/python3.7/site-packages/buildbot/process/buildstep.py",
 line 566, in startStep
   self.results = yield self.run()
 File 
"/home/ursabot/.conda/envs/ursabot/lib/python3.7/site-packages/twisted/internet/defer.py",
 line 1418, in _inlineCallbacks
   result = g.send(result)
 File "/home/ursabot/ursabot/ursabot/steps.py", line 67, in run
   await log.addContent(content)
 File 
"/home/ursabot/.conda/envs/ursabot/lib/python3.7/site-packages/buildbot/process/log.py",
 line 130, in addContent
   return self.lbf.append(text)
 File 
"/home/ursabot/.conda/envs/ursabot/lib/python3.7/site-packages/buildbot/util/lineboundaries.py",
 line 62, in append
   text = self.newline_re.sub('\n', text)
   builtins.TypeError: expected string or bytes-like object
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7475: ARROW-8500: [C++] Add benchmark for using Filter on RecordBatch

2020-06-17 Thread GitBox


github-actions[bot] commented on pull request #7475:
URL: https://github.com/apache/arrow/pull/7475#issuecomment-645710548


   https://issues.apache.org/jira/browse/ARROW-8500



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators

2020-06-17 Thread GitBox


wesm closed pull request #7461:
URL: https://github.com/apache/arrow/pull/7461


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators

2020-06-17 Thread GitBox


wesm commented on pull request #7461:
URL: https://github.com/apache/arrow/pull/7461#issuecomment-645710004


   I gave up on trying to have e.g. a common "64-bit" kernel for 
Equals/NotEquals Int64/UInt64/Timestamp/etc. The sticking point is scalar 
unboxing. We might need to fashion a common internal base class for primitive 
types and give them a virtual method that returns their data as `const 
uint8_t*` that you can cast to whatever primitive type you want



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm opened a new pull request #7475: ARROW-8500: [C++] Add benchmark for using Filter on RecordBatch

2020-06-17 Thread GitBox


wesm opened a new pull request #7475:
URL: https://github.com/apache/arrow/pull/7475


   Since I changed Filter on RecordBatch to transform the filter to indices and 
use Take, I wanted to have a benchmark to compare the before/after performance 
so this can also be monitored over time. These benchmarks could use some 
refactoring but this is at least a starting point. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7475: ARROW-8500: [C++] Add benchmark for using Filter on RecordBatch

2020-06-17 Thread GitBox


wesm commented on pull request #7475:
URL: https://github.com/apache/arrow/pull/7475#issuecomment-645709018


   @ursabot benchmark --benchmark-filter=FilterRecordBatch 22f374102



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-17 Thread GitBox


wesm commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-645698551


   I just merged my changes for the ASCII kernels making those work on sliced 
arrays



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7465: PARQUET-1877: [C++] Reconcile thrift limits

2020-06-17 Thread GitBox


wesm closed pull request #7465:
URL: https://github.com/apache/arrow/pull/7465


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7458: ARROW-9122: [C++] Properly handle sliced arrays in ascii_lower, ascii_upper kernels

2020-06-17 Thread GitBox


wesm closed pull request #7458:
URL: https://github.com/apache/arrow/pull/7458


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7458: ARROW-9122: [C++] Properly handle sliced arrays in ascii_lower, ascii_upper kernels

2020-06-17 Thread GitBox


wesm commented on pull request #7458:
URL: https://github.com/apache/arrow/pull/7458#issuecomment-645697268


   the test failure is an s390x flake



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7463: ARROW-9145: [C++] Implement BooleanArray::true_count and false_count, add Python bindings

2020-06-17 Thread GitBox


wesm closed pull request #7463:
URL: https://github.com/apache/arrow/pull/7463


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7463: ARROW-9145: [C++] Implement BooleanArray::true_count and false_count, add Python bindings

2020-06-17 Thread GitBox


wesm commented on pull request #7463:
URL: https://github.com/apache/arrow/pull/7463#issuecomment-645696772


   +1



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators

2020-06-17 Thread GitBox


wesm commented on pull request #7461:
URL: https://github.com/apache/arrow/pull/7461#issuecomment-645696688


   +1. I'm going to merge this to help avoid conflicts caused by the stuff I 
just renamed. I welcome further comments and I will work to address them in 
follow ups



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7461:
URL: https://github.com/apache/arrow/pull/7461#discussion_r441902495



##
File path: cpp/src/arrow/compute/kernels/codegen_internal.h
##
@@ -485,18 +502,44 @@ struct ScalarUnaryNotNullStateful {
   struct ArrayExec> {
 static void Exec(const ThisType& functor, KernelContext* ctx, const 
ExecBatch& batch,
  Datum* out) {
-  typename TypeTraits::BuilderType builder;
-  VisitArrayValuesInline(*batch[0].array(), 
[&](util::optional v) {
-if (v.has_value()) {
-  KERNEL_RETURN_IF_ERROR(ctx, builder.Append(functor.op.Call(ctx, 
*v)));
-} else {
-  KERNEL_RETURN_IF_ERROR(ctx, builder.AppendNull());
+  // TODO: This code path is currently inadequately tested.
+
+  using offset_type = typename Type::offset_type;

Review comment:
   I'm reverting this change since no kernel uses it and adding a comment 
explaining that it is suboptimal. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7156: ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles

2020-06-17 Thread GitBox


bkietz commented on a change in pull request #7156:
URL: https://github.com/apache/arrow/pull/7156#discussion_r441836943



##
File path: python/pyarrow/_dataset.pyx
##
@@ -43,6 +44,51 @@ def _forbid_instantiation(klass, subclasses_instead=True):
 raise TypeError(msg)
 
 
+cdef class FileSource:
+
+cdef:
+CFileSource wrapped
+
+def __cinit__(self, file, FileSystem filesystem=None):
+cdef:
+shared_ptr[CFileSystem] c_filesystem
+c_string c_path
+shared_ptr[CRandomAccessFile] c_file
+shared_ptr[CBuffer] c_buffer
+
+if isinstance(file, FileSource):

Review comment:
   Since we've decided not to export `FileSource` I think we can probably 
delete this class in a follow up (in favor of `cdef CFileSource 
_make_file_source(...)` or so)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators

2020-06-17 Thread GitBox


wesm commented on pull request #7461:
URL: https://github.com/apache/arrow/pull/7461#issuecomment-645623392


   I'm revamping the documentation about these codegen functions which I'm 
dubbing "Generator-Dispatchers" (GDs) for short. I'll add "Generate" to their 
name. Stay tuned



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7458: ARROW-9122: [C++] Properly handle sliced arrays in ascii_lower, ascii_upper kernels

2020-06-17 Thread GitBox


wesm commented on pull request #7458:
URL: https://github.com/apache/arrow/pull/7458#issuecomment-645619023


   +1, merge on green



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7473: ARROW-9162: [Python] Expose Add/Subtract/Multiply arithmetic kernels

2020-06-17 Thread GitBox


wesm commented on pull request #7473:
URL: https://github.com/apache/arrow/pull/7473#issuecomment-645615268


   https://issues.apache.org/jira/browse/ARROW-9164



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs edited a comment on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels

2020-06-17 Thread GitBox


kszucs edited a comment on pull request #7471:
URL: https://github.com/apache/arrow/pull/7471#issuecomment-645614037


   Quickly went through it and it LGTM. I can give it a more thorough review 
tomorrow.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels

2020-06-17 Thread GitBox


kszucs commented on pull request #7471:
URL: https://github.com/apache/arrow/pull/7471#issuecomment-645614037


   Quickly went through it and LGTM. I can give it a more thorough review 
tomorrow.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels

2020-06-17 Thread GitBox


wesm commented on pull request #7471:
URL: https://github.com/apache/arrow/pull/7471#issuecomment-645612814


   Very reasonable. Thanks @pitrou!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7473: ARROW-9162: [Python] Expose Add/Subtract/Multiply arithmetic kernels

2020-06-17 Thread GitBox


kszucs commented on pull request #7473:
URL: https://github.com/apache/arrow/pull/7473#issuecomment-645612711


   I was thinking of the same, so I eventually deferred the docstrings which I 
can add in my other PR where we need to add `check_overflow` flag to these 
functions.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7472: ARROW-8892: [C++][CI] CI builds for MSVC do not build benchmarks

2020-06-17 Thread GitBox


wesm closed pull request #7472:
URL: https://github.com/apache/arrow/pull/7472


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7473: ARROW-9162: [Python] Expose Add/Subtract/Multiply arithmetic kernels

2020-06-17 Thread GitBox


wesm closed pull request #7473:
URL: https://github.com/apache/arrow/pull/7473


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels

2020-06-17 Thread GitBox


kszucs commented on pull request #7471:
URL: https://github.com/apache/arrow/pull/7471#issuecomment-645603358


   > This grows the manylinux wheel sizes by about 2-3 MB (from ~13MB to 
~15-16MB). It think it's reasonable.
   
   Yes, I expected more.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#issuecomment-645600060


   +1. Thanks all for the comments



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm closed pull request #7442:
URL: https://github.com/apache/arrow/pull/7442


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on pull request #7474: ARROW-8802: [C++][Dataset] Preserve dataset schema's metadata on column projection

2020-06-17 Thread GitBox


fsaintjacques commented on pull request #7474:
URL: https://github.com/apache/arrow/pull/7474#issuecomment-645595500


   @jorisvandenbossche I'm puzzled by the error. Could it be an upstream pandas 
change?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs edited a comment on pull request #7439: ARROW-4309: [Documentation] Add a docker-compose entry which builds the documentation with CUDA enabled

2020-06-17 Thread GitBox


kszucs edited a comment on pull request #7439:
URL: https://github.com/apache/arrow/pull/7439#issuecomment-645593482


   Please verify it with the following command:
   
   ```bash
   $ archery docker run --no-pull ubuntu-cuda-cpp
   $ archery docker run --no-pull ubuntu-cuda-python
   $ archery docker run --no-pull ubuntu-cuda-docs
   ```
   
   Note that it requires docker version >=19.03.
   I successfully tested all three images.
   
   @wesm @kou perhaps you have necessary devices.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7439: ARROW-4309: [Documentation] Add a docker-compose entry which builds the documentation with CUDA enabled

2020-06-17 Thread GitBox


kszucs commented on pull request #7439:
URL: https://github.com/apache/arrow/pull/7439#issuecomment-645593482


   Please verify it with the following command:
   
   ```bash
   $ archery docker run --no-pull ubuntu-cuda-cpp
   $ archery docker run --no-pull ubuntu-cuda-python
   $ archery docker run --no-pull ubuntu-cuda-docs
   ```
   
   Note that it requires docker version >=19.03.
   I successfully tested all three images.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ursabot commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


ursabot commented on pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#issuecomment-645588035


   [AMD64 Ubuntu 18.04 C++ Benchmark 
(#113048)](https://ci.ursalabs.org/#builders/73/builds/83) builder has been 
succeeded.
   
   Revision: 54bb83848d391477c5ded222fd4401acbe08c6c7
   
   ```diff
 ===  ===  ===  

 benchmarkbaseline contender
change
 ===  ===  ===  

 FilterStringFilterWithNulls/262144/9 395.928 MiB/sec  397.664 MiB/sec  
0.439%
 FilterInt64FilterWithNulls/262144/0  621.828 MiB/sec  613.884 MiB/sec  
-1.277%
 FilterStringFilterWithNulls/262144/10578.179 MiB/sec  577.449 MiB/sec  
-0.126%
 FilterFSLInt64FilterWithNulls/262144/14  4.068 GiB/sec4.018 GiB/sec
-1.247%
 FilterInt64FilterWithNulls/262144/13 604.515 MiB/sec  575.481 MiB/sec  
-4.803%
 FilterFSLInt64FilterNoNulls/262144/13350.875 MiB/sec  355.061 MiB/sec  
1.193%
 FilterStringFilterWithNulls/262144/0 441.188 MiB/sec  442.379 MiB/sec  
0.270%
 FilterInt64FilterWithNulls/262144/7  623.569 MiB/sec  594.423 MiB/sec  
-4.674%
 FilterStringFilterWithNulls/262144/1273.925 MiB/sec   73.930 MiB/sec   
0.007%
 FilterStringFilterNoNulls/262144/3   548.889 MiB/sec  548.269 MiB/sec  
-0.113%
 FilterInt64FilterNoNulls/262144/07.942 GiB/sec8.079 GiB/sec
1.727%
 FilterInt64FilterNoNulls/262144/63.827 GiB/sec3.725 GiB/sec
-2.665%
 FilterStringFilterWithNulls/262144/2 9.138 GiB/sec9.205 GiB/sec
0.726%
 FilterFSLInt64FilterWithNulls/262144/13  385.938 MiB/sec  370.599 MiB/sec  
-3.975%
 FilterInt64FilterWithNulls/262144/9  549.281 MiB/sec  542.112 MiB/sec  
-1.305%
 FilterInt64FilterWithNulls/262144/2  5.253 GiB/sec5.047 GiB/sec
-3.918%
 FilterFSLInt64FilterNoNulls/262144/5 5.778 GiB/sec5.676 GiB/sec
-1.761%
 FilterStringFilterNoNulls/262144/1   711.705 MiB/sec  697.941 MiB/sec  
-1.934%
 FilterStringFilterNoNulls/262144/0   560.111 MiB/sec  560.315 MiB/sec  
0.036%
 FilterStringFilterWithNulls/262144/5 8.773 GiB/sec8.976 GiB/sec
2.318%
 FilterInt64FilterWithNulls/262144/11 4.863 GiB/sec4.942 GiB/sec
1.631%
 FilterFSLInt64FilterWithNulls/262144/11  4.145 GiB/sec4.089 GiB/sec
-1.362%
 FilterInt64FilterNoNulls/262144/27.854 GiB/sec7.609 GiB/sec
-3.117%
 FilterStringFilterNoNulls/262144/11  9.751 GiB/sec9.565 GiB/sec
-1.904%
 FilterStringFilterNoNulls/262144/7   641.570 MiB/sec  650.710 MiB/sec  
1.425%
 FilterStringFilterWithNulls/262144/3 435.185 MiB/sec  436.932 MiB/sec  
0.401%
 FilterFSLInt64FilterNoNulls/262144/145.202 GiB/sec5.302 GiB/sec
1.915%
 FilterInt64FilterNoNulls/262144/4674.907 MiB/sec  654.585 MiB/sec  
-3.011%
 FilterInt64FilterNoNulls/262144/57.023 GiB/sec6.971 GiB/sec
-0.741%
 FilterInt64FilterWithNulls/262144/12 548.203 MiB/sec  542.909 MiB/sec  
-0.966%
 FilterFSLInt64FilterNoNulls/262144/10387.772 MiB/sec  390.564 MiB/sec  
0.720%
 FilterInt64FilterWithNulls/262144/8  4.951 GiB/sec5.094 GiB/sec
2.880%
 FilterStringFilterNoNulls/262144/13  90.750 MiB/sec   91.694 MiB/sec   
1.040%
 FilterFSLInt64FilterWithNulls/262144/12  230.292 MiB/sec  263.113 MiB/sec  
14.252%
 FilterStringFilterNoNulls/262144/12  70.772 MiB/sec   70.740 MiB/sec   
-0.044%
 FilterStringFilterWithNulls/262144/14927.254 MiB/sec  925.791 MiB/sec  
-0.158%
 FilterStringFilterNoNulls/262144/5   10.587 GiB/sec   10.322 GiB/sec   
-2.509%
 FilterFSLInt64FilterNoNulls/262144/3 551.473 MiB/sec  556.816 MiB/sec  
0.969%
 FilterInt64FilterNoNulls/262144/14   6.302 GiB/sec6.848 GiB/sec
8.656%
 FilterInt64FilterWithNulls/262144/14 4.804 GiB/sec4.945 GiB/sec
2.933%
 FilterStringFilterNoNulls/262144/14  1.257 GiB/sec1.247 GiB/sec
-0.814%
 FilterFSLInt64FilterNoNulls/262144/6 399.266 MiB/sec  402.455 MiB/sec  
0.799%
 FilterInt64FilterWithNulls/262144/5  5.037 GiB/sec4.954 GiB/sec
-1.645%
 FilterFSLInt64FilterNoNulls/262144/8 5.576 GiB/sec5.576 GiB/sec
-0.004%
 FilterFSLInt64FilterNoNulls/262144/7 462.231 MiB/sec  456.668 MiB/sec  
-1.203%
 FilterFSLInt64FilterNoNulls/262144/115.377 GiB/sec5.381 GiB/sec
0.082%
 FilterStringFilterNoNulls/262144/6   487.645 MiB/sec  487.464 MiB/sec  
-0.037%
 FilterStringFilterNoNulls/262144/4   687.214 MiB/sec  678.019 MiB/sec  
-1.338%
 FilterFSLInt64FilterWithNulls/262144/9   287.916 MiB/sec  285.805 MiB/sec  
-0.733%
 FilterInt64FilterNoNulls/262144/93.245 GiB/sec3.126 

[GitHub] [arrow] github-actions[bot] commented on pull request #7474: ARROW-8802: [C++][Dataset] Preserve dataset schema's metadata on column projection

2020-06-17 Thread GitBox


github-actions[bot] commented on pull request #7474:
URL: https://github.com/apache/arrow/pull/7474#issuecomment-645570471


   https://issues.apache.org/jira/browse/ARROW-8802



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#issuecomment-645569875


   @ursabot benchmark --benchmark-filter=Filter 04006ff



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques opened a new pull request #7474: ARROW-8802: [C++][Dataset] Preserve Dataset's schema metadata on column projection

2020-06-17 Thread GitBox


fsaintjacques opened a new pull request #7474:
URL: https://github.com/apache/arrow/pull/7474


   Scanner does not preserve the original schema metadata when columns are 
projected.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels

2020-06-17 Thread GitBox


pitrou commented on pull request #7471:
URL: https://github.com/apache/arrow/pull/7471#issuecomment-645562851


   This grows the manylinux wheel sizes by about 2-3 MB (from ~13MB to 
~15-16MB). It think it's reasonable.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels

2020-06-17 Thread GitBox


pitrou commented on pull request #7471:
URL: https://github.com/apache/arrow/pull/7471#issuecomment-645560128


   It seems like everything is working fine here.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#issuecomment-645556193


   So these "readability" improvements made performance worse so I'll revert 
them



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] xhochy commented on pull request #7452: ARROW-8961: [C++] Add utf8proc library to toolchain

2020-06-17 Thread GitBox


xhochy commented on pull request #7452:
URL: https://github.com/apache/arrow/pull/7452#issuecomment-645548108


   @ursabot build



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ursabot commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


ursabot commented on pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#issuecomment-645545062


   [AMD64 Ubuntu 18.04 C++ Benchmark 
(#112989)](https://ci.ursalabs.org/#builders/73/builds/82) builder has been 
succeeded.
   
   Revision: 21227cc7530e59a481f7e3c0aae8d351b4226e9d
   
   ```diff
 ===  ===   
 
 benchmarkbaseline contender
 change
 ===  ===   
 
   - FilterStringFilterNoNulls/262144/7   637.909 MiB/sec  572.355 MiB/sec  
 -10.276%
   - FilterStringFilterNoNulls/262144/8   10.897 GiB/sec   8.711 GiB/sec
 -20.057%
 FilterStringFilterNoNulls/262144/6   485.775 MiB/sec  476.410 MiB/sec  
 -1.928%
 FilterStringFilterWithNulls/262144/4 649.558 MiB/sec  677.796 MiB/sec  
 4.347%
 FilterInt64FilterNoNulls/262144/93.212 GiB/sec3.264 GiB/sec
 1.612%
   - FilterFSLInt64FilterNoNulls/262144/13351.877 MiB/sec  308.073 MiB/sec  
 -12.449%
   - FilterFSLInt64FilterNoNulls/262144/10389.471 MiB/sec  333.418 MiB/sec  
 -14.392%
   - FilterInt64FilterNoNulls/262144/4668.729 MiB/sec  625.199 MiB/sec  
 -6.509%
 FilterFSLInt64FilterWithNulls/262144/9   287.988 MiB/sec  276.495 MiB/sec  
 -3.991%
   - FilterStringFilterWithNulls/262144/2 9.441 GiB/sec8.793 GiB/sec
 -6.865%
 FilterStringFilterWithNulls/262144/1273.855 MiB/sec   82.463 MiB/sec   
 11.656%
   - FilterFSLInt64FilterNoNulls/262144/5 6.091 GiB/sec4.403 GiB/sec
 -27.714%
   - FilterFSLInt64FilterNoNulls/262144/3 550.519 MiB/sec  463.959 MiB/sec  
 -15.723%
 FilterInt64FilterNoNulls/262144/27.988 GiB/sec7.976 GiB/sec
 -0.147%
   - FilterStringFilterNoNulls/262144/4   700.795 MiB/sec  605.189 MiB/sec  
 -13.643%
   - FilterFSLInt64FilterWithNulls/262144/1   516.544 MiB/sec  460.521 MiB/sec  
 -10.846%
   - FilterStringFilterWithNulls/262144/8 8.877 GiB/sec8.364 GiB/sec
 -5.779%
   - FilterFSLInt64FilterWithNulls/262144/3   350.123 MiB/sec  329.103 MiB/sec  
 -6.004%
 FilterStringFilterWithNulls/262144/3 435.836 MiB/sec  494.167 MiB/sec  
 13.384%
 FilterInt64FilterNoNulls/262144/10   630.544 MiB/sec  628.104 MiB/sec  
 -0.387%
   - FilterStringFilterNoNulls/262144/5   11.014 GiB/sec   8.788 GiB/sec
 -20.216%
 FilterInt64FilterNoNulls/262144/34.263 GiB/sec4.181 GiB/sec
 -1.936%
 FilterInt64FilterWithNulls/262144/1  635.637 MiB/sec  615.015 MiB/sec  
 -3.244%
 FilterStringFilterWithNulls/262144/7 638.645 MiB/sec  678.465 MiB/sec  
 6.235%
   - FilterFSLInt64FilterNoNulls/262144/2 6.506 GiB/sec5.012 GiB/sec
 -22.975%
   - FilterFSLInt64FilterNoNulls/262144/0 729.854 MiB/sec  569.623 MiB/sec  
 -21.954%
 FilterInt64FilterNoNulls/262144/56.946 GiB/sec6.899 GiB/sec
 -0.674%
 FilterInt64FilterWithNulls/262144/12 545.763 MiB/sec  547.657 MiB/sec  
 0.347%
 FilterStringFilterNoNulls/262144/9   383.858 MiB/sec  377.178 MiB/sec  
 -1.740%
   - FilterFSLInt64FilterNoNulls/262144/8 5.825 GiB/sec4.702 GiB/sec
 -19.289%
 FilterInt64FilterNoNulls/262144/13   632.053 MiB/sec  633.157 MiB/sec  
 0.175%
 FilterInt64FilterNoNulls/262144/11.020 GiB/sec1.022 GiB/sec
 0.239%
   - FilterFSLInt64FilterNoNulls/262144/12242.197 MiB/sec  228.152 MiB/sec  
 -5.799%
 FilterInt64FilterWithNulls/262144/4  640.980 MiB/sec  614.192 MiB/sec  
 -4.179%
 FilterInt64FilterWithNulls/262144/8  4.967 GiB/sec5.071 GiB/sec
 2.102%
   - FilterFSLInt64FilterWithNulls/262144/0   396.373 MiB/sec  374.388 MiB/sec  
 -5.546%
 FilterInt64FilterWithNulls/262144/11 4.934 GiB/sec4.997 GiB/sec
 1.282%
   - FilterFSLInt64FilterNoNulls/262144/145.435 GiB/sec4.459 GiB/sec
 -17.946%
 FilterInt64FilterNoNulls/262144/12   3.255 GiB/sec3.185 GiB/sec
 -2.144%
 FilterStringFilterWithNulls/262144/1 638.704 MiB/sec  690.413 MiB/sec  
 8.096%
   - FilterStringFilterNoNulls/262144/2   11.411 GiB/sec   9.040 GiB/sec
 -20.778%
 FilterInt64FilterWithNulls/262144/6  582.753 MiB/sec  554.462 MiB/sec  
 -4.855%
 FilterStringFilterWithNulls/262144/10586.149 MiB/sec  616.404 MiB/sec  
 5.162%
 FilterInt64FilterNoNulls/262144/07.653 GiB/sec7.971 GiB/sec
 4.146%
 FilterInt64FilterWithNulls/262144/13 590.396 MiB/sec  607.816 MiB/sec  
 2.951%
   - FilterStringFilterNoNulls/262144/14  1.254 GiB/sec1011.778 MiB/sec 
 -21.233%
   - FilterFSLInt64FilterWithNulls/262144/4   474.573 MiB/sec  428.073 MiB/sec  
 -9.798%
 FilterInt64FilterWithNulls/262144/2  5.245 GiB/sec5.072 GiB/sec
 -3.310%
   - FilterStringFilterWithNulls/262144/118.381 GiB/sec7.793 GiB/sec
 -7.006%
 

[GitHub] [arrow] fsaintjacques commented on a change in pull request #7156: ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles

2020-06-17 Thread GitBox


fsaintjacques commented on a change in pull request #7156:
URL: https://github.com/apache/arrow/pull/7156#discussion_r441740202



##
File path: python/pyarrow/_dataset.pyx
##
@@ -43,6 +44,51 @@ def _forbid_instantiation(klass, subclasses_instead=True):
 raise TypeError(msg)
 
 
+cdef class FileSource:
+
+cdef:
+CFileSource wrapped
+
+def __cinit__(self, file, FileSystem filesystem=None):
+cdef:
+shared_ptr[CFileSystem] c_filesystem
+c_string c_path
+shared_ptr[CRandomAccessFile] c_file
+shared_ptr[CBuffer] c_buffer
+
+if isinstance(file, FileSource):

Review comment:
   This is odd. Should you have `_ensure_file_source` like other methods? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-17 Thread GitBox


kszucs commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-645539612


   Added `libutf8proc` dependency to the ursabot builders, same could be done 
for the docker-compose images. The tests are failing though.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-17 Thread GitBox


kszucs commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-645538795


   @ursabot build



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#issuecomment-645526968


   @ursabot benchmark --benchmark-filter=Filter 04006ff



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels

2020-06-17 Thread GitBox


github-actions[bot] commented on pull request #7471:
URL: https://github.com/apache/arrow/pull/7471#issuecomment-645526238


   Revision: f28701736f39939dc7eafb3db2e093612ed3c10c
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-328](https://github.com/ursa-labs/crossbow/branches/all?query=actions-328)
   
   |Task|Status|
   ||--|
   
|wheel-manylinux1-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux1-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux1-cp35m)|
   
|wheel-manylinux1-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux1-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux1-cp36m)|
   
|wheel-manylinux1-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux1-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux1-cp37m)|
   
|wheel-manylinux1-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux1-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux1-cp38)|
   
|wheel-manylinux2010-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux2010-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux2010-cp35m)|
   
|wheel-manylinux2010-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux2010-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux2010-cp36m)|
   
|wheel-manylinux2010-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux2010-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux2010-cp37m)|
   
|wheel-manylinux2010-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux2010-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux2010-cp38)|
   
|wheel-manylinux2014-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux2014-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux2014-cp35m)|
   
|wheel-manylinux2014-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux2014-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux2014-cp36m)|
   
|wheel-manylinux2014-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux2014-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux2014-cp37m)|
   
|wheel-manylinux2014-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux2014-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux2014-cp38)|
   
|wheel-osx-cp35m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-328-travis-wheel-osx-cp35m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-osx-cp36m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-328-travis-wheel-osx-cp36m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-osx-cp37m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-328-travis-wheel-osx-cp37m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-osx-cp38|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-328-travis-wheel-osx-cp38.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-win-cp35m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-328-appveyor-wheel-win-cp35m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
   
|wheel-win-cp36m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-328-appveyor-wheel-win-cp36m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
   

[GitHub] [arrow] pitrou commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels

2020-06-17 Thread GitBox


pitrou commented on pull request #7471:
URL: https://github.com/apache/arrow/pull/7471#issuecomment-645525436


   @github-actions crossbow submit -g wheel



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#issuecomment-645521577


   Something weird with the commit history, I'm not sure those benchmarks are 
right. I'll rebase things again and rerun



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ursabot commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


ursabot commented on pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#issuecomment-645518289


   [AMD64 Ubuntu 18.04 C++ Benchmark 
(#112952)](https://ci.ursalabs.org/#builders/73/builds/81) builder has been 
succeeded.
   
   Revision: f50b39e54c50e8a53606eda486c88e6ec51d7006
   
   ```diff
 ===  ===   
 
 benchmarkbaseline contender
 change
 ===  ===   
 
   - FilterFSLInt64FilterNoNulls/262144/145.457 GiB/sec4.398 GiB/sec
 -19.404%
 FilterStringFilterWithNulls/262144/4 642.405 MiB/sec  677.920 MiB/sec  
 5.528%
   - FilterFSLInt64FilterNoNulls/262144/7 463.992 MiB/sec  378.391 MiB/sec  
 -18.449%
 FilterFSLInt64FilterWithNulls/262144/6   333.996 MiB/sec  320.327 MiB/sec  
 -4.093%
   - FilterFSLInt64FilterWithNulls/262144/1   516.189 MiB/sec  459.926 MiB/sec  
 -10.900%
   - FilterStringFilterNoNulls/262144/4   681.504 MiB/sec  595.788 MiB/sec  
 -12.577%
   - FilterFSLInt64FilterNoNulls/262144/8 5.889 GiB/sec4.675 GiB/sec
 -20.610%
   - FilterInt64FilterWithNulls/262144/10 606.960 MiB/sec  547.973 MiB/sec  
 -9.718%
   - FilterInt64FilterNoNulls/262144/7638.264 MiB/sec  568.923 MiB/sec  
 -10.864%
 FilterStringFilterWithNulls/262144/6 431.474 MiB/sec  484.077 MiB/sec  
 12.191%
   - FilterStringFilterNoNulls/262144/14  1.245 GiB/sec1008.386 MiB/sec 
 -20.893%
 FilterFSLInt64FilterWithNulls/262144/11  4.239 GiB/sec4.029 GiB/sec
 -4.954%
   - FilterStringFilterNoNulls/262144/8   10.899 GiB/sec   8.494 GiB/sec
 -22.064%
   - FilterFSLInt64FilterNoNulls/262144/4 515.626 MiB/sec  406.426 MiB/sec  
 -21.178%
 FilterInt64FilterNoNulls/262144/63.697 GiB/sec3.525 GiB/sec
 -4.664%
 FilterInt64FilterNoNulls/262144/86.829 GiB/sec6.809 GiB/sec
 -0.301%
   - FilterFSLInt64FilterNoNulls/262144/2 6.453 GiB/sec4.950 GiB/sec
 -23.289%
   - FilterInt64FilterWithNulls/262144/13 606.984 MiB/sec  548.948 MiB/sec  
 -9.561%
   - FilterStringFilterNoNulls/262144/1   707.132 MiB/sec  609.027 MiB/sec  
 -13.874%
 FilterStringFilterWithNulls/262144/3 436.301 MiB/sec  488.825 MiB/sec  
 12.038%
 FilterStringFilterWithNulls/262144/1 616.105 MiB/sec  675.493 MiB/sec  
 9.639%
 FilterStringFilterNoNulls/262144/3   548.660 MiB/sec  533.539 MiB/sec  
 -2.756%
   - FilterFSLInt64FilterNoNulls/262144/9 268.363 MiB/sec  250.359 MiB/sec  
 -6.709%
   - FilterStringFilterNoNulls/262144/13  89.995 MiB/sec   76.326 MiB/sec   
 -15.189%
 FilterStringFilterWithNulls/262144/1271.366 MiB/sec   82.415 MiB/sec   
 15.483%
 FilterInt64FilterNoNulls/262144/93.209 GiB/sec3.114 GiB/sec
 -2.971%
 FilterFSLInt64FilterWithNulls/262144/9   288.819 MiB/sec  276.679 MiB/sec  
 -4.203%
 FilterStringFilterNoNulls/262144/12  66.141 MiB/sec   65.509 MiB/sec   
 -0.956%
   - FilterFSLInt64FilterWithNulls/262144/4   474.907 MiB/sec  429.013 MiB/sec  
 -9.664%
   - FilterInt64FilterWithNulls/262144/1  651.659 MiB/sec  556.258 MiB/sec  
 -14.640%
 FilterStringFilterWithNulls/262144/14911.019 MiB/sec  871.756 MiB/sec  
 -4.310%
   - FilterInt64FilterNoNulls/262144/4675.941 MiB/sec  569.448 MiB/sec  
 -15.755%
   - FilterFSLInt64FilterNoNulls/262144/13352.227 MiB/sec  307.638 MiB/sec  
 -12.659%
 FilterInt64FilterWithNulls/262144/5  5.129 GiB/sec4.921 GiB/sec
 -4.068%
   - FilterFSLInt64FilterWithNulls/262144/14  4.168 GiB/sec3.909 GiB/sec
 -6.200%
 FilterStringFilterWithNulls/262144/9 396.156 MiB/sec  442.591 MiB/sec  
 11.721%
   - FilterFSLInt64FilterNoNulls/262144/3 554.664 MiB/sec  464.787 MiB/sec  
 -16.204%
   - FilterStringFilterNoNulls/262144/2   11.394 GiB/sec   8.924 GiB/sec
 -21.683%
   - FilterStringFilterWithNulls/262144/8 8.856 GiB/sec8.075 GiB/sec
 -8.825%
   - FilterFSLInt64FilterNoNulls/262144/10389.368 MiB/sec  333.033 MiB/sec  
 -14.468%
   - FilterFSLInt64FilterNoNulls/262144/115.587 GiB/sec4.507 GiB/sec
 -19.338%
 FilterStringFilterWithNulls/262144/10580.314 MiB/sec  612.106 MiB/sec  
 5.478%
   - FilterFSLInt64FilterNoNulls/262144/5 6.032 GiB/sec4.717 GiB/sec
 -21.802%
   - FilterFSLInt64FilterNoNulls/262144/0 725.211 MiB/sec  565.535 MiB/sec  
 -22.018%
   - FilterInt64FilterNoNulls/262144/34.266 GiB/sec3.855 GiB/sec
 -9.641%
   - FilterInt64FilterWithNulls/262144/12 549.159 MiB/sec  499.761 MiB/sec  
 -8.995%
   - FilterInt64FilterWithNulls/262144/0  622.810 MiB/sec  497.075 MiB/sec  
 -20.188%
   - FilterInt64FilterNoNulls/262144/11.021 GiB/sec980.686 MiB/sec  
 -6.230%
   - FilterFSLInt64FilterWithNulls/262144/0   399.890 MiB/sec  375.677 

[GitHub] [arrow] github-actions[bot] commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels

2020-06-17 Thread GitBox


github-actions[bot] commented on pull request #7471:
URL: https://github.com/apache/arrow/pull/7471#issuecomment-645507606


   Revision: 282de25836ec86c707c34340c5a0f4907392e178
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-327](https://github.com/ursa-labs/crossbow/branches/all?query=actions-327)
   
   |Task|Status|
   ||--|
   
|wheel-manylinux1-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux1-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux1-cp35m)|
   
|wheel-manylinux1-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux1-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux1-cp36m)|
   
|wheel-manylinux1-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux1-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux1-cp37m)|
   
|wheel-manylinux1-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux1-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux1-cp38)|
   
|wheel-manylinux2010-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux2010-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux2010-cp35m)|
   
|wheel-manylinux2010-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux2010-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux2010-cp36m)|
   
|wheel-manylinux2010-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux2010-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux2010-cp37m)|
   
|wheel-manylinux2010-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux2010-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux2010-cp38)|
   
|wheel-manylinux2014-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux2014-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux2014-cp35m)|
   
|wheel-manylinux2014-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux2014-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux2014-cp36m)|
   
|wheel-manylinux2014-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux2014-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux2014-cp37m)|
   
|wheel-manylinux2014-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux2014-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux2014-cp38)|
   
|wheel-osx-cp35m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-327-travis-wheel-osx-cp35m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-osx-cp36m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-327-travis-wheel-osx-cp36m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-osx-cp37m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-327-travis-wheel-osx-cp37m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-osx-cp38|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-327-travis-wheel-osx-cp38.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-win-cp35m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-327-appveyor-wheel-win-cp35m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
   
|wheel-win-cp36m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-327-appveyor-wheel-win-cp36m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
   

[GitHub] [arrow] pitrou commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels

2020-06-17 Thread GitBox


pitrou commented on pull request #7471:
URL: https://github.com/apache/arrow/pull/7471#issuecomment-645502945


   @github-actions crossbow submit -g wheel



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7469: ARROW-8832: [Python] Provide better error message when S3/HDFS is not enabled in installation

2020-06-17 Thread GitBox


pitrou commented on a change in pull request #7469:
URL: https://github.com/apache/arrow/pull/7469#discussion_r441596426



##
File path: python/pyarrow/fs.py
##
@@ -35,14 +36,31 @@
 # For backward compatibility.
 FileStats = FileInfo
 
+_not_imported = []
+
 try:
 from pyarrow._hdfs import HadoopFileSystem  # noqa
 except ImportError:
-pass
+_not_imported.append("HadoopFileSystem")
 
 try:
 from pyarrow._s3fs import S3FileSystem, initialize_s3, finalize_s3  # noqa
 except ImportError:
-pass
+_not_imported.append("S3FileSystem")
 else:
 initialize_s3()
+
+
+if sys.version_info >= (3, 7):

Review comment:
   I don't think this is required. At worse `__getattr__` is a no-op :-)

##
File path: python/pyarrow/fs.py
##
@@ -35,14 +36,31 @@
 # For backward compatibility.
 FileStats = FileInfo
 
+_not_imported = []
+
 try:
 from pyarrow._hdfs import HadoopFileSystem  # noqa
 except ImportError:
-pass
+_not_imported.append("HadoopFileSystem")
 
 try:
 from pyarrow._s3fs import S3FileSystem, initialize_s3, finalize_s3  # noqa
 except ImportError:
-pass
+_not_imported.append("S3FileSystem")
 else:
 initialize_s3()
+
+
+if sys.version_info >= (3, 7):
+
+def __getattr__(name):
+

Review comment:
   Spurious empty line?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson closed pull request #7415: ARROW-7028: [R] Date roundtrip results in different R storage mode

2020-06-17 Thread GitBox


nealrichardson closed pull request #7415:
URL: https://github.com/apache/arrow/pull/7415


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#issuecomment-645498297


   I think I improved some of the readability problems and addressed the other 
comments. I'd like to merge this soon once CI is creen



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm edited a comment on pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#issuecomment-645498297


   I think I improved some of the readability problems and addressed the other 
comments. I'd like to merge this soon once CI is green



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#issuecomment-645497918


   @ursabot benchmark --benchmark-filter=Filter c4f425768



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me closed pull request #7399: ARROW-9095: [Rust] Spec-compliant NullArray

2020-06-17 Thread GitBox


nevi-me closed pull request #7399:
URL: https://github.com/apache/arrow/pull/7399


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7473: ARROW-9162: [Python] Expose Add/Subtract/Multiply arithmetic kernels

2020-06-17 Thread GitBox


github-actions[bot] commented on pull request #7473:
URL: https://github.com/apache/arrow/pull/7473#issuecomment-645490171


   https://issues.apache.org/jira/browse/ARROW-9162



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs opened a new pull request #7473: ARROW-9162: [Python] Expose Add/Subtract/Multiply arithmetic kernels

2020-06-17 Thread GitBox


kszucs opened a new pull request #7473:
URL: https://github.com/apache/arrow/pull/7473


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7472: ARROW-8892: [C++][CI] CI builds for MSVC do not build benchmarks

2020-06-17 Thread GitBox


github-actions[bot] commented on pull request #7472:
URL: https://github.com/apache/arrow/pull/7472#issuecomment-645481620


   https://issues.apache.org/jira/browse/ARROW-8892



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#discussion_r441674808



##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -0,0 +1,1816 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include 
+#include 
+
+#include "arrow/array/array_base.h"
+#include "arrow/array/array_binary.h"
+#include "arrow/array/array_dict.h"
+#include "arrow/array/array_nested.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/array/concatenate.h"
+#include "arrow/buffer_builder.h"
+#include "arrow/compute/api_vector.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/compute/kernels/util_internal.h"
+#include "arrow/extension_type.h"
+#include "arrow/record_batch.h"
+#include "arrow/result.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bit_util.h"
+#include "arrow/util/bitmap_ops.h"
+#include "arrow/util/bitmap_reader.h"
+#include "arrow/util/int_util.h"
+
+namespace arrow {
+
+using internal::BinaryBitBlockCounter;
+using internal::BitBlockCount;
+using internal::BitBlockCounter;
+using internal::BitmapReader;
+using internal::CopyBitmap;
+using internal::GetArrayView;
+using internal::IndexBoundsCheck;
+using internal::OptionalBitBlockCounter;
+using internal::OptionalBitIndexer;
+
+namespace compute {
+namespace internal {
+
+int64_t GetFilterOutputSize(const ArrayData& filter,
+FilterOptions::NullSelectionBehavior 
null_selection) {
+  int64_t output_size = 0;
+  int64_t position = 0;
+  if (filter.GetNullCount() > 0) {
+const uint8_t* filter_is_valid = filter.buffers[0]->data();
+BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset,
+  filter_is_valid, filter.offset, 
filter.length);
+if (null_selection == FilterOptions::EMIT_NULL) {
+  while (position < filter.length) {
+BitBlockCount block = bit_counter.NextOrNotWord();
+output_size += block.popcount;
+position += block.length;
+  }
+} else {
+  while (position < filter.length) {
+BitBlockCount block = bit_counter.NextAndWord();
+output_size += block.popcount;
+position += block.length;
+  }
+}
+  } else {
+// The filter has no nulls, so we plow through its data as fast as
+// possible.
+BitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, 
filter.length);
+while (position < filter.length) {
+  BitBlockCount block = bit_counter.NextFourWords();
+  output_size += block.popcount;
+  position += block.length;
+}
+  }
+  return output_size;
+}
+
+template 
+Result> GetTakeIndicesImpl(
+const ArrayData& filter, FilterOptions::NullSelectionBehavior 
null_selection,
+MemoryPool* memory_pool) {
+  using T = typename IndexType::c_type;
+  typename TypeTraits::BuilderType builder(memory_pool);
+
+  const uint8_t* filter_data = filter.buffers[1]->data();
+  BitBlockCounter data_counter(filter_data, filter.offset, filter.length);
+
+  // The position relative to the start of the filter
+  T position = 0;
+
+  // The current position taking the filter offset into account
+  int64_t position_with_offset = filter.offset;
+  if (filter.GetNullCount() > 0) {
+// The filter has nulls, so we scan the validity bitmap and the filter data
+// bitmap together, branching on the null selection type.
+const uint8_t* filter_is_valid = filter.buffers[0]->data();
+
+// To count blocks whether filter_data[i] || !filter_is_valid[i]
+BinaryBitBlockCounter filter_counter(filter_data, filter.offset, 
filter_is_valid,
+ filter.offset, filter.length);
+if (null_selection == FilterOptions::DROP) {
+  while (position < filter.length) {
+BitBlockCount and_block = filter_counter.NextAndWord();
+RETURN_NOT_OK(builder.Reserve(and_block.popcount));
+if (and_block.AllSet()) {
+  // All the values are selected and non-null
+  for (int64_t i = 0; i < and_block.length; ++i) {
+builder.UnsafeAppend(position++);
+  }
+  position_with_offset += and_block.length;
+ 

[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#discussion_r441668163



##
File path: cpp/src/arrow/compute/api_vector.h
##
@@ -64,6 +67,24 @@ Result Filter(const Datum& values, const Datum& 
filter,
  const FilterOptions& options = FilterOptions::Defaults(),
  ExecContext* ctx = NULLPTR);
 
+namespace internal {
+
+// These internal functions are implemented in kernels/vector_selection.cc
+
+/// \brief Return the number of selected indices in the boolean filter
+ARROW_EXPORT
+int64_t GetFilterOutputSize(const ArrayData& filter,

Review comment:
   OK





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs opened a new pull request #7472: ARROW-8892: [C++][CI] CI builds for MSVC do not build benchmarks

2020-06-17 Thread GitBox


kszucs opened a new pull request #7472:
URL: https://github.com/apache/arrow/pull/7472


   Set `ARROW_BUILD_BENCHMARKS` a windows build which is triggered on pull 
requests.
   
   (also ordered the passed flags alphabetically)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#discussion_r441662165



##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -0,0 +1,1816 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include 
+#include 
+
+#include "arrow/array/array_base.h"
+#include "arrow/array/array_binary.h"
+#include "arrow/array/array_dict.h"
+#include "arrow/array/array_nested.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/array/concatenate.h"
+#include "arrow/buffer_builder.h"
+#include "arrow/compute/api_vector.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/compute/kernels/util_internal.h"
+#include "arrow/extension_type.h"
+#include "arrow/record_batch.h"
+#include "arrow/result.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bit_util.h"
+#include "arrow/util/bitmap_ops.h"
+#include "arrow/util/bitmap_reader.h"
+#include "arrow/util/int_util.h"
+
+namespace arrow {
+
+using internal::BinaryBitBlockCounter;
+using internal::BitBlockCount;
+using internal::BitBlockCounter;
+using internal::BitmapReader;
+using internal::CopyBitmap;
+using internal::GetArrayView;
+using internal::IndexBoundsCheck;
+using internal::OptionalBitBlockCounter;
+using internal::OptionalBitIndexer;
+
+namespace compute {
+namespace internal {
+
+int64_t GetFilterOutputSize(const ArrayData& filter,
+FilterOptions::NullSelectionBehavior 
null_selection) {
+  int64_t output_size = 0;
+  int64_t position = 0;
+  if (filter.GetNullCount() > 0) {
+const uint8_t* filter_is_valid = filter.buffers[0]->data();
+BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset,
+  filter_is_valid, filter.offset, 
filter.length);
+if (null_selection == FilterOptions::EMIT_NULL) {
+  while (position < filter.length) {
+BitBlockCount block = bit_counter.NextOrNotWord();
+output_size += block.popcount;
+position += block.length;
+  }
+} else {
+  while (position < filter.length) {
+BitBlockCount block = bit_counter.NextAndWord();
+output_size += block.popcount;
+position += block.length;
+  }
+}
+  } else {
+// The filter has no nulls, so we plow through its data as fast as
+// possible.
+BitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, 
filter.length);
+while (position < filter.length) {
+  BitBlockCount block = bit_counter.NextFourWords();
+  output_size += block.popcount;
+  position += block.length;
+}
+  }
+  return output_size;
+}
+
+template 
+Result> GetTakeIndicesImpl(
+const ArrayData& filter, FilterOptions::NullSelectionBehavior 
null_selection,
+MemoryPool* memory_pool) {
+  using T = typename IndexType::c_type;
+  typename TypeTraits::BuilderType builder(memory_pool);
+
+  const uint8_t* filter_data = filter.buffers[1]->data();
+  BitBlockCounter data_counter(filter_data, filter.offset, filter.length);
+
+  // The position relative to the start of the filter
+  T position = 0;
+
+  // The current position taking the filter offset into account
+  int64_t position_with_offset = filter.offset;
+  if (filter.GetNullCount() > 0) {
+// The filter has nulls, so we scan the validity bitmap and the filter data
+// bitmap together, branching on the null selection type.
+const uint8_t* filter_is_valid = filter.buffers[0]->data();
+
+// To count blocks whether filter_data[i] || !filter_is_valid[i]
+BinaryBitBlockCounter filter_counter(filter_data, filter.offset, 
filter_is_valid,
+ filter.offset, filter.length);
+if (null_selection == FilterOptions::DROP) {
+  while (position < filter.length) {
+BitBlockCount and_block = filter_counter.NextAndWord();
+RETURN_NOT_OK(builder.Reserve(and_block.popcount));
+if (and_block.AllSet()) {
+  // All the values are selected and non-null
+  for (int64_t i = 0; i < and_block.length; ++i) {
+builder.UnsafeAppend(position++);
+  }
+  position_with_offset += and_block.length;
+ 

[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#discussion_r441661392



##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -0,0 +1,1816 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include 
+#include 
+
+#include "arrow/array/array_base.h"
+#include "arrow/array/array_binary.h"
+#include "arrow/array/array_dict.h"
+#include "arrow/array/array_nested.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/array/concatenate.h"
+#include "arrow/buffer_builder.h"
+#include "arrow/compute/api_vector.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/compute/kernels/util_internal.h"
+#include "arrow/extension_type.h"
+#include "arrow/record_batch.h"
+#include "arrow/result.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bit_util.h"
+#include "arrow/util/bitmap_ops.h"
+#include "arrow/util/bitmap_reader.h"
+#include "arrow/util/int_util.h"
+
+namespace arrow {
+
+using internal::BinaryBitBlockCounter;
+using internal::BitBlockCount;
+using internal::BitBlockCounter;
+using internal::BitmapReader;
+using internal::CopyBitmap;
+using internal::GetArrayView;
+using internal::IndexBoundsCheck;
+using internal::OptionalBitBlockCounter;
+using internal::OptionalBitIndexer;
+
+namespace compute {
+namespace internal {
+
+int64_t GetFilterOutputSize(const ArrayData& filter,
+FilterOptions::NullSelectionBehavior 
null_selection) {
+  int64_t output_size = 0;
+  int64_t position = 0;
+  if (filter.GetNullCount() > 0) {
+const uint8_t* filter_is_valid = filter.buffers[0]->data();
+BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset,
+  filter_is_valid, filter.offset, 
filter.length);
+if (null_selection == FilterOptions::EMIT_NULL) {
+  while (position < filter.length) {
+BitBlockCount block = bit_counter.NextOrNotWord();
+output_size += block.popcount;
+position += block.length;
+  }
+} else {
+  while (position < filter.length) {
+BitBlockCount block = bit_counter.NextAndWord();
+output_size += block.popcount;
+position += block.length;
+  }
+}
+  } else {
+// The filter has no nulls, so we plow through its data as fast as
+// possible.
+BitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, 
filter.length);
+while (position < filter.length) {
+  BitBlockCount block = bit_counter.NextFourWords();
+  output_size += block.popcount;
+  position += block.length;
+}
+  }
+  return output_size;
+}
+
+template 
+Result> GetTakeIndicesImpl(
+const ArrayData& filter, FilterOptions::NullSelectionBehavior 
null_selection,
+MemoryPool* memory_pool) {
+  using T = typename IndexType::c_type;
+  typename TypeTraits::BuilderType builder(memory_pool);
+
+  const uint8_t* filter_data = filter.buffers[1]->data();
+  BitBlockCounter data_counter(filter_data, filter.offset, filter.length);
+
+  // The position relative to the start of the filter
+  T position = 0;
+
+  // The current position taking the filter offset into account
+  int64_t position_with_offset = filter.offset;
+  if (filter.GetNullCount() > 0) {
+// The filter has nulls, so we scan the validity bitmap and the filter data
+// bitmap together, branching on the null selection type.
+const uint8_t* filter_is_valid = filter.buffers[0]->data();
+
+// To count blocks whether filter_data[i] || !filter_is_valid[i]
+BinaryBitBlockCounter filter_counter(filter_data, filter.offset, 
filter_is_valid,
+ filter.offset, filter.length);
+if (null_selection == FilterOptions::DROP) {
+  while (position < filter.length) {
+BitBlockCount and_block = filter_counter.NextAndWord();
+RETURN_NOT_OK(builder.Reserve(and_block.popcount));
+if (and_block.AllSet()) {
+  // All the values are selected and non-null
+  for (int64_t i = 0; i < and_block.length; ++i) {
+builder.UnsafeAppend(position++);
+  }
+  position_with_offset += and_block.length;
+ 

[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#discussion_r441660992



##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -0,0 +1,1816 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include 
+#include 
+
+#include "arrow/array/array_base.h"
+#include "arrow/array/array_binary.h"
+#include "arrow/array/array_dict.h"
+#include "arrow/array/array_nested.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/array/concatenate.h"
+#include "arrow/buffer_builder.h"
+#include "arrow/compute/api_vector.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/compute/kernels/util_internal.h"
+#include "arrow/extension_type.h"
+#include "arrow/record_batch.h"
+#include "arrow/result.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bit_util.h"
+#include "arrow/util/bitmap_ops.h"
+#include "arrow/util/bitmap_reader.h"
+#include "arrow/util/int_util.h"
+
+namespace arrow {
+
+using internal::BinaryBitBlockCounter;
+using internal::BitBlockCount;
+using internal::BitBlockCounter;
+using internal::BitmapReader;
+using internal::CopyBitmap;
+using internal::GetArrayView;
+using internal::IndexBoundsCheck;
+using internal::OptionalBitBlockCounter;
+using internal::OptionalBitIndexer;
+
+namespace compute {
+namespace internal {
+
+int64_t GetFilterOutputSize(const ArrayData& filter,
+FilterOptions::NullSelectionBehavior 
null_selection) {
+  int64_t output_size = 0;
+  int64_t position = 0;
+  if (filter.GetNullCount() > 0) {
+const uint8_t* filter_is_valid = filter.buffers[0]->data();
+BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset,
+  filter_is_valid, filter.offset, 
filter.length);
+if (null_selection == FilterOptions::EMIT_NULL) {
+  while (position < filter.length) {
+BitBlockCount block = bit_counter.NextOrNotWord();
+output_size += block.popcount;
+position += block.length;
+  }
+} else {
+  while (position < filter.length) {
+BitBlockCount block = bit_counter.NextAndWord();
+output_size += block.popcount;
+position += block.length;
+  }
+}
+  } else {
+// The filter has no nulls, so we plow through its data as fast as
+// possible.
+BitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, 
filter.length);
+while (position < filter.length) {
+  BitBlockCount block = bit_counter.NextFourWords();
+  output_size += block.popcount;
+  position += block.length;
+}
+  }
+  return output_size;
+}
+
+template 
+Result> GetTakeIndicesImpl(
+const ArrayData& filter, FilterOptions::NullSelectionBehavior 
null_selection,
+MemoryPool* memory_pool) {
+  using T = typename IndexType::c_type;
+  typename TypeTraits::BuilderType builder(memory_pool);
+
+  const uint8_t* filter_data = filter.buffers[1]->data();
+  BitBlockCounter data_counter(filter_data, filter.offset, filter.length);
+
+  // The position relative to the start of the filter
+  T position = 0;
+
+  // The current position taking the filter offset into account
+  int64_t position_with_offset = filter.offset;
+  if (filter.GetNullCount() > 0) {
+// The filter has nulls, so we scan the validity bitmap and the filter data
+// bitmap together, branching on the null selection type.
+const uint8_t* filter_is_valid = filter.buffers[0]->data();
+
+// To count blocks whether filter_data[i] || !filter_is_valid[i]
+BinaryBitBlockCounter filter_counter(filter_data, filter.offset, 
filter_is_valid,
+ filter.offset, filter.length);
+if (null_selection == FilterOptions::DROP) {
+  while (position < filter.length) {
+BitBlockCount and_block = filter_counter.NextAndWord();
+RETURN_NOT_OK(builder.Reserve(and_block.popcount));
+if (and_block.AllSet()) {
+  // All the values are selected and non-null
+  for (int64_t i = 0; i < and_block.length; ++i) {
+builder.UnsafeAppend(position++);
+  }
+  position_with_offset += and_block.length;
+ 

[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#discussion_r441660523



##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -0,0 +1,1816 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include 
+#include 
+
+#include "arrow/array/array_base.h"
+#include "arrow/array/array_binary.h"
+#include "arrow/array/array_dict.h"
+#include "arrow/array/array_nested.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/array/concatenate.h"
+#include "arrow/buffer_builder.h"
+#include "arrow/compute/api_vector.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/compute/kernels/util_internal.h"
+#include "arrow/extension_type.h"
+#include "arrow/record_batch.h"
+#include "arrow/result.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bit_util.h"
+#include "arrow/util/bitmap_ops.h"
+#include "arrow/util/bitmap_reader.h"
+#include "arrow/util/int_util.h"
+
+namespace arrow {
+
+using internal::BinaryBitBlockCounter;
+using internal::BitBlockCount;
+using internal::BitBlockCounter;
+using internal::BitmapReader;
+using internal::CopyBitmap;
+using internal::GetArrayView;
+using internal::IndexBoundsCheck;
+using internal::OptionalBitBlockCounter;
+using internal::OptionalBitIndexer;
+
+namespace compute {
+namespace internal {
+
+int64_t GetFilterOutputSize(const ArrayData& filter,
+FilterOptions::NullSelectionBehavior 
null_selection) {
+  int64_t output_size = 0;
+  int64_t position = 0;
+  if (filter.GetNullCount() > 0) {
+const uint8_t* filter_is_valid = filter.buffers[0]->data();
+BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset,

Review comment:
   Maybe. Let's leave this for follow up because I would need some 
benchmarks to be comfortable doing this





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#discussion_r441659681



##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -0,0 +1,1816 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include 
+#include 
+
+#include "arrow/array/array_base.h"
+#include "arrow/array/array_binary.h"
+#include "arrow/array/array_dict.h"
+#include "arrow/array/array_nested.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/array/concatenate.h"
+#include "arrow/buffer_builder.h"
+#include "arrow/compute/api_vector.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/compute/kernels/util_internal.h"
+#include "arrow/extension_type.h"
+#include "arrow/record_batch.h"
+#include "arrow/result.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bit_util.h"
+#include "arrow/util/bitmap_ops.h"
+#include "arrow/util/bitmap_reader.h"
+#include "arrow/util/int_util.h"
+
+namespace arrow {
+
+using internal::BinaryBitBlockCounter;
+using internal::BitBlockCount;
+using internal::BitBlockCounter;
+using internal::BitmapReader;
+using internal::CopyBitmap;
+using internal::GetArrayView;
+using internal::IndexBoundsCheck;
+using internal::OptionalBitBlockCounter;
+using internal::OptionalBitIndexer;
+
+namespace compute {
+namespace internal {
+
+int64_t GetFilterOutputSize(const ArrayData& filter,
+FilterOptions::NullSelectionBehavior 
null_selection) {
+  int64_t output_size = 0;
+  int64_t position = 0;
+  if (filter.GetNullCount() > 0) {
+const uint8_t* filter_is_valid = filter.buffers[0]->data();
+BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset,
+  filter_is_valid, filter.offset, 
filter.length);
+if (null_selection == FilterOptions::EMIT_NULL) {
+  while (position < filter.length) {
+BitBlockCount block = bit_counter.NextOrNotWord();
+output_size += block.popcount;
+position += block.length;
+  }
+} else {
+  while (position < filter.length) {
+BitBlockCount block = bit_counter.NextAndWord();
+output_size += block.popcount;
+position += block.length;
+  }
+}
+  } else {
+// The filter has no nulls, so we plow through its data as fast as
+// possible.
+BitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, 
filter.length);

Review comment:
   Changing to use CountSetBits, which should be faster





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels

2020-06-17 Thread GitBox


github-actions[bot] commented on pull request #7471:
URL: https://github.com/apache/arrow/pull/7471#issuecomment-645466049


   https://issues.apache.org/jira/browse/ARROW-9109



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#discussion_r441658427



##
File path: cpp/src/arrow/testing/random.cc
##
@@ -84,7 +84,7 @@ std::shared_ptr RandomArrayGenerator::Boolean(int64_t 
size, double probab
 
   BufferVector buffers{2};
   // Need 2 distinct generators such that probabilities are not shared.
-  GenOpt value_gen(seed(), 0, 1, probability);
+  GenOpt value_gen(seed(), 0, 1, 1 - probability);

Review comment:
   Done, and renamed the parameters to be clear that it's the "true" 
probaility





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#discussion_r441656051



##
File path: cpp/src/arrow/compute/kernels/util_internal.h
##
@@ -0,0 +1,50 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "arrow/buffer.h"
+
+namespace arrow {
+namespace compute {
+namespace internal {
+
+// An internal data structure for unpacking a primitive argument to pass to a
+// kernel implementation
+struct PrimitiveArg {
+  const uint8_t* is_valid;
+  const uint8_t* data;

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels

2020-06-17 Thread GitBox


github-actions[bot] commented on pull request #7471:
URL: https://github.com/apache/arrow/pull/7471#issuecomment-645462522


   Revision: 0517d82791270284f4004ffe7812f08f2bb35235
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-326](https://github.com/ursa-labs/crossbow/branches/all?query=actions-326)
   
   |Task|Status|
   ||--|
   
|wheel-manylinux1-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux1-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux1-cp35m)|
   
|wheel-manylinux1-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux1-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux1-cp36m)|
   
|wheel-manylinux1-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux1-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux1-cp37m)|
   
|wheel-manylinux1-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux1-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux1-cp38)|
   
|wheel-manylinux2010-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux2010-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux2010-cp35m)|
   
|wheel-manylinux2010-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux2010-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux2010-cp36m)|
   
|wheel-manylinux2010-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux2010-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux2010-cp37m)|
   
|wheel-manylinux2010-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux2010-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux2010-cp38)|
   
|wheel-manylinux2014-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux2014-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux2014-cp35m)|
   
|wheel-manylinux2014-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux2014-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux2014-cp36m)|
   
|wheel-manylinux2014-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux2014-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux2014-cp37m)|
   
|wheel-manylinux2014-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux2014-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux2014-cp38)|
   
|wheel-osx-cp35m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-326-travis-wheel-osx-cp35m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-osx-cp36m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-326-travis-wheel-osx-cp36m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-osx-cp37m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-326-travis-wheel-osx-cp37m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-osx-cp38|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-326-travis-wheel-osx-cp38.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-win-cp35m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-326-appveyor-wheel-win-cp35m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
   
|wheel-win-cp36m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-326-appveyor-wheel-win-cp36m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
   

[GitHub] [arrow] kszucs closed pull request #6512: ARROW-8430: [CI] Configure self-hosted runners for Github Actions

2020-06-17 Thread GitBox


kszucs closed pull request #6512:
URL: https://github.com/apache/arrow/pull/6512


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels

2020-06-17 Thread GitBox


pitrou commented on pull request #7471:
URL: https://github.com/apache/arrow/pull/7471#issuecomment-645461657


   @github-actions crossbow submit -g wheel



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou opened a new pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels

2020-06-17 Thread GitBox


pitrou opened a new pull request #7471:
URL: https://github.com/apache/arrow/pull/7471


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#discussion_r441643717



##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -0,0 +1,1816 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include 
+#include 
+
+#include "arrow/array/array_base.h"
+#include "arrow/array/array_binary.h"
+#include "arrow/array/array_dict.h"
+#include "arrow/array/array_nested.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/array/concatenate.h"
+#include "arrow/buffer_builder.h"
+#include "arrow/compute/api_vector.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/compute/kernels/util_internal.h"
+#include "arrow/extension_type.h"
+#include "arrow/record_batch.h"
+#include "arrow/result.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bit_util.h"
+#include "arrow/util/bitmap_ops.h"
+#include "arrow/util/bitmap_reader.h"
+#include "arrow/util/int_util.h"
+
+namespace arrow {
+
+using internal::BinaryBitBlockCounter;
+using internal::BitBlockCount;
+using internal::BitBlockCounter;
+using internal::BitmapReader;
+using internal::CopyBitmap;
+using internal::GetArrayView;
+using internal::IndexBoundsCheck;
+using internal::OptionalBitBlockCounter;
+using internal::OptionalBitIndexer;
+
+namespace compute {
+namespace internal {
+
+int64_t GetFilterOutputSize(const ArrayData& filter,
+FilterOptions::NullSelectionBehavior 
null_selection) {
+  int64_t output_size = 0;
+  int64_t position = 0;
+  if (filter.GetNullCount() > 0) {
+const uint8_t* filter_is_valid = filter.buffers[0]->data();
+BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset,
+  filter_is_valid, filter.offset, 
filter.length);
+if (null_selection == FilterOptions::EMIT_NULL) {
+  while (position < filter.length) {
+BitBlockCount block = bit_counter.NextOrNotWord();
+output_size += block.popcount;
+position += block.length;
+  }
+} else {
+  while (position < filter.length) {
+BitBlockCount block = bit_counter.NextAndWord();
+output_size += block.popcount;
+position += block.length;
+  }
+}
+  } else {
+// The filter has no nulls, so we plow through its data as fast as
+// possible.
+BitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, 
filter.length);
+while (position < filter.length) {
+  BitBlockCount block = bit_counter.NextFourWords();
+  output_size += block.popcount;
+  position += block.length;
+}
+  }
+  return output_size;
+}
+
+template 
+Result> GetTakeIndicesImpl(
+const ArrayData& filter, FilterOptions::NullSelectionBehavior 
null_selection,
+MemoryPool* memory_pool) {
+  using T = typename IndexType::c_type;
+  typename TypeTraits::BuilderType builder(memory_pool);
+
+  const uint8_t* filter_data = filter.buffers[1]->data();
+  BitBlockCounter data_counter(filter_data, filter.offset, filter.length);
+
+  // The position relative to the start of the filter
+  T position = 0;
+
+  // The current position taking the filter offset into account
+  int64_t position_with_offset = filter.offset;
+  if (filter.GetNullCount() > 0) {

Review comment:
   Someone can do it as follow up (and keeping an eye on the benchmarks to 
avoid perf regressions), it doesn't feel like a good use of my time when there 
are so many things to do for 1.0.0. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7468: ARROW-8283: [Python] Limit FileSystemDataset constructor from fragments/paths, no filesystem interaction

2020-06-17 Thread GitBox


fsaintjacques commented on a change in pull request #7468:
URL: https://github.com/apache/arrow/pull/7468#discussion_r441638604



##
File path: python/pyarrow/_dataset.pyx
##
@@ -407,42 +407,82 @@ cdef class UnionDataset(Dataset):
 
 
 cdef class FileSystemDataset(Dataset):
-"""A Dataset created from a set of files on a particular filesystem.
+"""A Dataset of file fragments.
+
+A FileSystemDataset is composed of one or more FileFragment.
 
 Parameters
 --
-paths_or_selector : Union[FileSelector, List[FileInfo]]
-List of files/directories to consume.
+fragments : list[Fragments]
+List of fragments to consume.
 schema : Schema
-The top-level schema of the DataDataset.
+The top-level schema of the Dataset.
 format : FileFormat
-File format to create fragments from, currently only
-ParquetFileFormat, IpcFileFormat, and CsvFileFormat are supported.
-filesystem : FileSystem
-The filesystem which files are from.
-partitions : List[Expression], optional
-Attach additional partition information for the file paths.
+File format of the fragments, currently only ParquetFileFormat,
+IpcFileFormat, and CsvFileFormat are supported.
 root_partition : Expression, optional
 The top-level partition of the DataDataset.
 """
 
 cdef:
 CFileSystemDataset* filesystem_dataset
 
-def __init__(self, paths_or_selector, schema=None, format=None,
- filesystem=None, partitions=None, root_partition=None):
+def __init__(self, fragments, Schema schema, FileFormat format,
+ root_partition=None):
 cdef:
-FileInfo info
-Expression expr
 FileFragment fragment
-vector[CFileInfo] c_file_infos
-vector[shared_ptr[CExpression]] c_partitions
-shared_ptr[CFileFragment] c_fragment
 vector[shared_ptr[CFileFragment]] c_fragments
 CResult[shared_ptr[CDataset]] result
 
 root_partition = root_partition or _true
+if not isinstance(root_partition, Expression):
+raise TypeError(
+"Argument 'root_partition' has incorrect type (expected "
+"Epression, got {0})".format(type(root_partition))
+)
+
+for fragment in fragments:
+c_fragments.push_back(
+static_pointer_cast[CFileFragment, CFragment](
+fragment.unwrap()))
+
+result = CFileSystemDataset.Make(
+pyarrow_unwrap_schema(schema),
+( root_partition).unwrap(),
+( format).unwrap(),
+c_fragments
+)
+self.init(GetResultValue(result))
+
+cdef void init(self, const shared_ptr[CDataset]& sp):
+Dataset.init(self, sp)
+self.filesystem_dataset =  sp.get()
+
+@classmethod
+def from_paths(cls, paths, schema=None, format=None,

Review comment:
   Schema shouldn't be optional, add a test if the user doesn't pass a 
schema to see the result.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7437: ARROW-8943: [C++][Python][Dataset] Add partitioning support to ParquetDatasetFactory

2020-06-17 Thread GitBox


fsaintjacques commented on a change in pull request #7437:
URL: https://github.com/apache/arrow/pull/7437#discussion_r441633483



##
File path: python/pyarrow/tests/test_dataset.py
##
@@ -1522,19 +1522,20 @@ def _create_parquet_dataset_partitioned(root_path):
 @pytest.mark.parquet
 @pytest.mark.pandas
 def test_parquet_dataset_factory_partitioned(tempdir):
-# TODO support for specifying partitioning scheme
-
 root_path = tempdir / "test_parquet_dataset_factory_partitioned"
 metadata_path, table = _create_parquet_dataset_partitioned(root_path)
 
-dataset = ds.parquet_dataset(metadata_path)
-# TODO partition column not yet included
-# assert dataset.schema.equals(table.schema)
+partitioning = ds.partitioning(flavor="hive")
+dataset = ds.parquet_dataset(metadata_path,
+ partitioning=partitioning,
+ partition_base_dir=str(root_path))

Review comment:
   It is not needed anymore and fixed :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques closed pull request #7438: ARROW-9105: [C++][Dataset][Python] Pass an explicit schema to split_by_row_groups

2020-06-17 Thread GitBox


fsaintjacques closed pull request #7438:
URL: https://github.com/apache/arrow/pull/7438


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] bkietz commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


bkietz commented on a change in pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#discussion_r440923999



##
File path: cpp/src/arrow/compute/api_vector.h
##
@@ -64,6 +67,24 @@ Result Filter(const Datum& values, const Datum& 
filter,
  const FilterOptions& options = FilterOptions::Defaults(),
  ExecContext* ctx = NULLPTR);
 
+namespace internal {
+
+// These internal functions are implemented in kernels/vector_selection.cc
+
+/// \brief Return the number of selected indices in the boolean filter
+ARROW_EXPORT
+int64_t GetFilterOutputSize(const ArrayData& filter,

Review comment:
   This should probably be extracted as a ScalarFunction named popcount or 
so (follow up)

##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -0,0 +1,1816 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include 
+#include 
+
+#include "arrow/array/array_base.h"
+#include "arrow/array/array_binary.h"
+#include "arrow/array/array_dict.h"
+#include "arrow/array/array_nested.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/array/concatenate.h"
+#include "arrow/buffer_builder.h"
+#include "arrow/compute/api_vector.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/compute/kernels/util_internal.h"
+#include "arrow/extension_type.h"
+#include "arrow/record_batch.h"
+#include "arrow/result.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bit_util.h"
+#include "arrow/util/bitmap_ops.h"
+#include "arrow/util/bitmap_reader.h"
+#include "arrow/util/int_util.h"
+
+namespace arrow {
+
+using internal::BinaryBitBlockCounter;
+using internal::BitBlockCount;
+using internal::BitBlockCounter;
+using internal::BitmapReader;
+using internal::CopyBitmap;
+using internal::GetArrayView;
+using internal::IndexBoundsCheck;
+using internal::OptionalBitBlockCounter;
+using internal::OptionalBitIndexer;
+
+namespace compute {
+namespace internal {
+
+int64_t GetFilterOutputSize(const ArrayData& filter,
+FilterOptions::NullSelectionBehavior 
null_selection) {
+  int64_t output_size = 0;
+  int64_t position = 0;
+  if (filter.GetNullCount() > 0) {
+const uint8_t* filter_is_valid = filter.buffers[0]->data();
+BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset,
+  filter_is_valid, filter.offset, 
filter.length);
+if (null_selection == FilterOptions::EMIT_NULL) {
+  while (position < filter.length) {
+BitBlockCount block = bit_counter.NextOrNotWord();
+output_size += block.popcount;
+position += block.length;
+  }
+} else {
+  while (position < filter.length) {
+BitBlockCount block = bit_counter.NextAndWord();
+output_size += block.popcount;
+position += block.length;
+  }
+}
+  } else {
+// The filter has no nulls, so we plow through its data as fast as
+// possible.
+BitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, 
filter.length);

Review comment:
   If this is faster than `CountSetBits`, the latter should probably be 
rewritten with `BitBlockCounter`

##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -0,0 +1,1816 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include 
+#include 
+
+#include 

[GitHub] [arrow] wesm commented on pull request #7396: ARROW-9092: [C++][TRIAGE] Do not enable TestRoundFunctions when using LLVM 9 until gandiva-decimal-test is fixed

2020-06-17 Thread GitBox


wesm commented on pull request #7396:
URL: https://github.com/apache/arrow/pull/7396#issuecomment-645434797


   It occurs for me every time. My toolchain environment is somewhat complex 
(because I'm using some conda packages) so I'll see if I can provide a 
reproduction for you (it could take me a while to get back to you). FWIW I 
believe @kou encountered this problem also, probably in a more vanilla 
toolchain environment



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7437: ARROW-8943: [C++][Python][Dataset] Add partitioning support to ParquetDatasetFactory

2020-06-17 Thread GitBox


fsaintjacques commented on a change in pull request #7437:
URL: https://github.com/apache/arrow/pull/7437#discussion_r441621088



##
File path: cpp/src/arrow/dataset/file_parquet.h
##
@@ -215,6 +215,34 @@ class ARROW_DS_EXPORT ParquetFileFragment : public 
FileFragment {
   friend class ParquetFileFormat;
 };
 
+struct ParquetFactoryOptions {
+  // Either an explicit Partitioning or a PartitioningFactory to discover one.
+  //
+  // If a factory is provided, it will be used to infer a schema for partition 
fields
+  // based on file and directory paths then construct a Partitioning. The 
default
+  // is a Partitioning which will yield no partition information.
+  //
+  // The (explicit or discovered) partitioning will be applied to discovered 
files
+  // and the resulting partition information embedded in the Dataset.
+  PartitioningOrFactory partitioning{Partitioning::Default()};
+
+  // For the purposes of applying the partitioning, paths will be stripped
+  // of the partition_base_dir. Files not matching the partition_base_dir
+  // prefix will be skipped for partition discovery. The ignored files will 
still
+  // be part of the Dataset, but will not have partition information.
+  //
+  // Example:
+  // partition_base_dir = "/dataset";
+  //
+  // - "/dataset/US/sales.csv" -> "US/sales.csv" will be given to the 
partitioning
+  //
+  // - "/home/john/late_sales.csv" -> Will be ignored for partition discovery.
+  //
+  // This is useful for partitioning which parses directory when ordering
+  // is important, e.g. DirectoryPartitioning.
+  std::string partition_base_dir;

Review comment:
   That's a good point. I'll follow what FileSystemFactory does with the 
selector base's path.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7287: ARROW-8771: [C++] Add boost/process library to build support

2020-06-17 Thread GitBox


nealrichardson commented on pull request #7287:
URL: https://github.com/apache/arrow/pull/7287#issuecomment-645432605


   That's odd. I wonder what's different about your build setup from the jobs 
we run on CI because I haven't seen that before. Do you think you could add a 
crossbow job that captures this build setup (is it just bundled boost and 
thrift, plus ARROW_FLIGHT=ON and tests on?)
   
   FWIW this boost include has been removed in Thrift 0.13: 
https://github.com/apache/thrift/commit/1f34504f43a7a409364d4114f180762bf2679e57#diff-73a92ed6f7bb65d6e8f29f74ae6c94bf
   
   So if/whenever we can [upgrade to 
0.13](https://issues.apache.org/jira/browse/ARROW-8049), this particular header 
won't ever be invoked. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#discussion_r441613340



##
File path: cpp/src/arrow/compute/kernels/util_internal.cc
##
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/kernels/util_internal.h"
+
+#include 
+
+#include "arrow/array/data.h"
+#include "arrow/type.h"
+#include "arrow/util/checked_cast.h"
+
+namespace arrow {
+
+using internal::checked_cast;
+
+namespace compute {
+namespace internal {
+
+const uint8_t* GetValidityBitmap(const ArrayData& data) {
+  const uint8_t* bitmap = nullptr;
+  if (data.buffers[0]) {
+bitmap = data.buffers[0]->data();
+  }
+  return bitmap;
+}
+
+int GetBitWidth(const DataType& type) {
+  return checked_cast(type).bit_width();
+}
+
+PrimitiveArg GetPrimitiveArg(const ArrayData& arr) {
+  PrimitiveArg arg;
+  arg.is_valid = GetValidityBitmap(arr);
+  arg.data = arr.buffers[1]->data();
+  arg.bit_width = GetBitWidth(*arr.type);
+  arg.offset = arr.offset;
+  arg.length = arr.length;
+  if (arg.bit_width > 1) {
+arg.data += arr.offset * arg.bit_width / 8;
+  }
+  arg.null_count = arr.GetNullCount();

Review comment:
   I think now that I've fixed the problem in 
https://github.com/apache/arrow/commit/37c9804784325502bf47b651252c39bdcf3e03a9 
I don't need to compute it here





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#discussion_r441613579



##
File path: cpp/src/arrow/compute/kernels/vector_selection_test.cc
##
@@ -0,0 +1,1637 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/compute/api.h"
+#include "arrow/compute/kernels/test_util.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_common.h"
+#include "arrow/testing/gtest_util.h"
+#include "arrow/testing/random.h"
+#include "arrow/testing/util.h"
+
+namespace arrow {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+using util::string_view;
+
+namespace compute {
+
+// --
+// Some random data generation helpers
+
+template 
+std::shared_ptr RandomNumeric(int64_t length, double null_probability,
+ random::RandomArrayGenerator* rng) {
+  return rng->Numeric(length, 0, 127, null_probability);
+}
+
+std::shared_ptr RandomBoolean(int64_t length, double null_probability,
+ random::RandomArrayGenerator* rng) {
+  return rng->Boolean(length, 0.5, null_probability);
+}
+
+std::shared_ptr RandomString(int64_t length, double null_probability,
+random::RandomArrayGenerator* rng) {
+  return rng->String(length, 0, 32, null_probability);
+}
+
+std::shared_ptr RandomLargeString(int64_t length, double 
null_probability,
+ random::RandomArrayGenerator* rng) {
+  return rng->LargeString(length, 0, 32, null_probability);
+}
+
+std::shared_ptr RandomFixedSizeBinary(int64_t length, double 
null_probability,
+ random::RandomArrayGenerator* 
rng) {
+  const int32_t value_size = 16;
+  int64_t data_nbytes = length * value_size;
+  std::shared_ptr data = *AllocateBuffer(data_nbytes);
+  random_bytes(data_nbytes, /*seed=*/0, data->mutable_data());
+  auto validity = rng->Boolean(length, 1 - null_probability);
+
+  // Assemble the data for a FixedSizeBinaryArray
+  auto values_data = 
std::make_shared(fixed_size_binary(value_size), length);
+  values_data->buffers = {validity->data()->buffers[1], data};
+  return MakeArray(values_data);
+}
+
+// --
+
+TEST(GetTakeIndices, Basics) {
+  auto CheckCase = [&](const std::string& filter_json, const std::string& 
indices_json,
+   FilterOptions::NullSelectionBehavior null_selection,
+   const std::shared_ptr& indices_type = 
uint16()) {
+auto filter = ArrayFromJSON(boolean(), filter_json);
+auto expected_indices = ArrayFromJSON(indices_type, indices_json);
+ASSERT_OK_AND_ASSIGN(auto indices,
+ internal::GetTakeIndices(*filter->data(), 
null_selection));
+AssertArraysEqual(*expected_indices, *MakeArray(indices), 
/*verbose=*/true);
+  };
+
+  // Drop null cases
+  CheckCase("[]", "[]", FilterOptions::DROP);
+  CheckCase("[null]", "[]", FilterOptions::DROP);
+  CheckCase("[null, false, true, true, false, true]", "[2, 3, 5]", 
FilterOptions::DROP);
+
+  // Emit null cases
+  CheckCase("[]", "[]", FilterOptions::EMIT_NULL);
+  CheckCase("[null]", "[null]", FilterOptions::EMIT_NULL);
+  CheckCase("[null, false, true, true]", "[null, 2, 3]", 
FilterOptions::EMIT_NULL);
+}
+
+// TODO: Add slicing
+
+template 
+void CheckGetTakeIndicesCase(const Array& untyped_filter) {
+  const auto& filter = checked_cast(untyped_filter);
+  ASSERT_OK_AND_ASSIGN(std::shared_ptr drop_indices,
+   internal::GetTakeIndices(*filter.data(), 
FilterOptions::DROP));
+  // Verify DROP indices
+  {
+IndexArrayType indices(drop_indices);
+int64_t out_position = 0;
+for (int64_t i = 0; i < filter.length(); ++i) {
+  if (filter.IsValid(i)) {
+if (filter.Value(i)) {
+  ASSERT_EQ(indices.Value(out_position), i);
+  ++out_position;
+}
+  }
+}
+// Check that the end length agrees with the output of 

[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#discussion_r441612459



##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -0,0 +1,1816 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include 
+#include 
+
+#include "arrow/array/array_base.h"
+#include "arrow/array/array_binary.h"
+#include "arrow/array/array_dict.h"
+#include "arrow/array/array_nested.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/array/concatenate.h"
+#include "arrow/buffer_builder.h"
+#include "arrow/compute/api_vector.h"
+#include "arrow/compute/kernels/common.h"
+#include "arrow/compute/kernels/util_internal.h"
+#include "arrow/extension_type.h"
+#include "arrow/record_batch.h"
+#include "arrow/result.h"
+#include "arrow/util/bit_block_counter.h"
+#include "arrow/util/bit_util.h"
+#include "arrow/util/bitmap_ops.h"
+#include "arrow/util/bitmap_reader.h"
+#include "arrow/util/int_util.h"
+
+namespace arrow {
+
+using internal::BinaryBitBlockCounter;
+using internal::BitBlockCount;
+using internal::BitBlockCounter;
+using internal::BitmapReader;
+using internal::CopyBitmap;
+using internal::GetArrayView;
+using internal::IndexBoundsCheck;
+using internal::OptionalBitBlockCounter;
+using internal::OptionalBitIndexer;
+
+namespace compute {
+namespace internal {
+
+int64_t GetFilterOutputSize(const ArrayData& filter,
+FilterOptions::NullSelectionBehavior 
null_selection) {
+  int64_t output_size = 0;
+  int64_t position = 0;
+  if (filter.GetNullCount() > 0) {
+const uint8_t* filter_is_valid = filter.buffers[0]->data();
+BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset,
+  filter_is_valid, filter.offset, 
filter.length);
+if (null_selection == FilterOptions::EMIT_NULL) {
+  while (position < filter.length) {
+BitBlockCount block = bit_counter.NextOrNotWord();
+output_size += block.popcount;
+position += block.length;
+  }
+} else {
+  while (position < filter.length) {
+BitBlockCount block = bit_counter.NextAndWord();
+output_size += block.popcount;
+position += block.length;
+  }
+}
+  } else {
+// The filter has no nulls, so we plow through its data as fast as
+// possible.
+BitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, 
filter.length);
+while (position < filter.length) {
+  BitBlockCount block = bit_counter.NextFourWords();
+  output_size += block.popcount;
+  position += block.length;
+}
+  }
+  return output_size;
+}
+
+template 
+Result> GetTakeIndicesImpl(
+const ArrayData& filter, FilterOptions::NullSelectionBehavior 
null_selection,
+MemoryPool* memory_pool) {
+  using T = typename IndexType::c_type;
+  typename TypeTraits::BuilderType builder(memory_pool);
+
+  const uint8_t* filter_data = filter.buffers[1]->data();
+  BitBlockCounter data_counter(filter_data, filter.offset, filter.length);
+
+  // The position relative to the start of the filter
+  T position = 0;
+
+  // The current position taking the filter offset into account
+  int64_t position_with_offset = filter.offset;
+  if (filter.GetNullCount() > 0) {
+// The filter has nulls, so we scan the validity bitmap and the filter data
+// bitmap together, branching on the null selection type.
+const uint8_t* filter_is_valid = filter.buffers[0]->data();
+
+// To count blocks whether filter_data[i] || !filter_is_valid[i]
+BinaryBitBlockCounter filter_counter(filter_data, filter.offset, 
filter_is_valid,
+ filter.offset, filter.length);
+if (null_selection == FilterOptions::DROP) {
+  while (position < filter.length) {
+BitBlockCount and_block = filter_counter.NextAndWord();
+RETURN_NOT_OK(builder.Reserve(and_block.popcount));
+if (and_block.AllSet()) {
+  // All the values are selected and non-null
+  for (int64_t i = 0; i < and_block.length; ++i) {
+builder.UnsafeAppend(position++);
+  }
+  position_with_offset += and_block.length;
+ 

[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7461:
URL: https://github.com/apache/arrow/pull/7461#discussion_r441609892



##
File path: cpp/src/arrow/compute/kernels/codegen_internal.h
##
@@ -807,11 +885,30 @@ ArrayKernelExec SignedInteger(detail::GetTypeId get_id) {
   }
 }
 
-// Generate a kernel given a templated functor for base binary types
+// Generate a kernel given a templated functor for base binary types. Generates
+// a single kernel for binary/string and large binary / large string. If your
+// kernel implementation needs access to the specific type at compile time,
+// please use BaseBinarySpecific.
 //
 // See "Numeric" above for description of the generator functor
 template  class Generator, typename Type0, typename... 
Args>
 ArrayKernelExec BaseBinary(detail::GetTypeId get_id) {

Review comment:
   I'll move these "generator-dispatchers" to a `generate::` namespace for 
clarity. I thought that `codegen::` was clear enough but I guess not. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size

2020-06-17 Thread GitBox


fsaintjacques commented on a change in pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#discussion_r441606010



##
File path: cpp/src/arrow/compute/kernels/vector_selection_test.cc
##
@@ -0,0 +1,1637 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/compute/api.h"
+#include "arrow/compute/kernels/test_util.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_common.h"
+#include "arrow/testing/gtest_util.h"
+#include "arrow/testing/random.h"
+#include "arrow/testing/util.h"
+
+namespace arrow {
+
+using internal::checked_cast;
+using internal::checked_pointer_cast;
+using util::string_view;
+
+namespace compute {
+
+// --
+// Some random data generation helpers
+
+template 
+std::shared_ptr RandomNumeric(int64_t length, double null_probability,
+ random::RandomArrayGenerator* rng) {
+  return rng->Numeric(length, 0, 127, null_probability);
+}
+
+std::shared_ptr RandomBoolean(int64_t length, double null_probability,
+ random::RandomArrayGenerator* rng) {
+  return rng->Boolean(length, 0.5, null_probability);
+}
+
+std::shared_ptr RandomString(int64_t length, double null_probability,
+random::RandomArrayGenerator* rng) {
+  return rng->String(length, 0, 32, null_probability);
+}
+
+std::shared_ptr RandomLargeString(int64_t length, double 
null_probability,
+ random::RandomArrayGenerator* rng) {
+  return rng->LargeString(length, 0, 32, null_probability);
+}
+
+std::shared_ptr RandomFixedSizeBinary(int64_t length, double 
null_probability,
+ random::RandomArrayGenerator* 
rng) {
+  const int32_t value_size = 16;
+  int64_t data_nbytes = length * value_size;
+  std::shared_ptr data = *AllocateBuffer(data_nbytes);
+  random_bytes(data_nbytes, /*seed=*/0, data->mutable_data());
+  auto validity = rng->Boolean(length, 1 - null_probability);
+
+  // Assemble the data for a FixedSizeBinaryArray
+  auto values_data = 
std::make_shared(fixed_size_binary(value_size), length);
+  values_data->buffers = {validity->data()->buffers[1], data};
+  return MakeArray(values_data);
+}
+
+// --
+
+TEST(GetTakeIndices, Basics) {
+  auto CheckCase = [&](const std::string& filter_json, const std::string& 
indices_json,
+   FilterOptions::NullSelectionBehavior null_selection,
+   const std::shared_ptr& indices_type = 
uint16()) {
+auto filter = ArrayFromJSON(boolean(), filter_json);
+auto expected_indices = ArrayFromJSON(indices_type, indices_json);
+ASSERT_OK_AND_ASSIGN(auto indices,
+ internal::GetTakeIndices(*filter->data(), 
null_selection));
+AssertArraysEqual(*expected_indices, *MakeArray(indices), 
/*verbose=*/true);
+  };
+
+  // Drop null cases
+  CheckCase("[]", "[]", FilterOptions::DROP);
+  CheckCase("[null]", "[]", FilterOptions::DROP);
+  CheckCase("[null, false, true, true, false, true]", "[2, 3, 5]", 
FilterOptions::DROP);
+
+  // Emit null cases
+  CheckCase("[]", "[]", FilterOptions::EMIT_NULL);
+  CheckCase("[null]", "[null]", FilterOptions::EMIT_NULL);
+  CheckCase("[null, false, true, true]", "[null, 2, 3]", 
FilterOptions::EMIT_NULL);
+}
+
+// TODO: Add slicing
+
+template 
+void CheckGetTakeIndicesCase(const Array& untyped_filter) {
+  const auto& filter = checked_cast(untyped_filter);
+  ASSERT_OK_AND_ASSIGN(std::shared_ptr drop_indices,
+   internal::GetTakeIndices(*filter.data(), 
FilterOptions::DROP));
+  // Verify DROP indices
+  {
+IndexArrayType indices(drop_indices);
+int64_t out_position = 0;
+for (int64_t i = 0; i < filter.length(); ++i) {
+  if (filter.IsValid(i)) {
+if (filter.Value(i)) {
+  ASSERT_EQ(indices.Value(out_position), i);
+  ++out_position;
+}
+  }
+}
+// Check that the end length agrees with the output of 

[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7461:
URL: https://github.com/apache/arrow/pull/7461#discussion_r441606788



##
File path: cpp/src/arrow/compute/kernels/codegen_internal.h
##
@@ -485,18 +502,44 @@ struct ScalarUnaryNotNullStateful {
   struct ArrayExec> {
 static void Exec(const ThisType& functor, KernelContext* ctx, const 
ExecBatch& batch,
  Datum* out) {
-  typename TypeTraits::BuilderType builder;
-  VisitArrayValuesInline(*batch[0].array(), 
[&](util::optional v) {
-if (v.has_value()) {
-  KERNEL_RETURN_IF_ERROR(ctx, builder.Append(functor.op.Call(ctx, 
*v)));
-} else {
-  KERNEL_RETURN_IF_ERROR(ctx, builder.AppendNull());
+  // TODO: This code path is currently inadequately tested.
+
+  using offset_type = typename Type::offset_type;

Review comment:
   The validity bitmap is already precomputed (zero-copied, in fact) by the 
executor so using the builder naively here is computationally wasteful





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7461:
URL: https://github.com/apache/arrow/pull/7461#discussion_r441606788



##
File path: cpp/src/arrow/compute/kernels/codegen_internal.h
##
@@ -485,18 +502,44 @@ struct ScalarUnaryNotNullStateful {
   struct ArrayExec> {
 static void Exec(const ThisType& functor, KernelContext* ctx, const 
ExecBatch& batch,
  Datum* out) {
-  typename TypeTraits::BuilderType builder;
-  VisitArrayValuesInline(*batch[0].array(), 
[&](util::optional v) {
-if (v.has_value()) {
-  KERNEL_RETURN_IF_ERROR(ctx, builder.Append(functor.op.Call(ctx, 
*v)));
-} else {
-  KERNEL_RETURN_IF_ERROR(ctx, builder.AppendNull());
+  // TODO: This code path is currently inadequately tested.
+
+  using offset_type = typename Type::offset_type;

Review comment:
   The validity bitmap is already precomputed (zero-copied, in fact) by the 
executor so using the builder natively here is computationally wasteful





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7461:
URL: https://github.com/apache/arrow/pull/7461#discussion_r441605587



##
File path: cpp/src/arrow/compute/kernels/scalar_compare.cc
##
@@ -54,72 +56,106 @@ struct GreaterEqual {
   }
 };
 
-struct Less {
-  template 
-  static constexpr bool Call(KernelContext*, const T& left, const T& right) {
-return left < right;
-  }
-};
-
-struct LessEqual {
-  template 
-  static constexpr bool Call(KernelContext*, const T& left, const T& right) {
-return left <= right;
-  }
-};
+// Implement Greater, GreaterEqual by flipping arguments to Less, LessEqual
 
-template 
-void AddCompare(const std::shared_ptr& ty, ScalarFunction* func) {
-  ArrayKernelExec exec = codegen::ScalarBinaryEqualTypes::Exec;
-  DCHECK_OK(func->AddKernel({ty, ty}, boolean(), exec));
+template 

Review comment:
   Generator and Op are not the same thing in this nomenclature. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7461:
URL: https://github.com/apache/arrow/pull/7461#discussion_r441605587



##
File path: cpp/src/arrow/compute/kernels/scalar_compare.cc
##
@@ -54,72 +56,106 @@ struct GreaterEqual {
   }
 };
 
-struct Less {
-  template 
-  static constexpr bool Call(KernelContext*, const T& left, const T& right) {
-return left < right;
-  }
-};
-
-struct LessEqual {
-  template 
-  static constexpr bool Call(KernelContext*, const T& left, const T& right) {
-return left <= right;
-  }
-};
+// Implement Greater, GreaterEqual by flipping arguments to Less, LessEqual
 
-template 
-void AddCompare(const std::shared_ptr& ty, ScalarFunction* func) {
-  ArrayKernelExec exec = codegen::ScalarBinaryEqualTypes::Exec;
-  DCHECK_OK(func->AddKernel({ty, ty}, boolean(), exec));
+template 

Review comment:
   Generator and Op are not the same thing in codegen_internal.h -- I agree 
with giving consistent names to isomorphic concepts, though. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7461:
URL: https://github.com/apache/arrow/pull/7461#discussion_r441605115



##
File path: cpp/src/arrow/type.h
##
@@ -1251,6 +1256,7 @@ class ARROW_EXPORT TimestampType : public TemporalType, 
public ParametricType {
 
   static constexpr Type::type type_id = Type::TIMESTAMP;
   using c_type = int64_t;
+  using StorageType = Int64Type;

Review comment:
   OK





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators

2020-06-17 Thread GitBox


wesm commented on a change in pull request #7461:
URL: https://github.com/apache/arrow/pull/7461#discussion_r441604605



##
File path: cpp/src/arrow/scalar.h
##
@@ -237,19 +246,17 @@ struct ARROW_EXPORT FixedSizeBinaryScalar : public 
BinaryScalar {
   explicit FixedSizeBinaryScalar(std::shared_ptr type) : 
BinaryScalar(type) {}
 };
 
-template 
-struct ARROW_EXPORT TemporalScalar : public Scalar {
-  using Scalar::Scalar;
+template 
+struct ARROW_EXPORT TemporalScalar : internal::PrimitiveScalar {

Review comment:
   Int64Scalar and TimestampScalar need to have a common base type in order 
to both be unboxed by templated code that uses Int64Type in the kernel codegen. 
So `PrimitiveScalar` won't work. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >