[GitHub] [arrow] wesm commented on pull request #7560: ARROW-9252: [Integration] Factor out IPC integration tests into script, add back 0.14.1 "gold" files

2020-06-29 Thread GitBox


wesm commented on pull request #7560:
URL: https://github.com/apache/arrow/pull/7560#issuecomment-651423838


   Will merge this if the build passes with the arrow-testing changes



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


mrkn commented on pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#issuecomment-651456329


   @wesm Is it better to work for benchmarking in other pull-request?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7585: ARROW-3520: [C++] Add "list_flatten" vector kernel wrapper for Flatten method of ListArray types

2020-06-29 Thread GitBox


github-actions[bot] commented on pull request #7585:
URL: https://github.com/apache/arrow/pull/7585#issuecomment-651460741


   https://issues.apache.org/jira/browse/ARROW-3520



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux

2020-06-29 Thread GitBox


wesm commented on pull request #7580:
URL: https://github.com/apache/arrow/pull/7580#issuecomment-651460507


   Hm not so fast. The macOS py35 failure seems legitimate
   
   https://travis-ci.org/github/ursa-labs/crossbow/builds/703242650#L10060



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-29 Thread GitBox


wesm commented on pull request #7556:
URL: https://github.com/apache/arrow/pull/7556#issuecomment-651466752


   @github-actions crossbow submit -g linux -g wheel -g conda



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-29 Thread GitBox


github-actions[bot] commented on pull request #7556:
URL: https://github.com/apache/arrow/pull/7556#issuecomment-651467252


   Revision: 821f30a834dab99cdc757100e51986384f0a391c
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-367](https://github.com/ursa-labs/crossbow/branches/all?query=actions-367)
   
   |Task|Status|
   ||--|
   |centos-6-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-367-github-centos-6-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-367-github-centos-6-amd64)|
   
|centos-7-aarch64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-367-travis-centos-7-aarch64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |centos-7-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-367-github-centos-7-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-367-github-centos-7-amd64)|
   
|centos-8-aarch64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-367-travis-centos-8-aarch64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |centos-8-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-367-github-centos-8-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-367-github-centos-8-amd64)|
   
|conda-clean|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-clean)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-clean)|
   
|conda-linux-gcc-py36-cpu|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py36-cpu)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py36-cpu)|
   
|conda-linux-gcc-py36-cuda|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py36-cuda)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py36-cuda)|
   
|conda-linux-gcc-py37-cpu|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py37-cpu)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py37-cpu)|
   
|conda-linux-gcc-py37-cuda|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py37-cuda)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py37-cuda)|
   
|conda-linux-gcc-py38-cpu|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py38-cpu)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py38-cpu)|
   
|conda-linux-gcc-py38-cuda|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py38-cuda)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py38-cuda)|
   
|conda-osx-clang-py36|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-osx-clang-py36)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-osx-clang-py36)|
   
|conda-osx-clang-py37|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-osx-clang-py37)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-osx-clang-py37)|
   
|conda-osx-clang-py38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-osx-clang-py38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-osx-clang-py38)|
   
|conda-win-vs2015-py36|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-win-vs2015-py36)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-win-vs2015-py36)|
   
|conda-win-vs2015-py37|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-win-vs2015-py37)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-win-vs2015-py37)|
   

[GitHub] [arrow] liyafan82 commented on pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-29 Thread GitBox


liyafan82 commented on pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#issuecomment-651474250


   @rymurr Thanks for your work. A few typos. 
   I think it would be ready for merge. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-29 Thread GitBox


liyafan82 commented on a change in pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#discussion_r447363600



##
File path: java/memory/src/main/java/org/apache/arrow/memory/ArrowBuf.java
##
@@ -227,13 +207,25 @@ public ArrowBuf slice(long index, long length) {
 return newBuf;
   }
 
+  /**
+   * Make a nio byte buffer from this arrowbuf.
+   */
   public ByteBuffer nioBuffer() {
-return asNettyBuffer().nioBuffer();
+return nioBuffer(readerIndex, checkedCastToInt(readableBytes()));
   }
 
+
+  /**
+   *  Make an nio byte buffer from this ArrowBuf.

Review comment:
   nit: an -> a





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson edited a comment on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-29 Thread GitBox


nealrichardson edited a comment on pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#issuecomment-651263042


   I'm taking this over. Outstanding TODOs:
   
   - [x] Add tests
   - [x] Support record batches
   - [x] Support nested types (requires adapting the data structure and adding 
recursion)
   - [x] Test the print method
   - [x] Test/handle bad data in metadata$r; allow users to edit it manually?
   - [x] Don't keep attributes (e.g. factor levels) that we explicitly 
translate to Arrow already
   
   Top-level dataset metadata is deferred to 
https://issues.apache.org/jira/browse/ARROW-9271.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-29 Thread GitBox


wesm commented on pull request #7556:
URL: https://github.com/apache/arrow/pull/7556#issuecomment-651464683


   Apparently the `-DBUILD_SHARED_LIBS=OFF` option for Brotli doesn't do 
anything. I'll add some code to scrub the shared libs from the manylinux images



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zeevm opened a new pull request #7586: Calculate page and column statistics

2020-06-29 Thread GitBox


zeevm opened a new pull request #7586:
URL: https://github.com/apache/arrow/pull/7586


   1. Calculate page and column statistics
   2. Use pre-calculated statistics when available to speed-up when writing 
data from other formats like ORC.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-06-29 Thread GitBox


zhztheplayer commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-651422684


   Thanks for the comments! I've got some stuffs to deal with these days. Will 
address as soon as possible.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson edited a comment on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-29 Thread GitBox


nealrichardson edited a comment on pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#issuecomment-651263042


   I'm taking this over. Outstanding TODOs:
   
   - [x] Add tests
   - [x] Support record batches
   - [x] Support nested types (requires adapting the data structure and adding 
recursion)
   - [x] Test the print method
   - [ ] Test/handle bad data in metadata$r; allow users to edit it manually?
   - [x] Don't keep attributes (e.g. factor levels) that we explicitly 
translate to Arrow already
   
   Top-level dataset metadata is deferred to 
https://issues.apache.org/jira/browse/ARROW-9271.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #7535: ARROW-9222: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-29 Thread GitBox


emkornfield commented on a change in pull request #7535:
URL: https://github.com/apache/arrow/pull/7535#discussion_r447396488



##
File path: docs/source/format/Columnar.rst
##
@@ -688,11 +687,10 @@ will have the following layout: ::
 ||---|
 | joemark| unspecified (padding) |
 
-Similar to structs, a particular child array may have a non-null slot
-even if the validity bitmap of the parent union array indicates the
-slot is null.  Additionally, a child array may have a non-null slot
-even if the types array indicates that a slot contains a different
-type at the index.
+Note that a child array may have a non-null slot even if the types array
+indicates that a slot contains a different type at the index, so the

Review comment:
   I have a little trouble parsing this language.  Could it be stated more 
as a positive?
   
   > Only the slot in the array corresponding to the type index is considered.  
All "unselected" values are ignored and could be any semantically correct array 
value.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm opened a new pull request #7585: ARROW-3520: [C++] WIP Add vector kernel wrapper for Flatten method of ListArray types

2020-06-29 Thread GitBox


wesm opened a new pull request #7585:
URL: https://github.com/apache/arrow/pull/7585


   I'm testing a JIRA webhook, I'll close this PR and then reopen it when the 
patch is done



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-29 Thread GitBox


wesm commented on pull request #7556:
URL: https://github.com/apache/arrow/pull/7556#issuecomment-651466442


   Actually, that's crazy. I'm taking the same approach as ZSTD and adding a 
CMake toggle between shared and static Brotli (with default being shared)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-29 Thread GitBox


nealrichardson commented on pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#issuecomment-651484901


   Ok, this isn't necessarily pretty but I think it's done, or done enough for 
here. I'll add some more tests, probably some docs for the format, and poke 
around a bit more while doing the tickets for `haven::labelled` and `POSIXlt` 
types. @romainfrancois if you can glance at the changes I made and confirm that 
I haven't butchered things too badly, that would be great, thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #7535: ARROW-9222: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-29 Thread GitBox


emkornfield commented on a change in pull request #7535:
URL: https://github.com/apache/arrow/pull/7535#discussion_r447396655



##
File path: docs/source/format/Columnar.rst
##
@@ -566,33 +572,28 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]``
 ::
 
 * Length: 4, Null count: 1
-* Validity bitmap buffer:
-  |Byte 0 (validity bitmap) | Bytes 1-63|
-  |-|---|
-  |1101 | 0 (padding)   |
-
 * Types buffer:
 
   |Byte 0   | Byte 1  | Byte 2   | Byte 3   | Bytes 4-63  |
   |-|-|--|--|-|
-  | 0   | unspecified | 0| 1| unspecified |
+  | 0   | 0   | 0| 1| unspecified |
 
 * Offset buffer:
 
   |Bytes 0-3 | Bytes 4-7   | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
   |--|-||-|-|
-  | 0| unspecified | 1  | 0   | unspecified |
+  | 0| 1   | 2  | 0   | unspecified |
 
 * Children arrays:
   * Field-0 array (f: float):
 * Length: 2, nulls: 0

Review comment:
   should this be nulls: 1 now?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7585: ARROW-3520: [C++] WIP Add vector kernel wrapper for Flatten method of ListArray types

2020-06-29 Thread GitBox


wesm closed pull request #7585:
URL: https://github.com/apache/arrow/pull/7585


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7569: ARROW-9152: [C++] Specialized implementation of filtering Binary/LargeBinary-based types

2020-06-29 Thread GitBox


wesm commented on pull request #7569:
URL: https://github.com/apache/arrow/pull/7569#issuecomment-651457980


   +1, this is a bit dry so would rather reviewers reserve their time for other 
PRs



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7569: ARROW-9152: [C++] Specialized implementation of filtering Binary/LargeBinary-based types

2020-06-29 Thread GitBox


wesm closed pull request #7569:
URL: https://github.com/apache/arrow/pull/7569


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


wesm edited a comment on pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#issuecomment-651457062


   @mrkn it's up to you, it's fine with me if you work on performance tuning 
(or at least measurement) in another PR or this one



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


wesm commented on pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#issuecomment-651457062


   @mrkn it's up to you, it's fine with me if you work on performance tuning in 
another PR or this one



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-29 Thread GitBox


liyafan82 commented on a change in pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#discussion_r447363293



##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/rounding/DefaultRoundingPolicy.java
##
@@ -17,33 +17,103 @@
 
 package org.apache.arrow.memory.rounding;
 
-import java.lang.reflect.Field;
-
-import org.apache.arrow.memory.NettyAllocationManager;
 import org.apache.arrow.memory.util.CommonUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import io.netty.util.internal.SystemPropertyUtil;
 
 /**
  * The default rounding policy. That is, if the requested size is within the 
chunk size,
  * the rounded size will be the next power of two. Otherwise, the rounded size
  * will be identical to the requested size.
  */
 public class DefaultRoundingPolicy implements RoundingPolicy {
-
+  private static final Logger logger = 
LoggerFactory.getLogger(DefaultRoundingPolicy.class);
   public final long chunkSize;
 
   /**
-   * The singleton instance.
+   * The variables here and the static block calculates the DEFAULT_CHUNK_SIZE.
+   *
+   * 
+   *   It was copied from {@link io.netty.buffer.PooledByteBufAllocator}.
+   * 
*/
-  public static final DefaultRoundingPolicy INSTANCE = new 
DefaultRoundingPolicy();
+  private static final int MIN_PAGE_SIZE = 4096;
+  private static final int MAX_CHUNK_SIZE = (int) (((long) Integer.MAX_VALUE + 
1) / 2);
+  private static final long DEFAULT_CHUNK_SIZE;
 
-  private DefaultRoundingPolicy() {
+
+  static {
+int defaultPageSize = 
SystemPropertyUtil.getInt("org.apache.memory.allocator.pageSize", 8192);
+Throwable pageSizeFallbackCause = null;
 try {
-  Field field = 
NettyAllocationManager.class.getDeclaredField("CHUNK_SIZE");
-  field.setAccessible(true);
-  chunkSize = (Long) field.get(null);
-} catch (Exception e) {
-  throw new RuntimeException("Failed to get chunk size from allocation 
manager");
+  validateAndCalculatePageShifts(defaultPageSize);
+} catch (Throwable t) {
+  pageSizeFallbackCause = t;
+  defaultPageSize = 8192;
+}
+
+int defaultMaxOrder = 
SystemPropertyUtil.getInt("org.apache.memory.allocator.maxOrder", 11);
+Throwable maxOrderFallbackCause = null;
+try {
+  validateAndCalculateChunkSize(defaultPageSize, defaultMaxOrder);
+} catch (Throwable t) {
+  maxOrderFallbackCause = t;
+  defaultMaxOrder = 11;
+}
+DEFAULT_CHUNK_SIZE = validateAndCalculateChunkSize(defaultPageSize, 
defaultMaxOrder);
+if (logger.isDebugEnabled()) {
+  if (pageSizeFallbackCause == null) {
+logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", 
defaultPageSize);
+  } else {
+logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", 
defaultPageSize, pageSizeFallbackCause);
+  }

Review comment:
   I think here we only need one parameter here? So it should be
   
   logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", 
pageSizeFallbackCause);





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-29 Thread GitBox


liyafan82 commented on a change in pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#discussion_r447363481



##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/rounding/DefaultRoundingPolicy.java
##
@@ -17,33 +17,107 @@
 
 package org.apache.arrow.memory.rounding;
 
-import java.lang.reflect.Field;
-
-import org.apache.arrow.memory.NettyAllocationManager;
 import org.apache.arrow.memory.util.CommonUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import io.netty.util.internal.SystemPropertyUtil;
 
 /**
  * The default rounding policy. That is, if the requested size is within the 
chunk size,
  * the rounded size will be the next power of two. Otherwise, the rounded size
  * will be identical to the requested size.
  */
 public class DefaultRoundingPolicy implements RoundingPolicy {
-
+  private static final Logger logger = 
LoggerFactory.getLogger(DefaultRoundingPolicy.class);
   public final long chunkSize;
 
   /**
-   * The singleton instance.
+   * The variables here and the static block calculates teh DEFAULT_CHUNK_SIZE.
+   *
+   * 
+   *   It was copied from {@link io.netty.buffer.PooledByteBufAllocator}.
+   * 
*/
-  public static final DefaultRoundingPolicy INSTANCE = new 
DefaultRoundingPolicy();
+  private static final int MIN_PAGE_SIZE = 4096;
+  private static final int MAX_CHUNK_SIZE = (int) (((long) Integer.MAX_VALUE + 
1) / 2);
+  private static final long DEFAULT_CHUNK_SIZE;
+  private static final int DEFAULT_PAGE_SIZE;
+  private static final int DEFAULT_MAX_ORDER;
 
-  private DefaultRoundingPolicy() {
+
+  static {
+int defaultPageSize = 
SystemPropertyUtil.getInt("org.apache.memory.allocator.pageSize", 8192);
+Throwable pageSizeFallbackCause = null;
 try {
-  Field field = 
NettyAllocationManager.class.getDeclaredField("CHUNK_SIZE");
-  field.setAccessible(true);
-  chunkSize = (Long) field.get(null);
-} catch (Exception e) {
-  throw new RuntimeException("Failed to get chunk size from allocation 
manager");
+  validateAndCalculatePageShifts(defaultPageSize);
+} catch (Throwable t) {
+  pageSizeFallbackCause = t;
+  defaultPageSize = 8192;
+}
+DEFAULT_PAGE_SIZE = defaultPageSize;
+
+int defaultMaxOrder = 
SystemPropertyUtil.getInt("org.apache.memory.allocator.maxOrder", 11);
+Throwable maxOrderFallbackCause = null;
+try {
+  validateAndCalculateChunkSize(DEFAULT_PAGE_SIZE, defaultMaxOrder);
+} catch (Throwable t) {
+  maxOrderFallbackCause = t;
+  defaultMaxOrder = 11;
+}
+DEFAULT_MAX_ORDER = defaultMaxOrder;
+DEFAULT_CHUNK_SIZE = validateAndCalculateChunkSize(DEFAULT_PAGE_SIZE, 
DEFAULT_MAX_ORDER);
+if (logger.isDebugEnabled()) {
+  if (pageSizeFallbackCause == null) {
+logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", 
DEFAULT_PAGE_SIZE);
+  } else {
+logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", 
DEFAULT_PAGE_SIZE, pageSizeFallbackCause);
+  }
+  if (maxOrderFallbackCause == null) {
+logger.debug("-Dorg.apache.memory.allocator.maxOrder: {}", 
DEFAULT_MAX_ORDER);
+  } else {
+logger.debug("-Dorg.apache.memory.allocator.maxOrder: {}", 
DEFAULT_MAX_ORDER, maxOrderFallbackCause);
+  }

Review comment:
   Also here we need only one parameter in the "else" statement.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-29 Thread GitBox


liyafan82 commented on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-651492666


   In addition to the problem of top level validity buffer, I think there is 
another problem to discuss: Java is using the ordinal of the minor type as the 
type id, and this is not aligning with the C++ implementation/specification. 
   
   This PR works around the problem by supporting two modes. However, the C++ 
and Java implementations are not really aligning, IMO. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 edited a comment on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-29 Thread GitBox


liyafan82 edited a comment on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-651492666


   In addition to the problem of top level validity buffer, I think there is 
another problem to discuss: Java is using the ordinal of the minor type as the 
type id, and this is not aligning with the C++ implementation/specification. 
   
   This PR works around the problem by supporting two modes. However, the C++ 
and Java implementations are not really aligning, IMO. Do we have a plan to 
mitigate this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #6156: ARROW-7539: [Java] FieldVector getFieldBuffers API should not set reader/writer indices

2020-06-29 Thread GitBox


emkornfield commented on pull request #6156:
URL: https://github.com/apache/arrow/pull/6156#issuecomment-651509291


   Can we leave the old method in place and mark it as deprecated and remove in 
a later release?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-29 Thread GitBox


emkornfield commented on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-651515749


   @liyafan82 I think that is a good point.  If it supports both modes I think 
that is a reasonable compromise for now as long as @jacques-n is OK with it.  
But we can maybe discuss further if needed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7560: ARROW-9252: [Integration] Factor out IPC integration tests into script, add back 0.14.1 "gold" files

2020-06-29 Thread GitBox


wesm closed pull request #7560:
URL: https://github.com/apache/arrow/pull/7560


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7560: ARROW-9252: [Integration] Factor out IPC integration tests into script, add back 0.14.1 "gold" files

2020-06-29 Thread GitBox


wesm commented on pull request #7560:
URL: https://github.com/apache/arrow/pull/7560#issuecomment-651458622


   Yahtzee



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn edited a comment on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


mrkn edited a comment on pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#issuecomment-651458470


   @wesm OK.  I continue to work for benchmarking in this pull-request. If I 
need more time to tune etc., I'll split the issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


mrkn commented on pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#issuecomment-651458470


   OK.  I continue to work for benchmarking in this pull-request. If I need 
more time to tune etc., I'll split the issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7586: Calculate page and column statistics

2020-06-29 Thread GitBox


github-actions[bot] commented on pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#issuecomment-651495743


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs opened a new pull request #7584: ARROW-9272: [C++][Python] Reduce complexity in python to arrow conversion

2020-06-29 Thread GitBox


kszucs opened a new pull request #7584:
URL: https://github.com/apache/arrow/pull/7584


   The original motivation for this patch was to reuse the same conversions 
path for both the scalars and arrays. 
   
   In my recent patch the scalars are converted from a single element list to a 
single element array then copied out from it.
   
   On the long term we should convert them directly, perhaps with a more 
generic converter API, until that this patch aims to reduce code complexity 
without introducing any regressions.
   
   I checked the produced binary size:
   
   ```console
   # BEFORE
   -rwxr-xr-x   1 kszucs  staff   2926832 Jun 29 23:07 
libarrow_python.100.0.0.dylib
   
   # AFTER
   -rwxr-xr-x   1 kszucs  staff   2869136 Jun 29 23:06 
libarrow_python.100.0.0.dylib
   ```
   
   ```
   convert_builtins.ConvertArrayToPyList.time_convert
    == =
   type   before after 
    -- -
  int32  13.2±0ms   13.2±0m
  uint32 13.4±0ms   14.8±0m
  int64  16.4±0ms   13.7±0m
  uint64 13.0±0ms   17.4±0m
 float32 13.5±0ms   16.0±0m
 float64 16.3±0ms   14.7±0m
   bool  12.0±0ms   12.0±0m
 decimal 61.1±0ms   65.0±0m
  binary 19.1±0ms   18.2±0m
 binary1017.9±0ms   17.4±0m
  ascii  26.0±0ms   25.9±0m
 unicode 81.4±0ms   111±0m
int64 list   145±0ms152±0m
  struct 229±0ms212±0m
    == =
   
   convert_builtins.ConvertPyListToArray.time_convert
    = 
   type  before after
    - 
  int32  129±0ms   135±0m
  uint32 139±0ms   134±0m
  int64  132±0ms   139±0m
  uint64 135±0ms   132±0m
 float32 149±0ms   131±0m
 float64 143±0ms   141±0m
   bool  126±0ms   130±0m
 decimal 139±0ms   135±0m
  binary 145±0ms   130±0m
 binary10136±0ms   130±0m
  ascii  145±0ms   136±0m
 unicode 148±0ms   149±0m
int64 list   174±0ms   176±0m
  struct 164±0ms   177±0m
struct from tuples   165±0ms   169±0m
    = 
  
   convert_builtins.InferPyListToArray.time_infer
    = 
   type  beforeafter   
    - 
  int64  157±0ms   158±0m
 float64 155±0ms   153±0m
   bool  155±0ms   154±0m
 decimal 249±0ms   233±0m
  binary 167±0ms   162±0m
  ascii  181±0ms   154±0m
 unicode 179±0ms   171±0m
int64 list   188±0ms   192±0m
  struct 177±0ms   181±0m
    = 
   ```
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7584: ARROW-9272: [C++][Python] Reduce complexity in python to arrow conversion

2020-06-29 Thread GitBox


github-actions[bot] commented on pull request #7584:
URL: https://github.com/apache/arrow/pull/7584#issuecomment-651404016


   https://issues.apache.org/jira/browse/ARROW-9272



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


wesm commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651368104


   Indeed, toolchain incompatibilities only affect C++ code



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


mrkn commented on a change in pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#discussion_r447259489



##
File path: cpp/src/arrow/tensor/csf_converter.cc
##
@@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, 
const std::vector
-class SparseCSFTensorConverter {
- public:
-  using NumericTensorType = NumericTensor;
-  using value_type = typename NumericTensorType::value_type;
+class SparseCSFTensorConverter : private SparseTensorConverterMixin {
+  using SparseTensorConverterMixin::AssignIndex;
+  using SparseTensorConverterMixin::IsNonZero;
 
-  SparseCSFTensorConverter(const NumericTensorType& tensor,
+ public:
+  SparseCSFTensorConverter(const Tensor& tensor,
const std::shared_ptr& index_value_type,
MemoryPool* pool)
   : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {}
 
-  template 
   Status Convert() {
-using c_index_value_type = typename IndexValueType::c_type;
-
RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max()));
+RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, 
tensor_.shape()));
+
+const int index_elsize =
+checked_cast(*index_value_type_).bit_width() / 
CHAR_BIT;
+const int value_elsize =
+checked_cast(*tensor_.type()).bit_width() / 
CHAR_BIT;

Review comment:
   I agreed with you. I felt it is better to extract this idiom as a 
function, too.
   `internal::GetByteWith` looks good to me, but how about the member function 
of `FixedWidthType`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


mrkn commented on a change in pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#discussion_r447267538



##
File path: cpp/src/arrow/tensor/csf_converter.cc
##
@@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, 
const std::vector
-class SparseCSFTensorConverter {
- public:
-  using NumericTensorType = NumericTensor;
-  using value_type = typename NumericTensorType::value_type;
+class SparseCSFTensorConverter : private SparseTensorConverterMixin {
+  using SparseTensorConverterMixin::AssignIndex;
+  using SparseTensorConverterMixin::IsNonZero;
 
-  SparseCSFTensorConverter(const NumericTensorType& tensor,
+ public:
+  SparseCSFTensorConverter(const Tensor& tensor,
const std::shared_ptr& index_value_type,
MemoryPool* pool)
   : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {}
 
-  template 
   Status Convert() {
-using c_index_value_type = typename IndexValueType::c_type;
-
RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max()));
+RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, 
tensor_.shape()));
+
+const int index_elsize =
+checked_cast(*index_value_type_).bit_width() / 
CHAR_BIT;
+const int value_elsize =
+checked_cast(*tensor_.type()).bit_width() / 
CHAR_BIT;

Review comment:
   OK. I'll make `inernal::GetByteWidth`, and let it return `-1` for 
non-`FixedWidthType` values.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson edited a comment on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-29 Thread GitBox


nealrichardson edited a comment on pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#issuecomment-651263042


   I'm taking this over. Outstanding TODOs:
   
   - [x] Add tests
   - [x] Support record batches
   - [x] Support nested types (requires adapting the data structure and adding 
recursion)
   - [x] Test the print method
   - [ ] Test/handle bad data in metadata$r; allow users to edit it manually?
   - [ ] Don't keep attributes (e.g. factor levels) that we explicitly 
translate to Arrow already
   
   Top-level dataset metadata is deferred to 
https://issues.apache.org/jira/browse/ARROW-9271.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


nealrichardson commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651355763


   > > This means there also needs to be a PKGBUILD
   > 
   > Why? `libutf8proc` is installed.
   
   The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9. Most 
of our deps have to be compiled for both, and this is apparently one of those. 
That's what https://github.com/r-windows/rtools-backports is for.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on a change in pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info

2020-06-29 Thread GitBox


kou commented on a change in pull request #7581:
URL: https://github.com/apache/arrow/pull/7581#discussion_r447247927



##
File path: cpp/src/arrow/config.h
##
@@ -0,0 +1,47 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "arrow/util/config.h"  // IWYU pragma: export
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+
+struct BuildInfo {
+  int major;
+  int minor;
+  int patch;

Review comment:
   Oh, sorry. One comment wasn't submitted...
   
   How about adding `version_` prefix because our `ARROW_VERSION_*` have 
`VERSION_` prefix?
   
   ```suggestion
 int version_major;
 int version_minor;
 int version_patch;
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651366993


   > The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9
   
   What difference does it make? This is plain C.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


nealrichardson commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651369436


   > > The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9
   > 
   > What difference does it make? This is plain C.
   
   :shrug: then I'll leave it to you to sort out as this is beyond my 
knowledge. In the past, undefined symbols error + only compiled for 
rtools-packages (gcc8) = you need to get it built with rtools-backports too. 
Maybe something's off with the lib that was built, IDK if anyone has verified 
that it works.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


wesm commented on a change in pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#discussion_r447264419



##
File path: cpp/src/arrow/tensor/csf_converter.cc
##
@@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, 
const std::vector
-class SparseCSFTensorConverter {
- public:
-  using NumericTensorType = NumericTensor;
-  using value_type = typename NumericTensorType::value_type;
+class SparseCSFTensorConverter : private SparseTensorConverterMixin {
+  using SparseTensorConverterMixin::AssignIndex;
+  using SparseTensorConverterMixin::IsNonZero;
 
-  SparseCSFTensorConverter(const NumericTensorType& tensor,
+ public:
+  SparseCSFTensorConverter(const Tensor& tensor,
const std::shared_ptr& index_value_type,
MemoryPool* pool)
   : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {}
 
-  template 
   Status Convert() {
-using c_index_value_type = typename IndexValueType::c_type;
-
RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max()));
+RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, 
tensor_.shape()));
+
+const int index_elsize =
+checked_cast(*index_value_type_).bit_width() / 
CHAR_BIT;
+const int value_elsize =
+checked_cast(*tensor_.type()).bit_width() / 
CHAR_BIT;

Review comment:
   As long as the signature of the function is `int(const DataType&)` it 
sounds good to me. I don't think it should be a non-internal API because it 
does not need to do error-handling. If it's a non-static member of 
FixedWidthType, then a `checked_cast` is still necessary which seems against 
the spirit of making the code simpler





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


mrkn commented on a change in pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#discussion_r447268862



##
File path: cpp/src/arrow/tensor/csf_converter.cc
##
@@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, 
const std::vector
-class SparseCSFTensorConverter {
- public:
-  using NumericTensorType = NumericTensor;
-  using value_type = typename NumericTensorType::value_type;
+class SparseCSFTensorConverter : private SparseTensorConverterMixin {
+  using SparseTensorConverterMixin::AssignIndex;
+  using SparseTensorConverterMixin::IsNonZero;
 
-  SparseCSFTensorConverter(const NumericTensorType& tensor,
+ public:
+  SparseCSFTensorConverter(const Tensor& tensor,
const std::shared_ptr& index_value_type,
MemoryPool* pool)
   : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {}
 
-  template 
   Status Convert() {
-using c_index_value_type = typename IndexValueType::c_type;
-
RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max()));
+RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, 
tensor_.shape()));
+
+const int index_elsize =
+checked_cast(*index_value_type_).bit_width() / 
CHAR_BIT;
+const int value_elsize =
+checked_cast(*tensor_.type()).bit_width() / 
CHAR_BIT;

Review comment:
   Oops, I don't need to return `-1` because `checked_cast` is used in this 
function.  Sorry.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


mrkn commented on pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#issuecomment-650967222


   @wesm Could you please review this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-29 Thread GitBox


romainfrancois commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r446816192



##
File path: r/src/array_from_vector.cpp
##
@@ -918,6 +923,97 @@ class Time64Converter : public TimeConverter {
   }
 };
 
+template 
+class BinaryVectorConverter : public VectorConverter {
+ public:
+  ~BinaryVectorConverter() {}
+
+  Status Init(ArrayBuilder* builder) {
+typed_builder_ = checked_cast(builder);
+return Status::OK();
+  }
+
+  Status Ingest(SEXP obj) {
+ARROW_RETURN_IF(TYPEOF(obj) != VECSXP, Status::RError("Expecting a list"));
+R_xlen_t n = XLENGTH(obj);
+
+// Reserve enough space before appending
+int64_t size = 0;
+for (R_xlen_t i = 0; i < n; i++) {
+  SEXP obj_i = VECTOR_ELT(obj, i);
+  if (!Rf_isNull(obj_i)) {
+ARROW_RETURN_IF(TYPEOF(obj_i) != RAWSXP,
+Status::RError("Expecting a raw vector"));
+size += XLENGTH(obj_i);
+  }
+}
+RETURN_NOT_OK(typed_builder_->Reserve(size));
+
+// append
+for (R_xlen_t i = 0; i < n; i++) {
+  SEXP obj_i = VECTOR_ELT(obj, i);
+  if (Rf_isNull(obj_i)) {
+RETURN_NOT_OK(typed_builder_->AppendNull());
+  } else {
+RETURN_NOT_OK(typed_builder_->Append(RAW(obj_i), XLENGTH(obj_i)));
+  }
+}
+return Status::OK();
+  }
+
+  Status GetResult(std::shared_ptr* result) {
+return typed_builder_->Finish(result);
+  }
+
+ private:
+  Builder* typed_builder_;
+};
+
+template 
+class StringVectorConverter : public VectorConverter {
+ public:
+  ~StringVectorConverter() {}
+
+  Status Init(ArrayBuilder* builder) {
+typed_builder_ = checked_cast(builder);
+return Status::OK();
+  }
+
+  Status Ingest(SEXP obj) {
+ARROW_RETURN_IF(TYPEOF(obj) != STRSXP,
+Status::RError("Expecting a character vector"));
+R_xlen_t n = XLENGTH(obj);
+
+// Reserve enough space before appending
+int64_t size = 0;
+for (R_xlen_t i = 0; i < n; i++) {
+  SEXP string_i = STRING_ELT(obj, i);
+  if (string_i != NA_STRING) {
+size += XLENGTH(string_i);
+  }
+}
+RETURN_NOT_OK(typed_builder_->Reserve(size));
+
+// append
+for (R_xlen_t i = 0; i < n; i++) {
+  SEXP string_i = STRING_ELT(obj, i);
+  if (string_i == NA_STRING) {
+RETURN_NOT_OK(typed_builder_->AppendNull());
+  } else {
+RETURN_NOT_OK(typed_builder_->Append(CHAR(string_i), 
XLENGTH(string_i)));

Review comment:
   I'll have a look once back at this. Overall there seems to be two 
concurrent systems with no real reason and I believe we should only keep the 
once powered by `VectorToArrayConverter` . 
   
   I'm on my `dplyr` week now, I'll try still to make some space for this. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7579: ARROW-9242: [Java] Add forward compatibility check for Decimal bit width

2020-06-29 Thread GitBox


github-actions[bot] commented on pull request #7579:
URL: https://github.com/apache/arrow/pull/7579#issuecomment-650985250


   https://issues.apache.org/jira/browse/ARROW-9242



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] tianchen92 opened a new pull request #7579: ARROW-9242: [Java] Add forward compatibility check for Decimal bit width

2020-06-29 Thread GitBox


tianchen92 opened a new pull request #7579:
URL: https://github.com/apache/arrow/pull/7579


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] tianchen92 commented on a change in pull request #7568: ARROW-9241: [C++] Add forward compatibility check for Decimal bit width

2020-06-29 Thread GitBox


tianchen92 commented on a change in pull request #7568:
URL: https://github.com/apache/arrow/pull/7568#discussion_r446812811



##
File path: cpp/src/arrow/ipc/metadata_internal.cc
##
@@ -257,6 +259,9 @@ Status ConcreteTypeFromFlatbuffer(flatbuf::Type type, const 
void* type_data,
   return Status::OK();
 case flatbuf::Type::Decimal: {
   auto dec_type = static_cast(type_data);
+  if (dec_type->bitWidth() != kDecimalBitWidth) {
+return Status::Invalid("Library only supports 128-bit decimal values");

Review comment:
   sorry, bitWidth should return 128 if no value is set. LGTM

##
File path: cpp/src/arrow/ipc/metadata_internal.cc
##
@@ -257,6 +259,9 @@ Status ConcreteTypeFromFlatbuffer(flatbuf::Type type, const 
void* type_data,
   return Status::OK();
 case flatbuf::Type::Decimal: {
   auto dec_type = static_cast(type_data);
+  if (dec_type->bitWidth() != kDecimalBitWidth) {
+return Status::Invalid("Library only supports 128-bit decimal values");

Review comment:
   should also handle the case when bitWidth == null?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] tianchen92 commented on pull request #7579: ARROW-9242: [Java] Add forward compatibility check for Decimal bit width

2020-06-29 Thread GitBox


tianchen92 commented on pull request #7579:
URL: https://github.com/apache/arrow/pull/7579#issuecomment-650997334


   cc @wesm 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-29 Thread GitBox


rymurr commented on a change in pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#discussion_r446920091



##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/util/MemoryUtil.java
##
@@ -95,4 +143,26 @@ public static long getByteBufferAddress(ByteBuffer buf) {
 
   private MemoryUtil() {
   }
+
+  /**
+   * Create nio byte buffer.
+   */
+  public static ByteBuffer directBuffer(long address, int capacity) {
+if (DIRECT_BUFFER_CONSTRUCTOR != null) {
+  if (capacity < 0) {
+throw new IllegalArgumentException("Capacity is negative, has to be 
positive or 0");
+  }
+  try {
+return (ByteBuffer) DIRECT_BUFFER_CONSTRUCTOR.newInstance(address, 
capacity);
+  } catch (Throwable cause) {
+// Not expected to ever throw!
+if (cause instanceof Error) {
+  throw (Error) cause;

Review comment:
   None, only that it was how the netty developers wrote it. I thought I 
would copy their lead. However, I have simplified it now,





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-29 Thread GitBox


rymurr commented on pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#issuecomment-651078278


   Thanks @liyafan82 I have updated based on your comments and the integration 
tests have passed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-29 Thread GitBox


rymurr commented on pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#issuecomment-651081660


   looks like an ongoing github incident is causing build failures. Will rebase 
and rebuild once Github is back to normal



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r446943865



##
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##
@@ -68,6 +76,64 @@ TYPED_TEST(TestStringKernels, AsciiLower) {
this->string_type(), "[\"aaazzæÆ&\", null, \"\", \"bbb\"]");
 }
 
+TEST(TestStringKernels, Utf8Upper32bitGrowth) {

Review comment:
   Great, I was hoping such a system existed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r446964243



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -39,6 +158,121 @@ struct AsciiLength {
   }
 };
 
+template  class Derived>
+struct Utf8Transform {
+  using offset_type = typename Type::offset_type;
+  using DerivedClass = Derived;
+  using ArrayType = typename TypeTraits::ArrayType;
+
+  static offset_type Transform(const uint8_t* input, offset_type 
input_string_ncodeunits,
+   uint8_t* output) {
+uint8_t* dest = output;
+utf8_transform(input, input + input_string_ncodeunits, dest,
+   DerivedClass::TransformCodepoint);
+return (offset_type)(dest - output);
+  }
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+if (batch[0].kind() == Datum::ARRAY) {
+  std::call_once(flag_case_luts, []() {
+lut_upper_codepoint.reserve(MAX_CODEPOINT_LUT + 1);
+lut_lower_codepoint.reserve(MAX_CODEPOINT_LUT + 1);
+for (int i = 0; i <= MAX_CODEPOINT_LUT; i++) {
+  lut_upper_codepoint.push_back(utf8proc_toupper(i));
+  lut_lower_codepoint.push_back(utf8proc_tolower(i));
+}
+  });
+  const ArrayData& input = *batch[0].array();
+  ArrayType input_boxed(batch[0].array());
+  ArrayData* output = out->mutable_array();
+
+  offset_type const* input_string_offsets = 
input.GetValues(1);
+  utf8proc_uint8_t const* input_str =

Review comment:
   That's the offset, I don't see a similar method for the data, let me 
know if I'm wrong.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou opened a new pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux

2020-06-29 Thread GitBox


pitrou opened a new pull request #7580:
URL: https://github.com/apache/arrow/pull/7580


   The AWS SDK on manylinux packages uses a custom-compiled libcurl that is 
configured for the certificates of the build system.  However, there's no 
standard location for CA certificates on Linux, so we must instead configure 
the filesystem layer at runtime with Python's own TLS configuration values 
(which are assumed to be correct, otherwise Python itself couldn't connect to 
TLS hosts).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux

2020-06-29 Thread GitBox


pitrou commented on pull request #7580:
URL: https://github.com/apache/arrow/pull/7580#issuecomment-651139396


   @github-actions crossbow submit -g wheel



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux

2020-06-29 Thread GitBox


github-actions[bot] commented on pull request #7580:
URL: https://github.com/apache/arrow/pull/7580#issuecomment-651144126


   https://issues.apache.org/jira/browse/ARROW-9261



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7571: ARROW-8671: [C++] Use new BodyCompression Flatbuffers member for IPC compression metadata

2020-06-29 Thread GitBox


pitrou commented on a change in pull request #7571:
URL: https://github.com/apache/arrow/pull/7571#discussion_r447002338



##
File path: cpp/src/arrow/ipc/options.h
##
@@ -66,6 +67,10 @@ struct ARROW_EXPORT IpcWriteOptions {
   /// like compression
   bool use_threads = true;
 
+  /// \brief Format version to use for IPC messages and their
+  /// metadata. Presently using V4 version (readable by v0.8.0 and later).
+  MetadataVersion metadata_version = MetadataVersion::V4;

Review comment:
   I find it a bit weird to expose this an option. Will we be able to write 
data compatible with other metadata versions?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7571: ARROW-8671: [C++] Use new BodyCompression Flatbuffers member for IPC compression metadata

2020-06-29 Thread GitBox


wesm commented on a change in pull request #7571:
URL: https://github.com/apache/arrow/pull/7571#discussion_r447002951



##
File path: cpp/src/arrow/ipc/options.h
##
@@ -66,6 +67,10 @@ struct ARROW_EXPORT IpcWriteOptions {
   /// like compression
   bool use_threads = true;
 
+  /// \brief Format version to use for IPC messages and their
+  /// metadata. Presently using V4 version (readable by v0.8.0 and later).
+  MetadataVersion metadata_version = MetadataVersion::V4;

Review comment:
   Yes. This is what I've been saying in the V4/V5 MetadataVersion 
discussion e-mail





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r446952433



##
File path: cpp/src/arrow/compute/kernels/scalar_string_benchmark.cc
##
@@ -41,7 +42,9 @@ static void UnaryStringBenchmark(benchmark::State& state, 
const std::string& fun
 ABORT_NOT_OK(CallFunction(func_name, {values}));
   }
   state.SetItemsProcessed(state.iterations() * array_length);
-  state.SetBytesProcessed(state.iterations() * 
values->data()->buffers[2]->size());
+  state.SetBytesProcessed(state.iterations() *
+  ((touches_offsets ? 
values->data()->buffers[1]->size() : 0) +

Review comment:
   fair point, indeed makes them more comparable.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r446952269



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -39,6 +158,121 @@ struct AsciiLength {
   }
 };
 
+template  class Derived>
+struct Utf8Transform {
+  using offset_type = typename Type::offset_type;
+  using DerivedClass = Derived;
+  using ArrayType = typename TypeTraits::ArrayType;
+
+  static offset_type Transform(const uint8_t* input, offset_type 
input_string_ncodeunits,
+   uint8_t* output) {
+uint8_t* dest = output;
+utf8_transform(input, input + input_string_ncodeunits, dest,
+   DerivedClass::TransformCodepoint);
+return (offset_type)(dest - output);
+  }
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+if (batch[0].kind() == Datum::ARRAY) {
+  std::call_once(flag_case_luts, []() {
+lut_upper_codepoint.reserve(MAX_CODEPOINT_LUT + 1);
+lut_lower_codepoint.reserve(MAX_CODEPOINT_LUT + 1);
+for (int i = 0; i <= MAX_CODEPOINT_LUT; i++) {
+  lut_upper_codepoint.push_back(utf8proc_toupper(i));
+  lut_lower_codepoint.push_back(utf8proc_tolower(i));
+}
+  });
+  const ArrayData& input = *batch[0].array();
+  ArrayType input_boxed(batch[0].array());
+  ArrayData* output = out->mutable_array();
+
+  offset_type const* input_string_offsets = 
input.GetValues(1);
+  utf8proc_uint8_t const* input_str =
+  input.buffers[2]->data() + input_boxed.value_offset(0);
+  offset_type input_ncodeunits = input_boxed.total_values_length();
+  offset_type input_nstrings = (offset_type)input.length;
+
+  // Section 5.18 of the Unicode spec claim that the number of codepoints 
for case
+  // mapping can grow by a factor of 3. This means grow by a factor of 3 
in bytes
+  // However, since we don't support all casings (SpecialCasing.txt) the 
growth
+  // is actually only at max 3/2 (as covered by the unittest).
+  // Note that rounding down the 3/2 is ok, since only codepoints encoded 
by
+  // two code units (even) can grow to 3 code units.
+
+  int64_t output_ncodeunits_max = ((int64_t)input_ncodeunits) * 3 / 2;
+  if (output_ncodeunits_max > std::numeric_limits::max()) {
+ctx->SetStatus(Status::CapacityError(
+"Result might not fit in a 32bit utf8 array, convert to 
large_utf8"));
+return;
+  }
+
+  KERNEL_RETURN_IF_ERROR(
+  ctx, 
ctx->Allocate(output_ncodeunits_max).Value(>buffers[2]));
+  // We could reuse the buffer if it is all ascii, benchmarking showed 
this not to
+  // matter
+  // output->buffers[1] = input.buffers[1];
+  KERNEL_RETURN_IF_ERROR(ctx,
+ ctx->Allocate((input_nstrings + 1) * 
sizeof(offset_type))
+ .Value(>buffers[1]));
+  utf8proc_uint8_t* output_str = output->buffers[2]->mutable_data();
+  offset_type* output_string_offsets = 
output->GetMutableValues(1);
+  offset_type output_ncodeunits = 0;
+
+  offset_type output_string_offset = 0;
+  *output_string_offsets = output_string_offset;
+  offset_type input_string_first_offset = input_string_offsets[0];
+  for (int64_t i = 0; i < input_nstrings; i++) {
+offset_type input_string_offset =
+input_string_offsets[i] - input_string_first_offset;
+offset_type input_string_end =
+input_string_offsets[i + 1] - input_string_first_offset;
+offset_type input_string_ncodeunits = input_string_end - 
input_string_offset;
+offset_type encoded_nbytes = DerivedClass::Transform(
+input_str + input_string_offset, input_string_ncodeunits,
+output_str + output_ncodeunits);
+output_ncodeunits += encoded_nbytes;
+output_string_offsets[i + 1] = output_ncodeunits;
+  }
+
+  // trim the codepoint buffer, since we allocated too much
+  KERNEL_RETURN_IF_ERROR(
+  ctx,
+  output->buffers[2]->CopySlice(0, 
output_ncodeunits).Value(>buffers[2]));
+} else {
+  const auto& input = checked_cast(*batch[0].scalar());
+  auto result = 
checked_pointer_cast(MakeNullScalar(out->type()));
+  if (input.is_valid) {
+result->is_valid = true;
+offset_type data_nbytes = (offset_type)input.value->size();
+// See note above in the Array version explaining the 3 / 2
+KERNEL_RETURN_IF_ERROR(ctx,
+   ctx->Allocate(data_nbytes * 3 / 
2).Value(>value));
+offset_type encoded_nbytes = DerivedClass::Transform(
+input.value->data(), data_nbytes, result->value->mutable_data());
+KERNEL_RETURN_IF_ERROR(
+ctx, result->value->CopySlice(0, 
encoded_nbytes).Value(>value));
+  }
+  out->value = result;
+}
+  }
+};
+
+template 
+struct 

[GitHub] [arrow] github-actions[bot] commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux

2020-06-29 Thread GitBox


github-actions[bot] commented on pull request #7580:
URL: https://github.com/apache/arrow/pull/7580#issuecomment-651141065


   Revision: b0dd66f17e94fbf61da0f1f616ccd5e627a79145
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-365](https://github.com/ursa-labs/crossbow/branches/all?query=actions-365)
   
   |Task|Status|
   ||--|
   
|wheel-manylinux1-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux1-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux1-cp35m)|
   
|wheel-manylinux1-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux1-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux1-cp36m)|
   
|wheel-manylinux1-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux1-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux1-cp37m)|
   
|wheel-manylinux1-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux1-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux1-cp38)|
   
|wheel-manylinux2010-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux2010-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux2010-cp35m)|
   
|wheel-manylinux2010-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux2010-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux2010-cp36m)|
   
|wheel-manylinux2010-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux2010-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux2010-cp37m)|
   
|wheel-manylinux2010-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux2010-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux2010-cp38)|
   
|wheel-manylinux2014-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux2014-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux2014-cp35m)|
   
|wheel-manylinux2014-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux2014-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux2014-cp36m)|
   
|wheel-manylinux2014-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux2014-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux2014-cp37m)|
   
|wheel-manylinux2014-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux2014-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux2014-cp38)|
   
|wheel-osx-cp35m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-365-travis-wheel-osx-cp35m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-osx-cp36m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-365-travis-wheel-osx-cp36m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-osx-cp37m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-365-travis-wheel-osx-cp37m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-osx-cp38|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-365-travis-wheel-osx-cp38.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-win-cp35m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-365-appveyor-wheel-win-cp35m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
   
|wheel-win-cp36m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-365-appveyor-wheel-win-cp36m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
   

[GitHub] [arrow] wesm commented on a change in pull request #7568: ARROW-9241: [C++] Add forward compatibility check for Decimal bit width

2020-06-29 Thread GitBox


wesm commented on a change in pull request #7568:
URL: https://github.com/apache/arrow/pull/7568#discussion_r446993778



##
File path: cpp/src/arrow/ipc/metadata_internal.cc
##
@@ -257,6 +259,9 @@ Status ConcreteTypeFromFlatbuffer(flatbuf::Type type, const 
void* type_data,
   return Status::OK();
 case flatbuf::Type::Decimal: {
   auto dec_type = static_cast(type_data);
+  if (dec_type->bitWidth() != kDecimalBitWidth) {
+return Status::Invalid("Library only supports 128-bit decimal values");

Review comment:
   The Flatbuffers library takes care of this -- `bitWidth()` always 
returns a value and if the serialized field is not present, the default 128 is 
returned. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7571: ARROW-8671: [C++] Use new BodyCompression Flatbuffers member for IPC compression metadata

2020-06-29 Thread GitBox


pitrou commented on a change in pull request #7571:
URL: https://github.com/apache/arrow/pull/7571#discussion_r447003525



##
File path: cpp/src/arrow/ipc/options.h
##
@@ -66,6 +67,10 @@ struct ARROW_EXPORT IpcWriteOptions {
   /// like compression
   bool use_threads = true;
 
+  /// \brief Format version to use for IPC messages and their
+  /// metadata. Presently using V4 version (readable by v0.8.0 and later).
+  MetadataVersion metadata_version = MetadataVersion::V4;

Review comment:
   Oops, I had misunderstood that.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-29 Thread GitBox


rymurr commented on a change in pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#discussion_r446920388



##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/util/MemoryUtil.java
##
@@ -78,6 +77,55 @@ public Object run() {
   Field addressField = java.nio.Buffer.class.getDeclaredField("address");
   addressField.setAccessible(true);
   BYTE_BUFFER_ADDRESS_OFFSET = UNSAFE.objectFieldOffset(addressField);
+
+  Constructor directBufferConstructor;
+  long address = -1;
+  final ByteBuffer direct = ByteBuffer.allocateDirect(1);
+  try {
+
+final Object maybeDirectBufferConstructor =
+AccessController.doPrivileged(new PrivilegedAction() {
+  @Override
+  public Object run() {
+try {
+  final Constructor constructor =
+  direct.getClass().getDeclaredConstructor(long.class, 
int.class);
+  constructor.setAccessible(true);
+  logger.debug("Constructor for direct buffer found and made 
accessible");
+  return constructor;
+} catch (NoSuchMethodException e) {
+  logger.debug("Cannot get constructor for direct buffer 
allocation", e);
+  return e;
+} catch (SecurityException e) {
+  logger.debug("Cannot get constructor for direct buffer 
allocation", e);
+  return e;
+}
+  }
+});
+
+if (maybeDirectBufferConstructor instanceof Constructor) {
+  address = UNSAFE.allocateMemory(1);
+  // try to use the constructor now
+  try {
+((Constructor) 
maybeDirectBufferConstructor).newInstance(address, 1);
+directBufferConstructor = (Constructor) 
maybeDirectBufferConstructor;
+logger.debug("direct buffer constructor: available");
+  } catch (InstantiationException | IllegalAccessException | 
InvocationTargetException e) {
+logger.debug("unable to instantiate a direct buffer via 
constructor", e);

Review comment:
   fixed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r446934057



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -30,6 +31,124 @@ namespace internal {
 
 namespace {
 
+// lookup tables
+std::vector lut_upper_codepoint;
+std::vector lut_lower_codepoint;
+std::once_flag flag_case_luts;
+
+constexpr uint32_t REPLACEMENT_CHAR =
+'?';  // the proper replacement char would be the 0xFFFD codepoint, but 
that can
+  // increase string length by a factor of 3
+constexpr int MAX_CODEPOINT_LUT = 0x;  // up to this codepoint is in a 
lookup table
+
+static inline void utf8_encode(uint8_t*& str, uint32_t codepoint) {
+  if (codepoint < 0x80) {
+*str++ = codepoint;
+  } else if (codepoint < 0x800) {
+*str++ = 0xC0 + (codepoint >> 6);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else if (codepoint < 0x1) {
+*str++ = 0xE0 + (codepoint >> 12);
+*str++ = 0x80 + ((codepoint >> 6) & 0x3F);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else if (codepoint < 0x20) {
+*str++ = 0xF0 + (codepoint >> 18);
+*str++ = 0x80 + ((codepoint >> 12) & 0x3F);
+*str++ = 0x80 + ((codepoint >> 6) & 0x3F);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else {
+*str++ = codepoint;
+  }
+}
+
+static inline bool utf8_is_continuation(const uint8_t codeunit) {
+  return (codeunit & 0xC0) == 0x80;  // upper two bits should be 10
+}
+
+static inline uint32_t utf8_decode(const uint8_t*& str, int64_t& length) {
+  if (*str < 0x80) {  //
+length -= 1;
+return *str++;
+  } else if (*str < 0xC0) {  // invalid non-ascii char
+length -= 1;
+str++;
+return REPLACEMENT_CHAR;

Review comment:
   Would you mind we keep it as is for now, and maybe discuss this 
separately? Happy to take this to the mailing list.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r446988586



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -39,6 +158,121 @@ struct AsciiLength {
   }
 };
 
+template  class Derived>
+struct Utf8Transform {
+  using offset_type = typename Type::offset_type;
+  using DerivedClass = Derived;
+  using ArrayType = typename TypeTraits::ArrayType;
+
+  static offset_type Transform(const uint8_t* input, offset_type 
input_string_ncodeunits,
+   uint8_t* output) {
+uint8_t* dest = output;
+utf8_transform(input, input + input_string_ncodeunits, dest,
+   DerivedClass::TransformCodepoint);
+return (offset_type)(dest - output);
+  }
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+if (batch[0].kind() == Datum::ARRAY) {
+  std::call_once(flag_case_luts, []() {
+lut_upper_codepoint.reserve(MAX_CODEPOINT_LUT + 1);
+lut_lower_codepoint.reserve(MAX_CODEPOINT_LUT + 1);
+for (int i = 0; i <= MAX_CODEPOINT_LUT; i++) {
+  lut_upper_codepoint.push_back(utf8proc_toupper(i));
+  lut_lower_codepoint.push_back(utf8proc_tolower(i));
+}
+  });
+  const ArrayData& input = *batch[0].array();
+  ArrayType input_boxed(batch[0].array());
+  ArrayData* output = out->mutable_array();
+
+  offset_type const* input_string_offsets = 
input.GetValues(1);
+  utf8proc_uint8_t const* input_str =
+  input.buffers[2]->data() + input_boxed.value_offset(0);
+  offset_type input_ncodeunits = input_boxed.total_values_length();
+  offset_type input_nstrings = (offset_type)input.length;
+
+  // Section 5.18 of the Unicode spec claim that the number of codepoints 
for case
+  // mapping can grow by a factor of 3. This means grow by a factor of 3 
in bytes
+  // However, since we don't support all casings (SpecialCasing.txt) the 
growth
+  // is actually only at max 3/2 (as covered by the unittest).
+  // Note that rounding down the 3/2 is ok, since only codepoints encoded 
by
+  // two code units (even) can grow to 3 code units.
+
+  int64_t output_ncodeunits_max = ((int64_t)input_ncodeunits) * 3 / 2;
+  if (output_ncodeunits_max > std::numeric_limits::max()) {
+ctx->SetStatus(Status::CapacityError(
+"Result might not fit in a 32bit utf8 array, convert to 
large_utf8"));
+return;
+  }
+
+  KERNEL_RETURN_IF_ERROR(
+  ctx, 
ctx->Allocate(output_ncodeunits_max).Value(>buffers[2]));
+  // We could reuse the buffer if it is all ascii, benchmarking showed 
this not to
+  // matter
+  // output->buffers[1] = input.buffers[1];
+  KERNEL_RETURN_IF_ERROR(ctx,
+ ctx->Allocate((input_nstrings + 1) * 
sizeof(offset_type))
+ .Value(>buffers[1]));
+  utf8proc_uint8_t* output_str = output->buffers[2]->mutable_data();
+  offset_type* output_string_offsets = 
output->GetMutableValues(1);
+  offset_type output_ncodeunits = 0;
+
+  offset_type output_string_offset = 0;
+  *output_string_offsets = output_string_offset;
+  offset_type input_string_first_offset = input_string_offsets[0];
+  for (int64_t i = 0; i < input_nstrings; i++) {
+offset_type input_string_offset =
+input_string_offsets[i] - input_string_first_offset;
+offset_type input_string_end =
+input_string_offsets[i + 1] - input_string_first_offset;
+offset_type input_string_ncodeunits = input_string_end - 
input_string_offset;

Review comment:
   Using GetValue now, much cleaner  





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7568: ARROW-9241: [C++] Add forward compatibility check for Decimal bit width

2020-06-29 Thread GitBox


wesm commented on pull request #7568:
URL: https://github.com/apache/arrow/pull/7568#issuecomment-651142403


   +1



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7568: ARROW-9241: [C++] Add forward compatibility check for Decimal bit width

2020-06-29 Thread GitBox


wesm commented on a change in pull request #7568:
URL: https://github.com/apache/arrow/pull/7568#discussion_r446994593



##
File path: cpp/src/arrow/ipc/metadata_internal.cc
##
@@ -257,6 +259,9 @@ Status ConcreteTypeFromFlatbuffer(flatbuf::Type type, const 
void* type_data,
   return Status::OK();
 case flatbuf::Type::Decimal: {
   auto dec_type = static_cast(type_data);
+  if (dec_type->bitWidth() != kDecimalBitWidth) {
+return Status::Invalid("Library only supports 128-bit decimal values");

Review comment:
   Here's the generated binding -- so if `VT_BITWIDTH` is not present in 
the table then it returns 128
   
   ```c++
 int32_t bitWidth() const {
   return GetField(VT_BITWIDTH, 128);
 }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7575: ARROW-8671: [C++][FOLLOWUP] Fix ASAN/UBSAN bug found with IPC fuzz testing files

2020-06-29 Thread GitBox


pitrou commented on pull request #7575:
URL: https://github.com/apache/arrow/pull/7575#issuecomment-651147499


   Did you add the regression file to the testing repository? (in 
`data/arrow-ipc-file` or `data/arrow-ipc-stream`, depending on the fuzzer which 
found it)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7569: ARROW-9152: [C++] Specialized implementation of filtering Binary/LargeBinary-based types

2020-06-29 Thread GitBox


wesm edited a comment on pull request #7569:
URL: https://github.com/apache/arrow/pull/7569#issuecomment-650803803


   Benchmarks on gcc-8
   
   ```
   $ archery benchmark diff --cc=gcc-8 --cxx=g++-8 
--benchmark-filter=FilterString
benchmark baselinecontender 
 change %  
counters
   13 FilterStringFilterNoNulls/1048576/01.242 GiB/sec6.766 GiB/sec 
  444.926{'iterations': 890, 'data null%': 0.0, 'mask null%': 0.0, 
'select%': 99.9}
   28 FilterStringFilterNoNulls/1048576/31.216 GiB/sec5.055 GiB/sec 
  315.816{'iterations': 877, 'data null%': 0.1, 'mask null%': 0.0, 
'select%': 99.9}
   8  FilterStringFilterNoNulls/1048576/61.087 GiB/sec2.512 GiB/sec 
  130.990{'iterations': 771, 'data null%': 1.0, 'mask null%': 0.0, 
'select%': 99.9}
   4 FilterStringFilterNoNulls/1048576/12  227.501 MiB/sec  500.272 MiB/sec 
  119.899   {'iterations': 160, 'data null%': 90.0, 'mask null%': 0.0, 
'select%': 99.9}
   23 FilterStringFilterNoNulls/1048576/9  902.994 MiB/sec1.362 GiB/sec 
   54.476   {'iterations': 622, 'data null%': 10.0, 'mask null%': 0.0, 
'select%': 99.9}
   24  FilterStringFilterWithNulls/1048576/12  235.786 MiB/sec  356.124 MiB/sec 
   51.037   {'iterations': 165, 'data null%': 90.0, 'mask null%': 5.0, 
'select%': 99.9}
   5FilterStringFilterWithNulls/1048576/3  996.664 MiB/sec1.361 GiB/sec 
   39.858{'iterations': 687, 'data null%': 0.1, 'mask null%': 5.0, 
'select%': 99.9}
   22   FilterStringFilterWithNulls/1048576/6  988.599 MiB/sec1.331 GiB/sec 
   37.918{'iterations': 685, 'data null%': 1.0, 'mask null%': 5.0, 
'select%': 99.9}
   15   FilterStringFilterWithNulls/1048576/9  913.433 MiB/sec1.205 GiB/sec 
   35.073   {'iterations': 634, 'data null%': 10.0, 'mask null%': 5.0, 
'select%': 99.9}
   7 FilterStringFilterNoNulls/1048576/13  219.235 MiB/sec  292.800 MiB/sec 
   33.555   {'iterations': 155, 'data null%': 90.0, 'mask null%': 0.0, 
'select%': 50.0}
   14FilterStringFilterNoNulls/1048576/101.169 GiB/sec1.559 GiB/sec 
   33.370   {'iterations': 838, 'data null%': 10.0, 'mask null%': 0.0, 
'select%': 50.0}
   25   FilterStringFilterWithNulls/1048576/01.062 GiB/sec1.366 GiB/sec 
   28.690{'iterations': 761, 'data null%': 0.0, 'mask null%': 5.0, 
'select%': 99.9}
   2  FilterStringFilterNoNulls/1048576/11.398 GiB/sec1.763 GiB/sec 
   26.088   {'iterations': 1007, 'data null%': 0.0, 'mask null%': 0.0, 
'select%': 50.0}
   11   FilterStringFilterWithNulls/1048576/41.233 GiB/sec1.521 GiB/sec 
   23.336{'iterations': 894, 'data null%': 0.1, 'mask null%': 5.0, 
'select%': 50.0}
   29   FilterStringFilterWithNulls/1048576/71.240 GiB/sec1.519 GiB/sec 
   22.426{'iterations': 881, 'data null%': 1.0, 'mask null%': 5.0, 
'select%': 50.0}
   9   FilterStringFilterWithNulls/1048576/101.177 GiB/sec1.370 GiB/sec 
   16.380   {'iterations': 753, 'data null%': 10.0, 'mask null%': 5.0, 
'select%': 50.0}
   26  FilterStringFilterWithNulls/1048576/13  215.181 MiB/sec  249.411 MiB/sec 
   15.907   {'iterations': 151, 'data null%': 90.0, 'mask null%': 5.0, 
'select%': 50.0}
   3  FilterStringFilterNoNulls/1048576/71.373 GiB/sec1.580 GiB/sec 
   15.117{'iterations': 983, 'data null%': 1.0, 'mask null%': 0.0, 
'select%': 50.0}
   10 FilterStringFilterNoNulls/1048576/41.427 GiB/sec1.607 GiB/sec 
   12.602{'iterations': 996, 'data null%': 0.1, 'mask null%': 0.0, 
'select%': 50.0}
   17   FilterStringFilterWithNulls/1048576/11.364 GiB/sec1.532 GiB/sec 
   12.343{'iterations': 950, 'data null%': 0.0, 'mask null%': 5.0, 
'select%': 50.0}
   20 FilterStringFilterNoNulls/1048576/8   19.656 GiB/sec   20.549 GiB/sec 
4.540   {'iterations': 14209, 'data null%': 1.0, 'mask null%': 0.0, 
'select%': 1.0}
   6   FilterStringFilterWithNulls/1048576/11   12.449 GiB/sec   12.500 GiB/sec 
0.407   {'iterations': 8994, 'data null%': 10.0, 'mask null%': 5.0, 
'select%': 1.0}
   0  FilterStringFilterNoNulls/1048576/5   20.884 GiB/sec   20.824 GiB/sec 
   -0.287   {'iterations': 14923, 'data null%': 0.1, 'mask null%': 0.0, 
'select%': 1.0}
   19FilterStringFilterNoNulls/1048576/11   18.171 GiB/sec   17.865 GiB/sec 
   -1.687  {'iterations': 12875, 'data null%': 10.0, 'mask null%': 0.0, 
'select%': 1.0}
   21  FilterStringFilterWithNulls/1048576/141.551 GiB/sec1.519 GiB/sec 
   -2.057   {'iterations': 1090, 'data null%': 90.0, 'mask null%': 5.0, 
'select%': 1.0}
   16 FilterStringFilterNoNulls/1048576/2   21.518 GiB/sec   21.028 GiB/sec 
   -2.281   {'iterations': 15479, 'data null%': 0.0, 'mask null%': 0.0, 
'select%': 1.0}
   12FilterStringFilterNoNulls/1048576/142.367 GiB/sec2.295 GiB/sec 
   -3.071   {'iterations': 1690, 'data null%': 90.0, 'mask null%': 0.0, 
'select%': 1.0}
  

[GitHub] [arrow] wesm commented on pull request #7571: ARROW-8671: [C++] Use new BodyCompression Flatbuffers member for IPC compression metadata

2020-06-29 Thread GitBox


wesm commented on pull request #7571:
URL: https://github.com/apache/arrow/pull/7571#issuecomment-651146389


   @pitrou this is already merged (by accident actually, mistyped the PR number 
on the command line and went too fast), but let me know if you see anything 
concerning from a fuzz perspective or otherwise. I fixed one fuzz issue already 
in 
https://github.com/apache/arrow/commit/76c3e4a6d30e279fa5707f7cc14e8aacf00f08a3



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r446938183



##
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##
@@ -68,6 +76,64 @@ TYPED_TEST(TestStringKernels, AsciiLower) {
this->string_type(), "[\"aaazzæÆ&\", null, \"\", \"bbb\"]");
 }
 
+TEST(TestStringKernels, Utf8Upper32bitGrowth) {
+  std::string str(0x, 'a');
+  arrow::StringBuilder builder;
+  // 0x7fff * 0x is the max a 32 bit string array can hold
+  // since the utf8_upper kernel can grow it by 3/2, the max we should accept 
is is
+  // 0x7fff * 0x * 2/3 = 0x * 0x, so this should give us a 
CapacityError
+  for (int64_t i = 0; i < 0x5556; i++) {
+ASSERT_OK(builder.Append(str));
+  }
+  std::shared_ptr array;
+  arrow::Status st = builder.Finish();
+  const FunctionOptions* options = nullptr;
+  EXPECT_RAISES_WITH_MESSAGE_THAT(CapacityError,
+  testing::HasSubstr("Result might not fit"),
+  CallFunction("utf8_upper", {array}, 
options));
+}
+
+TYPED_TEST(TestStringKernels, Utf8Upper) {
+  this->CheckUnary("utf8_upper", "[\"aAazZæÆ&\", null, \"\", \"b\"]", 
this->string_type(),
+   "[\"AAAZZÆÆ&\", null, \"\", \"B\"]");
+
+  // test varying encoding lenghts and thus changing indices/offsets
+  this->CheckUnary("utf8_upper", "[\"ɑɽⱤoW\", null, \"ıI\", \"b\"]", 
this->string_type(),
+   "[\"ⱭⱤⱤOW\", null, \"II\", \"B\"]");
+
+  // ῦ to Υ͂ not supported
+  // this->CheckUnary("utf8_upper", "[\"ῦɐɜʞȿ\"]", this->string_type(),
+  // "[\"Υ͂ⱯꞫꞰⱾ\"]");
+
+  // test maximum buffer growth
+  this->CheckUnary("utf8_upper", "[\"\"]", this->string_type(), 
"[\"\"]");
+
+  // Test replacing invalid data by ? (ὖ == \xe1\xbd\x96)
+  this->CheckUnary("utf8_upper", "[\"ɑa\xFFɑ\", \"ɽ\xe1\xbdɽaa\"]", 
this->string_type(),
+   "[\"ⱭA?Ɑ\", \"Ɽ?ⱤAA\"]");
+}
+
+TYPED_TEST(TestStringKernels, Utf8Lower) {
+  this->CheckUnary("utf8_lower", "[\"aAazZæÆ&\", null, \"\", \"b\"]", 
this->string_type(),
+   "[\"aaazzææ&\", null, \"\", \"b\"]");
+
+  // test varying encoding lenghts and thus changing indices/offsets
+  this->CheckUnary("utf8_lower", "[\"ⱭɽⱤoW\", null, \"ıI\", \"B\"]", 
this->string_type(),
+   "[\"ɑɽɽow\", null, \"ıi\", \"b\"]");
+
+  // ῦ to Υ͂ is not supported, but in principle the reverse is, but it would 
need
+  // normalization
+  // this->CheckUnary("utf8_lower", "[\"Υ͂ⱯꞫꞰⱾ\"]", this->string_type(),
+  // "[\"ῦɐɜʞȿ\"]");
+
+  // test maximum buffer growth
+  this->CheckUnary("utf8_lower", "[\"\"]", this->string_type(), 
"[\"\"]");
+
+  // Test replacing invalid data by ? (ὖ == \xe1\xbd\x96)

Review comment:
   I added a truncated encoded byte string of that, I'll clarity in the 
comment.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r446946562



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -30,6 +31,124 @@ namespace internal {
 
 namespace {
 
+// lookup tables
+std::vector lut_upper_codepoint;
+std::vector lut_lower_codepoint;
+std::once_flag flag_case_luts;
+
+constexpr uint32_t REPLACEMENT_CHAR =
+'?';  // the proper replacement char would be the 0xFFFD codepoint, but 
that can
+  // increase string length by a factor of 3
+constexpr int MAX_CODEPOINT_LUT = 0x;  // up to this codepoint is in a 
lookup table
+
+static inline void utf8_encode(uint8_t*& str, uint32_t codepoint) {
+  if (codepoint < 0x80) {
+*str++ = codepoint;
+  } else if (codepoint < 0x800) {
+*str++ = 0xC0 + (codepoint >> 6);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else if (codepoint < 0x1) {
+*str++ = 0xE0 + (codepoint >> 12);
+*str++ = 0x80 + ((codepoint >> 6) & 0x3F);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else if (codepoint < 0x20) {
+*str++ = 0xF0 + (codepoint >> 18);
+*str++ = 0x80 + ((codepoint >> 12) & 0x3F);
+*str++ = 0x80 + ((codepoint >> 6) & 0x3F);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else {
+*str++ = codepoint;
+  }
+}
+
+static inline bool utf8_is_continuation(const uint8_t codeunit) {
+  return (codeunit & 0xC0) == 0x80;  // upper two bits should be 10
+}
+
+static inline uint32_t utf8_decode(const uint8_t*& str, int64_t& length) {
+  if (*str < 0x80) {  //
+length -= 1;
+return *str++;
+  } else if (*str < 0xC0) {  // invalid non-ascii char
+length -= 1;
+str++;
+return REPLACEMENT_CHAR;

Review comment:
   Well, I'm not sure why it would make sense to ignore invalid input data 
here, unless not ignoring it has a significant cost (which sounds unlikely).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r446958717



##
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##
@@ -81,5 +147,40 @@ TYPED_TEST(TestStringKernels, 
StrptimeDoesNotProvideDefaultOptions) {
   ASSERT_RAISES(Invalid, CallFunction("strptime", {input}));
 }
 
+TEST(TestStringKernels, UnicodeLibraryAssumptions) {
+  uint8_t output[4];
+  for (utf8proc_int32_t codepoint = 0x100; codepoint < 0x11; codepoint++) {
+utf8proc_ssize_t encoded_nbytes = utf8proc_encode_char(codepoint, output);
+utf8proc_int32_t codepoint_upper = utf8proc_toupper(codepoint);
+utf8proc_ssize_t encoded_nbytes_upper = 
utf8proc_encode_char(codepoint_upper, output);
+if (encoded_nbytes == 2) {
+  EXPECT_LE(encoded_nbytes_upper, 3)
+  << "Expected the upper case codepoint for a 2 byte encoded codepoint 
to be "
+ "encoded in maximum 3 bytes, not "
+  << encoded_nbytes_upper;
+}
+if (encoded_nbytes == 3) {

Review comment:
   This code tests the assumption that the byte length growth for case 
mapping can only be at max 3/2, which is what we use for pre-allocation of the 
buffers. Unicode codepoints encoded in 4 bytes can only be 4 bytes or less when 
upper cased or lower cased, so we don't need to check them here. We can also 
skip the test that tests that case for `encoded_nbytes=3`, since it will not 
matter. I'll add a comment why we do the test.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r446990168



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -30,6 +31,124 @@ namespace internal {
 
 namespace {
 
+// lookup tables
+std::vector lut_upper_codepoint;
+std::vector lut_lower_codepoint;
+std::once_flag flag_case_luts;
+
+constexpr uint32_t REPLACEMENT_CHAR =
+'?';  // the proper replacement char would be the 0xFFFD codepoint, but 
that can
+  // increase string length by a factor of 3
+constexpr int MAX_CODEPOINT_LUT = 0x;  // up to this codepoint is in a 
lookup table
+
+static inline void utf8_encode(uint8_t*& str, uint32_t codepoint) {
+  if (codepoint < 0x80) {
+*str++ = codepoint;
+  } else if (codepoint < 0x800) {
+*str++ = 0xC0 + (codepoint >> 6);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else if (codepoint < 0x1) {
+*str++ = 0xE0 + (codepoint >> 12);
+*str++ = 0x80 + ((codepoint >> 6) & 0x3F);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else if (codepoint < 0x20) {
+*str++ = 0xF0 + (codepoint >> 18);
+*str++ = 0x80 + ((codepoint >> 12) & 0x3F);
+*str++ = 0x80 + ((codepoint >> 6) & 0x3F);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else {
+*str++ = codepoint;

Review comment:
   Agree, this is not useful. Now also replacing that by '?', in principle, 
it should not happen, it can only happen when there is an internal bug (e.g. 
uppercase gives the wrong codepoint), but now it's more informative.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7575: ARROW-8671: [C++][FOLLOWUP] Fix ASAN/UBSAN bug found with IPC fuzz testing files

2020-06-29 Thread GitBox


wesm commented on pull request #7575:
URL: https://github.com/apache/arrow/pull/7575#issuecomment-651148456


   @pitrou the issue was caught by one of the existing fuzz files -- I had 
accidentally merged the PR without seeing the ASAN failure 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rymurr commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-29 Thread GitBox


rymurr commented on a change in pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#discussion_r446921213



##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/rounding/DefaultRoundingPolicy.java
##
@@ -17,33 +17,107 @@
 
 package org.apache.arrow.memory.rounding;
 
-import java.lang.reflect.Field;
-
-import org.apache.arrow.memory.NettyAllocationManager;
 import org.apache.arrow.memory.util.CommonUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import io.netty.util.internal.SystemPropertyUtil;
 
 /**
  * The default rounding policy. That is, if the requested size is within the 
chunk size,
  * the rounded size will be the next power of two. Otherwise, the rounded size
  * will be identical to the requested size.
  */
 public class DefaultRoundingPolicy implements RoundingPolicy {
-
+  private static final Logger logger = 
LoggerFactory.getLogger(DefaultRoundingPolicy.class);
   public final long chunkSize;
 
   /**
-   * The singleton instance.
+   * The variables here and the static block calculates teh DEFAULT_CHUNK_SIZE.
+   *
+   * 
+   *   It was copied from {@link io.netty.buffer.PooledByteBufAllocator}.
+   * 
*/
-  public static final DefaultRoundingPolicy INSTANCE = new 
DefaultRoundingPolicy();
+  private static final int MIN_PAGE_SIZE = 4096;
+  private static final int MAX_CHUNK_SIZE = (int) (((long) Integer.MAX_VALUE + 
1) / 2);
+  private static final long DEFAULT_CHUNK_SIZE;
+  private static final int DEFAULT_PAGE_SIZE;
+  private static final int DEFAULT_MAX_ORDER;
 
-  private DefaultRoundingPolicy() {
+
+  static {
+int defaultPageSize = 
SystemPropertyUtil.getInt("org.apache.memory.allocator.pageSize", 8192);
+Throwable pageSizeFallbackCause = null;
 try {
-  Field field = 
NettyAllocationManager.class.getDeclaredField("CHUNK_SIZE");
-  field.setAccessible(true);
-  chunkSize = (Long) field.get(null);
-} catch (Exception e) {
-  throw new RuntimeException("Failed to get chunk size from allocation 
manager");
+  validateAndCalculatePageShifts(defaultPageSize);
+} catch (Throwable t) {
+  pageSizeFallbackCause = t;
+  defaultPageSize = 8192;
+}
+DEFAULT_PAGE_SIZE = defaultPageSize;
+
+int defaultMaxOrder = 
SystemPropertyUtil.getInt("org.apache.memory.allocator.maxOrder", 11);
+Throwable maxOrderFallbackCause = null;
+try {
+  validateAndCalculateChunkSize(DEFAULT_PAGE_SIZE, defaultMaxOrder);
+} catch (Throwable t) {
+  maxOrderFallbackCause = t;
+  defaultMaxOrder = 11;
+}
+DEFAULT_MAX_ORDER = defaultMaxOrder;
+DEFAULT_CHUNK_SIZE = validateAndCalculateChunkSize(DEFAULT_PAGE_SIZE, 
DEFAULT_MAX_ORDER);
+if (logger.isDebugEnabled()) {
+  if (pageSizeFallbackCause == null) {
+logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", 
DEFAULT_PAGE_SIZE);
+  } else {
+logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", 
DEFAULT_PAGE_SIZE, pageSizeFallbackCause);
+  }
+  if (maxOrderFallbackCause == null) {
+logger.debug("-Dorg.apache.memory.allocator.maxOrder: {}", 
DEFAULT_MAX_ORDER);
+  } else {
+logger.debug("-Dorg.apache.memory.allocator.maxOrder: {}", 
DEFAULT_MAX_ORDER, maxOrderFallbackCause);
+  }

Review comment:
   done

##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/rounding/DefaultRoundingPolicy.java
##
@@ -17,33 +17,107 @@
 
 package org.apache.arrow.memory.rounding;
 
-import java.lang.reflect.Field;
-
-import org.apache.arrow.memory.NettyAllocationManager;
 import org.apache.arrow.memory.util.CommonUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import io.netty.util.internal.SystemPropertyUtil;
 
 /**
  * The default rounding policy. That is, if the requested size is within the 
chunk size,
  * the rounded size will be the next power of two. Otherwise, the rounded size
  * will be identical to the requested size.
  */
 public class DefaultRoundingPolicy implements RoundingPolicy {
-
+  private static final Logger logger = 
LoggerFactory.getLogger(DefaultRoundingPolicy.class);
   public final long chunkSize;
 
   /**
-   * The singleton instance.
+   * The variables here and the static block calculates teh DEFAULT_CHUNK_SIZE.
+   *
+   * 
+   *   It was copied from {@link io.netty.buffer.PooledByteBufAllocator}.
+   * 
*/
-  public static final DefaultRoundingPolicy INSTANCE = new 
DefaultRoundingPolicy();
+  private static final int MIN_PAGE_SIZE = 4096;
+  private static final int MAX_CHUNK_SIZE = (int) (((long) Integer.MAX_VALUE + 
1) / 2);
+  private static final long DEFAULT_CHUNK_SIZE;
+  private static final int DEFAULT_PAGE_SIZE;
+  private static final int DEFAULT_MAX_ORDER;
 
-  private DefaultRoundingPolicy() {
+
+  static {
+int defaultPageSize = 
SystemPropertyUtil.getInt("org.apache.memory.allocator.pageSize", 8192);
+Throwable pageSizeFallbackCause = null;
 try 

[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r446934218



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -30,6 +31,124 @@ namespace internal {
 
 namespace {
 
+// lookup tables
+std::vector lut_upper_codepoint;
+std::vector lut_lower_codepoint;
+std::once_flag flag_case_luts;
+
+constexpr uint32_t REPLACEMENT_CHAR =
+'?';  // the proper replacement char would be the 0xFFFD codepoint, but 
that can
+  // increase string length by a factor of 3
+constexpr int MAX_CODEPOINT_LUT = 0x;  // up to this codepoint is in a 
lookup table
+
+static inline void utf8_encode(uint8_t*& str, uint32_t codepoint) {

Review comment:
   Great, didn't see that, will move it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r446934815



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -39,6 +158,121 @@ struct AsciiLength {
   }
 };
 
+template  class Derived>
+struct Utf8Transform {
+  using offset_type = typename Type::offset_type;
+  using DerivedClass = Derived;
+  using ArrayType = typename TypeTraits::ArrayType;
+
+  static offset_type Transform(const uint8_t* input, offset_type 
input_string_ncodeunits,
+   uint8_t* output) {
+uint8_t* dest = output;
+utf8_transform(input, input + input_string_ncodeunits, dest,
+   DerivedClass::TransformCodepoint);
+return (offset_type)(dest - output);
+  }
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+if (batch[0].kind() == Datum::ARRAY) {
+  std::call_once(flag_case_luts, []() {
+lut_upper_codepoint.reserve(MAX_CODEPOINT_LUT + 1);
+lut_lower_codepoint.reserve(MAX_CODEPOINT_LUT + 1);
+for (int i = 0; i <= MAX_CODEPOINT_LUT; i++) {
+  lut_upper_codepoint.push_back(utf8proc_toupper(i));
+  lut_lower_codepoint.push_back(utf8proc_tolower(i));
+}
+  });
+  const ArrayData& input = *batch[0].array();
+  ArrayType input_boxed(batch[0].array());
+  ArrayData* output = out->mutable_array();
+
+  offset_type const* input_string_offsets = 
input.GetValues(1);
+  utf8proc_uint8_t const* input_str =
+  input.buffers[2]->data() + input_boxed.value_offset(0);
+  offset_type input_ncodeunits = input_boxed.total_values_length();
+  offset_type input_nstrings = (offset_type)input.length;
+
+  // Section 5.18 of the Unicode spec claim that the number of codepoints 
for case
+  // mapping can grow by a factor of 3. This means grow by a factor of 3 
in bytes
+  // However, since we don't support all casings (SpecialCasing.txt) the 
growth
+  // is actually only at max 3/2 (as covered by the unittest).
+  // Note that rounding down the 3/2 is ok, since only codepoints encoded 
by
+  // two code units (even) can grow to 3 code units.
+
+  int64_t output_ncodeunits_max = ((int64_t)input_ncodeunits) * 3 / 2;
+  if (output_ncodeunits_max > std::numeric_limits::max()) {
+ctx->SetStatus(Status::CapacityError(
+"Result might not fit in a 32bit utf8 array, convert to 
large_utf8"));
+return;
+  }
+
+  KERNEL_RETURN_IF_ERROR(
+  ctx, 
ctx->Allocate(output_ncodeunits_max).Value(>buffers[2]));
+  // We could reuse the buffer if it is all ascii, benchmarking showed 
this not to
+  // matter
+  // output->buffers[1] = input.buffers[1];
+  KERNEL_RETURN_IF_ERROR(ctx,
+ ctx->Allocate((input_nstrings + 1) * 
sizeof(offset_type))
+ .Value(>buffers[1]));
+  utf8proc_uint8_t* output_str = output->buffers[2]->mutable_data();
+  offset_type* output_string_offsets = 
output->GetMutableValues(1);
+  offset_type output_ncodeunits = 0;
+
+  offset_type output_string_offset = 0;
+  *output_string_offsets = output_string_offset;
+  offset_type input_string_first_offset = input_string_offsets[0];
+  for (int64_t i = 0; i < input_nstrings; i++) {
+offset_type input_string_offset =
+input_string_offsets[i] - input_string_first_offset;
+offset_type input_string_end =
+input_string_offsets[i + 1] - input_string_first_offset;
+offset_type input_string_ncodeunits = input_string_end - 
input_string_offset;
+offset_type encoded_nbytes = DerivedClass::Transform(
+input_str + input_string_offset, input_string_ncodeunits,
+output_str + output_ncodeunits);
+output_ncodeunits += encoded_nbytes;
+output_string_offsets[i + 1] = output_ncodeunits;
+  }
+
+  // trim the codepoint buffer, since we allocated too much
+  KERNEL_RETURN_IF_ERROR(
+  ctx,
+  output->buffers[2]->CopySlice(0, 
output_ncodeunits).Value(>buffers[2]));
+} else {
+  const auto& input = checked_cast(*batch[0].scalar());
+  auto result = 
checked_pointer_cast(MakeNullScalar(out->type()));
+  if (input.is_valid) {
+result->is_valid = true;
+offset_type data_nbytes = (offset_type)input.value->size();
+// See note above in the Array version explaining the 3 / 2
+KERNEL_RETURN_IF_ERROR(ctx,
+   ctx->Allocate(data_nbytes * 3 / 
2).Value(>value));
+offset_type encoded_nbytes = DerivedClass::Transform(
+input.value->data(), data_nbytes, result->value->mutable_data());
+KERNEL_RETURN_IF_ERROR(
+ctx, result->value->CopySlice(0, 
encoded_nbytes).Value(>value));
+  }
+  out->value = result;
+}
+  }
+};
+
+template 
+struct 

[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-06-29 Thread GitBox


wjones1 commented on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-651140872


   Actually @jorisvandenbossche, I agree we should probably just add in the 
batch_size argument (with a sensible default) to those other methods. Took me a 
while to understand what you were getting at, but by providing those parameters 
we avoid the issue where `iter_batches()` calls could mess with the underlying 
parameters when calling the other methods.
   
   I will implement those soon. I would especially welcome input on what the 
docstring should be for the `batch_size` parameter for those other methods 
(what is the user-facing impact of changing it?) and what the default value 
should be.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7568: ARROW-9241: [C++] Add forward compatibility check for Decimal bit width

2020-06-29 Thread GitBox


wesm closed pull request #7568:
URL: https://github.com/apache/arrow/pull/7568


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7571: ARROW-8671: [C++] Use new BodyCompression Flatbuffers member for IPC compression metadata

2020-06-29 Thread GitBox


wesm commented on a change in pull request #7571:
URL: https://github.com/apache/arrow/pull/7571#discussion_r447004437



##
File path: cpp/src/arrow/ipc/options.h
##
@@ -66,6 +67,10 @@ struct ARROW_EXPORT IpcWriteOptions {
   /// like compression
   bool use_threads = true;
 
+  /// \brief Format version to use for IPC messages and their
+  /// metadata. Presently using V4 version (readable by v0.8.0 and later).
+  MetadataVersion metadata_version = MetadataVersion::V4;

Review comment:
   see https://issues.apache.org/jira/browse/ARROW-9265





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] xhochy commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux

2020-06-29 Thread GitBox


xhochy commented on pull request #7580:
URL: https://github.com/apache/arrow/pull/7580#issuecomment-651162033


   > The manylinux import test 
[fails](https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=14018=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=3367cfe7-6aa5-5a88-c888-e6814d74f6eb=248)
 after removing `libz.so`, because it is necessary for some standard library 
modules.
   > @xhochy @kszucs Any idea how to circumvent this?
   
   Oh, where is this removing happening, do we do that in our scripts or is 
this `manylinux1`?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651165626


   @pitrou many thanks for the review. I've implemented all you suggestions 
except:
* Raising an error on invalid utf8 data (see comment)
* Having a benchmark run on non-ascii codepoints (I think we want to do 
this separate from this PR, but important point).
   
   Btw, I wasn't aware of existing utf8 code already in Arrow. The existing 
decoder based on http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ was new to me. 
Very interesting work, but unfortunately led to a performance regression 
(~50->30 M/s), which I'm surprised about actually. Maybe worth comparing again 
when we have a benchmark with non-ascii codepoints.
   
   @wesm I hope this is ready to go  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] rjzamora commented on pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

2020-06-29 Thread GitBox


rjzamora commented on pull request #7546:
URL: https://github.com/apache/arrow/pull/7546#issuecomment-651177576


   Thanks for the great work here @bkietz !
   
   This is wonderful - Dask uses the min/max statistics to calculate 
`divisions`, so this functionality is definitely necessary.
   
   
*A note on other (less-critical, but useful) statistics*:
   Dask also uses the `"total_byte_size"` statistics (for the full row-group, 
not each column) to aggregate partitions before reading in any data.  There is 
also a plan to use the `"num-rows”` statistics when the user executes 
`len(ddf)` (to avoid loading any data).   **How difficult would it be to 
add/expose these additional row-group statistics?**  Again, this is much less 
of a “blocker” for initial integration with Dask, but are likely things we will 
want to add in eventually.  cc @jorisvandenbossche 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7579: ARROW-9242: [Java] Add forward compatibility check for Decimal bit width

2020-06-29 Thread GitBox


wesm closed pull request #7579:
URL: https://github.com/apache/arrow/pull/7579


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-29 Thread GitBox


wesm commented on pull request #7556:
URL: https://github.com/apache/arrow/pull/7556#issuecomment-651186700


   If you're installing Brotli in any of the packaging setups, there may be a 
scenario where there is both the shared AND static library -- in that case 
there would be an issue. We need to see why the manylinux* builds have 
`libbrotli*.so` in them -- if for some reason Brotli's build is insubordinate 
and churlish we might have to add a flag to do a hard switch between 
shared/static like we do with ZSTD 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs closed pull request #7376: ARROW-9043: [Go][FOLLOWUP] Move license file copy to correct location

2020-06-29 Thread GitBox


kszucs closed pull request #7376:
URL: https://github.com/apache/arrow/pull/7376


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-29 Thread GitBox


nealrichardson commented on pull request #7556:
URL: https://github.com/apache/arrow/pull/7556#issuecomment-651182567


   @wesm how can I help/what should I look for?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-29 Thread GitBox


nealrichardson commented on pull request #7556:
URL: https://github.com/apache/arrow/pull/7556#issuecomment-651188772


   IIUC we're ok:
   
   * Windows: no brotli: 
https://github.com/apache/arrow/blob/master/ci/scripts/PKGBUILD
   * macOS: no brotli: 
https://github.com/apache/arrow/blob/master/dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb
   * Linux: maybe brotli, but it's all static build anyway: 
https://github.com/apache/arrow/blob/master/r/inst/build_arrow_static.sh



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jba commented on pull request #7376: ARROW-9043: [Go][FOLLOWUP] Move license file copy to correct location

2020-06-29 Thread GitBox


jba commented on pull request #7376:
URL: https://github.com/apache/arrow/pull/7376#issuecomment-651164902


   There's no feasible way to test this, unfortunately. I verified both my 
suggested changes by forking the repo, making the edits, and running an 
internal tool to verify that they worked. The external alternative would be to 
`go get` the package at the latest release to inform the Go module proxy, wait 
a few minutes, then check for it on pkg.go.dev. That would be a complicated 
test to automate, and would almost certainly be flaky, with little benefit.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux

2020-06-29 Thread GitBox


pitrou commented on pull request #7580:
URL: https://github.com/apache/arrow/pull/7580#issuecomment-651159316


   The manylinux import test 
[fails](https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=14018=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=3367cfe7-6aa5-5a88-c888-e6814d74f6eb=248)
 after removing `libz.so`, because it is necessary for some standard library 
modules.
   @xhochy @kszucs Any idea how to circumvent this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7376: ARROW-9043: [Go][FOLLOWUP] Move license file copy to correct location

2020-06-29 Thread GitBox


kszucs commented on pull request #7376:
URL: https://github.com/apache/arrow/pull/7376#issuecomment-651170232


   Thanks @jba for the update. Merging it then.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux

2020-06-29 Thread GitBox


pitrou commented on pull request #7580:
URL: https://github.com/apache/arrow/pull/7580#issuecomment-651181336


   @xhochy In 
https://github.com/apache/arrow/blob/master/dev/tasks/python-wheels/manylinux-test.sh#L33
 . I'm sure there's a better way to do that (e.g. spawn a minimal Python docker 
image).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7559: ARROW-9247: [Python] Expose total_values_length functions on BinaryArray, LargeBinaryArray

2020-06-29 Thread GitBox


wesm commented on pull request #7559:
URL: https://github.com/apache/arrow/pull/7559#issuecomment-651187351


   cc @pitrou or @jorisvandenbossche for review



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


wesm commented on a change in pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#discussion_r447030010



##
File path: cpp/src/arrow/tensor/csf_converter.cc
##
@@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, 
const std::vector
-class SparseCSFTensorConverter {
- public:
-  using NumericTensorType = NumericTensor;
-  using value_type = typename NumericTensorType::value_type;
+class SparseCSFTensorConverter : private SparseTensorConverterMixin {
+  using SparseTensorConverterMixin::AssignIndex;
+  using SparseTensorConverterMixin::IsNonZero;
 
-  SparseCSFTensorConverter(const NumericTensorType& tensor,
+ public:
+  SparseCSFTensorConverter(const Tensor& tensor,
const std::shared_ptr& index_value_type,
MemoryPool* pool)
   : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {}
 
-  template 
   Status Convert() {
-using c_index_value_type = typename IndexValueType::c_type;
-
RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max()));
+RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, 
tensor_.shape()));
+
+const int index_elsize =
+checked_cast(*index_value_type_).bit_width() / 
CHAR_BIT;
+const int value_elsize =
+checked_cast(*tensor_.type()).bit_width() / 
CHAR_BIT;

Review comment:
   Could be helper function, maybe we can put an `internal::GetByteWidth` 
function in `arrow/type.h` that does this? 

##
File path: cpp/src/arrow/tensor/csx_converter.cc
##
@@ -0,0 +1,245 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/tensor/converter.h"
+
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/buffer.h"
+#include "arrow/status.h"
+#include "arrow/type.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/visitor_inline.h"
+
+namespace arrow {
+
+class MemoryPool;
+
+namespace internal {
+namespace {
+
+// --
+// SparseTensorConverter for SparseCSRIndex
+
+class SparseCSXMatrixConverter : private SparseTensorConverterMixin {
+  using SparseTensorConverterMixin::AssignIndex;
+  using SparseTensorConverterMixin::IsNonZero;
+
+ public:
+  SparseCSXMatrixConverter(SparseMatrixCompressedAxis axis, const Tensor& 
tensor,
+   const std::shared_ptr& index_value_type,
+   MemoryPool* pool)
+  : axis_(axis), tensor_(tensor), index_value_type_(index_value_type), 
pool_(pool) {}
+
+  Status Convert() {
+RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, 
tensor_.shape()));
+
+const int index_elsize =
+checked_cast(*index_value_type_).bit_width() / 
CHAR_BIT;

Review comment:
   Can you factor this "get byte width" operation into a helper function? 

##
File path: cpp/src/arrow/tensor/csf_converter.cc
##
@@ -148,84 +161,130 @@ class SparseCSFTensorConverter {
 return Status::OK();
   }
 
-#define CALL_TYPE_SPECIFIC_CONVERT(TYPE_CLASS) \
-  case TYPE_CLASS##Type::type_id:  \
-return Convert();
-
-  Status Convert() {
-switch (index_value_type_->id()) {
-  ARROW_GENERATE_FOR_ALL_INTEGER_TYPES(CALL_TYPE_SPECIFIC_CONVERT);
-  // LCOV_EXCL_START: The following invalid causes program failure.
-  default:
-return Status::TypeError("Unsupported SparseTensor index value type");
-// LCOV_EXCL_STOP
-}
-  }
-
-#undef CALL_TYPE_SPECIFIC_CONVERT
-
   std::shared_ptr sparse_index;
   std::shared_ptr data;
 
  private:
-  const NumericTensorType& tensor_;
+  const Tensor& tensor_;
   const std::shared_ptr& index_value_type_;
   MemoryPool* pool_;
+};
 
-  template 
-  inline Status CheckMaximumValue(const c_value_type type_max) const {
-auto max_dimension =
-*std::max_element(tensor_.shape().begin(), tensor_.shape().end());
-if (static_cast(type_max) < max_dimension) {
-  // LCOV_EXCL_START: The following invalid causes program failure.
-  return Status::Invalid("The bit width of the index value type is too 
small");
-  // LCOV_EXCL_STOP
-}
-return Status::OK();
+class 

  1   2   >