[GitHub] [arrow] emkornfield commented on a change in pull request #7535: ARROW-9222: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-29 Thread GitBox


emkornfield commented on a change in pull request #7535:
URL: https://github.com/apache/arrow/pull/7535#discussion_r447396655



##
File path: docs/source/format/Columnar.rst
##
@@ -566,33 +572,28 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]``
 ::
 
 * Length: 4, Null count: 1
-* Validity bitmap buffer:
-  |Byte 0 (validity bitmap) | Bytes 1-63|
-  |-|---|
-  |1101 | 0 (padding)   |
-
 * Types buffer:
 
   |Byte 0   | Byte 1  | Byte 2   | Byte 3   | Bytes 4-63  |
   |-|-|--|--|-|
-  | 0   | unspecified | 0| 1| unspecified |
+  | 0   | 0   | 0| 1| unspecified |
 
 * Offset buffer:
 
   |Bytes 0-3 | Bytes 4-7   | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
   |--|-||-|-|
-  | 0| unspecified | 1  | 0   | unspecified |
+  | 0| 1   | 2  | 0   | unspecified |
 
 * Children arrays:
   * Field-0 array (f: float):
 * Length: 2, nulls: 0

Review comment:
   should this be nulls: 1 now?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #7535: ARROW-9222: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-29 Thread GitBox


emkornfield commented on a change in pull request #7535:
URL: https://github.com/apache/arrow/pull/7535#discussion_r447396488



##
File path: docs/source/format/Columnar.rst
##
@@ -688,11 +687,10 @@ will have the following layout: ::
 ||---|
 | joemark| unspecified (padding) |
 
-Similar to structs, a particular child array may have a non-null slot
-even if the validity bitmap of the parent union array indicates the
-slot is null.  Additionally, a child array may have a non-null slot
-even if the types array indicates that a slot contains a different
-type at the index.
+Note that a child array may have a non-null slot even if the types array
+indicates that a slot contains a different type at the index, so the

Review comment:
   I have a little trouble parsing this language.  Could it be stated more 
as a positive?
   
   > Only the slot in the array corresponding to the type index is considered.  
All "unselected" values are ignored and could be any semantically correct array 
value.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-29 Thread GitBox


emkornfield commented on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-651515749


   @liyafan82 I think that is a good point.  If it supports both modes I think 
that is a reasonable compromise for now as long as @jacques-n is OK with it.  
But we can maybe discuss further if needed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #6156: ARROW-7539: [Java] FieldVector getFieldBuffers API should not set reader/writer indices

2020-06-29 Thread GitBox


emkornfield commented on pull request #6156:
URL: https://github.com/apache/arrow/pull/6156#issuecomment-651509291


   Can we leave the old method in place and mark it as deprecated and remove in 
a later release?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7586: Calculate page and column statistics

2020-06-29 Thread GitBox


github-actions[bot] commented on pull request #7586:
URL: https://github.com/apache/arrow/pull/7586#issuecomment-651495743


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 edited a comment on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-29 Thread GitBox


liyafan82 edited a comment on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-651492666


   In addition to the problem of top level validity buffer, I think there is 
another problem to discuss: Java is using the ordinal of the minor type as the 
type id, and this is not aligning with the C++ implementation/specification. 
   
   This PR works around the problem by supporting two modes. However, the C++ 
and Java implementations are not really aligning, IMO. Do we have a plan to 
mitigate this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-29 Thread GitBox


liyafan82 commented on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-651492666


   In addition to the problem of top level validity buffer, I think there is 
another problem to discuss: Java is using the ordinal of the minor type as the 
type id, and this is not aligning with the C++ implementation/specification. 
   
   This PR works around the problem by supporting two modes. However, the C++ 
and Java implementations are not really aligning, IMO. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zeevm opened a new pull request #7586: Calculate page and column statistics

2020-06-29 Thread GitBox


zeevm opened a new pull request #7586:
URL: https://github.com/apache/arrow/pull/7586


   1. Calculate page and column statistics
   2. Use pre-calculated statistics when available to speed-up when writing 
data from other formats like ORC.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-29 Thread GitBox


nealrichardson commented on pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#issuecomment-651484901


   Ok, this isn't necessarily pretty but I think it's done, or done enough for 
here. I'll add some more tests, probably some docs for the format, and poke 
around a bit more while doing the tickets for `haven::labelled` and `POSIXlt` 
types. @romainfrancois if you can glance at the changes I made and confirm that 
I haven't butchered things too badly, that would be great, thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson edited a comment on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-29 Thread GitBox


nealrichardson edited a comment on pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#issuecomment-651263042


   I'm taking this over. Outstanding TODOs:
   
   - [x] Add tests
   - [x] Support record batches
   - [x] Support nested types (requires adapting the data structure and adding 
recursion)
   - [x] Test the print method
   - [x] Test/handle bad data in metadata$r; allow users to edit it manually?
   - [x] Don't keep attributes (e.g. factor levels) that we explicitly 
translate to Arrow already
   
   Top-level dataset metadata is deferred to 
https://issues.apache.org/jira/browse/ARROW-9271.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-29 Thread GitBox


liyafan82 commented on pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#issuecomment-651474250


   @rymurr Thanks for your work. A few typos. 
   I think it would be ready for merge. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-29 Thread GitBox


liyafan82 commented on a change in pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#discussion_r447363600



##
File path: java/memory/src/main/java/org/apache/arrow/memory/ArrowBuf.java
##
@@ -227,13 +207,25 @@ public ArrowBuf slice(long index, long length) {
 return newBuf;
   }
 
+  /**
+   * Make a nio byte buffer from this arrowbuf.
+   */
   public ByteBuffer nioBuffer() {
-return asNettyBuffer().nioBuffer();
+return nioBuffer(readerIndex, checkedCastToInt(readableBytes()));
   }
 
+
+  /**
+   *  Make an nio byte buffer from this ArrowBuf.

Review comment:
   nit: an -> a





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-29 Thread GitBox


liyafan82 commented on a change in pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#discussion_r447363481



##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/rounding/DefaultRoundingPolicy.java
##
@@ -17,33 +17,107 @@
 
 package org.apache.arrow.memory.rounding;
 
-import java.lang.reflect.Field;
-
-import org.apache.arrow.memory.NettyAllocationManager;
 import org.apache.arrow.memory.util.CommonUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import io.netty.util.internal.SystemPropertyUtil;
 
 /**
  * The default rounding policy. That is, if the requested size is within the 
chunk size,
  * the rounded size will be the next power of two. Otherwise, the rounded size
  * will be identical to the requested size.
  */
 public class DefaultRoundingPolicy implements RoundingPolicy {
-
+  private static final Logger logger = 
LoggerFactory.getLogger(DefaultRoundingPolicy.class);
   public final long chunkSize;
 
   /**
-   * The singleton instance.
+   * The variables here and the static block calculates teh DEFAULT_CHUNK_SIZE.
+   *
+   * 
+   *   It was copied from {@link io.netty.buffer.PooledByteBufAllocator}.
+   * 
*/
-  public static final DefaultRoundingPolicy INSTANCE = new 
DefaultRoundingPolicy();
+  private static final int MIN_PAGE_SIZE = 4096;
+  private static final int MAX_CHUNK_SIZE = (int) (((long) Integer.MAX_VALUE + 
1) / 2);
+  private static final long DEFAULT_CHUNK_SIZE;
+  private static final int DEFAULT_PAGE_SIZE;
+  private static final int DEFAULT_MAX_ORDER;
 
-  private DefaultRoundingPolicy() {
+
+  static {
+int defaultPageSize = 
SystemPropertyUtil.getInt("org.apache.memory.allocator.pageSize", 8192);
+Throwable pageSizeFallbackCause = null;
 try {
-  Field field = 
NettyAllocationManager.class.getDeclaredField("CHUNK_SIZE");
-  field.setAccessible(true);
-  chunkSize = (Long) field.get(null);
-} catch (Exception e) {
-  throw new RuntimeException("Failed to get chunk size from allocation 
manager");
+  validateAndCalculatePageShifts(defaultPageSize);
+} catch (Throwable t) {
+  pageSizeFallbackCause = t;
+  defaultPageSize = 8192;
+}
+DEFAULT_PAGE_SIZE = defaultPageSize;
+
+int defaultMaxOrder = 
SystemPropertyUtil.getInt("org.apache.memory.allocator.maxOrder", 11);
+Throwable maxOrderFallbackCause = null;
+try {
+  validateAndCalculateChunkSize(DEFAULT_PAGE_SIZE, defaultMaxOrder);
+} catch (Throwable t) {
+  maxOrderFallbackCause = t;
+  defaultMaxOrder = 11;
+}
+DEFAULT_MAX_ORDER = defaultMaxOrder;
+DEFAULT_CHUNK_SIZE = validateAndCalculateChunkSize(DEFAULT_PAGE_SIZE, 
DEFAULT_MAX_ORDER);
+if (logger.isDebugEnabled()) {
+  if (pageSizeFallbackCause == null) {
+logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", 
DEFAULT_PAGE_SIZE);
+  } else {
+logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", 
DEFAULT_PAGE_SIZE, pageSizeFallbackCause);
+  }
+  if (maxOrderFallbackCause == null) {
+logger.debug("-Dorg.apache.memory.allocator.maxOrder: {}", 
DEFAULT_MAX_ORDER);
+  } else {
+logger.debug("-Dorg.apache.memory.allocator.maxOrder: {}", 
DEFAULT_MAX_ORDER, maxOrderFallbackCause);
+  }

Review comment:
   Also here we need only one parameter in the "else" statement.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory

2020-06-29 Thread GitBox


liyafan82 commented on a change in pull request #7347:
URL: https://github.com/apache/arrow/pull/7347#discussion_r447363293



##
File path: 
java/memory/src/main/java/org/apache/arrow/memory/rounding/DefaultRoundingPolicy.java
##
@@ -17,33 +17,103 @@
 
 package org.apache.arrow.memory.rounding;
 
-import java.lang.reflect.Field;
-
-import org.apache.arrow.memory.NettyAllocationManager;
 import org.apache.arrow.memory.util.CommonUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import io.netty.util.internal.SystemPropertyUtil;
 
 /**
  * The default rounding policy. That is, if the requested size is within the 
chunk size,
  * the rounded size will be the next power of two. Otherwise, the rounded size
  * will be identical to the requested size.
  */
 public class DefaultRoundingPolicy implements RoundingPolicy {
-
+  private static final Logger logger = 
LoggerFactory.getLogger(DefaultRoundingPolicy.class);
   public final long chunkSize;
 
   /**
-   * The singleton instance.
+   * The variables here and the static block calculates the DEFAULT_CHUNK_SIZE.
+   *
+   * 
+   *   It was copied from {@link io.netty.buffer.PooledByteBufAllocator}.
+   * 
*/
-  public static final DefaultRoundingPolicy INSTANCE = new 
DefaultRoundingPolicy();
+  private static final int MIN_PAGE_SIZE = 4096;
+  private static final int MAX_CHUNK_SIZE = (int) (((long) Integer.MAX_VALUE + 
1) / 2);
+  private static final long DEFAULT_CHUNK_SIZE;
 
-  private DefaultRoundingPolicy() {
+
+  static {
+int defaultPageSize = 
SystemPropertyUtil.getInt("org.apache.memory.allocator.pageSize", 8192);
+Throwable pageSizeFallbackCause = null;
 try {
-  Field field = 
NettyAllocationManager.class.getDeclaredField("CHUNK_SIZE");
-  field.setAccessible(true);
-  chunkSize = (Long) field.get(null);
-} catch (Exception e) {
-  throw new RuntimeException("Failed to get chunk size from allocation 
manager");
+  validateAndCalculatePageShifts(defaultPageSize);
+} catch (Throwable t) {
+  pageSizeFallbackCause = t;
+  defaultPageSize = 8192;
+}
+
+int defaultMaxOrder = 
SystemPropertyUtil.getInt("org.apache.memory.allocator.maxOrder", 11);
+Throwable maxOrderFallbackCause = null;
+try {
+  validateAndCalculateChunkSize(defaultPageSize, defaultMaxOrder);
+} catch (Throwable t) {
+  maxOrderFallbackCause = t;
+  defaultMaxOrder = 11;
+}
+DEFAULT_CHUNK_SIZE = validateAndCalculateChunkSize(defaultPageSize, 
defaultMaxOrder);
+if (logger.isDebugEnabled()) {
+  if (pageSizeFallbackCause == null) {
+logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", 
defaultPageSize);
+  } else {
+logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", 
defaultPageSize, pageSizeFallbackCause);
+  }

Review comment:
   I think here we only need one parameter here? So it should be
   
   logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", 
pageSizeFallbackCause);





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-29 Thread GitBox


github-actions[bot] commented on pull request #7556:
URL: https://github.com/apache/arrow/pull/7556#issuecomment-651467252


   Revision: 821f30a834dab99cdc757100e51986384f0a391c
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-367](https://github.com/ursa-labs/crossbow/branches/all?query=actions-367)
   
   |Task|Status|
   ||--|
   |centos-6-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-367-github-centos-6-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-367-github-centos-6-amd64)|
   
|centos-7-aarch64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-367-travis-centos-7-aarch64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |centos-7-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-367-github-centos-7-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-367-github-centos-7-amd64)|
   
|centos-8-aarch64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-367-travis-centos-8-aarch64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   |centos-8-amd64|[![Github 
Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-367-github-centos-8-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-367-github-centos-8-amd64)|
   
|conda-clean|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-clean)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-clean)|
   
|conda-linux-gcc-py36-cpu|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py36-cpu)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py36-cpu)|
   
|conda-linux-gcc-py36-cuda|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py36-cuda)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py36-cuda)|
   
|conda-linux-gcc-py37-cpu|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py37-cpu)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py37-cpu)|
   
|conda-linux-gcc-py37-cuda|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py37-cuda)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py37-cuda)|
   
|conda-linux-gcc-py38-cpu|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py38-cpu)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py38-cpu)|
   
|conda-linux-gcc-py38-cuda|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py38-cuda)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py38-cuda)|
   
|conda-osx-clang-py36|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-osx-clang-py36)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-osx-clang-py36)|
   
|conda-osx-clang-py37|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-osx-clang-py37)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-osx-clang-py37)|
   
|conda-osx-clang-py38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-osx-clang-py38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-osx-clang-py38)|
   
|conda-win-vs2015-py36|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-win-vs2015-py36)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-win-vs2015-py36)|
   
|conda-win-vs2015-py37|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-win-vs2015-py37)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-win-vs2015-py37)|
   

[GitHub] [arrow] wesm commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-29 Thread GitBox


wesm commented on pull request #7556:
URL: https://github.com/apache/arrow/pull/7556#issuecomment-651466752


   @github-actions crossbow submit -g linux -g wheel -g conda



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-29 Thread GitBox


wesm commented on pull request #7556:
URL: https://github.com/apache/arrow/pull/7556#issuecomment-651466442


   Actually, that's crazy. I'm taking the same approach as ZSTD and adding a 
CMake toggle between shared and static Brotli (with default being shared)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7585: ARROW-3520: [C++] Add "list_flatten" vector kernel wrapper for Flatten method of ListArray types

2020-06-29 Thread GitBox


github-actions[bot] commented on pull request #7585:
URL: https://github.com/apache/arrow/pull/7585#issuecomment-651460741


   https://issues.apache.org/jira/browse/ARROW-3520



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux

2020-06-29 Thread GitBox


wesm commented on pull request #7580:
URL: https://github.com/apache/arrow/pull/7580#issuecomment-651460507


   Hm not so fast. The macOS py35 failure seems legitimate
   
   https://travis-ci.org/github/ursa-labs/crossbow/builds/703242650#L10060



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7560: ARROW-9252: [Integration] Factor out IPC integration tests into script, add back 0.14.1 "gold" files

2020-06-29 Thread GitBox


wesm closed pull request #7560:
URL: https://github.com/apache/arrow/pull/7560


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7560: ARROW-9252: [Integration] Factor out IPC integration tests into script, add back 0.14.1 "gold" files

2020-06-29 Thread GitBox


wesm commented on pull request #7560:
URL: https://github.com/apache/arrow/pull/7560#issuecomment-651458622


   Yahtzee



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn edited a comment on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


mrkn edited a comment on pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#issuecomment-651458470


   @wesm OK.  I continue to work for benchmarking in this pull-request. If I 
need more time to tune etc., I'll split the issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


mrkn commented on pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#issuecomment-651458470


   OK.  I continue to work for benchmarking in this pull-request. If I need 
more time to tune etc., I'll split the issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7569: ARROW-9152: [C++] Specialized implementation of filtering Binary/LargeBinary-based types

2020-06-29 Thread GitBox


wesm closed pull request #7569:
URL: https://github.com/apache/arrow/pull/7569


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7569: ARROW-9152: [C++] Specialized implementation of filtering Binary/LargeBinary-based types

2020-06-29 Thread GitBox


wesm commented on pull request #7569:
URL: https://github.com/apache/arrow/pull/7569#issuecomment-651457980


   +1, this is a bit dry so would rather reviewers reserve their time for other 
PRs



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


wesm edited a comment on pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#issuecomment-651457062


   @mrkn it's up to you, it's fine with me if you work on performance tuning 
(or at least measurement) in another PR or this one



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


wesm commented on pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#issuecomment-651457062


   @mrkn it's up to you, it's fine with me if you work on performance tuning in 
another PR or this one



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


mrkn commented on pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#issuecomment-651456329


   @wesm Is it better to work for benchmarking in other pull-request?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson edited a comment on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-29 Thread GitBox


nealrichardson edited a comment on pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#issuecomment-651263042


   I'm taking this over. Outstanding TODOs:
   
   - [x] Add tests
   - [x] Support record batches
   - [x] Support nested types (requires adapting the data structure and adding 
recursion)
   - [x] Test the print method
   - [ ] Test/handle bad data in metadata$r; allow users to edit it manually?
   - [x] Don't keep attributes (e.g. factor levels) that we explicitly 
translate to Arrow already
   
   Top-level dataset metadata is deferred to 
https://issues.apache.org/jira/browse/ARROW-9271.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7585: ARROW-3520: [C++] WIP Add vector kernel wrapper for Flatten method of ListArray types

2020-06-29 Thread GitBox


wesm closed pull request #7585:
URL: https://github.com/apache/arrow/pull/7585


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm opened a new pull request #7585: ARROW-3520: [C++] WIP Add vector kernel wrapper for Flatten method of ListArray types

2020-06-29 Thread GitBox


wesm opened a new pull request #7585:
URL: https://github.com/apache/arrow/pull/7585


   I'm testing a JIRA webhook, I'll close this PR and then reopen it when the 
patch is done



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7560: ARROW-9252: [Integration] Factor out IPC integration tests into script, add back 0.14.1 "gold" files

2020-06-29 Thread GitBox


wesm commented on pull request #7560:
URL: https://github.com/apache/arrow/pull/7560#issuecomment-651423838


   Will merge this if the build passes with the arrow-testing changes



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++

2020-06-29 Thread GitBox


zhztheplayer commented on pull request #7030:
URL: https://github.com/apache/arrow/pull/7030#issuecomment-651422684


   Thanks for the comments! I've got some stuffs to deal with these days. Will 
address as soon as possible.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7584: ARROW-9272: [C++][Python] Reduce complexity in python to arrow conversion

2020-06-29 Thread GitBox


github-actions[bot] commented on pull request #7584:
URL: https://github.com/apache/arrow/pull/7584#issuecomment-651404016


   https://issues.apache.org/jira/browse/ARROW-9272



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs opened a new pull request #7584: ARROW-9272: [C++][Python] Reduce complexity in python to arrow conversion

2020-06-29 Thread GitBox


kszucs opened a new pull request #7584:
URL: https://github.com/apache/arrow/pull/7584


   The original motivation for this patch was to reuse the same conversions 
path for both the scalars and arrays. 
   
   In my recent patch the scalars are converted from a single element list to a 
single element array then copied out from it.
   
   On the long term we should convert them directly, perhaps with a more 
generic converter API, until that this patch aims to reduce code complexity 
without introducing any regressions.
   
   I checked the produced binary size:
   
   ```console
   # BEFORE
   -rwxr-xr-x   1 kszucs  staff   2926832 Jun 29 23:07 
libarrow_python.100.0.0.dylib
   
   # AFTER
   -rwxr-xr-x   1 kszucs  staff   2869136 Jun 29 23:06 
libarrow_python.100.0.0.dylib
   ```
   
   ```
   convert_builtins.ConvertArrayToPyList.time_convert
    == =
   type   before after 
    -- -
  int32  13.2±0ms   13.2±0m
  uint32 13.4±0ms   14.8±0m
  int64  16.4±0ms   13.7±0m
  uint64 13.0±0ms   17.4±0m
 float32 13.5±0ms   16.0±0m
 float64 16.3±0ms   14.7±0m
   bool  12.0±0ms   12.0±0m
 decimal 61.1±0ms   65.0±0m
  binary 19.1±0ms   18.2±0m
 binary1017.9±0ms   17.4±0m
  ascii  26.0±0ms   25.9±0m
 unicode 81.4±0ms   111±0m
int64 list   145±0ms152±0m
  struct 229±0ms212±0m
    == =
   
   convert_builtins.ConvertPyListToArray.time_convert
    = 
   type  before after
    - 
  int32  129±0ms   135±0m
  uint32 139±0ms   134±0m
  int64  132±0ms   139±0m
  uint64 135±0ms   132±0m
 float32 149±0ms   131±0m
 float64 143±0ms   141±0m
   bool  126±0ms   130±0m
 decimal 139±0ms   135±0m
  binary 145±0ms   130±0m
 binary10136±0ms   130±0m
  ascii  145±0ms   136±0m
 unicode 148±0ms   149±0m
int64 list   174±0ms   176±0m
  struct 164±0ms   177±0m
struct from tuples   165±0ms   169±0m
    = 
  
   convert_builtins.InferPyListToArray.time_infer
    = 
   type  beforeafter   
    - 
  int64  157±0ms   158±0m
 float64 155±0ms   153±0m
   bool  155±0ms   154±0m
 decimal 249±0ms   233±0m
  binary 167±0ms   162±0m
  ascii  181±0ms   154±0m
 unicode 179±0ms   171±0m
int64 list   188±0ms   192±0m
  struct 177±0ms   181±0m
    = 
   ```
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


mrkn commented on a change in pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#discussion_r447268862



##
File path: cpp/src/arrow/tensor/csf_converter.cc
##
@@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, 
const std::vector
-class SparseCSFTensorConverter {
- public:
-  using NumericTensorType = NumericTensor;
-  using value_type = typename NumericTensorType::value_type;
+class SparseCSFTensorConverter : private SparseTensorConverterMixin {
+  using SparseTensorConverterMixin::AssignIndex;
+  using SparseTensorConverterMixin::IsNonZero;
 
-  SparseCSFTensorConverter(const NumericTensorType& tensor,
+ public:
+  SparseCSFTensorConverter(const Tensor& tensor,
const std::shared_ptr& index_value_type,
MemoryPool* pool)
   : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {}
 
-  template 
   Status Convert() {
-using c_index_value_type = typename IndexValueType::c_type;
-
RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max()));
+RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, 
tensor_.shape()));
+
+const int index_elsize =
+checked_cast(*index_value_type_).bit_width() / 
CHAR_BIT;
+const int value_elsize =
+checked_cast(*tensor_.type()).bit_width() / 
CHAR_BIT;

Review comment:
   Oops, I don't need to return `-1` because `checked_cast` is used in this 
function.  Sorry.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


mrkn commented on a change in pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#discussion_r447267538



##
File path: cpp/src/arrow/tensor/csf_converter.cc
##
@@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, 
const std::vector
-class SparseCSFTensorConverter {
- public:
-  using NumericTensorType = NumericTensor;
-  using value_type = typename NumericTensorType::value_type;
+class SparseCSFTensorConverter : private SparseTensorConverterMixin {
+  using SparseTensorConverterMixin::AssignIndex;
+  using SparseTensorConverterMixin::IsNonZero;
 
-  SparseCSFTensorConverter(const NumericTensorType& tensor,
+ public:
+  SparseCSFTensorConverter(const Tensor& tensor,
const std::shared_ptr& index_value_type,
MemoryPool* pool)
   : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {}
 
-  template 
   Status Convert() {
-using c_index_value_type = typename IndexValueType::c_type;
-
RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max()));
+RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, 
tensor_.shape()));
+
+const int index_elsize =
+checked_cast(*index_value_type_).bit_width() / 
CHAR_BIT;
+const int value_elsize =
+checked_cast(*tensor_.type()).bit_width() / 
CHAR_BIT;

Review comment:
   OK. I'll make `inernal::GetByteWidth`, and let it return `-1` for 
non-`FixedWidthType` values.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


wesm commented on a change in pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#discussion_r447264419



##
File path: cpp/src/arrow/tensor/csf_converter.cc
##
@@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, 
const std::vector
-class SparseCSFTensorConverter {
- public:
-  using NumericTensorType = NumericTensor;
-  using value_type = typename NumericTensorType::value_type;
+class SparseCSFTensorConverter : private SparseTensorConverterMixin {
+  using SparseTensorConverterMixin::AssignIndex;
+  using SparseTensorConverterMixin::IsNonZero;
 
-  SparseCSFTensorConverter(const NumericTensorType& tensor,
+ public:
+  SparseCSFTensorConverter(const Tensor& tensor,
const std::shared_ptr& index_value_type,
MemoryPool* pool)
   : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {}
 
-  template 
   Status Convert() {
-using c_index_value_type = typename IndexValueType::c_type;
-
RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max()));
+RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, 
tensor_.shape()));
+
+const int index_elsize =
+checked_cast(*index_value_type_).bit_width() / 
CHAR_BIT;
+const int value_elsize =
+checked_cast(*tensor_.type()).bit_width() / 
CHAR_BIT;

Review comment:
   As long as the signature of the function is `int(const DataType&)` it 
sounds good to me. I don't think it should be a non-internal API because it 
does not need to do error-handling. If it's a non-static member of 
FixedWidthType, then a `checked_cast` is still necessary which seems against 
the spirit of making the code simpler





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


nealrichardson commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651369436


   > > The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9
   > 
   > What difference does it make? This is plain C.
   
   :shrug: then I'll leave it to you to sort out as this is beyond my 
knowledge. In the past, undefined symbols error + only compiled for 
rtools-packages (gcc8) = you need to get it built with rtools-backports too. 
Maybe something's off with the lib that was built, IDK if anyone has verified 
that it works.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


wesm commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651368104


   Indeed, toolchain incompatibilities only affect C++ code



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651366993


   > The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9
   
   What difference does it make? This is plain C.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on a change in pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info

2020-06-29 Thread GitBox


kou commented on a change in pull request #7581:
URL: https://github.com/apache/arrow/pull/7581#discussion_r447247927



##
File path: cpp/src/arrow/config.h
##
@@ -0,0 +1,47 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include 
+
+#include "arrow/util/config.h"  // IWYU pragma: export
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+
+struct BuildInfo {
+  int major;
+  int minor;
+  int patch;

Review comment:
   Oh, sorry. One comment wasn't submitted...
   
   How about adding `version_` prefix because our `ARROW_VERSION_*` have 
`VERSION_` prefix?
   
   ```suggestion
 int version_major;
 int version_minor;
 int version_patch;
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


nealrichardson commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651355763


   > > This means there also needs to be a PKGBUILD
   > 
   > Why? `libutf8proc` is installed.
   
   The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9. Most 
of our deps have to be compiled for both, and this is apparently one of those. 
That's what https://github.com/r-windows/rtools-backports is for.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651353338


   > This means there also needs to be a PKGBUILD 
   
   Why? `libutf8proc` is installed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info

2020-06-29 Thread GitBox


pitrou commented on pull request #7581:
URL: https://github.com/apache/arrow/pull/7581#issuecomment-651352568


   > It would be also nice to store the enabled features.
   
   Agreed, but that can be done in a separate PR.
   
   > How about adding int BuildInfo::version for ARROW_VERSION too?
   
   Good idea.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info

2020-06-29 Thread GitBox


pitrou commented on pull request #7581:
URL: https://github.com/apache/arrow/pull/7581#issuecomment-651352858


   Also, I'll let others add `-DARROW_PACKAGE_KIND=...` in other places.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7571: ARROW-8671: [C++] Use new BodyCompression Flatbuffers member for IPC compression metadata

2020-06-29 Thread GitBox


wesm commented on pull request #7571:
URL: https://github.com/apache/arrow/pull/7571#issuecomment-651351599


   I'll close this for now. Please leave any review comments and I can address 
them later



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou merged pull request #7583: [Doc][C++] Follow docker-compose service name change for lint

2020-06-29 Thread GitBox


kou merged pull request #7583:
URL: https://github.com/apache/arrow/pull/7583


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7571: ARROW-8671: [C++] Use new BodyCompression Flatbuffers member for IPC compression metadata

2020-06-29 Thread GitBox


wesm closed pull request #7571:
URL: https://github.com/apache/arrow/pull/7571


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7576: ARROW-9263: [C++] Promote compute aggregate benchmark size to 1M.

2020-06-29 Thread GitBox


wesm closed pull request #7576:
URL: https://github.com/apache/arrow/pull/7576


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7576: ARROW-9263: [C++] Promote compute aggregate benchmark size to 1M.

2020-06-29 Thread GitBox


wesm commented on pull request #7576:
URL: https://github.com/apache/arrow/pull/7576#issuecomment-651350872


   +1



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7576: ARROW-9263: [C++] Promote compute aggregate benchmark size to 1M.

2020-06-29 Thread GitBox


wesm commented on pull request #7576:
URL: https://github.com/apache/arrow/pull/7576#issuecomment-651350708


   I think both the 1/1000 and 1/1 cases have something interesting to show 
perf wise, but in any case using 1M as the length in this benchmark seems OK. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson edited a comment on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-29 Thread GitBox


nealrichardson edited a comment on pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#issuecomment-651263042


   I'm taking this over. Outstanding TODOs:
   
   - [x] Add tests
   - [x] Support record batches
   - [x] Support nested types (requires adapting the data structure and adding 
recursion)
   - [ ] Test the print method
   - [ ] Test/handle bad data in metadata$r; allow users to edit it manually?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


nealrichardson commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651342350


   > @xhochy Could you help on the utf8proc issue on RTools 3.5?
   > See here: 
https://github.com/apache/arrow/pull/7449/checks?check_run_id=819772618#step:10:169
   
   This means there also needs to be a PKGBUILD submitted to 
`r-windows/rtools-backports` for the old toolchain.
   
   > 
   > It seems that `UTF8PROC_STATIC` would need to be defined when building 
Arrow. But it's not set by `Findutf8proc.cmake`.
   > Also, `libutf8proc.pc.in` added in 
[r-windows/rtools-packages#124](https://github.com/r-windows/rtools-packages/pull/124)
 doesn't set it either.
   
   Just a reminder that nothing in the R bindings touches these new functions, 
so turning off utf8proc in the C++ build is also an option for now.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info

2020-06-29 Thread GitBox


kszucs commented on pull request #7581:
URL: https://github.com/apache/arrow/pull/7581#issuecomment-651339349


   It would be also nice to store the enabled features.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651338264


   @xhochy Could you help on the utf8proc issue on RTools 3.5?
   See here: 
https://github.com/apache/arrow/pull/7449/checks?check_run_id=819772618#step:10:169
   
   It seems that `UTF8PROC_STATIC` would need to be defined when building 
Arrow. But it's not set by `Findutf8proc.cmake`.
   Also, `libutf8proc.pc.in` added in 
https://github.com/r-windows/rtools-packages/pull/124 doesn't set it either.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651322087


   I just concluded the same :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651316656


   I pushed a commit that raises an error on invalid UTF8. It does not seem to 
make the benchmarks slower.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651289874


   @pitrou  your size commit made the benchmark go from `52->60 M/s`  
   
   > Yes, too. The main point of this state-machine-based decoder is that it's 
branchless, and so it will perform roughly as well on non-Ascii data with 
unpredictable branching. On pure Ascii data, a branch-based decoder may be 
faster since the branches will always be predicted right.
   
   Yes, it would be interesting to see how the two methods deals with a 
25/25/25/25% mix of 1-2-3 or 4 byte encoded codepoints, vs say a few % 
non-ascii.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r447171303



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -133,23 +134,23 @@ struct Utf8Transform {
 output_string_offsets[i + 1] = output_ncodeunits;
   }
 
-  // trim the codepoint buffer, since we allocated too much
+  // Trim the codepoint buffer, since we allocated too much
   KERNEL_RETURN_IF_ERROR(
-  ctx,
-  output->buffers[2]->CopySlice(0, 
output_ncodeunits).Value(>buffers[2]));
+  ctx, values_buffer->Resize(output_ncodeunits, 
/*shrink_to_fit=*/true));

Review comment:
   :-)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r447170380



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -133,23 +134,23 @@ struct Utf8Transform {
 output_string_offsets[i + 1] = output_ncodeunits;
   }
 
-  // trim the codepoint buffer, since we allocated too much
+  // Trim the codepoint buffer, since we allocated too much
   KERNEL_RETURN_IF_ERROR(
-  ctx,
-  output->buffers[2]->CopySlice(0, 
output_ncodeunits).Value(>buffers[2]));
+  ctx, values_buffer->Resize(output_ncodeunits, 
/*shrink_to_fit=*/true));

Review comment:
   Nice way to make code more readable.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou edited a comment on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou edited a comment on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651282959


   Main point remaining is whether we raise an error on invalid UTF8 input. I 
see no reason not to (an Arrow string array has to be valid UTF8 as per the 
spec, just like a Python unicode string cannot contain characters outside of 
the unicode code range).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651282959


   Main point remaining is whether we raise an error on invalid UTF8 input. I 
see no reason not too (an Arrow string array has to be valid UTF8 as per the 
spec, just like a Python unicode string cannot contain characters outside of 
the unicode code range).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651282415


   > Having a benchmark run on non-ascii codepoints (I think we want to do this 
separate from this PR, but important point).
   
   Yes, I think we can defer that to a separate PR.
   
   > The existing decoder based on 
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ was new to me. Very interesting 
work, but unfortunately led to a performance regression (~50->30 M/s), which 
I'm surprised about actually. Maybe worth comparing again when we have a 
benchmark with non-ascii codepoints.
   
   Yes, too. The main point of this state-machine-based decoder is that it's 
branchless, and so it will perform roughly as well on non-Ascii data with 
unpredictable branching. On pure Ascii data, a branch-based decoder may be 
faster since the branches will always be predicted right.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r447161925



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -39,6 +158,121 @@ struct AsciiLength {
   }
 };
 
+template  class Derived>
+struct Utf8Transform {
+  using offset_type = typename Type::offset_type;
+  using DerivedClass = Derived;
+  using ArrayType = typename TypeTraits::ArrayType;
+
+  static offset_type Transform(const uint8_t* input, offset_type 
input_string_ncodeunits,
+   uint8_t* output) {
+uint8_t* dest = output;
+utf8_transform(input, input + input_string_ncodeunits, dest,
+   DerivedClass::TransformCodepoint);
+return (offset_type)(dest - output);
+  }
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+if (batch[0].kind() == Datum::ARRAY) {
+  std::call_once(flag_case_luts, []() {
+lut_upper_codepoint.reserve(MAX_CODEPOINT_LUT + 1);
+lut_lower_codepoint.reserve(MAX_CODEPOINT_LUT + 1);
+for (int i = 0; i <= MAX_CODEPOINT_LUT; i++) {
+  lut_upper_codepoint.push_back(utf8proc_toupper(i));
+  lut_lower_codepoint.push_back(utf8proc_tolower(i));
+}
+  });
+  const ArrayData& input = *batch[0].array();
+  ArrayType input_boxed(batch[0].array());
+  ArrayData* output = out->mutable_array();
+
+  offset_type const* input_string_offsets = 
input.GetValues(1);
+  utf8proc_uint8_t const* input_str =
+  input.buffers[2]->data() + input_boxed.value_offset(0);
+  offset_type input_ncodeunits = input_boxed.total_values_length();
+  offset_type input_nstrings = (offset_type)input.length;
+
+  // Section 5.18 of the Unicode spec claim that the number of codepoints 
for case
+  // mapping can grow by a factor of 3. This means grow by a factor of 3 
in bytes
+  // However, since we don't support all casings (SpecialCasing.txt) the 
growth
+  // is actually only at max 3/2 (as covered by the unittest).
+  // Note that rounding down the 3/2 is ok, since only codepoints encoded 
by
+  // two code units (even) can grow to 3 code units.
+
+  int64_t output_ncodeunits_max = ((int64_t)input_ncodeunits) * 3 / 2;
+  if (output_ncodeunits_max > std::numeric_limits::max()) {
+ctx->SetStatus(Status::CapacityError(
+"Result might not fit in a 32bit utf8 array, convert to 
large_utf8"));
+return;
+  }
+
+  KERNEL_RETURN_IF_ERROR(
+  ctx, 
ctx->Allocate(output_ncodeunits_max).Value(>buffers[2]));
+  // We could reuse the buffer if it is all ascii, benchmarking showed 
this not to
+  // matter
+  // output->buffers[1] = input.buffers[1];
+  KERNEL_RETURN_IF_ERROR(ctx,
+ ctx->Allocate((input_nstrings + 1) * 
sizeof(offset_type))
+ .Value(>buffers[1]));
+  utf8proc_uint8_t* output_str = output->buffers[2]->mutable_data();
+  offset_type* output_string_offsets = 
output->GetMutableValues(1);
+  offset_type output_ncodeunits = 0;
+
+  offset_type output_string_offset = 0;
+  *output_string_offsets = output_string_offset;
+  offset_type input_string_first_offset = input_string_offsets[0];
+  for (int64_t i = 0; i < input_nstrings; i++) {
+offset_type input_string_offset =
+input_string_offsets[i] - input_string_first_offset;
+offset_type input_string_end =
+input_string_offsets[i + 1] - input_string_first_offset;
+offset_type input_string_ncodeunits = input_string_end - 
input_string_offset;
+offset_type encoded_nbytes = DerivedClass::Transform(
+input_str + input_string_offset, input_string_ncodeunits,
+output_str + output_ncodeunits);
+output_ncodeunits += encoded_nbytes;
+output_string_offsets[i + 1] = output_ncodeunits;
+  }
+
+  // trim the codepoint buffer, since we allocated too much
+  KERNEL_RETURN_IF_ERROR(
+  ctx,
+  output->buffers[2]->CopySlice(0, 
output_ncodeunits).Value(>buffers[2]));

Review comment:
   Ok, I just pushed a change which resizes the buffer (hopefully 
in-place). This makes the benchmarks a bit faster.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r447161391



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -30,6 +31,124 @@ namespace internal {
 
 namespace {
 
+// lookup tables
+std::vector lut_upper_codepoint;
+std::vector lut_lower_codepoint;
+std::once_flag flag_case_luts;
+
+constexpr uint32_t REPLACEMENT_CHAR =
+'?';  // the proper replacement char would be the 0xFFFD codepoint, but 
that can
+  // increase string length by a factor of 3
+constexpr int MAX_CODEPOINT_LUT = 0x;  // up to this codepoint is in a 
lookup table
+
+static inline void utf8_encode(uint8_t*& str, uint32_t codepoint) {
+  if (codepoint < 0x80) {
+*str++ = codepoint;
+  } else if (codepoint < 0x800) {
+*str++ = 0xC0 + (codepoint >> 6);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else if (codepoint < 0x1) {
+*str++ = 0xE0 + (codepoint >> 12);
+*str++ = 0x80 + ((codepoint >> 6) & 0x3F);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else if (codepoint < 0x20) {
+*str++ = 0xF0 + (codepoint >> 18);
+*str++ = 0x80 + ((codepoint >> 12) & 0x3F);
+*str++ = 0x80 + ((codepoint >> 6) & 0x3F);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else {
+*str++ = codepoint;
+  }
+}
+
+static inline bool utf8_is_continuation(const uint8_t codeunit) {
+  return (codeunit & 0xC0) == 0x80;  // upper two bits should be 10
+}
+
+static inline uint32_t utf8_decode(const uint8_t*& str, int64_t& length) {
+  if (*str < 0x80) {  //
+length -= 1;
+return *str++;
+  } else if (*str < 0xC0) {  // invalid non-ascii char
+length -= 1;
+str++;
+return REPLACEMENT_CHAR;

Review comment:
   There's no reason a priori to be forgiving on invalid data. It there's a 
use case to be tolerant, we may add an option. But by default we should error 
out on invalid input, IMHO.
   
   cc @wesm  for opinions





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r447161391



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -30,6 +31,124 @@ namespace internal {
 
 namespace {
 
+// lookup tables
+std::vector lut_upper_codepoint;
+std::vector lut_lower_codepoint;
+std::once_flag flag_case_luts;
+
+constexpr uint32_t REPLACEMENT_CHAR =
+'?';  // the proper replacement char would be the 0xFFFD codepoint, but 
that can
+  // increase string length by a factor of 3
+constexpr int MAX_CODEPOINT_LUT = 0x;  // up to this codepoint is in a 
lookup table
+
+static inline void utf8_encode(uint8_t*& str, uint32_t codepoint) {
+  if (codepoint < 0x80) {
+*str++ = codepoint;
+  } else if (codepoint < 0x800) {
+*str++ = 0xC0 + (codepoint >> 6);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else if (codepoint < 0x1) {
+*str++ = 0xE0 + (codepoint >> 12);
+*str++ = 0x80 + ((codepoint >> 6) & 0x3F);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else if (codepoint < 0x20) {
+*str++ = 0xF0 + (codepoint >> 18);
+*str++ = 0x80 + ((codepoint >> 12) & 0x3F);
+*str++ = 0x80 + ((codepoint >> 6) & 0x3F);
+*str++ = 0x80 + (codepoint & 0x3F);
+  } else {
+*str++ = codepoint;
+  }
+}
+
+static inline bool utf8_is_continuation(const uint8_t codeunit) {
+  return (codeunit & 0xC0) == 0x80;  // upper two bits should be 10
+}
+
+static inline uint32_t utf8_decode(const uint8_t*& str, int64_t& length) {
+  if (*str < 0x80) {  //
+length -= 1;
+return *str++;
+  } else if (*str < 0xC0) {  // invalid non-ascii char
+length -= 1;
+str++;
+return REPLACEMENT_CHAR;

Review comment:
   There's no reason a priori to be forgiving on invalid data. It there's 
use case to be tolerant, we may add an option. But by default we should error 
out on invalid input.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sbinet closed pull request #7483: ARROW-9174: [Go] Fix table panic on 386

2020-06-29 Thread GitBox


sbinet closed pull request #7483:
URL: https://github.com/apache/arrow/pull/7483


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r447154836



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -15,13 +15,15 @@
 // specific language governing permissions and limitations
 // under the License.
 
+#include 

Review comment:
   Np, I'd rather do it correct.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r447155149



##
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##
@@ -39,6 +73,103 @@ struct AsciiLength {
   }
 };
 
+template 
+struct Utf8Transform {
+  using offset_type = typename Type::offset_type;
+  using ArrayType = typename TypeTraits::ArrayType;
+
+  static offset_type Transform(const uint8_t* input, offset_type 
input_string_ncodeunits,
+   uint8_t* output) {
+uint8_t* output_end = arrow::util::Utf8Transform(
+input, input + input_string_ncodeunits, output, 
Derived::TransformCodepoint);
+return static_cast(output_end - output);
+  }
+
+  static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) {
+if (batch[0].kind() == Datum::ARRAY) {
+  EnsureLookupTablesFilled();
+  const ArrayData& input = *batch[0].array();
+  ArrayType input_boxed(batch[0].array());
+  ArrayData* output = out->mutable_array();
+
+  offset_type input_ncodeunits = input_boxed.total_values_length();
+  offset_type input_nstrings = static_cast(input.length);
+
+  // Section 5.18 of the Unicode spec claim that the number of codepoints 
for case
+  // mapping can grow by a factor of 3. This means grow by a factor of 3 
in bytes
+  // However, since we don't support all casings (SpecialCasing.txt) the 
growth
+  // is actually only at max 3/2 (as covered by the unittest).
+  // Note that rounding down the 3/2 is ok, since only codepoints encoded 
by
+  // two code units (even) can grow to 3 code units.
+
+  int64_t output_ncodeunits_max = ((int64_t)input_ncodeunits) * 3 / 2;

Review comment:
   Old habit, doing my best :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


pitrou commented on a change in pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#discussion_r447143530



##
File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc
##
@@ -81,5 +147,40 @@ TYPED_TEST(TestStringKernels, 
StrptimeDoesNotProvideDefaultOptions) {
   ASSERT_RAISES(Invalid, CallFunction("strptime", {input}));
 }
 
+TEST(TestStringKernels, UnicodeLibraryAssumptions) {
+  uint8_t output[4];
+  for (utf8proc_int32_t codepoint = 0x100; codepoint < 0x11; codepoint++) {
+utf8proc_ssize_t encoded_nbytes = utf8proc_encode_char(codepoint, output);
+utf8proc_int32_t codepoint_upper = utf8proc_toupper(codepoint);
+utf8proc_ssize_t encoded_nbytes_upper = 
utf8proc_encode_char(codepoint_upper, output);
+if (encoded_nbytes == 2) {
+  EXPECT_LE(encoded_nbytes_upper, 3)
+  << "Expected the upper case codepoint for a 2 byte encoded codepoint 
to be "
+ "encoded in maximum 3 bytes, not "
+  << encoded_nbytes_upper;
+}
+if (encoded_nbytes == 3) {

Review comment:
   Ok, thank you.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity

2020-06-29 Thread GitBox


nealrichardson commented on pull request #7524:
URL: https://github.com/apache/arrow/pull/7524#issuecomment-651263042


   I'm taking this over. Outstanding TODOs:
   
   - [x] Add tests
   - [x] Support record batches
   - [ ] Support nested types (requires adapting the data structure and adding 
recursion)
   - [ ] Test the print method
   - [ ] Test/handle bad data in metadata$r; allow users to edit it manually?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #7559: ARROW-9247: [Python] Expose total_values_length functions on BinaryArray, LargeBinaryArray

2020-06-29 Thread GitBox


pitrou closed pull request #7559:
URL: https://github.com/apache/arrow/pull/7559


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux

2020-06-29 Thread GitBox


github-actions[bot] commented on pull request #7580:
URL: https://github.com/apache/arrow/pull/7580#issuecomment-651261802


   Revision: 989cd4023a59159b44f69a6d5f530acc815a2407
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-366](https://github.com/ursa-labs/crossbow/branches/all?query=actions-366)
   
   |Task|Status|
   ||--|
   
|wheel-manylinux1-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux1-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux1-cp35m)|
   
|wheel-manylinux1-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux1-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux1-cp36m)|
   
|wheel-manylinux1-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux1-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux1-cp37m)|
   
|wheel-manylinux1-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux1-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux1-cp38)|
   
|wheel-manylinux2010-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux2010-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux2010-cp35m)|
   
|wheel-manylinux2010-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux2010-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux2010-cp36m)|
   
|wheel-manylinux2010-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux2010-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux2010-cp37m)|
   
|wheel-manylinux2010-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux2010-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux2010-cp38)|
   
|wheel-manylinux2014-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux2014-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux2014-cp35m)|
   
|wheel-manylinux2014-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux2014-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux2014-cp36m)|
   
|wheel-manylinux2014-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux2014-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux2014-cp37m)|
   
|wheel-manylinux2014-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux2014-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux2014-cp38)|
   
|wheel-osx-cp35m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-366-travis-wheel-osx-cp35m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-osx-cp36m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-366-travis-wheel-osx-cp36m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-osx-cp37m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-366-travis-wheel-osx-cp37m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-osx-cp38|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-366-travis-wheel-osx-cp38.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|
   
|wheel-win-cp35m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-366-appveyor-wheel-win-cp35m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
   
|wheel-win-cp36m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-366-appveyor-wheel-win-cp36m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
   

[GitHub] [arrow] github-actions[bot] commented on pull request #7583: [Doc][C++] docker compose lint -> ubuntu-link

2020-06-29 Thread GitBox


github-actions[bot] commented on pull request #7583:
URL: https://github.com/apache/arrow/pull/7583#issuecomment-651261350


   
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


nealrichardson commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651261388


   The R Windows builds will fail until either utf8proc is not required by 
default (https://issues.apache.org/jira/browse/ARROW-9220) or until libutf8proc 
is added as a dependency to 
https://github.com/apache/arrow/blob/master/ci/scripts/PKGBUILD (@xhochy has 
done the necessary to get it built for Rtools in 
https://github.com/r-windows/rtools-packages/pull/124). 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels opened a new pull request #7583: [Doc][C++] docker compose lint -> ubuntu-link

2020-06-29 Thread GitBox


maartenbreddels opened a new pull request #7583:
URL: https://github.com/apache/arrow/pull/7583


   I guess the name changed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651257793


   We still have 2 failures, one might need a restart (travis / no output), the 
other is still a linker error:
   ```
   
C:/rtools40/mingw32/bin/../lib/gcc/i686-w64-mingw32/8.3.0/../../../../i686-w64-mingw32/bin/ld.exe:
 
../windows/arrow-0.17.1.9000/lib/i386/libarrow.a(scalar_string.cc.obj):(.text+0x45df):
 undefined reference to `utf8proc_tolower'
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7560: ARROW-9252: [Integration] Factor out IPC integration tests into script, add back 0.14.1 "gold" files

2020-06-29 Thread GitBox


pitrou commented on a change in pull request #7560:
URL: https://github.com/apache/arrow/pull/7560#discussion_r447134548



##
File path: ci/scripts/integration_arrow.sh
##
@@ -0,0 +1,29 @@
+#!/usr/bin/env bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+set -ex
+
+arrow_dir=${1}
+source_dir=${1}/cpp
+build_dir=${2}/cpp
+gold_dir_0_14_1=${1}/testing/data/arrow-ipc-stream/integration/0.14.1
+
+pip install -e /arrow/dev/archery

Review comment:
   Use `$source_dir` here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux

2020-06-29 Thread GitBox


pitrou commented on pull request #7580:
URL: https://github.com/apache/arrow/pull/7580#issuecomment-651256468


   @github-actions crossbow submit -g wheel



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kylebrandt commented on pull request #7483: ARROW-9174: [Go] Fix table panic on 386

2020-06-29 Thread GitBox


kylebrandt commented on pull request #7483:
URL: https://github.com/apache/arrow/pull/7483#issuecomment-651255643


   Hi @sbinet , new to contributing here (and see your name all over the Go 
code :-) ). Anything I need to do on my end for this to get merged?
   
   Thank you for all your work on this project.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux

2020-06-29 Thread GitBox


pitrou commented on pull request #7580:
URL: https://github.com/apache/arrow/pull/7580#issuecomment-651252764


   Amusingly, even a minimal Debian or Ubuntu Docker image has `liblz4` and 
`liblzma`. I think we can simply change the script not to remove the zlib.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou removed a comment on pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info

2020-06-29 Thread GitBox


pitrou removed a comment on pull request #7581:
URL: https://github.com/apache/arrow/pull/7581#issuecomment-651252311


   Amusingly, even a minimal Debian or Ubuntu Docker image has `liblz4` and 
`liblzma`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info

2020-06-29 Thread GitBox


pitrou commented on pull request #7581:
URL: https://github.com/apache/arrow/pull/7581#issuecomment-651252311


   Amusingly, even a minimal Debian or Ubuntu Docker image has `liblz4` and 
`liblzma`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7582: ARROW-8190: [FlightRPC][C++] Expose IPC options

2020-06-29 Thread GitBox


github-actions[bot] commented on pull request #7582:
URL: https://github.com/apache/arrow/pull/7582#issuecomment-651251968


   https://issues.apache.org/jira/browse/ARROW-8190



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info

2020-06-29 Thread GitBox


github-actions[bot] commented on pull request #7581:
URL: https://github.com/apache/arrow/pull/7581#issuecomment-651248175


   https://issues.apache.org/jira/browse/ARROW-6521



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] lidavidm opened a new pull request #7582: ARROW-8190: [FlightRPC][C++] Expose IPC options

2020-06-29 Thread GitBox


lidavidm opened a new pull request #7582:
URL: https://github.com/apache/arrow/pull/7582


   - Python is not covered as I'm not sure how best to expose these structs to 
Python.
   - Java is not covered as it doesn't use IpcOption at all currently; I'd 
rather hold off and see how the metadata change is implemented before trying to 
bind it. (The legacy/modern metadata format doesn't come into play since Flight 
uses Protobuf and not an IPC message exactly per se.)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7576: ARROW-9263: [C++] Promote compute aggregate benchmark size to 1M.

2020-06-29 Thread GitBox


pitrou commented on pull request #7576:
URL: https://github.com/apache/arrow/pull/7576#issuecomment-651243430


   Wouldn't it be more realistic to simply use 0.1% instead of 0.01%?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou opened a new pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info

2020-06-29 Thread GitBox


pitrou opened a new pull request #7581:
URL: https://github.com/apache/arrow/pull/7581


   Also add build options and preprocessor constants to represent
   git identification and package kind (e.g. "manylinux1").



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info

2020-06-29 Thread GitBox


pitrou commented on pull request #7581:
URL: https://github.com/apache/arrow/pull/7581#issuecomment-651242409


   @kou and @xhochy your advice would be welcome.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-29 Thread GitBox


nealrichardson commented on pull request #7556:
URL: https://github.com/apache/arrow/pull/7556#issuecomment-651188772


   IIUC we're ok:
   
   * Windows: no brotli: 
https://github.com/apache/arrow/blob/master/ci/scripts/PKGBUILD
   * macOS: no brotli: 
https://github.com/apache/arrow/blob/master/dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb
   * Linux: maybe brotli, but it's all static build anyway: 
https://github.com/apache/arrow/blob/master/r/inst/build_arrow_static.sh



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7559: ARROW-9247: [Python] Expose total_values_length functions on BinaryArray, LargeBinaryArray

2020-06-29 Thread GitBox


wesm commented on pull request #7559:
URL: https://github.com/apache/arrow/pull/7559#issuecomment-651187351


   cc @pitrou or @jorisvandenbossche for review



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-29 Thread GitBox


wesm commented on pull request #7556:
URL: https://github.com/apache/arrow/pull/7556#issuecomment-651186700


   If you're installing Brotli in any of the packaging setups, there may be a 
scenario where there is both the shared AND static library -- in that case 
there would be an issue. We need to see why the manylinux* builds have 
`libbrotli*.so` in them -- if for some reason Brotli's build is insubordinate 
and churlish we might have to add a flag to do a hard switch between 
shared/static like we do with ZSTD 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available

2020-06-29 Thread GitBox


nealrichardson commented on pull request #7556:
URL: https://github.com/apache/arrow/pull/7556#issuecomment-651182567


   @wesm how can I help/what should I look for?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux

2020-06-29 Thread GitBox


pitrou commented on pull request #7580:
URL: https://github.com/apache/arrow/pull/7580#issuecomment-651181336


   @xhochy In 
https://github.com/apache/arrow/blob/master/dev/tasks/python-wheels/manylinux-test.sh#L33
 . I'm sure there's a better way to do that (e.g. spawn a minimal Python docker 
image).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-29 Thread GitBox


wesm commented on a change in pull request #7539:
URL: https://github.com/apache/arrow/pull/7539#discussion_r447030010



##
File path: cpp/src/arrow/tensor/csf_converter.cc
##
@@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, 
const std::vector
-class SparseCSFTensorConverter {
- public:
-  using NumericTensorType = NumericTensor;
-  using value_type = typename NumericTensorType::value_type;
+class SparseCSFTensorConverter : private SparseTensorConverterMixin {
+  using SparseTensorConverterMixin::AssignIndex;
+  using SparseTensorConverterMixin::IsNonZero;
 
-  SparseCSFTensorConverter(const NumericTensorType& tensor,
+ public:
+  SparseCSFTensorConverter(const Tensor& tensor,
const std::shared_ptr& index_value_type,
MemoryPool* pool)
   : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {}
 
-  template 
   Status Convert() {
-using c_index_value_type = typename IndexValueType::c_type;
-
RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max()));
+RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, 
tensor_.shape()));
+
+const int index_elsize =
+checked_cast(*index_value_type_).bit_width() / 
CHAR_BIT;
+const int value_elsize =
+checked_cast(*tensor_.type()).bit_width() / 
CHAR_BIT;

Review comment:
   Could be helper function, maybe we can put an `internal::GetByteWidth` 
function in `arrow/type.h` that does this? 

##
File path: cpp/src/arrow/tensor/csx_converter.cc
##
@@ -0,0 +1,245 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/tensor/converter.h"
+
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/buffer.h"
+#include "arrow/status.h"
+#include "arrow/type.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/visitor_inline.h"
+
+namespace arrow {
+
+class MemoryPool;
+
+namespace internal {
+namespace {
+
+// --
+// SparseTensorConverter for SparseCSRIndex
+
+class SparseCSXMatrixConverter : private SparseTensorConverterMixin {
+  using SparseTensorConverterMixin::AssignIndex;
+  using SparseTensorConverterMixin::IsNonZero;
+
+ public:
+  SparseCSXMatrixConverter(SparseMatrixCompressedAxis axis, const Tensor& 
tensor,
+   const std::shared_ptr& index_value_type,
+   MemoryPool* pool)
+  : axis_(axis), tensor_(tensor), index_value_type_(index_value_type), 
pool_(pool) {}
+
+  Status Convert() {
+RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, 
tensor_.shape()));
+
+const int index_elsize =
+checked_cast(*index_value_type_).bit_width() / 
CHAR_BIT;

Review comment:
   Can you factor this "get byte width" operation into a helper function? 

##
File path: cpp/src/arrow/tensor/csf_converter.cc
##
@@ -148,84 +161,130 @@ class SparseCSFTensorConverter {
 return Status::OK();
   }
 
-#define CALL_TYPE_SPECIFIC_CONVERT(TYPE_CLASS) \
-  case TYPE_CLASS##Type::type_id:  \
-return Convert();
-
-  Status Convert() {
-switch (index_value_type_->id()) {
-  ARROW_GENERATE_FOR_ALL_INTEGER_TYPES(CALL_TYPE_SPECIFIC_CONVERT);
-  // LCOV_EXCL_START: The following invalid causes program failure.
-  default:
-return Status::TypeError("Unsupported SparseTensor index value type");
-// LCOV_EXCL_STOP
-}
-  }
-
-#undef CALL_TYPE_SPECIFIC_CONVERT
-
   std::shared_ptr sparse_index;
   std::shared_ptr data;
 
  private:
-  const NumericTensorType& tensor_;
+  const Tensor& tensor_;
   const std::shared_ptr& index_value_type_;
   MemoryPool* pool_;
+};
 
-  template 
-  inline Status CheckMaximumValue(const c_value_type type_max) const {
-auto max_dimension =
-*std::max_element(tensor_.shape().begin(), tensor_.shape().end());
-if (static_cast(type_max) < max_dimension) {
-  // LCOV_EXCL_START: The following invalid causes program failure.
-  return Status::Invalid("The bit width of the index value type is too 
small");
-  // LCOV_EXCL_STOP
-}
-return Status::OK();
+class 

[GitHub] [arrow] rjzamora commented on pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

2020-06-29 Thread GitBox


rjzamora commented on pull request #7546:
URL: https://github.com/apache/arrow/pull/7546#issuecomment-651177576


   Thanks for the great work here @bkietz !
   
   This is wonderful - Dask uses the min/max statistics to calculate 
`divisions`, so this functionality is definitely necessary.
   
   
*A note on other (less-critical, but useful) statistics*:
   Dask also uses the `"total_byte_size"` statistics (for the full row-group, 
not each column) to aggregate partitions before reading in any data.  There is 
also a plan to use the `"num-rows”` statistics when the user executes 
`len(ddf)` (to avoid loading any data).   **How difficult would it be to 
add/expose these additional row-group statistics?**  Again, this is much less 
of a “blocker” for initial integration with Dask, but are likely things we will 
want to add in eventually.  cc @jorisvandenbossche 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs closed pull request #7376: ARROW-9043: [Go][FOLLOWUP] Move license file copy to correct location

2020-06-29 Thread GitBox


kszucs closed pull request #7376:
URL: https://github.com/apache/arrow/pull/7376


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7376: ARROW-9043: [Go][FOLLOWUP] Move license file copy to correct location

2020-06-29 Thread GitBox


kszucs commented on pull request #7376:
URL: https://github.com/apache/arrow/pull/7376#issuecomment-651170232


   Thanks @jba for the update. Merging it then.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-29 Thread GitBox


maartenbreddels commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651165626


   @pitrou many thanks for the review. I've implemented all you suggestions 
except:
* Raising an error on invalid utf8 data (see comment)
* Having a benchmark run on non-ascii codepoints (I think we want to do 
this separate from this PR, but important point).
   
   Btw, I wasn't aware of existing utf8 code already in Arrow. The existing 
decoder based on http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ was new to me. 
Very interesting work, but unfortunately led to a performance regression 
(~50->30 M/s), which I'm surprised about actually. Maybe worth comparing again 
when we have a benchmark with non-ascii codepoints.
   
   @wesm I hope this is ready to go  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >