[GitHub] [arrow] wesm commented on pull request #7560: ARROW-9252: [Integration] Factor out IPC integration tests into script, add back 0.14.1 "gold" files
wesm commented on pull request #7560: URL: https://github.com/apache/arrow/pull/7560#issuecomment-651423838 Will merge this if the build passes with the arrow-testing changes This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mrkn commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
mrkn commented on pull request #7539: URL: https://github.com/apache/arrow/pull/7539#issuecomment-651456329 @wesm Is it better to work for benchmarking in other pull-request? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7585: ARROW-3520: [C++] Add "list_flatten" vector kernel wrapper for Flatten method of ListArray types
github-actions[bot] commented on pull request #7585: URL: https://github.com/apache/arrow/pull/7585#issuecomment-651460741 https://issues.apache.org/jira/browse/ARROW-3520 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux
wesm commented on pull request #7580: URL: https://github.com/apache/arrow/pull/7580#issuecomment-651460507 Hm not so fast. The macOS py35 failure seems legitimate https://travis-ci.org/github/ursa-labs/crossbow/builds/703242650#L10060 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available
wesm commented on pull request #7556: URL: https://github.com/apache/arrow/pull/7556#issuecomment-651466752 @github-actions crossbow submit -g linux -g wheel -g conda This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available
github-actions[bot] commented on pull request #7556: URL: https://github.com/apache/arrow/pull/7556#issuecomment-651467252 Revision: 821f30a834dab99cdc757100e51986384f0a391c Submitted crossbow builds: [ursa-labs/crossbow @ actions-367](https://github.com/ursa-labs/crossbow/branches/all?query=actions-367) |Task|Status| ||--| |centos-6-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-367-github-centos-6-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-367-github-centos-6-amd64)| |centos-7-aarch64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-367-travis-centos-7-aarch64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |centos-7-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-367-github-centos-7-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-367-github-centos-7-amd64)| |centos-8-aarch64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-367-travis-centos-8-aarch64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |centos-8-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-367-github-centos-8-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-367-github-centos-8-amd64)| |conda-clean|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-clean)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-clean)| |conda-linux-gcc-py36-cpu|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py36-cpu)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py36-cpu)| |conda-linux-gcc-py36-cuda|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py36-cuda)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py36-cuda)| |conda-linux-gcc-py37-cpu|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py37-cpu)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py37-cpu)| |conda-linux-gcc-py37-cuda|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py37-cuda)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py37-cuda)| |conda-linux-gcc-py38-cpu|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py38-cpu)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py38-cpu)| |conda-linux-gcc-py38-cuda|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py38-cuda)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py38-cuda)| |conda-osx-clang-py36|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-osx-clang-py36)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-osx-clang-py36)| |conda-osx-clang-py37|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-osx-clang-py37)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-osx-clang-py37)| |conda-osx-clang-py38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-osx-clang-py38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-osx-clang-py38)| |conda-win-vs2015-py36|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-win-vs2015-py36)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-win-vs2015-py36)| |conda-win-vs2015-py37|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-win-vs2015-py37)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-win-vs2015-py37)|
[GitHub] [arrow] liyafan82 commented on pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
liyafan82 commented on pull request #7347: URL: https://github.com/apache/arrow/pull/7347#issuecomment-651474250 @rymurr Thanks for your work. A few typos. I think it would be ready for merge. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
liyafan82 commented on a change in pull request #7347: URL: https://github.com/apache/arrow/pull/7347#discussion_r447363600 ## File path: java/memory/src/main/java/org/apache/arrow/memory/ArrowBuf.java ## @@ -227,13 +207,25 @@ public ArrowBuf slice(long index, long length) { return newBuf; } + /** + * Make a nio byte buffer from this arrowbuf. + */ public ByteBuffer nioBuffer() { -return asNettyBuffer().nioBuffer(); +return nioBuffer(readerIndex, checkedCastToInt(readableBytes())); } + + /** + * Make an nio byte buffer from this ArrowBuf. Review comment: nit: an -> a This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson edited a comment on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity
nealrichardson edited a comment on pull request #7524: URL: https://github.com/apache/arrow/pull/7524#issuecomment-651263042 I'm taking this over. Outstanding TODOs: - [x] Add tests - [x] Support record batches - [x] Support nested types (requires adapting the data structure and adding recursion) - [x] Test the print method - [x] Test/handle bad data in metadata$r; allow users to edit it manually? - [x] Don't keep attributes (e.g. factor levels) that we explicitly translate to Arrow already Top-level dataset metadata is deferred to https://issues.apache.org/jira/browse/ARROW-9271. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zeevm opened a new pull request #7586: Calculate page and column statistics
zeevm opened a new pull request #7586: URL: https://github.com/apache/arrow/pull/7586 1. Calculate page and column statistics 2. Use pre-calculated statistics when available to speed-up when writing data from other formats like ORC. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-651422684 Thanks for the comments! I've got some stuffs to deal with these days. Will address as soon as possible. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson edited a comment on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity
nealrichardson edited a comment on pull request #7524: URL: https://github.com/apache/arrow/pull/7524#issuecomment-651263042 I'm taking this over. Outstanding TODOs: - [x] Add tests - [x] Support record batches - [x] Support nested types (requires adapting the data structure and adding recursion) - [x] Test the print method - [ ] Test/handle bad data in metadata$r; allow users to edit it manually? - [x] Don't keep attributes (e.g. factor levels) that we explicitly translate to Arrow already Top-level dataset metadata is deferred to https://issues.apache.org/jira/browse/ARROW-9271. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7535: ARROW-9222: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types
emkornfield commented on a change in pull request #7535: URL: https://github.com/apache/arrow/pull/7535#discussion_r447396488 ## File path: docs/source/format/Columnar.rst ## @@ -688,11 +687,10 @@ will have the following layout: :: ||---| | joemark| unspecified (padding) | -Similar to structs, a particular child array may have a non-null slot -even if the validity bitmap of the parent union array indicates the -slot is null. Additionally, a child array may have a non-null slot -even if the types array indicates that a slot contains a different -type at the index. +Note that a child array may have a non-null slot even if the types array +indicates that a slot contains a different type at the index, so the Review comment: I have a little trouble parsing this language. Could it be stated more as a positive? > Only the slot in the array corresponding to the type index is considered. All "unselected" values are ignored and could be any semantically correct array value. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm opened a new pull request #7585: ARROW-3520: [C++] WIP Add vector kernel wrapper for Flatten method of ListArray types
wesm opened a new pull request #7585: URL: https://github.com/apache/arrow/pull/7585 I'm testing a JIRA webhook, I'll close this PR and then reopen it when the patch is done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available
wesm commented on pull request #7556: URL: https://github.com/apache/arrow/pull/7556#issuecomment-651466442 Actually, that's crazy. I'm taking the same approach as ZSTD and adding a CMake toggle between shared and static Brotli (with default being shared) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity
nealrichardson commented on pull request #7524: URL: https://github.com/apache/arrow/pull/7524#issuecomment-651484901 Ok, this isn't necessarily pretty but I think it's done, or done enough for here. I'll add some more tests, probably some docs for the format, and poke around a bit more while doing the tickets for `haven::labelled` and `POSIXlt` types. @romainfrancois if you can glance at the changes I made and confirm that I haven't butchered things too badly, that would be great, thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7535: ARROW-9222: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types
emkornfield commented on a change in pull request #7535: URL: https://github.com/apache/arrow/pull/7535#discussion_r447396655 ## File path: docs/source/format/Columnar.rst ## @@ -566,33 +572,28 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]`` :: * Length: 4, Null count: 1 -* Validity bitmap buffer: - |Byte 0 (validity bitmap) | Bytes 1-63| - |-|---| - |1101 | 0 (padding) | - * Types buffer: |Byte 0 | Byte 1 | Byte 2 | Byte 3 | Bytes 4-63 | |-|-|--|--|-| - | 0 | unspecified | 0| 1| unspecified | + | 0 | 0 | 0| 1| unspecified | * Offset buffer: |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 | |--|-||-|-| - | 0| unspecified | 1 | 0 | unspecified | + | 0| 1 | 2 | 0 | unspecified | * Children arrays: * Field-0 array (f: float): * Length: 2, nulls: 0 Review comment: should this be nulls: 1 now? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7585: ARROW-3520: [C++] WIP Add vector kernel wrapper for Flatten method of ListArray types
wesm closed pull request #7585: URL: https://github.com/apache/arrow/pull/7585 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7569: ARROW-9152: [C++] Specialized implementation of filtering Binary/LargeBinary-based types
wesm commented on pull request #7569: URL: https://github.com/apache/arrow/pull/7569#issuecomment-651457980 +1, this is a bit dry so would rather reviewers reserve their time for other PRs This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7569: ARROW-9152: [C++] Specialized implementation of filtering Binary/LargeBinary-based types
wesm closed pull request #7569: URL: https://github.com/apache/arrow/pull/7569 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm edited a comment on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
wesm edited a comment on pull request #7539: URL: https://github.com/apache/arrow/pull/7539#issuecomment-651457062 @mrkn it's up to you, it's fine with me if you work on performance tuning (or at least measurement) in another PR or this one This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
wesm commented on pull request #7539: URL: https://github.com/apache/arrow/pull/7539#issuecomment-651457062 @mrkn it's up to you, it's fine with me if you work on performance tuning in another PR or this one This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
liyafan82 commented on a change in pull request #7347: URL: https://github.com/apache/arrow/pull/7347#discussion_r447363293 ## File path: java/memory/src/main/java/org/apache/arrow/memory/rounding/DefaultRoundingPolicy.java ## @@ -17,33 +17,103 @@ package org.apache.arrow.memory.rounding; -import java.lang.reflect.Field; - -import org.apache.arrow.memory.NettyAllocationManager; import org.apache.arrow.memory.util.CommonUtil; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import io.netty.util.internal.SystemPropertyUtil; /** * The default rounding policy. That is, if the requested size is within the chunk size, * the rounded size will be the next power of two. Otherwise, the rounded size * will be identical to the requested size. */ public class DefaultRoundingPolicy implements RoundingPolicy { - + private static final Logger logger = LoggerFactory.getLogger(DefaultRoundingPolicy.class); public final long chunkSize; /** - * The singleton instance. + * The variables here and the static block calculates the DEFAULT_CHUNK_SIZE. + * + * + * It was copied from {@link io.netty.buffer.PooledByteBufAllocator}. + * */ - public static final DefaultRoundingPolicy INSTANCE = new DefaultRoundingPolicy(); + private static final int MIN_PAGE_SIZE = 4096; + private static final int MAX_CHUNK_SIZE = (int) (((long) Integer.MAX_VALUE + 1) / 2); + private static final long DEFAULT_CHUNK_SIZE; - private DefaultRoundingPolicy() { + + static { +int defaultPageSize = SystemPropertyUtil.getInt("org.apache.memory.allocator.pageSize", 8192); +Throwable pageSizeFallbackCause = null; try { - Field field = NettyAllocationManager.class.getDeclaredField("CHUNK_SIZE"); - field.setAccessible(true); - chunkSize = (Long) field.get(null); -} catch (Exception e) { - throw new RuntimeException("Failed to get chunk size from allocation manager"); + validateAndCalculatePageShifts(defaultPageSize); +} catch (Throwable t) { + pageSizeFallbackCause = t; + defaultPageSize = 8192; +} + +int defaultMaxOrder = SystemPropertyUtil.getInt("org.apache.memory.allocator.maxOrder", 11); +Throwable maxOrderFallbackCause = null; +try { + validateAndCalculateChunkSize(defaultPageSize, defaultMaxOrder); +} catch (Throwable t) { + maxOrderFallbackCause = t; + defaultMaxOrder = 11; +} +DEFAULT_CHUNK_SIZE = validateAndCalculateChunkSize(defaultPageSize, defaultMaxOrder); +if (logger.isDebugEnabled()) { + if (pageSizeFallbackCause == null) { +logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", defaultPageSize); + } else { +logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", defaultPageSize, pageSizeFallbackCause); + } Review comment: I think here we only need one parameter here? So it should be logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", pageSizeFallbackCause); This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
liyafan82 commented on a change in pull request #7347: URL: https://github.com/apache/arrow/pull/7347#discussion_r447363481 ## File path: java/memory/src/main/java/org/apache/arrow/memory/rounding/DefaultRoundingPolicy.java ## @@ -17,33 +17,107 @@ package org.apache.arrow.memory.rounding; -import java.lang.reflect.Field; - -import org.apache.arrow.memory.NettyAllocationManager; import org.apache.arrow.memory.util.CommonUtil; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import io.netty.util.internal.SystemPropertyUtil; /** * The default rounding policy. That is, if the requested size is within the chunk size, * the rounded size will be the next power of two. Otherwise, the rounded size * will be identical to the requested size. */ public class DefaultRoundingPolicy implements RoundingPolicy { - + private static final Logger logger = LoggerFactory.getLogger(DefaultRoundingPolicy.class); public final long chunkSize; /** - * The singleton instance. + * The variables here and the static block calculates teh DEFAULT_CHUNK_SIZE. + * + * + * It was copied from {@link io.netty.buffer.PooledByteBufAllocator}. + * */ - public static final DefaultRoundingPolicy INSTANCE = new DefaultRoundingPolicy(); + private static final int MIN_PAGE_SIZE = 4096; + private static final int MAX_CHUNK_SIZE = (int) (((long) Integer.MAX_VALUE + 1) / 2); + private static final long DEFAULT_CHUNK_SIZE; + private static final int DEFAULT_PAGE_SIZE; + private static final int DEFAULT_MAX_ORDER; - private DefaultRoundingPolicy() { + + static { +int defaultPageSize = SystemPropertyUtil.getInt("org.apache.memory.allocator.pageSize", 8192); +Throwable pageSizeFallbackCause = null; try { - Field field = NettyAllocationManager.class.getDeclaredField("CHUNK_SIZE"); - field.setAccessible(true); - chunkSize = (Long) field.get(null); -} catch (Exception e) { - throw new RuntimeException("Failed to get chunk size from allocation manager"); + validateAndCalculatePageShifts(defaultPageSize); +} catch (Throwable t) { + pageSizeFallbackCause = t; + defaultPageSize = 8192; +} +DEFAULT_PAGE_SIZE = defaultPageSize; + +int defaultMaxOrder = SystemPropertyUtil.getInt("org.apache.memory.allocator.maxOrder", 11); +Throwable maxOrderFallbackCause = null; +try { + validateAndCalculateChunkSize(DEFAULT_PAGE_SIZE, defaultMaxOrder); +} catch (Throwable t) { + maxOrderFallbackCause = t; + defaultMaxOrder = 11; +} +DEFAULT_MAX_ORDER = defaultMaxOrder; +DEFAULT_CHUNK_SIZE = validateAndCalculateChunkSize(DEFAULT_PAGE_SIZE, DEFAULT_MAX_ORDER); +if (logger.isDebugEnabled()) { + if (pageSizeFallbackCause == null) { +logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", DEFAULT_PAGE_SIZE); + } else { +logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", DEFAULT_PAGE_SIZE, pageSizeFallbackCause); + } + if (maxOrderFallbackCause == null) { +logger.debug("-Dorg.apache.memory.allocator.maxOrder: {}", DEFAULT_MAX_ORDER); + } else { +logger.debug("-Dorg.apache.memory.allocator.maxOrder: {}", DEFAULT_MAX_ORDER, maxOrderFallbackCause); + } Review comment: Also here we need only one parameter in the "else" statement. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working
liyafan82 commented on pull request #7290: URL: https://github.com/apache/arrow/pull/7290#issuecomment-651492666 In addition to the problem of top level validity buffer, I think there is another problem to discuss: Java is using the ordinal of the minor type as the type id, and this is not aligning with the C++ implementation/specification. This PR works around the problem by supporting two modes. However, the C++ and Java implementations are not really aligning, IMO. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 edited a comment on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working
liyafan82 edited a comment on pull request #7290: URL: https://github.com/apache/arrow/pull/7290#issuecomment-651492666 In addition to the problem of top level validity buffer, I think there is another problem to discuss: Java is using the ordinal of the minor type as the type id, and this is not aligning with the C++ implementation/specification. This PR works around the problem by supporting two modes. However, the C++ and Java implementations are not really aligning, IMO. Do we have a plan to mitigate this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #6156: ARROW-7539: [Java] FieldVector getFieldBuffers API should not set reader/writer indices
emkornfield commented on pull request #6156: URL: https://github.com/apache/arrow/pull/6156#issuecomment-651509291 Can we leave the old method in place and mark it as deprecated and remove in a later release? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working
emkornfield commented on pull request #7290: URL: https://github.com/apache/arrow/pull/7290#issuecomment-651515749 @liyafan82 I think that is a good point. If it supports both modes I think that is a reasonable compromise for now as long as @jacques-n is OK with it. But we can maybe discuss further if needed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7560: ARROW-9252: [Integration] Factor out IPC integration tests into script, add back 0.14.1 "gold" files
wesm closed pull request #7560: URL: https://github.com/apache/arrow/pull/7560 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7560: ARROW-9252: [Integration] Factor out IPC integration tests into script, add back 0.14.1 "gold" files
wesm commented on pull request #7560: URL: https://github.com/apache/arrow/pull/7560#issuecomment-651458622 Yahtzee This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mrkn edited a comment on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
mrkn edited a comment on pull request #7539: URL: https://github.com/apache/arrow/pull/7539#issuecomment-651458470 @wesm OK. I continue to work for benchmarking in this pull-request. If I need more time to tune etc., I'll split the issue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mrkn commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
mrkn commented on pull request #7539: URL: https://github.com/apache/arrow/pull/7539#issuecomment-651458470 OK. I continue to work for benchmarking in this pull-request. If I need more time to tune etc., I'll split the issue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7586: Calculate page and column statistics
github-actions[bot] commented on pull request #7586: URL: https://github.com/apache/arrow/pull/7586#issuecomment-651495743 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs opened a new pull request #7584: ARROW-9272: [C++][Python] Reduce complexity in python to arrow conversion
kszucs opened a new pull request #7584: URL: https://github.com/apache/arrow/pull/7584 The original motivation for this patch was to reuse the same conversions path for both the scalars and arrays. In my recent patch the scalars are converted from a single element list to a single element array then copied out from it. On the long term we should convert them directly, perhaps with a more generic converter API, until that this patch aims to reduce code complexity without introducing any regressions. I checked the produced binary size: ```console # BEFORE -rwxr-xr-x 1 kszucs staff 2926832 Jun 29 23:07 libarrow_python.100.0.0.dylib # AFTER -rwxr-xr-x 1 kszucs staff 2869136 Jun 29 23:06 libarrow_python.100.0.0.dylib ``` ``` convert_builtins.ConvertArrayToPyList.time_convert == = type before after -- - int32 13.2±0ms 13.2±0m uint32 13.4±0ms 14.8±0m int64 16.4±0ms 13.7±0m uint64 13.0±0ms 17.4±0m float32 13.5±0ms 16.0±0m float64 16.3±0ms 14.7±0m bool 12.0±0ms 12.0±0m decimal 61.1±0ms 65.0±0m binary 19.1±0ms 18.2±0m binary1017.9±0ms 17.4±0m ascii 26.0±0ms 25.9±0m unicode 81.4±0ms 111±0m int64 list 145±0ms152±0m struct 229±0ms212±0m == = convert_builtins.ConvertPyListToArray.time_convert = type before after - int32 129±0ms 135±0m uint32 139±0ms 134±0m int64 132±0ms 139±0m uint64 135±0ms 132±0m float32 149±0ms 131±0m float64 143±0ms 141±0m bool 126±0ms 130±0m decimal 139±0ms 135±0m binary 145±0ms 130±0m binary10136±0ms 130±0m ascii 145±0ms 136±0m unicode 148±0ms 149±0m int64 list 174±0ms 176±0m struct 164±0ms 177±0m struct from tuples 165±0ms 169±0m = convert_builtins.InferPyListToArray.time_infer = type beforeafter - int64 157±0ms 158±0m float64 155±0ms 153±0m bool 155±0ms 154±0m decimal 249±0ms 233±0m binary 167±0ms 162±0m ascii 181±0ms 154±0m unicode 179±0ms 171±0m int64 list 188±0ms 192±0m struct 177±0ms 181±0m = ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7584: ARROW-9272: [C++][Python] Reduce complexity in python to arrow conversion
github-actions[bot] commented on pull request #7584: URL: https://github.com/apache/arrow/pull/7584#issuecomment-651404016 https://issues.apache.org/jira/browse/ARROW-9272 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
wesm commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651368104 Indeed, toolchain incompatibilities only affect C++ code This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mrkn commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
mrkn commented on a change in pull request #7539: URL: https://github.com/apache/arrow/pull/7539#discussion_r447267538 ## File path: cpp/src/arrow/tensor/csf_converter.cc ## @@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, const std::vector -class SparseCSFTensorConverter { - public: - using NumericTensorType = NumericTensor; - using value_type = typename NumericTensorType::value_type; +class SparseCSFTensorConverter : private SparseTensorConverterMixin { + using SparseTensorConverterMixin::AssignIndex; + using SparseTensorConverterMixin::IsNonZero; - SparseCSFTensorConverter(const NumericTensorType& tensor, + public: + SparseCSFTensorConverter(const Tensor& tensor, const std::shared_ptr& index_value_type, MemoryPool* pool) : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {} - template Status Convert() { -using c_index_value_type = typename IndexValueType::c_type; - RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max())); +RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, tensor_.shape())); + +const int index_elsize = +checked_cast(*index_value_type_).bit_width() / CHAR_BIT; +const int value_elsize = +checked_cast(*tensor_.type()).bit_width() / CHAR_BIT; Review comment: OK. I'll make `inernal::GetByteWidth`, and let it return `-1` for non-`FixedWidthType` values. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
nealrichardson commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651355763 > > This means there also needs to be a PKGBUILD > > Why? `libutf8proc` is installed. The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9. Most of our deps have to be compiled for both, and this is apparently one of those. That's what https://github.com/r-windows/rtools-backports is for. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on a change in pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info
kou commented on a change in pull request #7581: URL: https://github.com/apache/arrow/pull/7581#discussion_r447247927 ## File path: cpp/src/arrow/config.h ## @@ -0,0 +1,47 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include + +#include "arrow/util/config.h" // IWYU pragma: export +#include "arrow/util/visibility.h" + +namespace arrow { + +struct BuildInfo { + int major; + int minor; + int patch; Review comment: Oh, sorry. One comment wasn't submitted... How about adding `version_` prefix because our `ARROW_VERSION_*` have `VERSION_` prefix? ```suggestion int version_major; int version_minor; int version_patch; ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
pitrou commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651366993 > The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9 What difference does it make? This is plain C. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
nealrichardson commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651369436 > > The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9 > > What difference does it make? This is plain C. :shrug: then I'll leave it to you to sort out as this is beyond my knowledge. In the past, undefined symbols error + only compiled for rtools-packages (gcc8) = you need to get it built with rtools-backports too. Maybe something's off with the lib that was built, IDK if anyone has verified that it works. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
wesm commented on a change in pull request #7539: URL: https://github.com/apache/arrow/pull/7539#discussion_r447264419 ## File path: cpp/src/arrow/tensor/csf_converter.cc ## @@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, const std::vector -class SparseCSFTensorConverter { - public: - using NumericTensorType = NumericTensor; - using value_type = typename NumericTensorType::value_type; +class SparseCSFTensorConverter : private SparseTensorConverterMixin { + using SparseTensorConverterMixin::AssignIndex; + using SparseTensorConverterMixin::IsNonZero; - SparseCSFTensorConverter(const NumericTensorType& tensor, + public: + SparseCSFTensorConverter(const Tensor& tensor, const std::shared_ptr& index_value_type, MemoryPool* pool) : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {} - template Status Convert() { -using c_index_value_type = typename IndexValueType::c_type; - RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max())); +RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, tensor_.shape())); + +const int index_elsize = +checked_cast(*index_value_type_).bit_width() / CHAR_BIT; +const int value_elsize = +checked_cast(*tensor_.type()).bit_width() / CHAR_BIT; Review comment: As long as the signature of the function is `int(const DataType&)` it sounds good to me. I don't think it should be a non-internal API because it does not need to do error-handling. If it's a non-static member of FixedWidthType, then a `checked_cast` is still necessary which seems against the spirit of making the code simpler This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mrkn commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
mrkn commented on a change in pull request #7539: URL: https://github.com/apache/arrow/pull/7539#discussion_r447268862 ## File path: cpp/src/arrow/tensor/csf_converter.cc ## @@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, const std::vector -class SparseCSFTensorConverter { - public: - using NumericTensorType = NumericTensor; - using value_type = typename NumericTensorType::value_type; +class SparseCSFTensorConverter : private SparseTensorConverterMixin { + using SparseTensorConverterMixin::AssignIndex; + using SparseTensorConverterMixin::IsNonZero; - SparseCSFTensorConverter(const NumericTensorType& tensor, + public: + SparseCSFTensorConverter(const Tensor& tensor, const std::shared_ptr& index_value_type, MemoryPool* pool) : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {} - template Status Convert() { -using c_index_value_type = typename IndexValueType::c_type; - RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max())); +RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, tensor_.shape())); + +const int index_elsize = +checked_cast(*index_value_type_).bit_width() / CHAR_BIT; +const int value_elsize = +checked_cast(*tensor_.type()).bit_width() / CHAR_BIT; Review comment: Oops, I don't need to return `-1` because `checked_cast` is used in this function. Sorry. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mrkn commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
mrkn commented on pull request #7539: URL: https://github.com/apache/arrow/pull/7539#issuecomment-650967222 @wesm Could you please review this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] romainfrancois commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector
romainfrancois commented on a change in pull request #7514: URL: https://github.com/apache/arrow/pull/7514#discussion_r446816192 ## File path: r/src/array_from_vector.cpp ## @@ -918,6 +923,97 @@ class Time64Converter : public TimeConverter { } }; +template +class BinaryVectorConverter : public VectorConverter { + public: + ~BinaryVectorConverter() {} + + Status Init(ArrayBuilder* builder) { +typed_builder_ = checked_cast(builder); +return Status::OK(); + } + + Status Ingest(SEXP obj) { +ARROW_RETURN_IF(TYPEOF(obj) != VECSXP, Status::RError("Expecting a list")); +R_xlen_t n = XLENGTH(obj); + +// Reserve enough space before appending +int64_t size = 0; +for (R_xlen_t i = 0; i < n; i++) { + SEXP obj_i = VECTOR_ELT(obj, i); + if (!Rf_isNull(obj_i)) { +ARROW_RETURN_IF(TYPEOF(obj_i) != RAWSXP, +Status::RError("Expecting a raw vector")); +size += XLENGTH(obj_i); + } +} +RETURN_NOT_OK(typed_builder_->Reserve(size)); + +// append +for (R_xlen_t i = 0; i < n; i++) { + SEXP obj_i = VECTOR_ELT(obj, i); + if (Rf_isNull(obj_i)) { +RETURN_NOT_OK(typed_builder_->AppendNull()); + } else { +RETURN_NOT_OK(typed_builder_->Append(RAW(obj_i), XLENGTH(obj_i))); + } +} +return Status::OK(); + } + + Status GetResult(std::shared_ptr* result) { +return typed_builder_->Finish(result); + } + + private: + Builder* typed_builder_; +}; + +template +class StringVectorConverter : public VectorConverter { + public: + ~StringVectorConverter() {} + + Status Init(ArrayBuilder* builder) { +typed_builder_ = checked_cast(builder); +return Status::OK(); + } + + Status Ingest(SEXP obj) { +ARROW_RETURN_IF(TYPEOF(obj) != STRSXP, +Status::RError("Expecting a character vector")); +R_xlen_t n = XLENGTH(obj); + +// Reserve enough space before appending +int64_t size = 0; +for (R_xlen_t i = 0; i < n; i++) { + SEXP string_i = STRING_ELT(obj, i); + if (string_i != NA_STRING) { +size += XLENGTH(string_i); + } +} +RETURN_NOT_OK(typed_builder_->Reserve(size)); + +// append +for (R_xlen_t i = 0; i < n; i++) { + SEXP string_i = STRING_ELT(obj, i); + if (string_i == NA_STRING) { +RETURN_NOT_OK(typed_builder_->AppendNull()); + } else { +RETURN_NOT_OK(typed_builder_->Append(CHAR(string_i), XLENGTH(string_i))); Review comment: I'll have a look once back at this. Overall there seems to be two concurrent systems with no real reason and I believe we should only keep the once powered by `VectorToArrayConverter` . I'm on my `dplyr` week now, I'll try still to make some space for this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7579: ARROW-9242: [Java] Add forward compatibility check for Decimal bit width
github-actions[bot] commented on pull request #7579: URL: https://github.com/apache/arrow/pull/7579#issuecomment-650985250 https://issues.apache.org/jira/browse/ARROW-9242 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] tianchen92 opened a new pull request #7579: ARROW-9242: [Java] Add forward compatibility check for Decimal bit width
tianchen92 opened a new pull request #7579: URL: https://github.com/apache/arrow/pull/7579 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] tianchen92 commented on a change in pull request #7568: ARROW-9241: [C++] Add forward compatibility check for Decimal bit width
tianchen92 commented on a change in pull request #7568: URL: https://github.com/apache/arrow/pull/7568#discussion_r446812811 ## File path: cpp/src/arrow/ipc/metadata_internal.cc ## @@ -257,6 +259,9 @@ Status ConcreteTypeFromFlatbuffer(flatbuf::Type type, const void* type_data, return Status::OK(); case flatbuf::Type::Decimal: { auto dec_type = static_cast(type_data); + if (dec_type->bitWidth() != kDecimalBitWidth) { +return Status::Invalid("Library only supports 128-bit decimal values"); Review comment: sorry, bitWidth should return 128 if no value is set. LGTM ## File path: cpp/src/arrow/ipc/metadata_internal.cc ## @@ -257,6 +259,9 @@ Status ConcreteTypeFromFlatbuffer(flatbuf::Type type, const void* type_data, return Status::OK(); case flatbuf::Type::Decimal: { auto dec_type = static_cast(type_data); + if (dec_type->bitWidth() != kDecimalBitWidth) { +return Status::Invalid("Library only supports 128-bit decimal values"); Review comment: should also handle the case when bitWidth == null? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] tianchen92 commented on pull request #7579: ARROW-9242: [Java] Add forward compatibility check for Decimal bit width
tianchen92 commented on pull request #7579: URL: https://github.com/apache/arrow/pull/7579#issuecomment-650997334 cc @wesm This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rymurr commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
rymurr commented on a change in pull request #7347: URL: https://github.com/apache/arrow/pull/7347#discussion_r446920091 ## File path: java/memory/src/main/java/org/apache/arrow/memory/util/MemoryUtil.java ## @@ -95,4 +143,26 @@ public static long getByteBufferAddress(ByteBuffer buf) { private MemoryUtil() { } + + /** + * Create nio byte buffer. + */ + public static ByteBuffer directBuffer(long address, int capacity) { +if (DIRECT_BUFFER_CONSTRUCTOR != null) { + if (capacity < 0) { +throw new IllegalArgumentException("Capacity is negative, has to be positive or 0"); + } + try { +return (ByteBuffer) DIRECT_BUFFER_CONSTRUCTOR.newInstance(address, capacity); + } catch (Throwable cause) { +// Not expected to ever throw! +if (cause instanceof Error) { + throw (Error) cause; Review comment: None, only that it was how the netty developers wrote it. I thought I would copy their lead. However, I have simplified it now, This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rymurr commented on pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
rymurr commented on pull request #7347: URL: https://github.com/apache/arrow/pull/7347#issuecomment-651078278 Thanks @liyafan82 I have updated based on your comments and the integration tests have passed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rymurr commented on pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
rymurr commented on pull request #7347: URL: https://github.com/apache/arrow/pull/7347#issuecomment-651081660 looks like an ongoing github incident is causing build failures. Will rebase and rebuild once Github is back to normal This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r446943865 ## File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc ## @@ -68,6 +76,64 @@ TYPED_TEST(TestStringKernels, AsciiLower) { this->string_type(), "[\"aaazzæÆ&\", null, \"\", \"bbb\"]"); } +TEST(TestStringKernels, Utf8Upper32bitGrowth) { Review comment: Great, I was hoping such a system existed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r446964243 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -39,6 +158,121 @@ struct AsciiLength { } }; +template class Derived> +struct Utf8Transform { + using offset_type = typename Type::offset_type; + using DerivedClass = Derived; + using ArrayType = typename TypeTraits::ArrayType; + + static offset_type Transform(const uint8_t* input, offset_type input_string_ncodeunits, + uint8_t* output) { +uint8_t* dest = output; +utf8_transform(input, input + input_string_ncodeunits, dest, + DerivedClass::TransformCodepoint); +return (offset_type)(dest - output); + } + + static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +if (batch[0].kind() == Datum::ARRAY) { + std::call_once(flag_case_luts, []() { +lut_upper_codepoint.reserve(MAX_CODEPOINT_LUT + 1); +lut_lower_codepoint.reserve(MAX_CODEPOINT_LUT + 1); +for (int i = 0; i <= MAX_CODEPOINT_LUT; i++) { + lut_upper_codepoint.push_back(utf8proc_toupper(i)); + lut_lower_codepoint.push_back(utf8proc_tolower(i)); +} + }); + const ArrayData& input = *batch[0].array(); + ArrayType input_boxed(batch[0].array()); + ArrayData* output = out->mutable_array(); + + offset_type const* input_string_offsets = input.GetValues(1); + utf8proc_uint8_t const* input_str = Review comment: That's the offset, I don't see a similar method for the data, let me know if I'm wrong. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou opened a new pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux
pitrou opened a new pull request #7580: URL: https://github.com/apache/arrow/pull/7580 The AWS SDK on manylinux packages uses a custom-compiled libcurl that is configured for the certificates of the build system. However, there's no standard location for CA certificates on Linux, so we must instead configure the filesystem layer at runtime with Python's own TLS configuration values (which are assumed to be correct, otherwise Python itself couldn't connect to TLS hosts). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux
pitrou commented on pull request #7580: URL: https://github.com/apache/arrow/pull/7580#issuecomment-651139396 @github-actions crossbow submit -g wheel This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux
github-actions[bot] commented on pull request #7580: URL: https://github.com/apache/arrow/pull/7580#issuecomment-651144126 https://issues.apache.org/jira/browse/ARROW-9261 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7571: ARROW-8671: [C++] Use new BodyCompression Flatbuffers member for IPC compression metadata
wesm commented on a change in pull request #7571: URL: https://github.com/apache/arrow/pull/7571#discussion_r447002951 ## File path: cpp/src/arrow/ipc/options.h ## @@ -66,6 +67,10 @@ struct ARROW_EXPORT IpcWriteOptions { /// like compression bool use_threads = true; + /// \brief Format version to use for IPC messages and their + /// metadata. Presently using V4 version (readable by v0.8.0 and later). + MetadataVersion metadata_version = MetadataVersion::V4; Review comment: Yes. This is what I've been saying in the V4/V5 MetadataVersion discussion e-mail This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r446952433 ## File path: cpp/src/arrow/compute/kernels/scalar_string_benchmark.cc ## @@ -41,7 +42,9 @@ static void UnaryStringBenchmark(benchmark::State& state, const std::string& fun ABORT_NOT_OK(CallFunction(func_name, {values})); } state.SetItemsProcessed(state.iterations() * array_length); - state.SetBytesProcessed(state.iterations() * values->data()->buffers[2]->size()); + state.SetBytesProcessed(state.iterations() * + ((touches_offsets ? values->data()->buffers[1]->size() : 0) + Review comment: fair point, indeed makes them more comparable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r446952269 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -39,6 +158,121 @@ struct AsciiLength { } }; +template class Derived> +struct Utf8Transform { + using offset_type = typename Type::offset_type; + using DerivedClass = Derived; + using ArrayType = typename TypeTraits::ArrayType; + + static offset_type Transform(const uint8_t* input, offset_type input_string_ncodeunits, + uint8_t* output) { +uint8_t* dest = output; +utf8_transform(input, input + input_string_ncodeunits, dest, + DerivedClass::TransformCodepoint); +return (offset_type)(dest - output); + } + + static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +if (batch[0].kind() == Datum::ARRAY) { + std::call_once(flag_case_luts, []() { +lut_upper_codepoint.reserve(MAX_CODEPOINT_LUT + 1); +lut_lower_codepoint.reserve(MAX_CODEPOINT_LUT + 1); +for (int i = 0; i <= MAX_CODEPOINT_LUT; i++) { + lut_upper_codepoint.push_back(utf8proc_toupper(i)); + lut_lower_codepoint.push_back(utf8proc_tolower(i)); +} + }); + const ArrayData& input = *batch[0].array(); + ArrayType input_boxed(batch[0].array()); + ArrayData* output = out->mutable_array(); + + offset_type const* input_string_offsets = input.GetValues(1); + utf8proc_uint8_t const* input_str = + input.buffers[2]->data() + input_boxed.value_offset(0); + offset_type input_ncodeunits = input_boxed.total_values_length(); + offset_type input_nstrings = (offset_type)input.length; + + // Section 5.18 of the Unicode spec claim that the number of codepoints for case + // mapping can grow by a factor of 3. This means grow by a factor of 3 in bytes + // However, since we don't support all casings (SpecialCasing.txt) the growth + // is actually only at max 3/2 (as covered by the unittest). + // Note that rounding down the 3/2 is ok, since only codepoints encoded by + // two code units (even) can grow to 3 code units. + + int64_t output_ncodeunits_max = ((int64_t)input_ncodeunits) * 3 / 2; + if (output_ncodeunits_max > std::numeric_limits::max()) { +ctx->SetStatus(Status::CapacityError( +"Result might not fit in a 32bit utf8 array, convert to large_utf8")); +return; + } + + KERNEL_RETURN_IF_ERROR( + ctx, ctx->Allocate(output_ncodeunits_max).Value(>buffers[2])); + // We could reuse the buffer if it is all ascii, benchmarking showed this not to + // matter + // output->buffers[1] = input.buffers[1]; + KERNEL_RETURN_IF_ERROR(ctx, + ctx->Allocate((input_nstrings + 1) * sizeof(offset_type)) + .Value(>buffers[1])); + utf8proc_uint8_t* output_str = output->buffers[2]->mutable_data(); + offset_type* output_string_offsets = output->GetMutableValues(1); + offset_type output_ncodeunits = 0; + + offset_type output_string_offset = 0; + *output_string_offsets = output_string_offset; + offset_type input_string_first_offset = input_string_offsets[0]; + for (int64_t i = 0; i < input_nstrings; i++) { +offset_type input_string_offset = +input_string_offsets[i] - input_string_first_offset; +offset_type input_string_end = +input_string_offsets[i + 1] - input_string_first_offset; +offset_type input_string_ncodeunits = input_string_end - input_string_offset; +offset_type encoded_nbytes = DerivedClass::Transform( +input_str + input_string_offset, input_string_ncodeunits, +output_str + output_ncodeunits); +output_ncodeunits += encoded_nbytes; +output_string_offsets[i + 1] = output_ncodeunits; + } + + // trim the codepoint buffer, since we allocated too much + KERNEL_RETURN_IF_ERROR( + ctx, + output->buffers[2]->CopySlice(0, output_ncodeunits).Value(>buffers[2])); +} else { + const auto& input = checked_cast(*batch[0].scalar()); + auto result = checked_pointer_cast(MakeNullScalar(out->type())); + if (input.is_valid) { +result->is_valid = true; +offset_type data_nbytes = (offset_type)input.value->size(); +// See note above in the Array version explaining the 3 / 2 +KERNEL_RETURN_IF_ERROR(ctx, + ctx->Allocate(data_nbytes * 3 / 2).Value(>value)); +offset_type encoded_nbytes = DerivedClass::Transform( +input.value->data(), data_nbytes, result->value->mutable_data()); +KERNEL_RETURN_IF_ERROR( +ctx, result->value->CopySlice(0, encoded_nbytes).Value(>value)); + } + out->value = result; +} + } +}; + +template +struct
[GitHub] [arrow] github-actions[bot] commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux
github-actions[bot] commented on pull request #7580: URL: https://github.com/apache/arrow/pull/7580#issuecomment-651141065 Revision: b0dd66f17e94fbf61da0f1f616ccd5e627a79145 Submitted crossbow builds: [ursa-labs/crossbow @ actions-365](https://github.com/ursa-labs/crossbow/branches/all?query=actions-365) |Task|Status| ||--| |wheel-manylinux1-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux1-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux1-cp35m)| |wheel-manylinux1-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux1-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux1-cp36m)| |wheel-manylinux1-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux1-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux1-cp37m)| |wheel-manylinux1-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux1-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux1-cp38)| |wheel-manylinux2010-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux2010-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux2010-cp35m)| |wheel-manylinux2010-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux2010-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux2010-cp36m)| |wheel-manylinux2010-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux2010-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux2010-cp37m)| |wheel-manylinux2010-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux2010-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux2010-cp38)| |wheel-manylinux2014-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux2014-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux2014-cp35m)| |wheel-manylinux2014-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux2014-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux2014-cp36m)| |wheel-manylinux2014-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux2014-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux2014-cp37m)| |wheel-manylinux2014-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-365-azure-wheel-manylinux2014-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-365-azure-wheel-manylinux2014-cp38)| |wheel-osx-cp35m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-365-travis-wheel-osx-cp35m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-osx-cp36m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-365-travis-wheel-osx-cp36m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-osx-cp37m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-365-travis-wheel-osx-cp37m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-osx-cp38|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-365-travis-wheel-osx-cp38.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-win-cp35m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-365-appveyor-wheel-win-cp35m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)| |wheel-win-cp36m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-365-appveyor-wheel-win-cp36m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
[GitHub] [arrow] wesm commented on a change in pull request #7568: ARROW-9241: [C++] Add forward compatibility check for Decimal bit width
wesm commented on a change in pull request #7568: URL: https://github.com/apache/arrow/pull/7568#discussion_r446993778 ## File path: cpp/src/arrow/ipc/metadata_internal.cc ## @@ -257,6 +259,9 @@ Status ConcreteTypeFromFlatbuffer(flatbuf::Type type, const void* type_data, return Status::OK(); case flatbuf::Type::Decimal: { auto dec_type = static_cast(type_data); + if (dec_type->bitWidth() != kDecimalBitWidth) { +return Status::Invalid("Library only supports 128-bit decimal values"); Review comment: The Flatbuffers library takes care of this -- `bitWidth()` always returns a value and if the serialized field is not present, the default 128 is returned. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7571: ARROW-8671: [C++] Use new BodyCompression Flatbuffers member for IPC compression metadata
pitrou commented on a change in pull request #7571: URL: https://github.com/apache/arrow/pull/7571#discussion_r447003525 ## File path: cpp/src/arrow/ipc/options.h ## @@ -66,6 +67,10 @@ struct ARROW_EXPORT IpcWriteOptions { /// like compression bool use_threads = true; + /// \brief Format version to use for IPC messages and their + /// metadata. Presently using V4 version (readable by v0.8.0 and later). + MetadataVersion metadata_version = MetadataVersion::V4; Review comment: Oops, I had misunderstood that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rymurr commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
rymurr commented on a change in pull request #7347: URL: https://github.com/apache/arrow/pull/7347#discussion_r446920388 ## File path: java/memory/src/main/java/org/apache/arrow/memory/util/MemoryUtil.java ## @@ -78,6 +77,55 @@ public Object run() { Field addressField = java.nio.Buffer.class.getDeclaredField("address"); addressField.setAccessible(true); BYTE_BUFFER_ADDRESS_OFFSET = UNSAFE.objectFieldOffset(addressField); + + Constructor directBufferConstructor; + long address = -1; + final ByteBuffer direct = ByteBuffer.allocateDirect(1); + try { + +final Object maybeDirectBufferConstructor = +AccessController.doPrivileged(new PrivilegedAction() { + @Override + public Object run() { +try { + final Constructor constructor = + direct.getClass().getDeclaredConstructor(long.class, int.class); + constructor.setAccessible(true); + logger.debug("Constructor for direct buffer found and made accessible"); + return constructor; +} catch (NoSuchMethodException e) { + logger.debug("Cannot get constructor for direct buffer allocation", e); + return e; +} catch (SecurityException e) { + logger.debug("Cannot get constructor for direct buffer allocation", e); + return e; +} + } +}); + +if (maybeDirectBufferConstructor instanceof Constructor) { + address = UNSAFE.allocateMemory(1); + // try to use the constructor now + try { +((Constructor) maybeDirectBufferConstructor).newInstance(address, 1); +directBufferConstructor = (Constructor) maybeDirectBufferConstructor; +logger.debug("direct buffer constructor: available"); + } catch (InstantiationException | IllegalAccessException | InvocationTargetException e) { +logger.debug("unable to instantiate a direct buffer via constructor", e); Review comment: fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r446934057 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -30,6 +31,124 @@ namespace internal { namespace { +// lookup tables +std::vector lut_upper_codepoint; +std::vector lut_lower_codepoint; +std::once_flag flag_case_luts; + +constexpr uint32_t REPLACEMENT_CHAR = +'?'; // the proper replacement char would be the 0xFFFD codepoint, but that can + // increase string length by a factor of 3 +constexpr int MAX_CODEPOINT_LUT = 0x; // up to this codepoint is in a lookup table + +static inline void utf8_encode(uint8_t*& str, uint32_t codepoint) { + if (codepoint < 0x80) { +*str++ = codepoint; + } else if (codepoint < 0x800) { +*str++ = 0xC0 + (codepoint >> 6); +*str++ = 0x80 + (codepoint & 0x3F); + } else if (codepoint < 0x1) { +*str++ = 0xE0 + (codepoint >> 12); +*str++ = 0x80 + ((codepoint >> 6) & 0x3F); +*str++ = 0x80 + (codepoint & 0x3F); + } else if (codepoint < 0x20) { +*str++ = 0xF0 + (codepoint >> 18); +*str++ = 0x80 + ((codepoint >> 12) & 0x3F); +*str++ = 0x80 + ((codepoint >> 6) & 0x3F); +*str++ = 0x80 + (codepoint & 0x3F); + } else { +*str++ = codepoint; + } +} + +static inline bool utf8_is_continuation(const uint8_t codeunit) { + return (codeunit & 0xC0) == 0x80; // upper two bits should be 10 +} + +static inline uint32_t utf8_decode(const uint8_t*& str, int64_t& length) { + if (*str < 0x80) { // +length -= 1; +return *str++; + } else if (*str < 0xC0) { // invalid non-ascii char +length -= 1; +str++; +return REPLACEMENT_CHAR; Review comment: Would you mind we keep it as is for now, and maybe discuss this separately? Happy to take this to the mailing list. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r446988586 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -39,6 +158,121 @@ struct AsciiLength { } }; +template class Derived> +struct Utf8Transform { + using offset_type = typename Type::offset_type; + using DerivedClass = Derived; + using ArrayType = typename TypeTraits::ArrayType; + + static offset_type Transform(const uint8_t* input, offset_type input_string_ncodeunits, + uint8_t* output) { +uint8_t* dest = output; +utf8_transform(input, input + input_string_ncodeunits, dest, + DerivedClass::TransformCodepoint); +return (offset_type)(dest - output); + } + + static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +if (batch[0].kind() == Datum::ARRAY) { + std::call_once(flag_case_luts, []() { +lut_upper_codepoint.reserve(MAX_CODEPOINT_LUT + 1); +lut_lower_codepoint.reserve(MAX_CODEPOINT_LUT + 1); +for (int i = 0; i <= MAX_CODEPOINT_LUT; i++) { + lut_upper_codepoint.push_back(utf8proc_toupper(i)); + lut_lower_codepoint.push_back(utf8proc_tolower(i)); +} + }); + const ArrayData& input = *batch[0].array(); + ArrayType input_boxed(batch[0].array()); + ArrayData* output = out->mutable_array(); + + offset_type const* input_string_offsets = input.GetValues(1); + utf8proc_uint8_t const* input_str = + input.buffers[2]->data() + input_boxed.value_offset(0); + offset_type input_ncodeunits = input_boxed.total_values_length(); + offset_type input_nstrings = (offset_type)input.length; + + // Section 5.18 of the Unicode spec claim that the number of codepoints for case + // mapping can grow by a factor of 3. This means grow by a factor of 3 in bytes + // However, since we don't support all casings (SpecialCasing.txt) the growth + // is actually only at max 3/2 (as covered by the unittest). + // Note that rounding down the 3/2 is ok, since only codepoints encoded by + // two code units (even) can grow to 3 code units. + + int64_t output_ncodeunits_max = ((int64_t)input_ncodeunits) * 3 / 2; + if (output_ncodeunits_max > std::numeric_limits::max()) { +ctx->SetStatus(Status::CapacityError( +"Result might not fit in a 32bit utf8 array, convert to large_utf8")); +return; + } + + KERNEL_RETURN_IF_ERROR( + ctx, ctx->Allocate(output_ncodeunits_max).Value(>buffers[2])); + // We could reuse the buffer if it is all ascii, benchmarking showed this not to + // matter + // output->buffers[1] = input.buffers[1]; + KERNEL_RETURN_IF_ERROR(ctx, + ctx->Allocate((input_nstrings + 1) * sizeof(offset_type)) + .Value(>buffers[1])); + utf8proc_uint8_t* output_str = output->buffers[2]->mutable_data(); + offset_type* output_string_offsets = output->GetMutableValues(1); + offset_type output_ncodeunits = 0; + + offset_type output_string_offset = 0; + *output_string_offsets = output_string_offset; + offset_type input_string_first_offset = input_string_offsets[0]; + for (int64_t i = 0; i < input_nstrings; i++) { +offset_type input_string_offset = +input_string_offsets[i] - input_string_first_offset; +offset_type input_string_end = +input_string_offsets[i + 1] - input_string_first_offset; +offset_type input_string_ncodeunits = input_string_end - input_string_offset; Review comment: Using GetValue now, much cleaner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7568: ARROW-9241: [C++] Add forward compatibility check for Decimal bit width
wesm commented on pull request #7568: URL: https://github.com/apache/arrow/pull/7568#issuecomment-651142403 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7568: ARROW-9241: [C++] Add forward compatibility check for Decimal bit width
wesm commented on a change in pull request #7568: URL: https://github.com/apache/arrow/pull/7568#discussion_r446994593 ## File path: cpp/src/arrow/ipc/metadata_internal.cc ## @@ -257,6 +259,9 @@ Status ConcreteTypeFromFlatbuffer(flatbuf::Type type, const void* type_data, return Status::OK(); case flatbuf::Type::Decimal: { auto dec_type = static_cast(type_data); + if (dec_type->bitWidth() != kDecimalBitWidth) { +return Status::Invalid("Library only supports 128-bit decimal values"); Review comment: Here's the generated binding -- so if `VT_BITWIDTH` is not present in the table then it returns 128 ```c++ int32_t bitWidth() const { return GetField(VT_BITWIDTH, 128); } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7575: ARROW-8671: [C++][FOLLOWUP] Fix ASAN/UBSAN bug found with IPC fuzz testing files
pitrou commented on pull request #7575: URL: https://github.com/apache/arrow/pull/7575#issuecomment-651147499 Did you add the regression file to the testing repository? (in `data/arrow-ipc-file` or `data/arrow-ipc-stream`, depending on the fuzzer which found it) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm edited a comment on pull request #7569: ARROW-9152: [C++] Specialized implementation of filtering Binary/LargeBinary-based types
wesm edited a comment on pull request #7569: URL: https://github.com/apache/arrow/pull/7569#issuecomment-650803803 Benchmarks on gcc-8 ``` $ archery benchmark diff --cc=gcc-8 --cxx=g++-8 --benchmark-filter=FilterString benchmark baselinecontender change % counters 13 FilterStringFilterNoNulls/1048576/01.242 GiB/sec6.766 GiB/sec 444.926{'iterations': 890, 'data null%': 0.0, 'mask null%': 0.0, 'select%': 99.9} 28 FilterStringFilterNoNulls/1048576/31.216 GiB/sec5.055 GiB/sec 315.816{'iterations': 877, 'data null%': 0.1, 'mask null%': 0.0, 'select%': 99.9} 8 FilterStringFilterNoNulls/1048576/61.087 GiB/sec2.512 GiB/sec 130.990{'iterations': 771, 'data null%': 1.0, 'mask null%': 0.0, 'select%': 99.9} 4 FilterStringFilterNoNulls/1048576/12 227.501 MiB/sec 500.272 MiB/sec 119.899 {'iterations': 160, 'data null%': 90.0, 'mask null%': 0.0, 'select%': 99.9} 23 FilterStringFilterNoNulls/1048576/9 902.994 MiB/sec1.362 GiB/sec 54.476 {'iterations': 622, 'data null%': 10.0, 'mask null%': 0.0, 'select%': 99.9} 24 FilterStringFilterWithNulls/1048576/12 235.786 MiB/sec 356.124 MiB/sec 51.037 {'iterations': 165, 'data null%': 90.0, 'mask null%': 5.0, 'select%': 99.9} 5FilterStringFilterWithNulls/1048576/3 996.664 MiB/sec1.361 GiB/sec 39.858{'iterations': 687, 'data null%': 0.1, 'mask null%': 5.0, 'select%': 99.9} 22 FilterStringFilterWithNulls/1048576/6 988.599 MiB/sec1.331 GiB/sec 37.918{'iterations': 685, 'data null%': 1.0, 'mask null%': 5.0, 'select%': 99.9} 15 FilterStringFilterWithNulls/1048576/9 913.433 MiB/sec1.205 GiB/sec 35.073 {'iterations': 634, 'data null%': 10.0, 'mask null%': 5.0, 'select%': 99.9} 7 FilterStringFilterNoNulls/1048576/13 219.235 MiB/sec 292.800 MiB/sec 33.555 {'iterations': 155, 'data null%': 90.0, 'mask null%': 0.0, 'select%': 50.0} 14FilterStringFilterNoNulls/1048576/101.169 GiB/sec1.559 GiB/sec 33.370 {'iterations': 838, 'data null%': 10.0, 'mask null%': 0.0, 'select%': 50.0} 25 FilterStringFilterWithNulls/1048576/01.062 GiB/sec1.366 GiB/sec 28.690{'iterations': 761, 'data null%': 0.0, 'mask null%': 5.0, 'select%': 99.9} 2 FilterStringFilterNoNulls/1048576/11.398 GiB/sec1.763 GiB/sec 26.088 {'iterations': 1007, 'data null%': 0.0, 'mask null%': 0.0, 'select%': 50.0} 11 FilterStringFilterWithNulls/1048576/41.233 GiB/sec1.521 GiB/sec 23.336{'iterations': 894, 'data null%': 0.1, 'mask null%': 5.0, 'select%': 50.0} 29 FilterStringFilterWithNulls/1048576/71.240 GiB/sec1.519 GiB/sec 22.426{'iterations': 881, 'data null%': 1.0, 'mask null%': 5.0, 'select%': 50.0} 9 FilterStringFilterWithNulls/1048576/101.177 GiB/sec1.370 GiB/sec 16.380 {'iterations': 753, 'data null%': 10.0, 'mask null%': 5.0, 'select%': 50.0} 26 FilterStringFilterWithNulls/1048576/13 215.181 MiB/sec 249.411 MiB/sec 15.907 {'iterations': 151, 'data null%': 90.0, 'mask null%': 5.0, 'select%': 50.0} 3 FilterStringFilterNoNulls/1048576/71.373 GiB/sec1.580 GiB/sec 15.117{'iterations': 983, 'data null%': 1.0, 'mask null%': 0.0, 'select%': 50.0} 10 FilterStringFilterNoNulls/1048576/41.427 GiB/sec1.607 GiB/sec 12.602{'iterations': 996, 'data null%': 0.1, 'mask null%': 0.0, 'select%': 50.0} 17 FilterStringFilterWithNulls/1048576/11.364 GiB/sec1.532 GiB/sec 12.343{'iterations': 950, 'data null%': 0.0, 'mask null%': 5.0, 'select%': 50.0} 20 FilterStringFilterNoNulls/1048576/8 19.656 GiB/sec 20.549 GiB/sec 4.540 {'iterations': 14209, 'data null%': 1.0, 'mask null%': 0.0, 'select%': 1.0} 6 FilterStringFilterWithNulls/1048576/11 12.449 GiB/sec 12.500 GiB/sec 0.407 {'iterations': 8994, 'data null%': 10.0, 'mask null%': 5.0, 'select%': 1.0} 0 FilterStringFilterNoNulls/1048576/5 20.884 GiB/sec 20.824 GiB/sec -0.287 {'iterations': 14923, 'data null%': 0.1, 'mask null%': 0.0, 'select%': 1.0} 19FilterStringFilterNoNulls/1048576/11 18.171 GiB/sec 17.865 GiB/sec -1.687 {'iterations': 12875, 'data null%': 10.0, 'mask null%': 0.0, 'select%': 1.0} 21 FilterStringFilterWithNulls/1048576/141.551 GiB/sec1.519 GiB/sec -2.057 {'iterations': 1090, 'data null%': 90.0, 'mask null%': 5.0, 'select%': 1.0} 16 FilterStringFilterNoNulls/1048576/2 21.518 GiB/sec 21.028 GiB/sec -2.281 {'iterations': 15479, 'data null%': 0.0, 'mask null%': 0.0, 'select%': 1.0} 12FilterStringFilterNoNulls/1048576/142.367 GiB/sec2.295 GiB/sec -3.071 {'iterations': 1690, 'data null%': 90.0, 'mask null%': 0.0, 'select%': 1.0}
[GitHub] [arrow] wesm commented on pull request #7571: ARROW-8671: [C++] Use new BodyCompression Flatbuffers member for IPC compression metadata
wesm commented on pull request #7571: URL: https://github.com/apache/arrow/pull/7571#issuecomment-651146389 @pitrou this is already merged (by accident actually, mistyped the PR number on the command line and went too fast), but let me know if you see anything concerning from a fuzz perspective or otherwise. I fixed one fuzz issue already in https://github.com/apache/arrow/commit/76c3e4a6d30e279fa5707f7cc14e8aacf00f08a3 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r446938183 ## File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc ## @@ -68,6 +76,64 @@ TYPED_TEST(TestStringKernels, AsciiLower) { this->string_type(), "[\"aaazzæÆ&\", null, \"\", \"bbb\"]"); } +TEST(TestStringKernels, Utf8Upper32bitGrowth) { + std::string str(0x, 'a'); + arrow::StringBuilder builder; + // 0x7fff * 0x is the max a 32 bit string array can hold + // since the utf8_upper kernel can grow it by 3/2, the max we should accept is is + // 0x7fff * 0x * 2/3 = 0x * 0x, so this should give us a CapacityError + for (int64_t i = 0; i < 0x5556; i++) { +ASSERT_OK(builder.Append(str)); + } + std::shared_ptr array; + arrow::Status st = builder.Finish(); + const FunctionOptions* options = nullptr; + EXPECT_RAISES_WITH_MESSAGE_THAT(CapacityError, + testing::HasSubstr("Result might not fit"), + CallFunction("utf8_upper", {array}, options)); +} + +TYPED_TEST(TestStringKernels, Utf8Upper) { + this->CheckUnary("utf8_upper", "[\"aAazZæÆ&\", null, \"\", \"b\"]", this->string_type(), + "[\"AAAZZÆÆ&\", null, \"\", \"B\"]"); + + // test varying encoding lenghts and thus changing indices/offsets + this->CheckUnary("utf8_upper", "[\"ɑɽⱤoW\", null, \"ıI\", \"b\"]", this->string_type(), + "[\"ⱭⱤⱤOW\", null, \"II\", \"B\"]"); + + // ῦ to Υ͂ not supported + // this->CheckUnary("utf8_upper", "[\"ῦɐɜʞȿ\"]", this->string_type(), + // "[\"Υ͂ⱯꞫꞰⱾ\"]"); + + // test maximum buffer growth + this->CheckUnary("utf8_upper", "[\"\"]", this->string_type(), "[\"\"]"); + + // Test replacing invalid data by ? (ὖ == \xe1\xbd\x96) + this->CheckUnary("utf8_upper", "[\"ɑa\xFFɑ\", \"ɽ\xe1\xbdɽaa\"]", this->string_type(), + "[\"ⱭA?Ɑ\", \"Ɽ?ⱤAA\"]"); +} + +TYPED_TEST(TestStringKernels, Utf8Lower) { + this->CheckUnary("utf8_lower", "[\"aAazZæÆ&\", null, \"\", \"b\"]", this->string_type(), + "[\"aaazzææ&\", null, \"\", \"b\"]"); + + // test varying encoding lenghts and thus changing indices/offsets + this->CheckUnary("utf8_lower", "[\"ⱭɽⱤoW\", null, \"ıI\", \"B\"]", this->string_type(), + "[\"ɑɽɽow\", null, \"ıi\", \"b\"]"); + + // ῦ to Υ͂ is not supported, but in principle the reverse is, but it would need + // normalization + // this->CheckUnary("utf8_lower", "[\"Υ͂ⱯꞫꞰⱾ\"]", this->string_type(), + // "[\"ῦɐɜʞȿ\"]"); + + // test maximum buffer growth + this->CheckUnary("utf8_lower", "[\"\"]", this->string_type(), "[\"\"]"); + + // Test replacing invalid data by ? (ὖ == \xe1\xbd\x96) Review comment: I added a truncated encoded byte string of that, I'll clarity in the comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
pitrou commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r446946562 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -30,6 +31,124 @@ namespace internal { namespace { +// lookup tables +std::vector lut_upper_codepoint; +std::vector lut_lower_codepoint; +std::once_flag flag_case_luts; + +constexpr uint32_t REPLACEMENT_CHAR = +'?'; // the proper replacement char would be the 0xFFFD codepoint, but that can + // increase string length by a factor of 3 +constexpr int MAX_CODEPOINT_LUT = 0x; // up to this codepoint is in a lookup table + +static inline void utf8_encode(uint8_t*& str, uint32_t codepoint) { + if (codepoint < 0x80) { +*str++ = codepoint; + } else if (codepoint < 0x800) { +*str++ = 0xC0 + (codepoint >> 6); +*str++ = 0x80 + (codepoint & 0x3F); + } else if (codepoint < 0x1) { +*str++ = 0xE0 + (codepoint >> 12); +*str++ = 0x80 + ((codepoint >> 6) & 0x3F); +*str++ = 0x80 + (codepoint & 0x3F); + } else if (codepoint < 0x20) { +*str++ = 0xF0 + (codepoint >> 18); +*str++ = 0x80 + ((codepoint >> 12) & 0x3F); +*str++ = 0x80 + ((codepoint >> 6) & 0x3F); +*str++ = 0x80 + (codepoint & 0x3F); + } else { +*str++ = codepoint; + } +} + +static inline bool utf8_is_continuation(const uint8_t codeunit) { + return (codeunit & 0xC0) == 0x80; // upper two bits should be 10 +} + +static inline uint32_t utf8_decode(const uint8_t*& str, int64_t& length) { + if (*str < 0x80) { // +length -= 1; +return *str++; + } else if (*str < 0xC0) { // invalid non-ascii char +length -= 1; +str++; +return REPLACEMENT_CHAR; Review comment: Well, I'm not sure why it would make sense to ignore invalid input data here, unless not ignoring it has a significant cost (which sounds unlikely). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r446958717 ## File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc ## @@ -81,5 +147,40 @@ TYPED_TEST(TestStringKernels, StrptimeDoesNotProvideDefaultOptions) { ASSERT_RAISES(Invalid, CallFunction("strptime", {input})); } +TEST(TestStringKernels, UnicodeLibraryAssumptions) { + uint8_t output[4]; + for (utf8proc_int32_t codepoint = 0x100; codepoint < 0x11; codepoint++) { +utf8proc_ssize_t encoded_nbytes = utf8proc_encode_char(codepoint, output); +utf8proc_int32_t codepoint_upper = utf8proc_toupper(codepoint); +utf8proc_ssize_t encoded_nbytes_upper = utf8proc_encode_char(codepoint_upper, output); +if (encoded_nbytes == 2) { + EXPECT_LE(encoded_nbytes_upper, 3) + << "Expected the upper case codepoint for a 2 byte encoded codepoint to be " + "encoded in maximum 3 bytes, not " + << encoded_nbytes_upper; +} +if (encoded_nbytes == 3) { Review comment: This code tests the assumption that the byte length growth for case mapping can only be at max 3/2, which is what we use for pre-allocation of the buffers. Unicode codepoints encoded in 4 bytes can only be 4 bytes or less when upper cased or lower cased, so we don't need to check them here. We can also skip the test that tests that case for `encoded_nbytes=3`, since it will not matter. I'll add a comment why we do the test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r446990168 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -30,6 +31,124 @@ namespace internal { namespace { +// lookup tables +std::vector lut_upper_codepoint; +std::vector lut_lower_codepoint; +std::once_flag flag_case_luts; + +constexpr uint32_t REPLACEMENT_CHAR = +'?'; // the proper replacement char would be the 0xFFFD codepoint, but that can + // increase string length by a factor of 3 +constexpr int MAX_CODEPOINT_LUT = 0x; // up to this codepoint is in a lookup table + +static inline void utf8_encode(uint8_t*& str, uint32_t codepoint) { + if (codepoint < 0x80) { +*str++ = codepoint; + } else if (codepoint < 0x800) { +*str++ = 0xC0 + (codepoint >> 6); +*str++ = 0x80 + (codepoint & 0x3F); + } else if (codepoint < 0x1) { +*str++ = 0xE0 + (codepoint >> 12); +*str++ = 0x80 + ((codepoint >> 6) & 0x3F); +*str++ = 0x80 + (codepoint & 0x3F); + } else if (codepoint < 0x20) { +*str++ = 0xF0 + (codepoint >> 18); +*str++ = 0x80 + ((codepoint >> 12) & 0x3F); +*str++ = 0x80 + ((codepoint >> 6) & 0x3F); +*str++ = 0x80 + (codepoint & 0x3F); + } else { +*str++ = codepoint; Review comment: Agree, this is not useful. Now also replacing that by '?', in principle, it should not happen, it can only happen when there is an internal bug (e.g. uppercase gives the wrong codepoint), but now it's more informative. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7575: ARROW-8671: [C++][FOLLOWUP] Fix ASAN/UBSAN bug found with IPC fuzz testing files
wesm commented on pull request #7575: URL: https://github.com/apache/arrow/pull/7575#issuecomment-651148456 @pitrou the issue was caught by one of the existing fuzz files -- I had accidentally merged the PR without seeing the ASAN failure This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rymurr commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
rymurr commented on a change in pull request #7347: URL: https://github.com/apache/arrow/pull/7347#discussion_r446921213 ## File path: java/memory/src/main/java/org/apache/arrow/memory/rounding/DefaultRoundingPolicy.java ## @@ -17,33 +17,107 @@ package org.apache.arrow.memory.rounding; -import java.lang.reflect.Field; - -import org.apache.arrow.memory.NettyAllocationManager; import org.apache.arrow.memory.util.CommonUtil; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import io.netty.util.internal.SystemPropertyUtil; /** * The default rounding policy. That is, if the requested size is within the chunk size, * the rounded size will be the next power of two. Otherwise, the rounded size * will be identical to the requested size. */ public class DefaultRoundingPolicy implements RoundingPolicy { - + private static final Logger logger = LoggerFactory.getLogger(DefaultRoundingPolicy.class); public final long chunkSize; /** - * The singleton instance. + * The variables here and the static block calculates teh DEFAULT_CHUNK_SIZE. + * + * + * It was copied from {@link io.netty.buffer.PooledByteBufAllocator}. + * */ - public static final DefaultRoundingPolicy INSTANCE = new DefaultRoundingPolicy(); + private static final int MIN_PAGE_SIZE = 4096; + private static final int MAX_CHUNK_SIZE = (int) (((long) Integer.MAX_VALUE + 1) / 2); + private static final long DEFAULT_CHUNK_SIZE; + private static final int DEFAULT_PAGE_SIZE; + private static final int DEFAULT_MAX_ORDER; - private DefaultRoundingPolicy() { + + static { +int defaultPageSize = SystemPropertyUtil.getInt("org.apache.memory.allocator.pageSize", 8192); +Throwable pageSizeFallbackCause = null; try { - Field field = NettyAllocationManager.class.getDeclaredField("CHUNK_SIZE"); - field.setAccessible(true); - chunkSize = (Long) field.get(null); -} catch (Exception e) { - throw new RuntimeException("Failed to get chunk size from allocation manager"); + validateAndCalculatePageShifts(defaultPageSize); +} catch (Throwable t) { + pageSizeFallbackCause = t; + defaultPageSize = 8192; +} +DEFAULT_PAGE_SIZE = defaultPageSize; + +int defaultMaxOrder = SystemPropertyUtil.getInt("org.apache.memory.allocator.maxOrder", 11); +Throwable maxOrderFallbackCause = null; +try { + validateAndCalculateChunkSize(DEFAULT_PAGE_SIZE, defaultMaxOrder); +} catch (Throwable t) { + maxOrderFallbackCause = t; + defaultMaxOrder = 11; +} +DEFAULT_MAX_ORDER = defaultMaxOrder; +DEFAULT_CHUNK_SIZE = validateAndCalculateChunkSize(DEFAULT_PAGE_SIZE, DEFAULT_MAX_ORDER); +if (logger.isDebugEnabled()) { + if (pageSizeFallbackCause == null) { +logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", DEFAULT_PAGE_SIZE); + } else { +logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", DEFAULT_PAGE_SIZE, pageSizeFallbackCause); + } + if (maxOrderFallbackCause == null) { +logger.debug("-Dorg.apache.memory.allocator.maxOrder: {}", DEFAULT_MAX_ORDER); + } else { +logger.debug("-Dorg.apache.memory.allocator.maxOrder: {}", DEFAULT_MAX_ORDER, maxOrderFallbackCause); + } Review comment: done ## File path: java/memory/src/main/java/org/apache/arrow/memory/rounding/DefaultRoundingPolicy.java ## @@ -17,33 +17,107 @@ package org.apache.arrow.memory.rounding; -import java.lang.reflect.Field; - -import org.apache.arrow.memory.NettyAllocationManager; import org.apache.arrow.memory.util.CommonUtil; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import io.netty.util.internal.SystemPropertyUtil; /** * The default rounding policy. That is, if the requested size is within the chunk size, * the rounded size will be the next power of two. Otherwise, the rounded size * will be identical to the requested size. */ public class DefaultRoundingPolicy implements RoundingPolicy { - + private static final Logger logger = LoggerFactory.getLogger(DefaultRoundingPolicy.class); public final long chunkSize; /** - * The singleton instance. + * The variables here and the static block calculates teh DEFAULT_CHUNK_SIZE. + * + * + * It was copied from {@link io.netty.buffer.PooledByteBufAllocator}. + * */ - public static final DefaultRoundingPolicy INSTANCE = new DefaultRoundingPolicy(); + private static final int MIN_PAGE_SIZE = 4096; + private static final int MAX_CHUNK_SIZE = (int) (((long) Integer.MAX_VALUE + 1) / 2); + private static final long DEFAULT_CHUNK_SIZE; + private static final int DEFAULT_PAGE_SIZE; + private static final int DEFAULT_MAX_ORDER; - private DefaultRoundingPolicy() { + + static { +int defaultPageSize = SystemPropertyUtil.getInt("org.apache.memory.allocator.pageSize", 8192); +Throwable pageSizeFallbackCause = null; try
[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r446934218 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -30,6 +31,124 @@ namespace internal { namespace { +// lookup tables +std::vector lut_upper_codepoint; +std::vector lut_lower_codepoint; +std::once_flag flag_case_luts; + +constexpr uint32_t REPLACEMENT_CHAR = +'?'; // the proper replacement char would be the 0xFFFD codepoint, but that can + // increase string length by a factor of 3 +constexpr int MAX_CODEPOINT_LUT = 0x; // up to this codepoint is in a lookup table + +static inline void utf8_encode(uint8_t*& str, uint32_t codepoint) { Review comment: Great, didn't see that, will move it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-651140872 Actually @jorisvandenbossche, I agree we should probably just add in the batch_size argument (with a sensible default) to those other methods. Took me a while to understand what you were getting at, but by providing those parameters we avoid the issue where `iter_batches()` calls could mess with the underlying parameters when calling the other methods. I will implement those soon. I would especially welcome input on what the docstring should be for the `batch_size` parameter for those other methods (what is the user-facing impact of changing it?) and what the default value should be. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7568: ARROW-9241: [C++] Add forward compatibility check for Decimal bit width
wesm closed pull request #7568: URL: https://github.com/apache/arrow/pull/7568 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7571: ARROW-8671: [C++] Use new BodyCompression Flatbuffers member for IPC compression metadata
wesm commented on a change in pull request #7571: URL: https://github.com/apache/arrow/pull/7571#discussion_r447004437 ## File path: cpp/src/arrow/ipc/options.h ## @@ -66,6 +67,10 @@ struct ARROW_EXPORT IpcWriteOptions { /// like compression bool use_threads = true; + /// \brief Format version to use for IPC messages and their + /// metadata. Presently using V4 version (readable by v0.8.0 and later). + MetadataVersion metadata_version = MetadataVersion::V4; Review comment: see https://issues.apache.org/jira/browse/ARROW-9265 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] xhochy commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux
xhochy commented on pull request #7580: URL: https://github.com/apache/arrow/pull/7580#issuecomment-651162033 > The manylinux import test [fails](https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=14018=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=3367cfe7-6aa5-5a88-c888-e6814d74f6eb=248) after removing `libz.so`, because it is necessary for some standard library modules. > @xhochy @kszucs Any idea how to circumvent this? Oh, where is this removing happening, do we do that in our scripts or is this `manylinux1`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651165626 @pitrou many thanks for the review. I've implemented all you suggestions except: * Raising an error on invalid utf8 data (see comment) * Having a benchmark run on non-ascii codepoints (I think we want to do this separate from this PR, but important point). Btw, I wasn't aware of existing utf8 code already in Arrow. The existing decoder based on http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ was new to me. Very interesting work, but unfortunately led to a performance regression (~50->30 M/s), which I'm surprised about actually. Maybe worth comparing again when we have a benchmark with non-ascii codepoints. @wesm I hope this is ready to go This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rjzamora commented on pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values
rjzamora commented on pull request #7546: URL: https://github.com/apache/arrow/pull/7546#issuecomment-651177576 Thanks for the great work here @bkietz ! This is wonderful - Dask uses the min/max statistics to calculate `divisions`, so this functionality is definitely necessary. *A note on other (less-critical, but useful) statistics*: Dask also uses the `"total_byte_size"` statistics (for the full row-group, not each column) to aggregate partitions before reading in any data. There is also a plan to use the `"num-rows”` statistics when the user executes `len(ddf)` (to avoid loading any data). **How difficult would it be to add/expose these additional row-group statistics?** Again, this is much less of a “blocker” for initial integration with Dask, but are likely things we will want to add in eventually. cc @jorisvandenbossche This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available
wesm commented on pull request #7556: URL: https://github.com/apache/arrow/pull/7556#issuecomment-651186700 If you're installing Brotli in any of the packaging setups, there may be a scenario where there is both the shared AND static library -- in that case there would be an issue. We need to see why the manylinux* builds have `libbrotli*.so` in them -- if for some reason Brotli's build is insubordinate and churlish we might have to add a flag to do a hard switch between shared/static like we do with ZSTD This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs closed pull request #7376: ARROW-9043: [Go][FOLLOWUP] Move license file copy to correct location
kszucs closed pull request #7376: URL: https://github.com/apache/arrow/pull/7376 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available
nealrichardson commented on pull request #7556: URL: https://github.com/apache/arrow/pull/7556#issuecomment-651182567 @wesm how can I help/what should I look for? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available
nealrichardson commented on pull request #7556: URL: https://github.com/apache/arrow/pull/7556#issuecomment-651188772 IIUC we're ok: * Windows: no brotli: https://github.com/apache/arrow/blob/master/ci/scripts/PKGBUILD * macOS: no brotli: https://github.com/apache/arrow/blob/master/dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb * Linux: maybe brotli, but it's all static build anyway: https://github.com/apache/arrow/blob/master/r/inst/build_arrow_static.sh This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jba commented on pull request #7376: ARROW-9043: [Go][FOLLOWUP] Move license file copy to correct location
jba commented on pull request #7376: URL: https://github.com/apache/arrow/pull/7376#issuecomment-651164902 There's no feasible way to test this, unfortunately. I verified both my suggested changes by forking the repo, making the edits, and running an internal tool to verify that they worked. The external alternative would be to `go get` the package at the latest release to inform the Go module proxy, wait a few minutes, then check for it on pkg.go.dev. That would be a complicated test to automate, and would almost certainly be flaky, with little benefit. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux
pitrou commented on pull request #7580: URL: https://github.com/apache/arrow/pull/7580#issuecomment-651159316 The manylinux import test [fails](https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=14018=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=3367cfe7-6aa5-5a88-c888-e6814d74f6eb=248) after removing `libz.so`, because it is necessary for some standard library modules. @xhochy @kszucs Any idea how to circumvent this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on pull request #7376: ARROW-9043: [Go][FOLLOWUP] Move license file copy to correct location
kszucs commented on pull request #7376: URL: https://github.com/apache/arrow/pull/7376#issuecomment-651170232 Thanks @jba for the update. Merging it then. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux
pitrou commented on pull request #7580: URL: https://github.com/apache/arrow/pull/7580#issuecomment-651181336 @xhochy In https://github.com/apache/arrow/blob/master/dev/tasks/python-wheels/manylinux-test.sh#L33 . I'm sure there's a better way to do that (e.g. spawn a minimal Python docker image). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7559: ARROW-9247: [Python] Expose total_values_length functions on BinaryArray, LargeBinaryArray
wesm commented on pull request #7559: URL: https://github.com/apache/arrow/pull/7559#issuecomment-651187351 cc @pitrou or @jorisvandenbossche for review This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
wesm commented on a change in pull request #7539: URL: https://github.com/apache/arrow/pull/7539#discussion_r447030010 ## File path: cpp/src/arrow/tensor/csf_converter.cc ## @@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, const std::vector -class SparseCSFTensorConverter { - public: - using NumericTensorType = NumericTensor; - using value_type = typename NumericTensorType::value_type; +class SparseCSFTensorConverter : private SparseTensorConverterMixin { + using SparseTensorConverterMixin::AssignIndex; + using SparseTensorConverterMixin::IsNonZero; - SparseCSFTensorConverter(const NumericTensorType& tensor, + public: + SparseCSFTensorConverter(const Tensor& tensor, const std::shared_ptr& index_value_type, MemoryPool* pool) : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {} - template Status Convert() { -using c_index_value_type = typename IndexValueType::c_type; - RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max())); +RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, tensor_.shape())); + +const int index_elsize = +checked_cast(*index_value_type_).bit_width() / CHAR_BIT; +const int value_elsize = +checked_cast(*tensor_.type()).bit_width() / CHAR_BIT; Review comment: Could be helper function, maybe we can put an `internal::GetByteWidth` function in `arrow/type.h` that does this? ## File path: cpp/src/arrow/tensor/csx_converter.cc ## @@ -0,0 +1,245 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/tensor/converter.h" + +#include +#include +#include +#include + +#include "arrow/buffer.h" +#include "arrow/status.h" +#include "arrow/type.h" +#include "arrow/util/checked_cast.h" +#include "arrow/visitor_inline.h" + +namespace arrow { + +class MemoryPool; + +namespace internal { +namespace { + +// -- +// SparseTensorConverter for SparseCSRIndex + +class SparseCSXMatrixConverter : private SparseTensorConverterMixin { + using SparseTensorConverterMixin::AssignIndex; + using SparseTensorConverterMixin::IsNonZero; + + public: + SparseCSXMatrixConverter(SparseMatrixCompressedAxis axis, const Tensor& tensor, + const std::shared_ptr& index_value_type, + MemoryPool* pool) + : axis_(axis), tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {} + + Status Convert() { +RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, tensor_.shape())); + +const int index_elsize = +checked_cast(*index_value_type_).bit_width() / CHAR_BIT; Review comment: Can you factor this "get byte width" operation into a helper function? ## File path: cpp/src/arrow/tensor/csf_converter.cc ## @@ -148,84 +161,130 @@ class SparseCSFTensorConverter { return Status::OK(); } -#define CALL_TYPE_SPECIFIC_CONVERT(TYPE_CLASS) \ - case TYPE_CLASS##Type::type_id: \ -return Convert(); - - Status Convert() { -switch (index_value_type_->id()) { - ARROW_GENERATE_FOR_ALL_INTEGER_TYPES(CALL_TYPE_SPECIFIC_CONVERT); - // LCOV_EXCL_START: The following invalid causes program failure. - default: -return Status::TypeError("Unsupported SparseTensor index value type"); -// LCOV_EXCL_STOP -} - } - -#undef CALL_TYPE_SPECIFIC_CONVERT - std::shared_ptr sparse_index; std::shared_ptr data; private: - const NumericTensorType& tensor_; + const Tensor& tensor_; const std::shared_ptr& index_value_type_; MemoryPool* pool_; +}; - template - inline Status CheckMaximumValue(const c_value_type type_max) const { -auto max_dimension = -*std::max_element(tensor_.shape().begin(), tensor_.shape().end()); -if (static_cast(type_max) < max_dimension) { - // LCOV_EXCL_START: The following invalid causes program failure. - return Status::Invalid("The bit width of the index value type is too small"); - // LCOV_EXCL_STOP -} -return Status::OK(); +class
[GitHub] [arrow] pitrou commented on pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info
pitrou commented on pull request #7581: URL: https://github.com/apache/arrow/pull/7581#issuecomment-651242409 @kou and @xhochy your advice would be welcome. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou opened a new pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info
pitrou opened a new pull request #7581: URL: https://github.com/apache/arrow/pull/7581 Also add build options and preprocessor constants to represent git identification and package kind (e.g. "manylinux1"). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm opened a new pull request #7582: ARROW-8190: [FlightRPC][C++] Expose IPC options
lidavidm opened a new pull request #7582: URL: https://github.com/apache/arrow/pull/7582 - Python is not covered as I'm not sure how best to expose these structs to Python. - Java is not covered as it doesn't use IpcOption at all currently; I'd rather hold off and see how the metadata change is implemented before trying to bind it. (The legacy/modern metadata format doesn't come into play since Flight uses Protobuf and not an IPC message exactly per se.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info
github-actions[bot] commented on pull request #7581: URL: https://github.com/apache/arrow/pull/7581#issuecomment-651248175 https://issues.apache.org/jira/browse/ARROW-6521 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651257793 We still have 2 failures, one might need a restart (travis / no output), the other is still a linker error: ``` C:/rtools40/mingw32/bin/../lib/gcc/i686-w64-mingw32/8.3.0/../../../../i686-w64-mingw32/bin/ld.exe: ../windows/arrow-0.17.1.9000/lib/i386/libarrow.a(scalar_string.cc.obj):(.text+0x45df): undefined reference to `utf8proc_tolower' ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7560: ARROW-9252: [Integration] Factor out IPC integration tests into script, add back 0.14.1 "gold" files
pitrou commented on a change in pull request #7560: URL: https://github.com/apache/arrow/pull/7560#discussion_r447134548 ## File path: ci/scripts/integration_arrow.sh ## @@ -0,0 +1,29 @@ +#!/usr/bin/env bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +set -ex + +arrow_dir=${1} +source_dir=${1}/cpp +build_dir=${2}/cpp +gold_dir_0_14_1=${1}/testing/data/arrow-ipc-stream/integration/0.14.1 + +pip install -e /arrow/dev/archery Review comment: Use `$source_dir` here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org