[GitHub] [arrow] mrkn commented on a change in pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index

2020-06-22 Thread GitBox
mrkn commented on a change in pull request #7477: URL: https://github.com/apache/arrow/pull/7477#discussion_r443990095 ## File path: cpp/src/arrow/sparse_tensor_test.cc ## @@ -49,7 +49,10 @@ static inline void AssertCOOIndex(const std::shared_ptr& sidx, const int } } -TE

[GitHub] [arrow] mrkn commented on a change in pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index

2020-06-22 Thread GitBox
mrkn commented on a change in pull request #7477: URL: https://github.com/apache/arrow/pull/7477#discussion_r443990095 ## File path: cpp/src/arrow/sparse_tensor_test.cc ## @@ -49,7 +49,10 @@ static inline void AssertCOOIndex(const std::shared_ptr& sidx, const int } } -TE

[GitHub] [arrow] mrkn commented on a change in pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index

2020-06-22 Thread GitBox
mrkn commented on a change in pull request #7477: URL: https://github.com/apache/arrow/pull/7477#discussion_r443982716 ## File path: python/pyarrow/tensor.pxi ## @@ -339,6 +350,15 @@ shape: {0.shape}""".format(self) def non_zero_length(self): return self.stp.non_z

[GitHub] [arrow] cyb70289 commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
cyb70289 commented on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647914768 > I'm refactoring to nix util::optional. I'm too tired to finish it tonight so I'll work on it tomorrow morning. If the perf regression isn't gone I'll rewrite the sort kernels.

[GitHub] [arrow] emkornfield commented on pull request #6402: ARROW-7831: [Java] do not allocate a new offset buffer if the slice starts at 0 since the relative offset pointer would be unchanged

2020-06-22 Thread GitBox
emkornfield commented on pull request #6402: URL: https://github.com/apache/arrow/pull/6402#issuecomment-647907608 @jacques-n are you happy with the changes? This is an automated message from the Apache Git Service. To respon

[GitHub] [arrow] emkornfield commented on a change in pull request #7321: ARROW-8985: [Format][DONOTMERGE] RFC Proposed Decimal::byteWidth field for forward compatibility

2020-06-22 Thread GitBox
emkornfield commented on a change in pull request #7321: URL: https://github.com/apache/arrow/pull/7321#discussion_r443958283 ## File path: format/Schema.fbs ## @@ -134,11 +134,20 @@ table FixedSizeBinary { table Bool { } +/// Exact decimal value represented as an integer v

[GitHub] [arrow] emkornfield commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-22 Thread GitBox
emkornfield commented on pull request #7290: URL: https://github.com/apache/arrow/pull/7290#issuecomment-647903916 It would be great if @jacques-n could confirm he is OK with these changes. I believe on a previous PR the extra indirection was a concern. --

[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm edited a comment on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647888437 FWIW the performance issue seems to be more pronounced on gcc than clang, here is the benchmark comparison on my machine with clang-8 ```

[GitHub] [arrow] wesm commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm commented on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647888437 FWIW the performance issue seems to be more pronounced on gcc than clang, here is the benchmark comparison on my machine with clang-8 ```

[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm edited a comment on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647887135 I'm refactoring to nix util::optional. I'm too tired to finish it tonight so I'll work on it tomorrow morning. --

[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm edited a comment on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647887135 I'm refactoring to nix util::optional. I'm too tired to finish it tonight so I'll work on it tomorrow morning. If the perf regression isn't gone I'll rewrite the sort kernels.

[GitHub] [arrow] wesm commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm commented on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647887135 I'm refactoring to nix util::optional. This is an automated message from the Apache Git Service. To respond to the m

[GitHub] [arrow] wesm commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm commented on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647886669 Also, I don't really understand the use of `util::optional` in these templates. The user should pass separate lambdas for the not-null and null cases ---

[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm edited a comment on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647884747 Sorting seems too important to leave it to these relatively complex templates (for example, just after determining that a value is not null, the `optional` value is checked wh

[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm edited a comment on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647884747 Sorting seems too important to leave it to these relatively complex templates, I would suggest implementing the counting sort without using `VisitArrayDataInline`. I'm happy t

[GitHub] [arrow] wesm commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm commented on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647884747 Sorting seems too important to leave it to these relatively complex templates, I would suggest implementing the counting sort without using `VisitArrayDataInline` -

[GitHub] [arrow] cyb70289 commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
cyb70289 commented on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647884195 I see big performance drop from some counting sort cases, also tested on my local machine. Should be related to these visitor code: https://github.com/apache/arrow/blob/master/cp

[GitHub] [arrow] wesm commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox
wesm commented on pull request #7516: URL: https://github.com/apache/arrow/pull/7516#issuecomment-647883524 I improved the output to show the `state.counters` stuff ``` benchmark baselinecontender change % count

[GitHub] [arrow] ursabot commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
ursabot commented on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647883517 [AMD64 Ubuntu 18.04 C++ Benchmark (#114347)](https://ci.ursalabs.org/#builders/73/builds/89) builder has been succeeded. Revision: dbd166df749e73cbf7c1ec0c6cfa5837280aa32d

[GitHub] [arrow] cyb70289 commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
cyb70289 commented on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647878260 @ursabot benchmark --suite-filter=arrow-compute-vector-sort-benchmark This is an automated message from the Apach

[GitHub] [arrow] mrkn commented on a change in pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index

2020-06-22 Thread GitBox
mrkn commented on a change in pull request #7477: URL: https://github.com/apache/arrow/pull/7477#discussion_r443928439 ## File path: python/pyarrow/tensor.pxi ## @@ -199,7 +202,13 @@ shape: {0.shape}""".format(self) for x in dim_names: c_dim_names.

[GitHub] [arrow] mrkn commented on a change in pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index

2020-06-22 Thread GitBox
mrkn commented on a change in pull request #7477: URL: https://github.com/apache/arrow/pull/7477#discussion_r443922882 ## File path: python/pyarrow/tensor.pxi ## @@ -199,7 +202,13 @@ shape: {0.shape}""".format(self) for x in dim_names: c_dim_names.

[GitHub] [arrow] mrkn commented on pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index

2020-06-22 Thread GitBox
mrkn commented on pull request #7477: URL: https://github.com/apache/arrow/pull/7477#issuecomment-647860707 > I don't understand the PR's motivation. You say that: > > > To support the integration with scipy.sparse.coo_matrix, it is necessary to add a flag in SparseCOOIndex > > bu

[GitHub] [arrow] wesm edited a comment on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-22 Thread GitBox
wesm edited a comment on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-647860226 @kou utf8proc should only be used in a small number of compilation units, so what do you think about just using `set_source_files_properties(... PROPERTIES COMPILE_DEFINITIONS

[GitHub] [arrow] wesm commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-22 Thread GitBox
wesm commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-647860226 @kou utf8proc should only be used in a small number of compilation units, so what do you think about just using `set_target_properties(... PROPERTIES COMPILE_DEFINITIONS UTF8PROC_STA

[GitHub] [arrow] wesm commented on a change in pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-22 Thread GitBox
wesm commented on a change in pull request #7520: URL: https://github.com/apache/arrow/pull/7520#discussion_r443914793 ## File path: docs/source/developers/contributing.rst ## @@ -17,53 +17,73 @@ .. _contributing: -*** -Contribution Guidelines -

[GitHub] [arrow] github-actions[bot] commented on pull request #7522: ARROW-8801: [Python] Fix memory leak when converting datetime64-with-tz data to pandas

2020-06-22 Thread GitBox
github-actions[bot] commented on pull request #7522: URL: https://github.com/apache/arrow/pull/7522#issuecomment-647857111 https://issues.apache.org/jira/browse/ARROW-8801 This is an automated message from the Apache Git Serv

[GitHub] [arrow] wesm commented on a change in pull request #7522: ARROW-8801: [Python] Fix memory leak when converting datetime64-with-tz data to pandas

2020-06-22 Thread GitBox
wesm commented on a change in pull request #7522: URL: https://github.com/apache/arrow/pull/7522#discussion_r443913968 ## File path: cpp/src/arrow/python/arrow_to_pandas.cc ## @@ -1269,13 +1283,18 @@ class DatetimeTZWriter : public DatetimeNanoWriter { : DatetimeNanoWrit

[GitHub] [arrow] wesm opened a new pull request #7522: ARROW-8801: [Python] Fix memory leak when converting datetime64-with-tz data to pandas

2020-06-22 Thread GitBox
wesm opened a new pull request #7522: URL: https://github.com/apache/arrow/pull/7522 An array object was failing to be decref'd on the DatetimeTZ conversion path. The code is slightly complicated by the different calling reference ownership semantics of the Array/ChunkedArray conversion pa

[GitHub] [arrow] kiszk commented on a change in pull request #7507: ARROW-8797: [C++] [WIP] Create test to receive RecordBatch for different endian

2020-06-22 Thread GitBox
kiszk commented on a change in pull request #7507: URL: https://github.com/apache/arrow/pull/7507#discussion_r443912778 ## File path: cpp/src/arrow/ipc/read_write_test.cc ## @@ -409,6 +414,32 @@ class IpcTestFixture : public io::MemoryMapFixture, public ExtensionTypesMixin {

[GitHub] [arrow] kou closed pull request #7509: ARROW-9203: [Packaging][deb] Add missing gir1.2-arrow-dataset-1.0.install

2020-06-22 Thread GitBox
kou closed pull request #7509: URL: https://github.com/apache/arrow/pull/7509 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

[GitHub] [arrow] kou commented on pull request #7509: ARROW-9203: [Packaging][deb] Add missing gir1.2-arrow-dataset-1.0.install

2020-06-22 Thread GitBox
kou commented on pull request #7509: URL: https://github.com/apache/arrow/pull/7509#issuecomment-647843737 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and us

[GitHub] [arrow] github-actions[bot] commented on pull request #7509: ARROW-9203: [Packaging][deb] Add missing gir1.2-arrow-dataset-1.0.install

2020-06-22 Thread GitBox
github-actions[bot] commented on pull request #7509: URL: https://github.com/apache/arrow/pull/7509#issuecomment-647836070 Revision: cb16452e499fb61e6ec231fed4c2ddc7fe88d11b Submitted crossbow builds: [ursa-labs/crossbow @ actions-360](https://github.com/ursa-labs/crossbow/branches/a

[GitHub] [arrow] kou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-22 Thread GitBox
kou commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-647834815 The change doesn't add `UTF8PROC_STATIC` definition... This is an automated message from the Apache Git Service. To re

[GitHub] [arrow] kou commented on pull request #7452: ARROW-8961: [C++] Add utf8proc library to toolchain

2020-06-22 Thread GitBox
kou commented on pull request #7452: URL: https://github.com/apache/arrow/pull/7452#issuecomment-647834442 TODO: * Need the `UTF8PROC_STATIC` definition with bundled utf8proc * Need to add include path for `utf8proc.h` -

[GitHub] [arrow] kou commented on pull request #7509: ARROW-9203: [Packaging][deb] Add missing gir1.2-arrow-dataset-1.0.install

2020-06-22 Thread GitBox
kou commented on pull request #7509: URL: https://github.com/apache/arrow/pull/7509#issuecomment-647833733 @github-actions crossbow submit debian-buster-arm64 This is an automated message from the Apache Git Service. To respo

[GitHub] [arrow] wesm commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm commented on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647825256 FWIW on the "gcc/clang perf discussion", clang also shows performance benefits and limited downside ``` benchmark baselinecontender change

[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm edited a comment on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647825256 FWIW on the "gcc/clang perf discussion", clang-8 also shows performance benefits and limited downside ``` benchmark baselinecontende

[GitHub] [arrow] wesm commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm commented on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647823397 Also on the binary size, these changes add about 75KB to libarrow.so. My guess is the difference is mostly coming from code inlining for the all-null case (which wasn't split out bef

[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm edited a comment on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647821487 Here's a benchmark run with gcc-8 ``` --- BenchmarkTime

[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm edited a comment on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647821487 Here's a benchmark run with gcc-8 ``` --- BenchmarkTime CPU Iteration

[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm edited a comment on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647821487 Here's a benchmark run with gcc-8 ``` --- BenchmarkTime CPU Iteration

[GitHub] [arrow] wesm commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm commented on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647821487 Here's a benchmark run with gcc-8 ``` --- BenchmarkTime CPU Iterations --

[GitHub] [arrow] github-actions[bot] commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
github-actions[bot] commented on pull request #7521: URL: https://github.com/apache/arrow/pull/7521#issuecomment-647820320 https://issues.apache.org/jira/browse/ARROW-9210 This is an automated message from the Apache Git Serv

[GitHub] [arrow] wesm opened a new pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox
wesm opened a new pull request #7521: URL: https://github.com/apache/arrow/pull/7521 This significantly speeds up processing of mostly-not-null or mostly-null data, while having almost no overhead for the other scenarios where you rarely have a word-sized run of all-not-null or all-null-da

[GitHub] [arrow] kszucs edited a comment on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox
kszucs edited a comment on pull request #7516: URL: https://github.com/apache/arrow/pull/7516#issuecomment-647798092 Using pandas is not a problem, but the results cannot be improved much other than sorting the table. This i

[GitHub] [arrow] kszucs commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox
kszucs commented on pull request #7516: URL: https://github.com/apache/arrow/pull/7516#issuecomment-647798092 Using pandas is not a problem, but other than sorting the results we cannot really improve the look and feel. Thi

[GitHub] [arrow] github-actions[bot] commented on pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-22 Thread GitBox
github-actions[bot] commented on pull request #7520: URL: https://github.com/apache/arrow/pull/7520#issuecomment-647796781 https://issues.apache.org/jira/browse/ARROW-9189 This is an automated message from the Apache Git Serv

[GitHub] [arrow] nealrichardson opened a new pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-22 Thread GitBox
nealrichardson opened a new pull request #7520: URL: https://github.com/apache/arrow/pull/7520 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

[GitHub] [arrow] kou closed pull request #7508: ARROW-9202: [GLib] Add GArrowDatum

2020-06-22 Thread GitBox
kou closed pull request #7508: URL: https://github.com/apache/arrow/pull/7508 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

[GitHub] [arrow] github-actions[bot] commented on pull request #7509: ARROW-9203: [Packaging][deb] Add missing gir1.2-arrow-dataset-1.0.install

2020-06-22 Thread GitBox
github-actions[bot] commented on pull request #7509: URL: https://github.com/apache/arrow/pull/7509#issuecomment-647794119 Revision: 56a5e1ce92e541cfaac827d5334cbe8981b1b74b Submitted crossbow builds: [ursa-labs/crossbow @ actions-359](https://github.com/ursa-labs/crossbow/branches/a

[GitHub] [arrow] kou commented on pull request #7508: ARROW-9202: [GLib] Add GArrowDatum

2020-06-22 Thread GitBox
kou commented on pull request #7508: URL: https://github.com/apache/arrow/pull/7508#issuecomment-647793770 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and us

[GitHub] [arrow] kou commented on pull request #7509: ARROW-9203: [Packaging][deb] Add missing gir1.2-arrow-dataset-1.0.install

2020-06-22 Thread GitBox
kou commented on pull request #7509: URL: https://github.com/apache/arrow/pull/7509#issuecomment-647793370 @github-actions crossbow submit debian-buster-arm64 This is an automated message from the Apache Git Service. To respo

[GitHub] [arrow] kou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-22 Thread GitBox
kou commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-647791287 Ah, we need the `UTF8PROC_STATIC` definition. Does this work? ```diff diff --git a/cpp/cmake_modules/ThirdpartyToolchain.cmake b/cpp/cmake_modules/ThirdpartyToolchain.

[GitHub] [arrow] alippai commented on pull request #7517: ARROW-1682: [Doc] Expand S3/MinIO fileystem dataset documentation

2020-06-22 Thread GitBox
alippai commented on pull request #7517: URL: https://github.com/apache/arrow/pull/7517#issuecomment-647789155 Does "for testing and benchmarking" mean that it's not optimal to store arrow/parquet files on Minio for production workloads? ---

[GitHub] [arrow] kou commented on a change in pull request #7507: ARROW-8797: [C++] [WIP] Create test to receive RecordBatch for different endian

2020-06-22 Thread GitBox
kou commented on a change in pull request #7507: URL: https://github.com/apache/arrow/pull/7507#discussion_r443832986 ## File path: cpp/src/arrow/ipc/read_write_test.cc ## @@ -409,6 +414,32 @@ class IpcTestFixture : public io::MemoryMapFixture, public ExtensionTypesMixin {

[GitHub] [arrow] wesm edited a comment on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox
wesm edited a comment on pull request #7516: URL: https://github.com/apache/arrow/pull/7516#issuecomment-647758158 I’m sort of -1 on using anything but pandas for data munging and data presentation in our tooling. It’s not a very large dependency and has everything we need. FWIW, the curre

[GitHub] [arrow] wesm commented on pull request #7452: ARROW-8961: [C++] Add utf8proc library to toolchain

2020-06-22 Thread GitBox
wesm commented on pull request #7452: URL: https://github.com/apache/arrow/pull/7452#issuecomment-647763853 @xhochy I'm fine with merging this -- it's a non-mandatory dependency right (sorry have not reviewed the patch yet and the PR description does not say)?

[GitHub] [arrow] github-actions[bot] commented on pull request #7509: ARROW-9203: [Packaging][deb] Add missing gir1.2-arrow-dataset-1.0.install

2020-06-22 Thread GitBox
github-actions[bot] commented on pull request #7509: URL: https://github.com/apache/arrow/pull/7509#issuecomment-647759816 Revision: 77bc06d2ff96b41dd1e37838601f92fd0e95e7e2 Submitted crossbow builds: [ursa-labs/crossbow @ actions-358](https://github.com/ursa-labs/crossbow/branches/a

[GitHub] [arrow] kou commented on pull request #7509: ARROW-9203: [Packaging][deb] Add missing gir1.2-arrow-dataset-1.0.install

2020-06-22 Thread GitBox
kou commented on pull request #7509: URL: https://github.com/apache/arrow/pull/7509#issuecomment-647758878 @github-actions crossbow submit debian-buster-arm64 This is an automated message from the Apache Git Service. To respo

[GitHub] [arrow] wesm commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox
wesm commented on pull request #7516: URL: https://github.com/apache/arrow/pull/7516#issuecomment-647758158 I’m sort of -1 on using anything but pandas for data munging and data presentation in our tooling. It’s not a very large dependency and has everything we need. ---

[GitHub] [arrow] github-actions[bot] commented on pull request #7519: ARROW-9153: [C++][Python] Refactor scalar bindings [WIP]

2020-06-22 Thread GitBox
github-actions[bot] commented on pull request #7519: URL: https://github.com/apache/arrow/pull/7519#issuecomment-647754455 https://issues.apache.org/jira/browse/ARROW-9153 This is an automated message from the Apache Git Serv

[GitHub] [arrow] kszucs opened a new pull request #7519: ARROW-9153: [C++][Python] Refactor scalar bindings [WIP]

2020-06-22 Thread GitBox
kszucs opened a new pull request #7519: URL: https://github.com/apache/arrow/pull/7519 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-22 Thread GitBox
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-647711867 @xhochy I see at least an `undefined reference to `_imp__utf8proc_toupper'` in the MinGW builds, hope that helps.

[GitHub] [arrow] github-actions[bot] commented on pull request #7518: ARROW-9138: [Docs][Format] Make sure format version is hard coded in the docs

2020-06-22 Thread GitBox
github-actions[bot] commented on pull request #7518: URL: https://github.com/apache/arrow/pull/7518#issuecomment-647699885 https://issues.apache.org/jira/browse/ARROW-9138 This is an automated message from the Apache Git Serv

[GitHub] [arrow] github-actions[bot] commented on pull request #7517: ARROW-1682: [Doc] Expand S3/MinIO fileystem dataset documentation

2020-06-22 Thread GitBox
github-actions[bot] commented on pull request #7517: URL: https://github.com/apache/arrow/pull/7517#issuecomment-647699887 https://issues.apache.org/jira/browse/ARROW-1682 This is an automated message from the Apache Git Serv

[GitHub] [arrow] fsaintjacques commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox
fsaintjacques commented on pull request #7516: URL: https://github.com/apache/arrow/pull/7516#issuecomment-647698721 ursabot uses `tabulate` which I think is smaller dependencies. This is an automated message from the Apache

[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-22 Thread GitBox
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-647699240 Here's with clang, even less of an issue there it seems ``` benchmark baselinecontender change % regression 32 BM_R

[GitHub] [arrow] wesm closed pull request #7506: ARROW-9197: [C++] Overhaul integer/floating point casting: vectorize truncation checks, reduce binary size

2020-06-22 Thread GitBox
wesm closed pull request #7506: URL: https://github.com/apache/arrow/pull/7506 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

[GitHub] [arrow] wesm commented on pull request #7506: ARROW-9197: [C++] Overhaul integer/floating point casting: vectorize truncation checks, reduce binary size

2020-06-22 Thread GitBox
wesm commented on pull request #7506: URL: https://github.com/apache/arrow/pull/7506#issuecomment-647696604 +1. Appveyor build https://ci.appveyor.com/project/wesm/arrow/builds/33669506 This is an automated message from the A

[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-22 Thread GitBox
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-647694302 Here's what I see with gcc-8: ``` benchmark baseline contender change % regression 1 BM_ReadColumn/1/1 1.336

[GitHub] [arrow] pitrou closed pull request #7504: ARROW-9193: [C++] Avoid spurious intermediate string copy in ToDateHolder

2020-06-22 Thread GitBox
pitrou closed pull request #7504: URL: https://github.com/apache/arrow/pull/7504 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [arrow] nealrichardson opened a new pull request #7518: ARROW-9138: [Docs][Format] Make sure format version is hard coded in the docs

2020-06-22 Thread GitBox
nealrichardson opened a new pull request #7518: URL: https://github.com/apache/arrow/pull/7518 Since after 1.0, the library versions will continue to increase but the format will remain at 1.0 until it is modified, we need to version the specification docs separately from the library versi

[GitHub] [arrow] fsaintjacques opened a new pull request #7517: ARROW-1682: [Doc] Expand S3/MinIO fileystem dataset documentation

2020-06-22 Thread GitBox
fsaintjacques opened a new pull request #7517: URL: https://github.com/apache/arrow/pull/7517 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [arrow] projjal commented on pull request #7504: ARROW-9193: [C++] Avoid spurious intermediate string copy in ToDateHolder

2020-06-22 Thread GitBox
projjal commented on pull request #7504: URL: https://github.com/apache/arrow/pull/7504#issuecomment-647691097 > Any reason not to use a `util::string_view`? Since the toDate function here takes a char array and just passes it onto the parseTimestamp method, using a util::string_view

[GitHub] [arrow] nealrichardson commented on pull request #7503: ARROW-4429: [Doc] Add Git conventions to contributing guidelines

2020-06-22 Thread GitBox
nealrichardson commented on pull request #7503: URL: https://github.com/apache/arrow/pull/7503#issuecomment-647691008 Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [arrow] nealrichardson closed pull request #7503: ARROW-4429: [Doc] Add Git conventions to contributing guidelines

2020-06-22 Thread GitBox
nealrichardson closed pull request #7503: URL: https://github.com/apache/arrow/pull/7503 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to g

[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-22 Thread GitBox
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-647690792 I rebased. I'm going to look at the benchmarks locally and then probably merge this so further performance work (including comparing with BitBlockCounter) can be pursued as follow up

[GitHub] [arrow] xhochy commented on pull request #7452: ARROW-8961: [C++] Add utf8proc library to toolchain

2020-06-22 Thread GitBox
xhochy commented on pull request #7452: URL: https://github.com/apache/arrow/pull/7452#issuecomment-647688900 @maartenbreddels Feel free to rebase again, this patch is ready for a merge. @wesm Should we merge this as-is or is delay it until @maartenbreddels branch is ready? ---

[GitHub] [arrow] nealrichardson commented on a change in pull request #7435: ARROW-8779: [R] Implement conversion to List

2020-06-22 Thread GitBox
nealrichardson commented on a change in pull request #7435: URL: https://github.com/apache/arrow/pull/7435#discussion_r443735874 ## File path: r/src/array_from_vector.cpp ## @@ -201,6 +202,67 @@ struct VectorToArrayConverter { return Status::OK(); } + template + ar

[GitHub] [arrow] nealrichardson commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-22 Thread GitBox
nealrichardson commented on a change in pull request #7514: URL: https://github.com/apache/arrow/pull/7514#discussion_r443733083 ## File path: r/src/array_from_vector.cpp ## @@ -1067,12 +1110,22 @@ std::shared_ptr InferArrowTypeFromVector(SEXP x) { if (Rf_inherits(x, "data.

[GitHub] [arrow] kszucs edited a comment on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox
kszucs edited a comment on pull request #7516: URL: https://github.com/apache/arrow/pull/7516#issuecomment-647677894 > @kszucs can you assist me with adapting ursabot for these changes? Sure. > I think we can use pandas's `DataFrame.to_html` to create a colorized table for Git

[GitHub] [arrow] kszucs commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox
kszucs commented on pull request #7516: URL: https://github.com/apache/arrow/pull/7516#issuecomment-647677894 > @kszucs can you assist me with adapting ursabot for these changes? Sure. > I think we can use pandas's `DataFrame.to_html` to create a colorized table for GitHub, too https

[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-22 Thread GitBox
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-647674585 @emkornfield I'm sorry that I've been neglecting this PR. I will try to rebase this and investigate the perf questions a little bit -

[GitHub] [arrow] pitrou commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox
pitrou commented on pull request #7516: URL: https://github.com/apache/arrow/pull/7516#issuecomment-647673667 Just a small question: why are `m` and `b` used for millions and billions, respectively? (I would probably expect `M` and `G`)

[GitHub] [arrow] pitrou commented on a change in pull request #7477: ARROW-4221: Add canonical flag in COO sparse index

2020-06-22 Thread GitBox
pitrou commented on a change in pull request #7477: URL: https://github.com/apache/arrow/pull/7477#discussion_r443707259 ## File path: python/pyarrow/tensor.pxi ## @@ -339,6 +350,15 @@ shape: {0.shape}""".format(self) def non_zero_length(self): return self.stp.non

[GitHub] [arrow] wesm commented on pull request #7506: ARROW-9197: [C++] Overhaul integer/floating point casting: vectorize truncation checks, reduce binary size

2020-06-22 Thread GitBox
wesm commented on pull request #7506: URL: https://github.com/apache/arrow/pull/7506#issuecomment-647649670 For curiosity, here are the same benchmarks on my laptop with gcc-8 (using the new output formatting from ARROW-9201) ``` benchmark

[GitHub] [arrow] wesm commented on a change in pull request #7506: ARROW-9197: [C++] Overhaul integer/floating point casting: vectorize truncation checks, reduce binary size

2020-06-22 Thread GitBox
wesm commented on a change in pull request #7506: URL: https://github.com/apache/arrow/pull/7506#discussion_r443696884 ## File path: cpp/src/arrow/util/int_util.cc ## @@ -461,75 +472,434 @@ Status IndexBoundsCheckImpl(const ArrayData& indices, uint64_t upper_limit) { if (in

[GitHub] [arrow] wesm commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox
wesm commented on pull request #7516: URL: https://github.com/apache/arrow/pull/7516#issuecomment-647641007 @kszucs can you assist me with adapting ursabot for these changes? I think we can use pandas's `DataFrame.to_html` to create a colorized table for GitHub, too https://pandas.pydata.o

[GitHub] [arrow] github-actions[bot] commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox
github-actions[bot] commented on pull request #7516: URL: https://github.com/apache/arrow/pull/7516#issuecomment-647641059 https://issues.apache.org/jira/browse/ARROW-9201 This is an automated message from the Apache Git Serv

[GitHub] [arrow] wesm opened a new pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox
wesm opened a new pull request #7516: URL: https://github.com/apache/arrow/pull/7516 This uses pandas to generate a sorted text table when using `archery benchmark diff`. Example: https://github.com/apache/arrow/pull/7506#issuecomment-647633470 There's some other incidental ch

[GitHub] [arrow] wesm commented on a change in pull request #7506: ARROW-9197: [C++] Overhaul integer/floating point casting: vectorize truncation checks, reduce binary size

2020-06-22 Thread GitBox
wesm commented on a change in pull request #7506: URL: https://github.com/apache/arrow/pull/7506#discussion_r443689349 ## File path: cpp/src/arrow/util/int_util.cc ## @@ -461,75 +472,434 @@ Status IndexBoundsCheckImpl(const ArrayData& indices, uint64_t upper_limit) { if (in

[GitHub] [arrow] pitrou commented on a change in pull request #7506: ARROW-9197: [C++] Overhaul integer/floating point casting: vectorize truncation checks, reduce binary size

2020-06-22 Thread GitBox
pitrou commented on a change in pull request #7506: URL: https://github.com/apache/arrow/pull/7506#discussion_r443664270 ## File path: cpp/src/arrow/util/int_util.cc ## @@ -461,75 +472,434 @@ Status IndexBoundsCheckImpl(const ArrayData& indices, uint64_t upper_limit) { if (

[GitHub] [arrow] wesm edited a comment on pull request #7506: ARROW-9197: [C++] Overhaul integer/floating point casting: vectorize truncation checks, reduce binary size

2020-06-22 Thread GitBox
wesm edited a comment on pull request #7506: URL: https://github.com/apache/arrow/pull/7506#issuecomment-647633470 Here's the benchmark comparison with clang-8 ``` $ archery benchmark diff --cc=clang-8 --cxx=clang++-8 2db48b4 653817301 --benchmark-filter=Cast

[GitHub] [arrow] pitrou commented on pull request #7263: ARROW-8927: [C++] Support dictionary memo in CUDA IPC ReadRecordBatch functions

2020-06-22 Thread GitBox
pitrou commented on pull request #7263: URL: https://github.com/apache/arrow/pull/7263#issuecomment-647633067 Well, can't you just use `BufferReader` and `BufferOutputStream`? Am I missing something? This is an automated mes

[GitHub] [arrow] wesm commented on pull request #7506: ARROW-9197: [C++] Overhaul integer/floating point casting: vectorize truncation checks, reduce binary size

2020-06-22 Thread GitBox
wesm commented on pull request #7506: URL: https://github.com/apache/arrow/pull/7506#issuecomment-647633470 Here's the benchmark comparison with clang-8 ``` $ archery benchmark diff --cc=clang-8 --cxx=clang++-8 89844a100 653817301 --benchmark-filter=Cast

[GitHub] [arrow] alexbaden commented on pull request #7263: ARROW-8927: [C++] Support dictionary memo in CUDA IPC ReadRecordBatch functions

2020-06-22 Thread GitBox
alexbaden commented on pull request #7263: URL: https://github.com/apache/arrow/pull/7263#issuecomment-647628896 I'd like to have an object similar to `RecordBatchStreamReader` that could read the schema, IPC handles, and dictionary memo from a single buffer, and a corresponding writer of

[GitHub] [arrow] pitrou commented on pull request #7263: ARROW-8927: [C++] Support dictionary memo in CUDA IPC ReadRecordBatch functions

2020-06-22 Thread GitBox
pitrou commented on pull request #7263: URL: https://github.com/apache/arrow/pull/7263#issuecomment-647627751 @alexbaden I'm not sure I understand what you have in mind. Could you elaborate? This is an automated message from

[GitHub] [arrow] alexbaden commented on pull request #7263: ARROW-8927: [C++] Support dictionary memo in CUDA IPC ReadRecordBatch functions

2020-06-22 Thread GitBox
alexbaden commented on pull request #7263: URL: https://github.com/apache/arrow/pull/7263#issuecomment-647623057 I'd be interested in trying to add it into the record batch stream reader -- the idea being you could send both CPU data and GPU data pointers in a single serialized object with

[GitHub] [arrow] github-actions[bot] commented on pull request #7515: ARROW-2801: [Python] Add split_row_group keyword to ParquetDataset / document split_by_row_group

2020-06-22 Thread GitBox
github-actions[bot] commented on pull request #7515: URL: https://github.com/apache/arrow/pull/7515#issuecomment-647616782 https://issues.apache.org/jira/browse/ARROW-2801 This is an automated message from the Apache Git Serv

  1   2   >