[GitHub] [arrow] bkietz closed pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

2020-06-25 Thread GitBox
bkietz closed pull request #7546: URL: https://github.com/apache/arrow/pull/7546 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [arrow] wesm commented on a change in pull request #6213: ARROW-7592: [C++] Fix crashes on corrupt IPC input

2020-06-25 Thread GitBox
wesm commented on a change in pull request #6213: URL: https://github.com/apache/arrow/pull/6213#discussion_r445934957 ## File path: cpp/src/arrow/type.cc ## @@ -501,20 +501,35 @@ Status Decimal128Type::Make(int32_t precision, int32_t scale, //

[GitHub] [arrow] wesm commented on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox
wesm commented on pull request #7531: URL: https://github.com/apache/arrow/pull/7531#issuecomment-649911719 Okay, thanks. I think we can leave it as is This is an automated message from the Apache Git Service. To respond to

[GitHub] [arrow] jianxind commented on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox
jianxind commented on pull request #7531: URL: https://github.com/apache/arrow/pull/7531#issuecomment-649874368 > +1, merging this. If removing the BitmapReader helps perf it can be done in a follow up PR Thanks. BitmapReader does help for the 1% and 50% null probability case, the

[GitHub] [arrow] kszucs commented on pull request #7376: ARROW-9043: [Go][FOLLOWUP] Move license file copy to correct location

2020-06-25 Thread GitBox
kszucs commented on pull request #7376: URL: https://github.com/apache/arrow/pull/7376#issuecomment-649859049 Yes. We can go on either route, but it is still unclear to me whether we can test it automatically (I'm unfamiliar with go packaging). Based on the information @jba

[GitHub] [arrow] wesm closed pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox
wesm closed pull request #7531: URL: https://github.com/apache/arrow/pull/7531 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [arrow] wesm commented on a change in pull request #7502: [DISCUSS] Proposed Feature enum for forward compatibility.

2020-06-25 Thread GitBox
wesm commented on a change in pull request #7502: URL: https://github.com/apache/arrow/pull/7502#discussion_r445874451 ## File path: format/Schema.fbs ## @@ -33,6 +33,35 @@ enum MetadataVersion:short { V4, } +/// Represents Arrow Features that might not have full support

[GitHub] [arrow] wesm commented on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox
wesm commented on pull request #7531: URL: https://github.com/apache/arrow/pull/7531#issuecomment-649849609 +1, merging this. If removing the BitmapReader helps perf it can be done in a follow up PR This is an automated

[GitHub] [arrow] wesm commented on a change in pull request #7535: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-25 Thread GitBox
wesm commented on a change in pull request #7535: URL: https://github.com/apache/arrow/pull/7535#discussion_r445873149 ## File path: docs/source/format/Columnar.rst ## @@ -566,11 +572,6 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]`` :: * Length: 4, Null

[GitHub] [arrow] wesm commented on a change in pull request #7535: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-25 Thread GitBox
wesm commented on a change in pull request #7535: URL: https://github.com/apache/arrow/pull/7535#discussion_r445873223 ## File path: docs/source/format/Columnar.rst ## @@ -586,13 +587,13 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]`` * Children arrays: *

[GitHub] [arrow] emkornfield commented on a change in pull request #6213: ARROW-7592: [C++] Fix crashes on corrupt IPC input

2020-06-25 Thread GitBox
emkornfield commented on a change in pull request #6213: URL: https://github.com/apache/arrow/pull/6213#discussion_r445866678 ## File path: cpp/src/arrow/type.cc ## @@ -501,20 +501,35 @@ Status Decimal128Type::Make(int32_t precision, int32_t scale, //

[GitHub] [arrow] kkraus14 commented on a change in pull request #6213: ARROW-7592: [C++] Fix crashes on corrupt IPC input

2020-06-25 Thread GitBox
kkraus14 commented on a change in pull request #6213: URL: https://github.com/apache/arrow/pull/6213#discussion_r445863087 ## File path: cpp/src/arrow/type.cc ## @@ -501,20 +501,35 @@ Status Decimal128Type::Make(int32_t precision, int32_t scale, //

[GitHub] [arrow] bkietz commented on a change in pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

2020-06-25 Thread GitBox
bkietz commented on a change in pull request #7546: URL: https://github.com/apache/arrow/pull/7546#discussion_r445811655 ## File path: python/pyarrow/_dataset.pyx ## @@ -845,6 +845,28 @@ cdef class RowGroupInfo: def num_rows(self): return self.info.num_rows() +

[GitHub] [arrow] github-actions[bot] commented on pull request #7550: ARROW-9219: [R] coerce_timestamps in Parquet write options does not work

2020-06-25 Thread GitBox
github-actions[bot] commented on pull request #7550: URL: https://github.com/apache/arrow/pull/7550#issuecomment-649754779 https://issues.apache.org/jira/browse/ARROW-9219 This is an automated message from the Apache Git

[GitHub] [arrow] nealrichardson opened a new pull request #7550: ARROW-9219: [R] coerce_timestamps in Parquet write options does not work

2020-06-25 Thread GitBox
nealrichardson opened a new pull request #7550: URL: https://github.com/apache/arrow/pull/7550 In addition to fixing the bug (a quick fix), I also spent some time deleting unnecessary bindings for parquet writer builder methods. There's more that can be done, which I think would shave

[GitHub] [arrow] github-actions[bot] commented on pull request #7549: ARROW-9230: [FlightRPC][Python] pass through all options in flight.connect

2020-06-25 Thread GitBox
github-actions[bot] commented on pull request #7549: URL: https://github.com/apache/arrow/pull/7549#issuecomment-649740871 https://issues.apache.org/jira/browse/ARROW-9230 This is an automated message from the Apache Git

[GitHub] [arrow] lidavidm opened a new pull request #7549: ARROW-9230: [FlightRPC][Python] pass through all options in flight.connect

2020-06-25 Thread GitBox
lidavidm opened a new pull request #7549: URL: https://github.com/apache/arrow/pull/7549 Instead of individually listing options, just use kwargs - especially as options are in the docstring anyways. This is an automated

[GitHub] [arrow] emkornfield commented on a change in pull request #7502: [DISCUSS] Proposed Feature enum for forward compatibility.

2020-06-25 Thread GitBox
emkornfield commented on a change in pull request #7502: URL: https://github.com/apache/arrow/pull/7502#discussion_r445699442 ## File path: format/Schema.fbs ## @@ -33,6 +33,35 @@ enum MetadataVersion:short { V4, } +/// Represents Arrow Features that might not have full

[GitHub] [arrow] github-actions[bot] commented on pull request #7548: WIP - Data frame memory management - see: "Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-06-25 Thread GitBox
github-actions[bot] commented on pull request #7548: URL: https://github.com/apache/arrow/pull/7548#issuecomment-64960 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then

[GitHub] [arrow] raduteo opened a new pull request #7548: WIP - Data frame memory management - see: "Two proposals for expanding arrow Table API (virtual arrays and random access))

2020-06-25 Thread GitBox
raduteo opened a new pull request #7548: URL: https://github.com/apache/arrow/pull/7548 Follow up on the "Two proposals for expanding arrow Table API (virtual arrays and random access)" thread I have laid out a number of components illustrating how I see an arrow DataFrame

[GitHub] [arrow] fsaintjacques closed pull request #7517: ARROW-1682: [Doc] Expand S3/MinIO fileystem dataset documentation

2020-06-25 Thread GitBox
fsaintjacques closed pull request #7517: URL: https://github.com/apache/arrow/pull/7517 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [arrow] pitrou commented on pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-06-25 Thread GitBox
pitrou commented on pull request #7544: URL: https://github.com/apache/arrow/pull/7544#issuecomment-649673482 @liyafan82 I'm assuming this is work-in-progress? I can give it a quick review anyway. This is an automated

[GitHub] [arrow] wesm closed pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox
wesm closed pull request #7537: URL: https://github.com/apache/arrow/pull/7537 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [arrow] pitrou commented on a change in pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox
pitrou commented on a change in pull request #7547: URL: https://github.com/apache/arrow/pull/7547#discussion_r445673261 ## File path: cpp/src/arrow/dataset/discovery.cc ## @@ -156,27 +177,26 @@ Result> FileSystemDatasetFactory::Make( ARROW_ASSIGN_OR_RAISE(auto files,

[GitHub] [arrow] bkietz commented on a change in pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox
bkietz commented on a change in pull request #7547: URL: https://github.com/apache/arrow/pull/7547#discussion_r445672523 ## File path: cpp/src/arrow/dataset/file_parquet.cc ## @@ -571,6 +571,8 @@ static inline Result FileFromRowGroup( } } +// TODO Is it

[GitHub] [arrow] bkietz commented on a change in pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox
bkietz commented on a change in pull request #7547: URL: https://github.com/apache/arrow/pull/7547#discussion_r445663612 ## File path: cpp/src/arrow/dataset/file_parquet.cc ## @@ -571,6 +571,8 @@ static inline Result FileFromRowGroup( } } +// TODO Is it

[GitHub] [arrow] pitrou commented on pull request #7541: ARROW-9224: [Dev][Archery] Copy local repo on clone failure

2020-06-25 Thread GitBox
pitrou commented on pull request #7541: URL: https://github.com/apache/arrow/pull/7541#issuecomment-649653387 > What about simply replacing `git clone --local` with `git clone --shared`? +1 for me. This is an

[GitHub] [arrow] jorisvandenbossche commented on pull request #7523: ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo

2020-06-25 Thread GitBox
jorisvandenbossche commented on pull request #7523: URL: https://github.com/apache/arrow/pull/7523#issuecomment-649644745 @rjzamora indeed something like that. I am not sure that you need to keep track of the path as well, unless maybe to have it working with existing functions to

[GitHub] [arrow] jorisvandenbossche closed pull request #7523: ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo

2020-06-25 Thread GitBox
jorisvandenbossche closed pull request #7523: URL: https://github.com/apache/arrow/pull/7523 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

2020-06-25 Thread GitBox
jorisvandenbossche commented on a change in pull request #7546: URL: https://github.com/apache/arrow/pull/7546#discussion_r445661743 ## File path: python/pyarrow/_dataset.pyx ## @@ -845,6 +845,28 @@ cdef class RowGroupInfo: def num_rows(self): return

[GitHub] [arrow] pitrou commented on pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox
pitrou commented on pull request #7547: URL: https://github.com/apache/arrow/pull/7547#issuecomment-649636318 Note to self: instead of basing the default `OpenInputStream(const FileInfo&)` on `OpenInputStream(const std::string&)`, it would be more logical to do the reverse, so that no

[GitHub] [arrow] github-actions[bot] commented on pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox
github-actions[bot] commented on pull request #7547: URL: https://github.com/apache/arrow/pull/7547#issuecomment-649636224 https://issues.apache.org/jira/browse/ARROW-8950 This is an automated message from the Apache Git

[GitHub] [arrow] github-actions[bot] commented on pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

2020-06-25 Thread GitBox
github-actions[bot] commented on pull request #7546: URL: https://github.com/apache/arrow/pull/7546#issuecomment-649636223 https://issues.apache.org/jira/browse/ARROW-8733 This is an automated message from the Apache Git

[GitHub] [arrow] github-actions[bot] commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-06-25 Thread GitBox
github-actions[bot] commented on pull request #7545: URL: https://github.com/apache/arrow/pull/7545#issuecomment-649636228 https://issues.apache.org/jira/browse/ARROW-9139 This is an automated message from the Apache Git

[GitHub] [arrow] pitrou commented on pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox
pitrou commented on pull request #7547: URL: https://github.com/apache/arrow/pull/7547#issuecomment-649634658 @bkietz Would like your input on the dataset changes. This is an automated message from the Apache Git Service. To

[GitHub] [arrow] pitrou opened a new pull request #7547: ARROW-8950: [C++] Avoid HEAD when possible in S3 filesystem

2020-06-25 Thread GitBox
pitrou opened a new pull request #7547: URL: https://github.com/apache/arrow/pull/7547 Add FileSystem::OpenInput{Stream,File} overrides that accept a FileInfo parameter. This can be used to optimize file opening when it the file size and existence is already known. Concretely, avoids

[GitHub] [arrow] bkietz opened a new pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

2020-06-25 Thread GitBox
bkietz opened a new pull request #7546: URL: https://github.com/apache/arrow/pull/7546 ```python stats = parquet_fragment.row_groups[0].statistics assert stats == { 'normal_column': {'min': 1, 'max': 2}, 'all_null_column': {'min': None, 'max': None},

[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-06-25 Thread GitBox
jorisvandenbossche commented on pull request #7545: URL: https://github.com/apache/arrow/pull/7545#issuecomment-649631989 Still some work: Need to add tests for the different filesystems that can be passed. There are still some skipped tests: * `ARROW:schema` is not yet

[GitHub] [arrow] jorisvandenbossche opened a new pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-06-25 Thread GitBox
jorisvandenbossche opened a new pull request #7545: URL: https://github.com/apache/arrow/pull/7545 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [arrow] liyafan82 opened a new pull request #7544: ARROW-7285: [C++] ensure C++ implementation meets clarified dictionary spec

2020-06-25 Thread GitBox
liyafan82 opened a new pull request #7544: URL: https://github.com/apache/arrow/pull/7544 https://issues.apache.org/jira/browse/ARROW-7285 This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [arrow] wesm commented on pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox
wesm commented on pull request #7537: URL: https://github.com/apache/arrow/pull/7537#issuecomment-649612663 +1, will merge this once build is green This is an automated message from the Apache Git Service. To respond to the

[GitHub] [arrow] fsaintjacques commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

2020-06-25 Thread GitBox
fsaintjacques commented on pull request #7536: URL: https://github.com/apache/arrow/pull/7536#issuecomment-649607214 I'm also of the opinion that we should stick with int32_t. That's what parquet uses for dict column, that's what R uses for factor columns, that's what we use by default in

[GitHub] [arrow] bkietz commented on a change in pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox
bkietz commented on a change in pull request #7534: URL: https://github.com/apache/arrow/pull/7534#discussion_r445625710 ## File path: cpp/src/parquet/arrow/reader.cc ## @@ -338,22 +348,39 @@ class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader { // TODO

[GitHub] [arrow] wesm closed pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox
wesm closed pull request #7542: URL: https://github.com/apache/arrow/pull/7542 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [arrow] wesm commented on pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox
wesm commented on pull request #7542: URL: https://github.com/apache/arrow/pull/7542#issuecomment-649601444 Here's my benchmarks on i9-9960X ``` $ archery benchmark diff --cc=gcc-8 --cxx=g++-8 cyb70289/sort master --suite-filter=vector-sort

[GitHub] [arrow] fsaintjacques commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox
fsaintjacques commented on pull request #7526: URL: https://github.com/apache/arrow/pull/7526#issuecomment-649600834 The test is failing on windows. https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/33736283/job/mshkl837y3b5v5u6

[GitHub] [arrow] fsaintjacques commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox
fsaintjacques commented on pull request #7526: URL: https://github.com/apache/arrow/pull/7526#issuecomment-649600179 Correct, we don't support the file changing underneath. For now we can preach YAGNI on this as it is a rare use case. If this is required, the consumer can create a custom

[GitHub] [arrow] cyb70289 commented on pull request #7541: ARROW-9224: [Dev][Archery] Copy local repo on clone failure

2020-06-25 Thread GitBox
cyb70289 commented on pull request #7541: URL: https://github.com/apache/arrow/pull/7541#issuecomment-649595034 > Can you try using `git clone --shared` instead? It should avoid the copy. `--shard` works okay per my test. What about simply replacing `git clone --local` with `git

[GitHub] [arrow] fsaintjacques commented on a change in pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox
fsaintjacques commented on a change in pull request #7534: URL: https://github.com/apache/arrow/pull/7534#discussion_r445605965 ## File path: cpp/src/parquet/arrow/reader.cc ## @@ -338,22 +348,39 @@ class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader { //

[GitHub] [arrow] kszucs commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox
kszucs commented on pull request #7519: URL: https://github.com/apache/arrow/pull/7519#issuecomment-649590873 > Really nice! > > Could we add a `is_valid` attribute to the python scalar as well? Now the only way to check for a null value is to do `.as_py() is None` ?

[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox
kszucs commented on a change in pull request #7519: URL: https://github.com/apache/arrow/pull/7519#discussion_r445603562 ## File path: python/pyarrow/util.py ## @@ -41,6 +41,24 @@ def wrapper(*args, **kwargs): return wrapper +def _deprecate_class(old_name, new_class,

[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox
kszucs commented on a change in pull request #7519: URL: https://github.com/apache/arrow/pull/7519#discussion_r445604627 ## File path: python/pyarrow/tests/test_scalars.py ## @@ -17,426 +17,395 @@ import datetime import pytest -import unittest import numpy as np

[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-25 Thread GitBox
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-649585804 +1, thanks for fixing this @emkornfield! This is an automated message from the Apache Git Service. To respond to the

[GitHub] [arrow] wesm closed pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-25 Thread GitBox
wesm closed pull request #7143: URL: https://github.com/apache/arrow/pull/7143 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-25 Thread GitBox
wesm commented on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-649585468 FWIW gcc benchmarks (sse4.2) on my machine (ubuntu 18.04 on i9-9960X) ``` $ archery benchmark diff --cc=gcc-8 --cxx=g++-8 emkornfield/ARROW-8504 master

[GitHub] [arrow] kszucs commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox
kszucs commented on a change in pull request #7519: URL: https://github.com/apache/arrow/pull/7519#discussion_r445603562 ## File path: python/pyarrow/util.py ## @@ -41,6 +41,24 @@ def wrapper(*args, **kwargs): return wrapper +def _deprecate_class(old_name, new_class,

[GitHub] [arrow] wesm commented on issue #1771: pyarrow-- reading selected columns from multiindex parquet file

2020-06-25 Thread GitBox
wesm commented on issue #1771: URL: https://github.com/apache/arrow/issues/1771#issuecomment-649581184 Please direct questions to u...@arrow.apache.org or d...@arrow.apache.org. Thank you This is an automated message from

[GitHub] [arrow] wesm commented on pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

2020-06-25 Thread GitBox
wesm commented on pull request #7456: URL: https://github.com/apache/arrow/pull/7456#issuecomment-649580251 Looks good here. Merging This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [arrow] wesm closed pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

2020-06-25 Thread GitBox
wesm closed pull request #7456: URL: https://github.com/apache/arrow/pull/7456 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [arrow] wesm commented on pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox
wesm commented on pull request #7534: URL: https://github.com/apache/arrow/pull/7534#issuecomment-649579503 That crash is ARROW-8999. I'm fairly confident it's a real bug given that is happens about 1-5% of the time This is

[GitHub] [arrow] wesm edited a comment on pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox
wesm edited a comment on pull request #7534: URL: https://github.com/apache/arrow/pull/7534#issuecomment-649579503 That crash is ARROW-8999. I'm fairly confident it's a real bug given that it happens about 1-5% of the time

[GitHub] [arrow] wesm commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox
wesm commented on a change in pull request #7537: URL: https://github.com/apache/arrow/pull/7537#discussion_r445594104 ## File path: cpp/src/arrow/python/helpers.cc ## @@ -254,14 +255,45 @@ bool PyFloat_IsNaN(PyObject* obj) { return PyFloat_Check(obj) &&

[GitHub] [arrow] wesm commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox
wesm commented on a change in pull request #7537: URL: https://github.com/apache/arrow/pull/7537#discussion_r445593841 ## File path: cpp/src/arrow/python/helpers.cc ## @@ -254,14 +255,45 @@ bool PyFloat_IsNaN(PyObject* obj) { return PyFloat_Check(obj) &&

[GitHub] [arrow] wesm commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox
wesm commented on a change in pull request #7537: URL: https://github.com/apache/arrow/pull/7537#discussion_r445593243 ## File path: cpp/src/arrow/python/python_to_arrow.cc ## @@ -1171,6 +1171,12 @@ Status GetConverterFlat(const std::shared_ptr& type, bool strict_conve

[GitHub] [arrow] wesm commented on issue #7540: Why conversion from DoubleArray with nulls to numpy needs to copy?

2020-06-25 Thread GitBox
wesm commented on issue #7540: URL: https://github.com/apache/arrow/issues/7540#issuecomment-649571447 This came up in ARROW-3263 for R. If you'd like you can open a corresponding Python issue. On the face of it, this seems rather difficult to me to do safely

[GitHub] [arrow] wesm closed issue #7540: Why conversion from DoubleArray with nulls to numpy needs to copy?

2020-06-25 Thread GitBox
wesm closed issue #7540: URL: https://github.com/apache/arrow/issues/7540 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [arrow] wesm commented on pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox
wesm commented on pull request #7542: URL: https://github.com/apache/arrow/pull/7542#issuecomment-649563024 I'm interested to see if we could add a `BitmapReader::Advance` method so that we could employ a hybrid approach to use both BitBlockCounter and BitmapReader to get the best of both

[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox
bkietz commented on a change in pull request #7526: URL: https://github.com/apache/arrow/pull/7526#discussion_r445583091 ## File path: cpp/src/arrow/util/mutex.h ## @@ -0,0 +1,58 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license

[GitHub] [arrow] fsaintjacques commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox
fsaintjacques commented on a change in pull request #7526: URL: https://github.com/apache/arrow/pull/7526#discussion_r445096637 ## File path: cpp/src/arrow/util/mutex.h ## @@ -0,0 +1,58 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor

[GitHub] [arrow] cyb70289 commented on pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox
cyb70289 commented on pull request #7542: URL: https://github.com/apache/arrow/pull/7542#issuecomment-649561938 About `null_percent=0` case, the benchmark result I attached shows about 19% improvement. Looks reasonable.

[GitHub] [arrow] wesm commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module

2020-06-25 Thread GitBox
wesm commented on pull request #7539: URL: https://github.com/apache/arrow/pull/7539#issuecomment-649561148 This removes already more than 2MB of code from libarrow.so on Linux: great. I'll keep an eye on this This is an

[GitHub] [arrow] rladeira edited a comment on issue #1771: pyarrow-- reading selected columns from multiindex parquet file

2020-06-25 Thread GitBox
rladeira edited a comment on issue #1771: URL: https://github.com/apache/arrow/issues/1771#issuecomment-649554677 I found the same issue here, using pyarrow version '0.17.1'. I could not select columns from a multi index dataframe saved as a parquet file. Is there some way to accomplish

[GitHub] [arrow] rladeira commented on issue #1771: pyarrow-- reading selected columns from multiindex parquet file

2020-06-25 Thread GitBox
rladeira commented on issue #1771: URL: https://github.com/apache/arrow/issues/1771#issuecomment-649554677 I found the same issue here, using pyarrow version '0.17.1'. I could not select columns from a multi index dataframe saved as a parquet file. Is there some way to accomplish this?

[GitHub] [arrow] cyb70289 commented on a change in pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox
cyb70289 commented on a change in pull request #7542: URL: https://github.com/apache/arrow/pull/7542#discussion_r445575078 ## File path: cpp/src/arrow/compute/kernels/vector_sort.cc ## @@ -88,6 +88,28 @@ struct PartitionIndices { } }; +template +inline void

[GitHub] [arrow] wesm commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-25 Thread GitBox
wesm commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r445574693 ## File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc ## @@ -68,6 +76,64 @@ TYPED_TEST(TestStringKernels, AsciiLower) {

[GitHub] [arrow] wesm commented on a change in pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox
wesm commented on a change in pull request #7531: URL: https://github.com/apache/arrow/pull/7531#discussion_r445569869 ## File path: cpp/src/arrow/util/spaced.h ## @@ -0,0 +1,200 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license

[GitHub] [arrow] github-actions[bot] commented on pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox
github-actions[bot] commented on pull request #7519: URL: https://github.com/apache/arrow/pull/7519#issuecomment-649551591 https://issues.apache.org/jira/browse/ARROW-9017 This is an automated message from the Apache Git

[GitHub] [arrow] wesm commented on a change in pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox
wesm commented on a change in pull request #7531: URL: https://github.com/apache/arrow/pull/7531#discussion_r445569869 ## File path: cpp/src/arrow/util/spaced.h ## @@ -0,0 +1,200 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7519: ARROW-9017: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox
jorisvandenbossche commented on a change in pull request #7519: URL: https://github.com/apache/arrow/pull/7519#discussion_r445569265 ## File path: python/pyarrow/scalar.pxi ## @@ -16,1198 +16,704 @@ # under the License. -_NULL = NA = None - - cdef class Scalar: """

[GitHub] [arrow] wesm commented on pull request #7531: ARROW-9216: [C++] Use BitBlockCounter for plain spaced encoding/decoding

2020-06-25 Thread GitBox
wesm commented on pull request #7531: URL: https://github.com/apache/arrow/pull/7531#issuecomment-649548830 @jianxind I think the bot is broken right now because of the changes I recently made in ARROW-9201. @kszucs is going to update it

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7519: ARROW-9153: [C++][Python] Refactor scalar bindings

2020-06-25 Thread GitBox
jorisvandenbossche commented on a change in pull request #7519: URL: https://github.com/apache/arrow/pull/7519#discussion_r445523294 ## File path: python/pyarrow/_dataset.pyx ## @@ -216,22 +216,18 @@ cdef class Expression: @staticmethod def _scalar(value):

[GitHub] [arrow] github-actions[bot] commented on pull request #7543: ARROW-9221: [Java] account for big-endian buffers in ArrowBuf.setBytes

2020-06-25 Thread GitBox
github-actions[bot] commented on pull request #7543: URL: https://github.com/apache/arrow/pull/7543#issuecomment-649534666 https://issues.apache.org/jira/browse/ARROW-9221 This is an automated message from the Apache Git

[GitHub] [arrow] lidavidm opened a new pull request #7543: ARROW-9221: [Java] account for big-endian buffers in ArrowBuf.setBytes

2020-06-25 Thread GitBox
lidavidm opened a new pull request #7543: URL: https://github.com/apache/arrow/pull/7543 `ArrowBuf.setBytes` has an override that uses a 8-byte-at-a-time copy loop if the byte buffer does not provide an array and is not direct. Unfortunately, this means it'll mangle data when the byte

[GitHub] [arrow] pitrou commented on pull request #7456: ARROW-9106: [Python] Allow specifying CSV file encoding

2020-06-25 Thread GitBox
pitrou commented on pull request #7456: URL: https://github.com/apache/arrow/pull/7456#issuecomment-649522260 Ok, I added some tests for error propagation. I'm going to merge if CI stays green. This is an automated message

[GitHub] [arrow] bkietz commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

2020-06-25 Thread GitBox
bkietz commented on pull request #7536: URL: https://github.com/apache/arrow/pull/7536#issuecomment-649514981 @jorisvandenbossche okay, I'll extend the key value `Partitioning`s to maintain dictionaries of all unique values of a field.

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox
jorisvandenbossche commented on a change in pull request #7526: URL: https://github.com/apache/arrow/pull/7526#discussion_r445516148 ## File path: cpp/src/arrow/util/mutex.h ## @@ -0,0 +1,58 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more

[GitHub] [arrow] bkietz commented on pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox
bkietz commented on pull request #7526: URL: https://github.com/apache/arrow/pull/7526#issuecomment-649506352 > I suppose we don't / don't want to support files being changed after the dataset object has been constructed anyways? No, we don't support this at all. IMHO if fragments

[GitHub] [arrow] bkietz commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox
bkietz commented on a change in pull request #7526: URL: https://github.com/apache/arrow/pull/7526#discussion_r445512267 ## File path: cpp/src/arrow/util/mutex.h ## @@ -0,0 +1,58 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license

[GitHub] [arrow] jorisvandenbossche commented on pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox
jorisvandenbossche commented on pull request #7534: URL: https://github.com/apache/arrow/pull/7534#issuecomment-649501921 Hmm, it failed on the last commit as well, but I restarted that one. And so now appears to be green indeed ..

[GitHub] [arrow] bkietz commented on pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox
bkietz commented on pull request #7534: URL: https://github.com/apache/arrow/pull/7534#issuecomment-649500754 @jorisvandenbossche that build comes from the first of two `suggestion` commits and it doesn't seem to have crashed with both commits in place. Maybe it was ephemeral?

[GitHub] [arrow] jorisvandenbossche commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

2020-06-25 Thread GitBox
jorisvandenbossche commented on pull request #7536: URL: https://github.com/apache/arrow/pull/7536#issuecomment-649500017 Currently for the ParquetDataset, it also simply uses int32 for the indices. Now, there is a more fundamental issue I had not thought of: the actual dictionary

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7537: ARROW-842: [Python] Recognize pandas.NaT as null when converting object arrays with from_pandas=True

2020-06-25 Thread GitBox
jorisvandenbossche commented on a change in pull request #7537: URL: https://github.com/apache/arrow/pull/7537#discussion_r445495780 ## File path: cpp/src/arrow/python/helpers.cc ## @@ -254,14 +255,45 @@ bool PyFloat_IsNaN(PyObject* obj) { return PyFloat_Check(obj) &&

[GitHub] [arrow] jorisvandenbossche commented on pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox
jorisvandenbossche commented on pull request #7534: URL: https://github.com/apache/arrow/pull/7534#issuecomment-649482676 The python dataset tests are crashing on Mac: https://github.com/apache/arrow/runs/806974457 This is

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7526: ARROW-9146: [C++][Dataset] Lazily store fragment physical schema

2020-06-25 Thread GitBox
jorisvandenbossche commented on a change in pull request #7526: URL: https://github.com/apache/arrow/pull/7526#discussion_r445482607 ## File path: cpp/src/arrow/util/mutex.h ## @@ -0,0 +1,58 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more

[GitHub] [arrow] bkietz commented on a change in pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

2020-06-25 Thread GitBox
bkietz commented on a change in pull request #7534: URL: https://github.com/apache/arrow/pull/7534#discussion_r445470685 ## File path: cpp/src/parquet/arrow/reader.cc ## @@ -338,22 +348,37 @@ class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader { // TODO

[GitHub] [arrow] cyb70289 commented on pull request #7542: ARROW-9225: [C++][Compute] Speed up counting sort

2020-06-25 Thread GitBox
cyb70289 commented on pull request #7542: URL: https://github.com/apache/arrow/pull/7542#issuecomment-649457378 > I see no significant performance difference here on an AMD Ryzen CPU. > The weird thing with these performance numbers is that the `null_percent=0` case should probably be

[GitHub] [arrow] pitrou commented on a change in pull request #7517: ARROW-1682: [Doc] Expand S3/MinIO fileystem dataset documentation

2020-06-25 Thread GitBox
pitrou commented on a change in pull request #7517: URL: https://github.com/apache/arrow/pull/7517#discussion_r445450752 ## File path: docs/source/python/dataset.rst ## @@ -325,6 +325,22 @@ The currently available classes are :class:`~pyarrow.fs.S3FileSystem` and details.

[GitHub] [arrow] pitrou commented on a change in pull request #7517: ARROW-1682: [Doc] Expand S3/MinIO fileystem dataset documentation

2020-06-25 Thread GitBox
pitrou commented on a change in pull request #7517: URL: https://github.com/apache/arrow/pull/7517#discussion_r445450482 ## File path: docs/source/python/dataset.rst ## @@ -325,6 +325,22 @@ The currently available classes are :class:`~pyarrow.fs.S3FileSystem` and details.

[GitHub] [arrow] pitrou commented on a change in pull request #7535: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-25 Thread GitBox
pitrou commented on a change in pull request #7535: URL: https://github.com/apache/arrow/pull/7535#discussion_r445449521 ## File path: docs/source/format/Columnar.rst ## @@ -586,13 +587,13 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]`` * Children arrays:

[GitHub] [arrow] pitrou commented on a change in pull request #7535: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types

2020-06-25 Thread GitBox
pitrou commented on a change in pull request #7535: URL: https://github.com/apache/arrow/pull/7535#discussion_r445449396 ## File path: docs/source/format/Columnar.rst ## @@ -566,11 +572,6 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]`` :: * Length: 4, Null

  1   2   >