[jira] [Updated] (ARROW-3896) [MATLAB] Decouple MATLAB-Arrow conversion logic from Feather file specific logic
[ https://issues.apache.org/jira/browse/ARROW-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3896: Fix Version/s: (was: 0.14.0) > [MATLAB] Decouple MATLAB-Arrow conversion logic from Feather file specific > logic > > > Key: ARROW-3896 > URL: https://issues.apache.org/jira/browse/ARROW-3896 > Project: Apache Arrow > Issue Type: Improvement > Components: MATLAB >Reporter: Kevin Gurney >Assignee: Kevin Gurney >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > Currently, the logic for converting between a MATLAB mxArray and various > Arrow data structures (arrow::Table, arrow::Array, etc.) is tightly coupled > and fairly tangled up with the logic specific to handling Feather files. It > would be helpful to factor out these conversions into a more generic > "mlarrow" conversion layer component so that it can be reused in the future > for use cases other than Feather support. Furthermore, this would be helpful > to enforce a cleaner separation of concerns. > It would be nice to start off with this refactoring work up front before > adding support for more datatypes to the MATLAB featherread/featherwrite > functions, so that we can start off with a clean base upon which to expand > moving forward. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3919) [Python] Support 64 bit indices for pyarrow.serialize and pyarrow.deserialize
[ https://issues.apache.org/jira/browse/ARROW-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3919: Fix Version/s: (was: 0.14.0) > [Python] Support 64 bit indices for pyarrow.serialize and pyarrow.deserialize > - > > Key: ARROW-3919 > URL: https://issues.apache.org/jira/browse/ARROW-3919 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Philipp Moritz >Assignee: Philipp Moritz >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > see https://github.com/modin-project/modin/issues/266 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3873) [C++] Build shared libraries consistently with -fvisibility=hidden
[ https://issues.apache.org/jira/browse/ARROW-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3873: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Build shared libraries consistently with -fvisibility=hidden > -- > > Key: ARROW-3873 > URL: https://issues.apache.org/jira/browse/ARROW-3873 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > See https://github.com/apache/arrow/pull/2437 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3901) [Python] Make Schema hashable
[ https://issues.apache.org/jira/browse/ARROW-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3901: Fix Version/s: (was: 0.14.0) > [Python] Make Schema hashable > - > > Key: ARROW-3901 > URL: https://issues.apache.org/jira/browse/ARROW-3901 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > > Currently pa.Schema is not hashable, however all of its components are > hashable -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4022) [C++] RFC: promote Datum variant out of compute namespace
[ https://issues.apache.org/jira/browse/ARROW-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4022: Fix Version/s: (was: 0.14.0) > [C++] RFC: promote Datum variant out of compute namespace > - > > Key: ARROW-4022 > URL: https://issues.apache.org/jira/browse/ARROW-4022 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > In working on ARROW-3762, I've found it's useful to be able to have functions > return either {{Array}} or {{ChunkedArray}}. We might consider promoting the > {{arrow::compute::Datum}} variant out of {{arrow/compute/kernel.h}} so it can > be used in other places where it's helpful -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4001) [Python] Create Parquet Schema in python
[ https://issues.apache.org/jira/browse/ARROW-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4001: Fix Version/s: (was: 0.14.0) > [Python] Create Parquet Schema in python > > > Key: ARROW-4001 > URL: https://issues.apache.org/jira/browse/ARROW-4001 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Affects Versions: 0.9.0 >Reporter: David Stauffer >Priority: Major > Labels: parquet > > Enable the creation of a Parquet schema in python. For functions like > pyarrow.parquet.ParquetDataset, a schema must be a Parquet schema. See: > https://stackoverflow.com/questions/53725691/pyarrow-lib-schema-vs-pyarrow-parquet-schema -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4046) [Python/CI] Run nightly large memory tests
[ https://issues.apache.org/jira/browse/ARROW-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4046: Fix Version/s: (was: 0.14.0) > [Python/CI] Run nightly large memory tests > -- > > Key: ARROW-4046 > URL: https://issues.apache.org/jira/browse/ARROW-4046 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration, Python >Reporter: Krisztian Szucs >Priority: Major > Labels: nightly > > See comment https://github.com/apache/arrow/pull/3171#issuecomment-447156646 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4046) [Python/CI] Run nightly large memory tests
[ https://issues.apache.org/jira/browse/ARROW-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4046: Labels: nightly (was: ) > [Python/CI] Run nightly large memory tests > -- > > Key: ARROW-4046 > URL: https://issues.apache.org/jira/browse/ARROW-4046 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration, Python >Reporter: Krisztian Szucs >Priority: Major > Labels: nightly > Fix For: 0.14.0 > > > See comment https://github.com/apache/arrow/pull/3171#issuecomment-447156646 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5455) [Rust] Build broken by 2019-05-30 Rust nightly
Wes McKinney created ARROW-5455: --- Summary: [Rust] Build broken by 2019-05-30 Rust nightly Key: ARROW-5455 URL: https://issues.apache.org/jira/browse/ARROW-5455 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Wes McKinney Fix For: 0.14.0 Seem example failed build https://travis-ci.org/apache/arrow/jobs/539477452 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5453) [C++] Just-released cmake-format 0.5.2 breaks the build
[ https://issues.apache.org/jira/browse/ARROW-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-5453. - Resolution: Fixed Issue resolved by pull request 4423 [https://github.com/apache/arrow/pull/4423] > [C++] Just-released cmake-format 0.5.2 breaks the build > --- > > Key: ARROW-5453 > URL: https://issues.apache.org/jira/browse/ARROW-5453 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 20m > Remaining Estimate: 0h > > It seems we should always pin the cmake-format version until the developers > stop changing the formatting algorithm -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4631) [C++] Implement serial version of sort computational kernel
[ https://issues.apache.org/jira/browse/ARROW-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4631: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Implement serial version of sort computational kernel > --- > > Key: ARROW-4631 > URL: https://issues.apache.org/jira/browse/ARROW-4631 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Affects Versions: 0.13.0 >Reporter: Areg Melik-Adamyan >Assignee: Areg Melik-Adamyan >Priority: Major > Labels: analytics > Fix For: 0.15.0 > > > Implement serial version of sort computational kernel. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4591) [Rust] Add explicit SIMD vectorization for aggregation ops in "array_ops"
[ https://issues.apache.org/jira/browse/ARROW-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4591: Fix Version/s: (was: 0.14.0) > [Rust] Add explicit SIMD vectorization for aggregation ops in "array_ops" > - > > Key: ARROW-4591 > URL: https://issues.apache.org/jira/browse/ARROW-4591 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4575) [Python] Add Python Flight implementation to integration testing
[ https://issues.apache.org/jira/browse/ARROW-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4575: Fix Version/s: (was: 0.14.0) > [Python] Add Python Flight implementation to integration testing > > > Key: ARROW-4575 > URL: https://issues.apache.org/jira/browse/ARROW-4575 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Integration, Python >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: flight > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4567) [C++] Convert Scalar values to Array values with length 1
[ https://issues.apache.org/jira/browse/ARROW-4567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852594#comment-16852594 ] Wes McKinney commented on ARROW-4567: - cc [~fsaintjacques] > [C++] Convert Scalar values to Array values with length 1 > - > > Key: ARROW-4567 > URL: https://issues.apache.org/jira/browse/ARROW-4567 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > A common approach to performing operations on both scalar and array values is > to treat a Scalar as an array of length 1. For example, we cannot currently > use our Cast kernels to cast a Scalar. It would be senseless to create > separate kernel implementations specialized for a single value, and much > easier to promote a scalar to an Array, execute the kernel, then unbox the > result back into a Scalar -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5457) [GLib][Plasma] Environment variable name for test is wrong
Kouhei Sutou created ARROW-5457: --- Summary: [GLib][Plasma] Environment variable name for test is wrong Key: ARROW-5457 URL: https://issues.apache.org/jira/browse/ARROW-5457 Project: Apache Arrow Issue Type: Bug Components: GLib Affects Versions: 0.13.0 Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5457) [GLib][Plasma] Environment variable name for test is wrong
[ https://issues.apache.org/jira/browse/ARROW-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5457: -- Labels: pull-request-available (was: ) > [GLib][Plasma] Environment variable name for test is wrong > -- > > Key: ARROW-5457 > URL: https://issues.apache.org/jira/browse/ARROW-5457 > Project: Apache Arrow > Issue Type: Bug > Components: GLib >Affects Versions: 0.13.0 >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-750) [Format] Add LargeBinary and LargeString types
[ https://issues.apache.org/jira/browse/ARROW-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-750: --- Fix Version/s: (was: 0.14.0) 0.15.0 > [Format] Add LargeBinary and LargeString types > -- > > Key: ARROW-750 > URL: https://issues.apache.org/jira/browse/ARROW-750 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > These are string and binary types that use 64-bit offsets. Java will not need > to implement these types for the time being, but they are needed when > representing very large datasets in C++ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3840) [C++] Run fuzzer tests with docker-compose
[ https://issues.apache.org/jira/browse/ARROW-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3840: Fix Version/s: (was: 0.14.0) > [C++] Run fuzzer tests with docker-compose > -- > > Key: ARROW-3840 > URL: https://issues.apache.org/jira/browse/ARROW-3840 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > These are not being run regularly right now -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3419) [C++] Run include-what-you-use checks as nightly build
[ https://issues.apache.org/jira/browse/ARROW-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3419: Fix Version/s: (was: 0.14.0) > [C++] Run include-what-you-use checks as nightly build > -- > > Key: ARROW-3419 > URL: https://issues.apache.org/jira/browse/ARROW-3419 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > > As part of linting (and running linter checks in a separate Travis entry), we > should also run include-what-you-use on changed files so that we can force > include cleanliness -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3410) [C++] Streaming CSV reader interface for memory-constrainted environments
[ https://issues.apache.org/jira/browse/ARROW-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3410: Fix Version/s: (was: 0.14.0) > [C++] Streaming CSV reader interface for memory-constrainted environments > - > > Key: ARROW-3410 > URL: https://issues.apache.org/jira/browse/ARROW-3410 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > > CSV reads are currently all-or-nothing. If the results of parsing a CSV file > do not fit into memory, this can be a problem. I propose to define a > streaming {{RecordBatchReader}} interface so that the record batches produced > by reading can be written out immediately to a stream on disk, to be memory > mapped later -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3408) [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns
[ https://issues.apache.org/jira/browse/ARROW-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3408: Labels: datasets (was: ) > [C++] Add option to CSV reader to dictionary encode individual columns or all > string / binary columns > - > > Key: ARROW-3408 > URL: https://issues.apache.org/jira/browse/ARROW-3408 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: datasets > Fix For: 0.14.0 > > > For many datasets, dictionary encoding everything can result in drastically > lower memory usage and subsequently better performance in doing analytics > One difficulty of dictionary encoding in multithreaded conversions is that > ideally you end up with one dictionary at the end. So you have two options: > * Implement a concurrent hashing scheme -- for low cardinality dictionaries, > the overhead associated with mutex contention will not be meaningful, for > high cardinality it can be more of a problem > * Hash each chunk separately, then normalize at the end > My guess is that a crude concurrent hash table with a mutex to protect > mutations and resizes is going to outperform the latter -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3379) [C++] Implement regex/multichar delimiter tokenizer
[ https://issues.apache.org/jira/browse/ARROW-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3379: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Implement regex/multichar delimiter tokenizer > --- > > Key: ARROW-3379 > URL: https://issues.apache.org/jira/browse/ARROW-3379 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: csv, datasets > Fix For: 0.15.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3424) [Python] Improved workflow for loading an arbitrary collection of Parquet files
[ https://issues.apache.org/jira/browse/ARROW-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3424: Labels: datasets parquet (was: parquet) > [Python] Improved workflow for loading an arbitrary collection of Parquet > files > --- > > Key: ARROW-3424 > URL: https://issues.apache.org/jira/browse/ARROW-3424 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: datasets, parquet > Fix For: 0.14.0 > > > See SO question for use case: > https://stackoverflow.com/questions/52613682/load-multiple-parquet-files-into-dataframe-for-analysis -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3408) [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns
[ https://issues.apache.org/jira/browse/ARROW-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3408: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Add option to CSV reader to dictionary encode individual columns or all > string / binary columns > - > > Key: ARROW-3408 > URL: https://issues.apache.org/jira/browse/ARROW-3408 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: datasets > Fix For: 0.15.0 > > > For many datasets, dictionary encoding everything can result in drastically > lower memory usage and subsequently better performance in doing analytics > One difficulty of dictionary encoding in multithreaded conversions is that > ideally you end up with one dictionary at the end. So you have two options: > * Implement a concurrent hashing scheme -- for low cardinality dictionaries, > the overhead associated with mutex contention will not be meaningful, for > high cardinality it can be more of a problem > * Hash each chunk separately, then normalize at the end > My guess is that a crude concurrent hash table with a mutex to protect > mutations and resizes is going to outperform the latter -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3401) [C++] Pluggable statistics collector API for unconvertible CSV values
[ https://issues.apache.org/jira/browse/ARROW-3401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3401: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Pluggable statistics collector API for unconvertible CSV values > - > > Key: ARROW-3401 > URL: https://issues.apache.org/jira/browse/ARROW-3401 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > It would be useful to be able to collect statistics (e.g. distinct value > counts) about values in a column of a CSV file that cannot be converted to a > desired data type. > When conversion fails, the converters can call into an abstract API like > {code} > statistics_->CannotConvert(token, size); > {code} > or something similar -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3406) [C++] Create a caching memory pool implementation
[ https://issues.apache.org/jira/browse/ARROW-3406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3406: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Create a caching memory pool implementation > - > > Key: ARROW-3406 > URL: https://issues.apache.org/jira/browse/ARROW-3406 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.11.0 >Reporter: Antoine Pitrou >Priority: Minor > Fix For: 0.15.0 > > > A caching memory pool implementation would be able to recycle freed memory > blocks instead of returning them to the system immediately. Two different > policies may be chosen: > * either an unbounded cache > * or a size-limited cache, perhaps with some kind of LRU mechanism > Such a feature might help e.g. for CSV parsing, when reading and parsing data > into temporary memory buffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4259) [Plasma] CI failure in test_plasma_tf_op
[ https://issues.apache.org/jira/browse/ARROW-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4259: Fix Version/s: (was: 0.14.0) > [Plasma] CI failure in test_plasma_tf_op > > > Key: ARROW-4259 > URL: https://issues.apache.org/jira/browse/ARROW-4259 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Plasma, Continuous Integration, Python >Reporter: Wes McKinney >Priority: Major > Labels: ci-failure > > Recently-appeared failure on master: > https://travis-ci.org/apache/arrow/jobs/479378188#L7108 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4286) [C++/R] Namespace vendored Boost
[ https://issues.apache.org/jira/browse/ARROW-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4286: Fix Version/s: (was: 0.14.0) > [C++/R] Namespace vendored Boost > > > Key: ARROW-4286 > URL: https://issues.apache.org/jira/browse/ARROW-4286 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Packaging, R >Reporter: Uwe L. Korn >Priority: Major > > For R, we vendor Boost and thus also include the symbols privately in our > modules. While they are private, some things like virtual destructors can > still interfere with other packages that vendor Boost. We should also > namespace the vendored Boost as we do in the manylinux1 packaging: > https://github.com/apache/arrow/blob/0f8bd747468dd28c909ef823bed77d8082a5b373/python/manylinux1/scripts/build_boost.sh#L28 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4217) [Plasma] Remove custom object metadata
[ https://issues.apache.org/jira/browse/ARROW-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4217: Fix Version/s: (was: 0.14.0) > [Plasma] Remove custom object metadata > -- > > Key: ARROW-4217 > URL: https://issues.apache.org/jira/browse/ARROW-4217 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Plasma >Affects Versions: 0.11.1 >Reporter: Philipp Moritz >Assignee: Philipp Moritz >Priority: Minor > > Currently, Plasma supports custom metadata for objects. This doesn't seem to > be used at the moment, and removing it will simplify the interface and > implementation of plasma. Removing the custom metadata will also make > eviction to other blob stores easier (most other stores don't support custom > metadata). > My personal use case was to store arrow schemata in there, but they are now > stored as part of the object itself. > If nobody else is using this, I'd suggest removing it. If people really want > metadata, they could always store it as a separate object if desired. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4220) [Python] Add buffered input and output stream ASV benchmarks with simulated high latency IO
[ https://issues.apache.org/jira/browse/ARROW-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852570#comment-16852570 ] Wes McKinney commented on ARROW-4220: - cc [~jorisvandenbossche] > [Python] Add buffered input and output stream ASV benchmarks with simulated > high latency IO > --- > > Key: ARROW-4220 > URL: https://issues.apache.org/jira/browse/ARROW-4220 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > Follow up to ARROW-3126 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4283) [Python] Should RecordBatchStreamReader/Writer be AsyncIterable?
[ https://issues.apache.org/jira/browse/ARROW-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4283: Fix Version/s: (was: 0.14.0) > [Python] Should RecordBatchStreamReader/Writer be AsyncIterable? > > > Key: ARROW-4283 > URL: https://issues.apache.org/jira/browse/ARROW-4283 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Paul Taylor >Priority: Minor > > Filing this issue after a discussion today with [~xhochy] about how to > implement streaming pyarrow http services. I had attempted to use both Flask > and [aiohttp|https://aiohttp.readthedocs.io/en/stable/streams.html]'s > streaming interfaces because they seemed familiar, but no dice. I have no > idea how hard this would be to add -- supporting all the asynciterable > primitives in JS was non-trivial. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4309) [Release] gen_apidocs docker-compose task is out of date
[ https://issues.apache.org/jira/browse/ARROW-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4309: Fix Version/s: (was: 0.14.0) > [Release] gen_apidocs docker-compose task is out of date > > > Key: ARROW-4309 > URL: https://issues.apache.org/jira/browse/ARROW-4309 > Project: Apache Arrow > Issue Type: Bug > Components: Developer Tools, Documentation >Reporter: Wes McKinney >Priority: Major > Labels: docker > > This needs to be updated to build with CUDA support (which in turn will > require the host machine to have nvidia-docker), among other things -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-4302) [C++] Add OpenSSL to C++ build toolchain
[ https://issues.apache.org/jira/browse/ARROW-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-4302. - Resolution: Fixed > [C++] Add OpenSSL to C++ build toolchain > > > Key: ARROW-4302 > URL: https://issues.apache.org/jira/browse/ARROW-4302 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Deepak Majeti >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > This is needed for encryption support for Parquet, among other things. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4301) [Java][Gandiva] Maven snapshot version update does not seem to update Gandiva submodule
[ https://issues.apache.org/jira/browse/ARROW-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852571#comment-16852571 ] Wes McKinney commented on ARROW-4301: - [~pravindra] any ideas about this? This will get us again in 0.14 if it is not fixed > [Java][Gandiva] Maven snapshot version update does not seem to update Gandiva > submodule > --- > > Key: ARROW-4301 > URL: https://issues.apache.org/jira/browse/ARROW-4301 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva, Java >Reporter: Wes McKinney >Assignee: Praveen Kumar Desabandu >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h > Remaining Estimate: 0h > > See > https://github.com/apache/arrow/commit/a486db8c1476be1165981c4fe22996639da8e550. > This is breaking the build so I'm going to patch manually -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4465) [Rust] [DataFusion] Add support for ORDER BY
[ https://issues.apache.org/jira/browse/ARROW-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4465: Fix Version/s: (was: 0.14.0) > [Rust] [DataFusion] Add support for ORDER BY > > > Key: ARROW-4465 > URL: https://issues.apache.org/jira/browse/ARROW-4465 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Priority: Major > > As a user, I would like to be able to specify an ORDER BY clause on my query. > Work involved: > * Add OrderBy to LogicalPlan enum > * Write query planner code to translate SQL AST to OrderBy (SQL parser that > we use already supports parsing ORDER BY) > * Implement SortRelation > My high level thoughts on implementing the SortRelation: > * Create Arrow array of uint32 same size as batch and populate such that > each element contains its own index i.e. array will be 0, 1, 2, 3 > * Find a Rust crate for sorting that allows us to provide our own comparison > lambda > * Implement the comparison logic (probably can reuse existing execution code > - see filter.rs for how it implements comparison expressions) > * Use index array to store the result of the sort i.e. no need to rewrite > the whole batch, just the index > * Rewrite the batch after the sort has completed > It would also be good to see how Gandiva has implemented this > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4439) [C++] Improve FindBrotli.cmake
[ https://issues.apache.org/jira/browse/ARROW-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852586#comment-16852586 ] Wes McKinney commented on ARROW-4439: - [~rip@gmail.com] is this OK in master now? > [C++] Improve FindBrotli.cmake > -- > > Key: ARROW-4439 > URL: https://issues.apache.org/jira/browse/ARROW-4439 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Renat Valiullin >Assignee: Renat Valiullin >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 4h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4453) [Python] Create Cython wrappers for SparseTensor
[ https://issues.apache.org/jira/browse/ARROW-4453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4453: Fix Version/s: (was: 0.14.0) > [Python] Create Cython wrappers for SparseTensor > > > Key: ARROW-4453 > URL: https://issues.apache.org/jira/browse/ARROW-4453 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Philipp Moritz >Assignee: Rok Mihevc >Priority: Minor > > We should have cython wrappers for [https://github.com/apache/arrow/pull/2546] > This is related to support for > https://issues.apache.org/jira/browse/ARROW-4223 and > https://issues.apache.org/jira/browse/ARROW-4224 > I imagine the code would be similar to > https://github.com/apache/arrow/blob/5a502d281545402240e818d5fd97a9aaf36363f2/python/pyarrow/array.pxi#L748 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-4447) [C++] Investigate dynamic linking for libthift
[ https://issues.apache.org/jira/browse/ARROW-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-4447. - Resolution: Fixed Assignee: Uwe L. Korn Thrift is now dynamically linked > [C++] Investigate dynamic linking for libthift > -- > > Key: ARROW-4447 > URL: https://issues.apache.org/jira/browse/ARROW-4447 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.14.0 > > > We're currently only linking statically against {{libthrift}} . Distributions > would often prefer a dynamic linkage to libraries where possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4470) [Python] Pyarrow using considerable more memory when reading partitioned Parquet file
[ https://issues.apache.org/jira/browse/ARROW-4470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4470: Fix Version/s: (was: 0.14.0) 0.15.0 > [Python] Pyarrow using considerable more memory when reading partitioned > Parquet file > - > > Key: ARROW-4470 > URL: https://issues.apache.org/jira/browse/ARROW-4470 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0 >Reporter: Ivan SPM >Priority: Major > Labels: datasets, parquet > Fix For: 0.15.0 > > > Hi, > I have a partitioned Parquet table in Impala in HDFS, using Hive metastore, > with the following structure: > {{/data/myparquettable/year=2016}}{{/data/myparquettable/year=2016/myfile_1.prt}} > {{/data/myparquettable/year=2016/myfile_2.prt}} > {{/data/myparquettable/year=2016/myfile_3.prt}} > {{/data/myparquettable/year=2017}} > {{/data/myparquettable/year=2017/myfile_1.prt}} > {{/data/myparquettable/year=2017/myfile_2.prt}} > {{/data/myparquettable/year=2017/myfile_3.prt}} > and so on. I need to work with one partition, so I copied one partition to a > local filesystem: > {{hdfs fs -get /data/myparquettable/year=2017 /local/}} > so now I have some data on the local disk: > {{/local/year=2017/myfile_1.prt }}{{/local/year=2017/myfile_2.prt }} > etc.I tried to read it using Pyarrow: > {{import pyarrow.parquet as pq}}{{pq.read_parquet('/local/year=2017')}} > and it starts reading. The problem is that the local Parquet files are around > 15GB total, and I blew up my machine memory a couple of times because when > reading these files, Pyarrow is using more than 60GB of RAM, and I'm not sure > how much it will take because it never finishes. Is this expected? Is there a > workaround? > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4479) [Plasma] Add S3 as external store for Plasma
[ https://issues.apache.org/jira/browse/ARROW-4479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852589#comment-16852589 ] Wes McKinney commented on ARROW-4479: - What is the status of this project? > [Plasma] Add S3 as external store for Plasma > > > Key: ARROW-4479 > URL: https://issues.apache.org/jira/browse/ARROW-4479 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Plasma >Affects Versions: 0.12.0 >Reporter: Anurag Khandelwal >Assignee: Anurag Khandelwal >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Adding S3 as an external store will allow objects to be evicted to S3 when > Plasma runs out of memory capacity. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4482) [Website] Add blog archive page
[ https://issues.apache.org/jira/browse/ARROW-4482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4482: Fix Version/s: (was: 0.14.0) 0.15.0 > [Website] Add blog archive page > --- > > Key: ARROW-4482 > URL: https://issues.apache.org/jira/browse/ARROW-4482 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > There's no easy way to get a bulleted list of all blog posts on the Arrow > website. See example archive on my personal blog > http://wesmckinney.com/archives.html > It would be great to have such a generated archive on our website -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4473) [Website] Add instructions to do a test-deploy of Arrow website and fix bugs
[ https://issues.apache.org/jira/browse/ARROW-4473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4473: Fix Version/s: (was: 0.14.0) 0.15.0 > [Website] Add instructions to do a test-deploy of Arrow website and fix bugs > > > Key: ARROW-4473 > URL: https://issues.apache.org/jira/browse/ARROW-4473 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > This will help with testing and proofing the website. > I have noticed that there are bugs in the website when the baseurl is not a > foo.bar.baz, e.g. if you deploy at root foo.bar.baz/test-site many images and > links are broken -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4470) [Python] Pyarrow using considerable more memory when reading partitioned Parquet file
[ https://issues.apache.org/jira/browse/ARROW-4470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4470: Labels: datasets parquet (was: parquet) > [Python] Pyarrow using considerable more memory when reading partitioned > Parquet file > - > > Key: ARROW-4470 > URL: https://issues.apache.org/jira/browse/ARROW-4470 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0 >Reporter: Ivan SPM >Priority: Major > Labels: datasets, parquet > Fix For: 0.14.0 > > > Hi, > I have a partitioned Parquet table in Impala in HDFS, using Hive metastore, > with the following structure: > {{/data/myparquettable/year=2016}}{{/data/myparquettable/year=2016/myfile_1.prt}} > {{/data/myparquettable/year=2016/myfile_2.prt}} > {{/data/myparquettable/year=2016/myfile_3.prt}} > {{/data/myparquettable/year=2017}} > {{/data/myparquettable/year=2017/myfile_1.prt}} > {{/data/myparquettable/year=2017/myfile_2.prt}} > {{/data/myparquettable/year=2017/myfile_3.prt}} > and so on. I need to work with one partition, so I copied one partition to a > local filesystem: > {{hdfs fs -get /data/myparquettable/year=2017 /local/}} > so now I have some data on the local disk: > {{/local/year=2017/myfile_1.prt }}{{/local/year=2017/myfile_2.prt }} > etc.I tried to read it using Pyarrow: > {{import pyarrow.parquet as pq}}{{pq.read_parquet('/local/year=2017')}} > and it starts reading. The problem is that the local Parquet files are around > 15GB total, and I blew up my machine memory a couple of times because when > reading these files, Pyarrow is using more than 60GB of RAM, and I'm not sure > how much it will take because it never finishes. Is this expected? Is there a > workaround? > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5452) [R] Add documentation website (pkgdown)
[ https://issues.apache.org/jira/browse/ARROW-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852326#comment-16852326 ] Wes McKinney commented on ARROW-5452: - Yeah, for generated API docs that is fine, if we start writing prose documentation for R we should consider doing it in a common place > [R] Add documentation website (pkgdown) > --- > > Key: ARROW-5452 > URL: https://issues.apache.org/jira/browse/ARROW-5452 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 10m > Remaining Estimate: 0h > > pkgdown ([https://pkgdown.r-lib.org/]) is the standard for R package > documentation websites. Build this for arrow and deploy it at > https://arrow.apache.org/docs/r. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3054) [Packaging] Tooling to enable nightly conda packages to be updated to some anaconda.org channel
[ https://issues.apache.org/jira/browse/ARROW-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3054: Fix Version/s: (was: 0.14.0) > [Packaging] Tooling to enable nightly conda packages to be updated to some > anaconda.org channel > --- > > Key: ARROW-3054 > URL: https://issues.apache.org/jira/browse/ARROW-3054 > Project: Apache Arrow > Issue Type: Task > Components: Packaging >Affects Versions: 0.10.0 >Reporter: Phillip Cloud >Assignee: Krisztian Szucs >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3082) [C++] Add SSL support for hiveserver2
[ https://issues.apache.org/jira/browse/ARROW-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3082: Fix Version/s: (was: 0.14.0) > [C++] Add SSL support for hiveserver2 > - > > Key: ARROW-3082 > URL: https://issues.apache.org/jira/browse/ARROW-3082 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: HiveServer2 > > This amounts to using the TSSLSocket in Thrift -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3806) [Python] When converting nested types to pandas, use tuples
[ https://issues.apache.org/jira/browse/ARROW-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3806: Fix Version/s: (was: 0.14.0) > [Python] When converting nested types to pandas, use tuples > --- > > Key: ARROW-3806 > URL: https://issues.apache.org/jira/browse/ARROW-3806 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.11.1 > Environment: Fedora 29, pyarrow installed with conda >Reporter: Suvayu Ali >Priority: Minor > Labels: pandas > > When converting to pandas, convert nested types (e.g. list) to tuples. > Columns with lists are difficult to query. Here are a few unsuccessful > attempts: > {code} > >>> mini > CHROMPOS IDREFALTS QUAL > 80 20 63521 rs191905748 G [A] 100 > 81 20 63541 rs117322527 C [A] 100 > 82 20 63548 rs541129280 G[GT] 100 > 83 20 63553 rs536661806 T [C] 100 > 84 20 63555 rs553463231 T [C] 100 > 85 20 63559 rs138359120 C [A] 100 > 86 20 63586 rs545178789 T [G] 100 > 87 20 63636 rs374311122 G [A] 100 > 88 20 63696 rs149160003 A [G] 100 > 89 20 63698 rs544072005 A [C] 100 > 90 20 63729 rs181483669 G [A] 100 > 91 20 63733 rs75670495 C [T] 100 > 92 20 63799rs1418258 C [T] 100 > 93 20 63808 rs76004960 G [C] 100 > 94 20 63813 rs532151719 G [A] 100 > 95 20 63857 rs543686274 CCTGGAAAGGATT [C] 100 > 96 20 63865 rs551938596 G [A] 100 > 97 20 63902 rs571779099 A [T] 100 > 98 20 63963 rs531152674 G [A] 100 > 99 20 63967 rs116770801 A [G] 100 > 10020 63977 rs199703510 C [G] 100 > 10120 64016 rs143263863 G [A] 100 > 10220 64062 rs148297240 G [A] 100 > 10320 64139 rs186497980 G [A, T] 100 > 10420 64150rs7274499 C [A] 100 > 10520 64151 rs190945171 C [T] 100 > 10620 64154 rs537656456 T [G] 100 > 10720 64175 rs116531220 A [G] 100 > 10820 64186 rs141793347 C [G] 100 > 10920 64210 rs182418654 G [C] 100 > 11020 64303 rs559929739 C [A] 100 > {code} > # I think this one fails because it tries to broadcast the comparison. > {code} > >>> mini[mini.ALTS == ["A", "T"]] > Traceback (most recent call last): > File "", line 1, in > File > "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line > 1283, in wrapper > res = na_op(values, other) > File > "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line > 1143, in na_op > result = _comp_method_OBJECT_ARRAY(op, x, y) > File > "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line > 1120, in _comp_method_OBJECT_ARRAY > result = libops.vec_compare(x, y, op) > File "pandas/_libs/ops.pyx", line 128, in pandas._libs.ops.vec_compare > ValueError: Arrays were different lengths: 31 vs 2 > {code} > # I think this fails due to a similar reason, but the broadcasting is > happening at a different place. > {code} > >>> mini[mini.ALTS.apply(lambda x: x == ["A", "T"])] > Traceback (most recent call last): > File "", line 1, in > File > "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", > line 2682, in __getitem__ > return self._getitem_array(key) > File > "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", > line 2726, in _getitem_array > indexer = self.loc._convert_to_indexer(key, axis=1) > File > "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexing.py", > line 1314, in _convert_to_indexer > indexer = check = labels.get_indexer(objarr) > File > "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", > line 3259, in get_indexer > indexer = self._engine.get_indexer(target._ndarray_values) > File "pandas/_libs/index.pyx", line 301, in > pandas._libs.index.IndexEngine.get_indexer > File "pandas/_libs/hashtable_class_helper.pxi", line 1544, in > pandas._libs.hashtable.PyObjectHashTable.lookup > TypeError: unhashable type: 'numpy.ndarray' > >>> mini.ALTS.apply(lambda x: x == ["A", "T"]).head() > 80 [True, False] > 81 [True, False] > 82[False, False] > 83
[jira] [Updated] (ARROW-3789) [Python] Enable calling object in Table.to_pandas to "self-destruct" for improved memory use
[ https://issues.apache.org/jira/browse/ARROW-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3789: Fix Version/s: (was: 0.14.0) > [Python] Enable calling object in Table.to_pandas to "self-destruct" for > improved memory use > > > Key: ARROW-3789 > URL: https://issues.apache.org/jira/browse/ARROW-3789 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > > One issue with using {{Table.to_pandas}} is that it results in a memory > doubling (at least, more if there are a lot of Python objects created). It > would be useful if there was an option to destroy the {{arrow::Column}} > references once they've been transferred into the target data frame. This > would render the {{pyarrow.Table}} object useless afterward -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3764) [C++] Port Python "ParquetDataset" business logic to C++
[ https://issues.apache.org/jira/browse/ARROW-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3764: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Port Python "ParquetDataset" business logic to C++ > > > Key: ARROW-3764 > URL: https://issues.apache.org/jira/browse/ARROW-3764 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: datasets, parquet > Fix For: 0.15.0 > > > Along with defining appropriate abstractions for dealing with generic > filesystems in C++, we should implement the machinery for reading multiple > Parquet files in C++ so that it can reused in GLib, R, and Ruby. Otherwise > these languages will have to reimplement things, and this would surely result > in inconsistent features, bugs in some implementations but not others -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3759) [R][CI] Build and test on Windows in Appveyor
[ https://issues.apache.org/jira/browse/ARROW-3759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852548#comment-16852548 ] Wes McKinney commented on ARROW-3759: - cc [~npr] > [R][CI] Build and test on Windows in Appveyor > - > > Key: ARROW-3759 > URL: https://issues.apache.org/jira/browse/ARROW-3759 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3873) [C++] Build shared libraries consistently with -fvisibility=hidden
[ https://issues.apache.org/jira/browse/ARROW-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852552#comment-16852552 ] Wes McKinney commented on ARROW-3873: - I might take another crack at this to see if it is doable, but after 0.14 > [C++] Build shared libraries consistently with -fvisibility=hidden > -- > > Key: ARROW-3873 > URL: https://issues.apache.org/jira/browse/ARROW-3873 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > See https://github.com/apache/arrow/pull/2437 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3801) [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable
[ https://issues.apache.org/jira/browse/ARROW-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852549#comment-16852549 ] Wes McKinney commented on ARROW-3801: - cc [~jorisvandenbossche] > [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable > > > Key: ARROW-3801 > URL: https://issues.apache.org/jira/browse/ARROW-3801 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.10.0 >Reporter: Thomas Buhrmann >Priority: Major > Fix For: 0.14.0 > > > Serializing and deserializing a pandas series with categorical dtype will > make the categorical index non-writeable, which in turn trips up pandas when > e.g. reordering the categories, raising "ValueError: buffer source array is > read-only" : > {code} > import pandas as pd > import pyarrow as pa > df = pd.Series([1,2,3], dtype='category', name="c1").to_frame() > print("DType before:", repr(df.c1.dtype)) > print("Writeable:", df.c1.cat.categories.values.flags.writeable) > ro = df.c1.cat.reorder_categories([3,2,1]) > print("DType reordered:", repr(ro.dtype), "\n") > tbl = pa.Table.from_pandas(df) > df2 = tbl.to_pandas() > print("DType after:", repr(df2.c1.dtype)) > print("Writeable:", df2.c1.cat.categories.values.flags.writeable) > ro = df2.c1.cat.reorder_categories([3,2,1]) > print("DType reordered:", repr(ro.dtype), "\n") > {code} > > Outputs: > > {code:java} > DType before: CategoricalDtype(categories=[1, 2, 3], ordered=False) > Writeable: True > DType reordered: CategoricalDtype(categories=[3, 2, 1], ordered=False) > DType after: CategoricalDtype(categories=[1, 2, 3], ordered=False) > Writeable: False > --- > ValueError Traceback (most recent call last) > in > 12 print("DType after:", repr(df2.c1.dtype)) > 13 print("Writeable:", df2.c1.cat.categories.values.flags.writeable) > ---> 14 ro = df2.c1.cat.reorder_categories([3,2,1]) > 15 print("DType reordered:", repr(ro.dtype), "\n") > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3827) [Rust] Implement UnionArray
[ https://issues.apache.org/jira/browse/ARROW-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3827: Fix Version/s: (was: 0.14.0) > [Rust] Implement UnionArray > --- > > Key: ARROW-3827 > URL: https://issues.apache.org/jira/browse/ARROW-3827 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4208) [CI/Python] Have automatized tests for S3
[ https://issues.apache.org/jira/browse/ARROW-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4208: Labels: filesystem s3 (was: s3) > [CI/Python] Have automatized tests for S3 > - > > Key: ARROW-4208 > URL: https://issues.apache.org/jira/browse/ARROW-4208 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Python >Reporter: Krisztian Szucs >Priority: Major > Labels: filesystem, s3 > Fix For: 0.14.0 > > > Currently We don't run S3 integration tests regularly. > Possible solutions: > - mock it within python/pytest > - simply run the s3 tests with an S3 credential provided > - create a hdfs-integration like docker-compose setup and run an S3 mock > server (e.g.: https://github.com/adobe/S3Mock, > https://github.com/jubos/fake-s3, https://github.com/gaul/s3proxy, > https://github.com/jserver/mock-s3) > For more see discussion https://github.com/apache/arrow/pull/3286 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4095) [C++] Implement optimizations for dictionary unification where dictionaries are prefixes of the unified dictionary
[ https://issues.apache.org/jira/browse/ARROW-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4095: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Implement optimizations for dictionary unification where dictionaries > are prefixes of the unified dictionary > -- > > Key: ARROW-4095 > URL: https://issues.apache.org/jira/browse/ARROW-4095 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > In the event that the unified dictionary contains other dictionaries as > prefixes (e.g. as the result of delta dictionaries), we can avoid memory > allocation and index transposition. > See discussion at > https://github.com/apache/arrow/pull/3165#discussion_r243020982 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4133) [C++/Python] ORC adapter should fail gracefully if /etc/timezone is missing instead of aborting
[ https://issues.apache.org/jira/browse/ARROW-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4133: Fix Version/s: (was: 0.14.0) > [C++/Python] ORC adapter should fail gracefully if /etc/timezone is missing > instead of aborting > --- > > Key: ARROW-4133 > URL: https://issues.apache.org/jira/browse/ARROW-4133 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Krisztian Szucs >Priority: Major > Labels: orc > > The following core was genereted by nightly build: > https://travis-ci.org/kszucs/crossbow/builds/473397855 > {code} > Core was generated by `/opt/conda/bin/python /opt/conda/bin/pytest -v > --pyargs pyarrow'. > Program terminated with signal SIGABRT, Aborted. > #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 > 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. > [Current thread is 1 (Thread 0x7fea61f9e740 (LWP 179))] > (gdb) bt > #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 > #1 0x7fea608c8801 in __GI_abort () at abort.c:79 > #2 0x7fea4b3483df in __gnu_cxx::__verbose_terminate_handler () > at > /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95 > #3 0x7fea4b346b16 in __cxxabiv1::__terminate (handler=) > at > /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47 > #4 0x7fea4b346b4c in std::terminate () > at > /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57 > #5 0x7fea4b346d28 in __cxxabiv1::__cxa_throw (obj=0x2039220, > tinfo=0x7fea494803d0 , > dest=0x7fea49087e52 ) > at > /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95 > #6 0x7fea49086824 in orc::getTimezoneByFilename (filename=...) > at /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Timezone.cc:704 > #7 0x7fea490868d2 in orc::getLocalTimezone () at > /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Timezone.cc:713 > > #8 0x7fea49063e59 in > orc::RowReaderImpl::RowReaderImpl (this=0x204fe30, _contents=..., opts=...) > at /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Reader.cc:185 > #9 0x7fea4906651e in orc::ReaderImpl::createRowReader (this=0x1fb41b0, > opts=...) > at /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Reader.cc:630 > #10 0x7fea48c2d904 in > arrow::adapters::orc::ORCFileReader::Impl::ReadSchema (this=0x1270600, > opts=..., > > out=0x7ffe0ccae7b0) at /arrow/cpp/src/arrow/adapters/orc/adapter.cc:264 > #11 0x7fea48c2e18d in arrow::adapters::orc::ORCFileReader::Impl::Read > (this=0x1270600, out=0x7ffe0ccaea00) > at /arrow/cpp/src/arrow/adapters/orc/adapter.cc:302 > #12 0x7fea48c2a8b9 in arrow::adapters::orc::ORCFileReader::Read > (this=0x1e14d10, out=0x7ffe0ccaea00) > at /arrow/cpp/src/arrow/adapters/orc/adapter.cc:697 > > > #13 0x7fea48218c9d in __pyx_pf_7pyarrow_4_orc_9ORCReader_12read > (__pyx_v_self=0x7fea43de8688, > __pyx_v_include_indices=0x7fea61d07b70 <_Py_NoneStruct>) at _orc.cpp:3865 > #14 0x7fea48218b31 in __pyx_pw_7pyarrow_4_orc_9ORCReader_13read > (__pyx_v_self=0x7fea43de8688, > __pyx_args=0x7fea61f5e048, __pyx_kwds=0x7fea444f78b8) at _orc.cpp:3813 > #15 0x7fea61910cbd in _PyCFunction_FastCallDict > (func_obj=func_obj@entry=0x7fea444b9558, > args=args@entry=0x7fea44a40fa8, nargs=nargs@entry=0, > kwargs=kwargs@entry=0x7fea444f78b8) > at Objects/methodobject.c:231 > #16 0x7fea61910f16 in _PyCFunction_FastCallKeywords > (func=func@entry=0x7fea444b9558, > stack=stack@entry=0x7fea44a40fa8, nargs=0, > kwnames=kwnames@entry=0x7fea47d81d30) at Objects/methodobject.c:294 > #17 0x7fea619aa0da in call_function > (pp_stack=pp_stack@entry=0x7ffe0ccaecf0, oparg=, > kwnames=kwnames@entry=0x7fea47d81d30) at Python/ceval.c:4837 > #18 0x7fea619abb46 in _PyEval_EvalFrameDefault (f=, > throwflag=) > at Python/ceval.c:3351 > #19 0x7fea619a9cde in _PyEval_EvalCodeWithName (_co=0x7fea47d9f6f0, >
[jira] [Updated] (ARROW-4090) [Python] Table.flatten() doesn't work recursively
[ https://issues.apache.org/jira/browse/ARROW-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4090: Fix Version/s: (was: 0.14.0) > [Python] Table.flatten() doesn't work recursively > - > > Key: ARROW-4090 > URL: https://issues.apache.org/jira/browse/ARROW-4090 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Francisco Sanchez >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > It seems that the pyarrow.Table.flatten() function is not working recursively > nor providing a parameter to do it. > {code} > test1c_data = {'level1-A': 'abc', >'level1-B': 112233, >'level1-C': {'x': 123.111, 'y': 123.222, 'z': 123.333} > } > test1c_type = pa.struct([('level1-A', pa.string()), > ('level1-B', pa.int32()), > ('level1-C', pa.struct([('x', pa.float64()), > ('y', pa.float64()), > ('z', pa.float64()) > ])) > ]) > test1c_array = pa.array([test1c_data]*5, type=test1c_type) > test1c_table = pa.Table.from_arrays([test1c_array], names=['msg']) > print('{}\n\n{}\n\n{}'.format(test1c_table.schema, > test1c_table.flatten().schema, > test1c_table.flatten().flatten().schema)) > {code} > output: > {quote}msg: struct double, y: double, z: double>> > child 0, level1-A: string > child 1, level1-B: int32 > child 2, level1-C: struct > child 0, x: double > child 1, y: double > child 2, z: double > msg.level1-A: string > msg.level1-B: int32 > msg.level1-C: struct > child 0, x: double > child 1, y: double > child 2, z: double > msg.level1-A: string > msg.level1-B: int32 > msg.level1-C.x: double > msg.level1-C.y: double > msg.level1-C.z: double > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4202) [Gandiva] use ArrayFromJson in tests
[ https://issues.apache.org/jira/browse/ARROW-4202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4202: Fix Version/s: (was: 0.14.0) > [Gandiva] use ArrayFromJson in tests > > > Key: ARROW-4202 > URL: https://issues.apache.org/jira/browse/ARROW-4202 > Project: Apache Arrow > Issue Type: Task > Components: C++ - Gandiva >Reporter: Pindikura Ravindra >Priority: Major > > Most of the gandiva tests use wrappers over ArrowFromVector. These will > become a lot more readable if we switch to ArrayFromJSON. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4146) [C++] Extend visitor functions to include ArrayBuilder and allow callable visitors
[ https://issues.apache.org/jira/browse/ARROW-4146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4146: Fix Version/s: (was: 0.14.0) > [C++] Extend visitor functions to include ArrayBuilder and allow callable > visitors > -- > > Key: ARROW-4146 > URL: https://issues.apache.org/jira/browse/ARROW-4146 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Benjamin Kietzman >Priority: Minor > > In addition to accepting objects with Visit methods for the visited type, > {{Visit(Array|Type)}} and {{Visit(Array|Type)Inline}} should accept objects > with overloaded call operators. > In addition for inline visitation if a visitor can only visit one of the > potential unboxings then this can be detected at compile time and the full > type_id switch can be avoided (if the unboxed object cannot be visited then > do nothing). For example: > {code} > VisitTypeInline(some_type, [](const StructType& s) { > // only execute this if some_type.id() == Type::STRUCT > }); > {code} > Finally, visit functions should be added for visiting ArrayBuilders -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4201) [C++][Gandiva] integrate test utils with arrow
[ https://issues.apache.org/jira/browse/ARROW-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4201: Fix Version/s: (was: 0.14.0) > [C++][Gandiva] integrate test utils with arrow > -- > > Key: ARROW-4201 > URL: https://issues.apache.org/jira/browse/ARROW-4201 > Project: Apache Arrow > Issue Type: Task > Components: C++ - Gandiva >Reporter: Pindikura Ravindra >Priority: Major > > The following tasks to be addressed as part of this Jira : > # move (or consolidate) data generators in generate_data.h to arrow > # move convenience fns in gandiva/tests/test_util.h to arrow > # move (or consolidate) EXPECT_ARROW_* fns to arrow -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4208) [CI/Python] Have automatized tests for S3
[ https://issues.apache.org/jira/browse/ARROW-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4208: Fix Version/s: (was: 0.14.0) 0.15.0 > [CI/Python] Have automatized tests for S3 > - > > Key: ARROW-4208 > URL: https://issues.apache.org/jira/browse/ARROW-4208 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Python >Reporter: Krisztian Szucs >Priority: Major > Labels: filesystem, s3 > Fix For: 0.15.0 > > > Currently We don't run S3 integration tests regularly. > Possible solutions: > - mock it within python/pytest > - simply run the s3 tests with an S3 credential provided > - create a hdfs-integration like docker-compose setup and run an S3 mock > server (e.g.: https://github.com/adobe/S3Mock, > https://github.com/jubos/fake-s3, https://github.com/gaul/s3proxy, > https://github.com/jserver/mock-s3) > For more see discussion https://github.com/apache/arrow/pull/3286 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5456) [GLib][Plasma] Installed plasma-glib may be used on building document
Kouhei Sutou created ARROW-5456: --- Summary: [GLib][Plasma] Installed plasma-glib may be used on building document Key: ARROW-5456 URL: https://issues.apache.org/jira/browse/ARROW-5456 Project: Apache Arrow Issue Type: Bug Components: GLib Affects Versions: 0.13.0 Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5456) [GLib][Plasma] Installed plasma-glib may be used on building document
[ https://issues.apache.org/jira/browse/ARROW-5456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5456: -- Labels: pull-request-available (was: ) > [GLib][Plasma] Installed plasma-glib may be used on building document > - > > Key: ARROW-5456 > URL: https://issues.apache.org/jira/browse/ARROW-5456 > Project: Apache Arrow > Issue Type: Bug > Components: GLib >Affects Versions: 0.13.0 >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5458) Apache Arrow parallel CRC32c computation optimization
[ https://issues.apache.org/jira/browse/ARROW-5458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852610#comment-16852610 ] Yuqi Gu commented on ARROW-5458: PR: https://github.com/apache/arrow/pull/4427 > Apache Arrow parallel CRC32c computation optimization > - > > Key: ARROW-5458 > URL: https://issues.apache.org/jira/browse/ARROW-5458 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yuqi Gu >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > ARMv8 defines VMULL/PMULL crypto instruction. > This patch optimizes crc32c calculate with the instruction when > available rather than original linear crc instructions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5452) [R] Add documentation website (pkgdown)
[ https://issues.apache.org/jira/browse/ARROW-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5452: -- Labels: pull-request-available (was: ) > [R] Add documentation website (pkgdown) > --- > > Key: ARROW-5452 > URL: https://issues.apache.org/jira/browse/ARROW-5452 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > pkgdown ([https://pkgdown.r-lib.org/]) is the standard for R package > documentation websites. Build this for arrow and deploy it at > https://arrow.apache.org/docs/r. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1988) [Python] Extend flavor=spark in Parquet writing to handle INT types
[ https://issues.apache.org/jira/browse/ARROW-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1988: Fix Version/s: (was: 0.14.0) 0.15.0 > [Python] Extend flavor=spark in Parquet writing to handle INT types > --- > > Key: ARROW-1988 > URL: https://issues.apache.org/jira/browse/ARROW-1988 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Uwe L. Korn >Priority: Major > Labels: parquet > Fix For: 0.15.0 > > > See the relevant code sections at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L139 > We should cater for them in the {{pyarrow}} code and also reach out to Spark > developers so that they are supported there in the longterm. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1987) [Website] Enable Docker-based documentation generator to build at a specific Arrow commit
[ https://issues.apache.org/jira/browse/ARROW-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1987: Fix Version/s: 0.15.0 > [Website] Enable Docker-based documentation generator to build at a specific > Arrow commit > - > > Key: ARROW-1987 > URL: https://issues.apache.org/jira/browse/ARROW-1987 > Project: Apache Arrow > Issue Type: Bug > Components: Website >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > Currently both the Docker setup and the Arrow repo have to be at the same > commit. It would be useful to create a checkout in the Docker image and > enable the build version to be passed in -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1989) [Python] Better UX on timestamp conversion to Pandas
[ https://issues.apache.org/jira/browse/ARROW-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852495#comment-16852495 ] Wes McKinney commented on ARROW-1989: - [~jorisvandenbossche] potentially of interest? > [Python] Better UX on timestamp conversion to Pandas > > > Key: ARROW-1989 > URL: https://issues.apache.org/jira/browse/ARROW-1989 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Priority: Major > Fix For: 0.14.0 > > > Converting timestamp columns to Pandas, users often have the problem that > they have dates that are larger than Pandas can represent with their > nanosecond representation. Currently they simply see an Arrow exception and > think that this problem is caused by Arrow. We should try to change the error > from > {code} > ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: XX > {code} > to something along the lines of > {code} > ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: > XX. This conversion is needed as Pandas does only support nanosecond > timestamps. Your data is likely out of the range that can be represented with > nanosecond resolution. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2006) [C++] Add option to trim excess padding when writing IPC messages
[ https://issues.apache.org/jira/browse/ARROW-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852496#comment-16852496 ] Wes McKinney commented on ARROW-2006: - Our IPC methods lack configurability in general. We may want to introduce an IpcOptions struct > [C++] Add option to trim excess padding when writing IPC messages > - > > Key: ARROW-2006 > URL: https://issues.apache.org/jira/browse/ARROW-2006 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > This will help with situations like > [https://github.com/apache/arrow/issues/1467] where we don't really need the > extra padding bytes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1987) [Website] Enable Docker-based documentation generator to build at a specific Arrow commit
[ https://issues.apache.org/jira/browse/ARROW-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1987: Fix Version/s: (was: 0.14.0) > [Website] Enable Docker-based documentation generator to build at a specific > Arrow commit > - > > Key: ARROW-1987 > URL: https://issues.apache.org/jira/browse/ARROW-1987 > Project: Apache Arrow > Issue Type: Bug > Components: Website >Reporter: Wes McKinney >Priority: Major > > Currently both the Docker setup and the Arrow repo have to be at the same > commit. It would be useful to create a checkout in the Docker image and > enable the build version to be passed in -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1957) [Python] Write nanosecond timestamps using new NANO LogicalType Parquet unit
[ https://issues.apache.org/jira/browse/ARROW-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1957: --- Assignee: TP Boudreau > [Python] Write nanosecond timestamps using new NANO LogicalType Parquet unit > > > Key: ARROW-1957 > URL: https://issues.apache.org/jira/browse/ARROW-1957 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 > Environment: Python 3.6.4. Mac OSX and CentOS Linux release > 7.3.1611. Pandas 0.21.1 . >Reporter: Jordan Samuels >Assignee: TP Boudreau >Priority: Minor > Labels: parquet > Fix For: 0.14.0 > > > The following code > {code} > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > n=3 > df = pd.DataFrame({'x': range(n)}, index=pd.DatetimeIndex(start='2017-01-01', > freq='1n', periods=n)) > pq.write_table(pa.Table.from_pandas(df), '/tmp/t.parquet'){code} > results in: > {{ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: > 14832288001}} > The desired effect is that we can save nanosecond resolution without losing > precision (e.g. conversion to ms). Note that if {{freq='1u'}} is used, the > code runs properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1959) [Python] Add option for "lossy" conversions (overflow -> null) from timestamps to datetime.datetime / pandas.Timestamp
[ https://issues.apache.org/jira/browse/ARROW-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1959: Fix Version/s: (was: 0.14.0) 0.15.0 > [Python] Add option for "lossy" conversions (overflow -> null) from > timestamps to datetime.datetime / pandas.Timestamp > -- > > Key: ARROW-1959 > URL: https://issues.apache.org/jira/browse/ARROW-1959 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > See discussion in > https://stackoverflow.com/questions/47946038/overflow-error-using-datetimes-with-pyarrow -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1846) [C++] Implement "any" reduction kernel for boolean data
[ https://issues.apache.org/jira/browse/ARROW-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1846: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Implement "any" reduction kernel for boolean data > --- > > Key: ARROW-1846 > URL: https://issues.apache.org/jira/browse/ARROW-1846 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: analytics, dataframe > Fix For: 0.15.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1957) [Python] Write nanosecond timestamps using new NANO LogicalType Parquet unit
[ https://issues.apache.org/jira/browse/ARROW-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852494#comment-16852494 ] Wes McKinney commented on ARROW-1957: - [~tpboudreau] I assume this is on your critical path > [Python] Write nanosecond timestamps using new NANO LogicalType Parquet unit > > > Key: ARROW-1957 > URL: https://issues.apache.org/jira/browse/ARROW-1957 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 > Environment: Python 3.6.4. Mac OSX and CentOS Linux release > 7.3.1611. Pandas 0.21.1 . >Reporter: Jordan Samuels >Assignee: TP Boudreau >Priority: Minor > Labels: parquet > Fix For: 0.14.0 > > > The following code > {code} > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > n=3 > df = pd.DataFrame({'x': range(n)}, index=pd.DatetimeIndex(start='2017-01-01', > freq='1n', periods=n)) > pq.write_table(pa.Table.from_pandas(df), '/tmp/t.parquet'){code} > results in: > {{ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: > 14832288001}} > The desired effect is that we can save nanosecond resolution without losing > precision (e.g. conversion to ms). Note that if {{freq='1u'}} is used, the > code runs properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1837) [Java] Unable to read unsigned integers outside signed range for bit width in integration tests
[ https://issues.apache.org/jira/browse/ARROW-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852493#comment-16852493 ] Wes McKinney commented on ARROW-1837: - [~emkornfi...@gmail.com] if you are interested in unsigned integers this would benefit from some attention > [Java] Unable to read unsigned integers outside signed range for bit width in > integration tests > --- > > Key: ARROW-1837 > URL: https://issues.apache.org/jira/browse/ARROW-1837 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Wes McKinney >Priority: Blocker > Labels: columnar-format-1.0 > Fix For: 0.14.0 > > Attachments: generated_primitive.json > > > I believe this was introduced recently (perhaps in the refactors), but there > was a problem where the integration tests weren't being properly run that hid > the error from us > see https://github.com/apache/arrow/pull/1294#issuecomment-345553066 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2077) [Python] Document on how to use Storefact & Arrow to read Parquet from S3/Azure/...
[ https://issues.apache.org/jira/browse/ARROW-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2077: Fix Version/s: (was: 0.14.0) > [Python] Document on how to use Storefact & Arrow to read Parquet from > S3/Azure/... > --- > > Key: ARROW-2077 > URL: https://issues.apache.org/jira/browse/ARROW-2077 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: parquet > > We're using this happily in production, also with column projection down to > the storage layer. Others should also benefit from this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2057) [Python] Configure size of data pages in pyarrow.parquet.write_table
[ https://issues.apache.org/jira/browse/ARROW-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-2057: --- Assignee: (was: Uwe L. Korn) > [Python] Configure size of data pages in pyarrow.parquet.write_table > > > Key: ARROW-2057 > URL: https://issues.apache.org/jira/browse/ARROW-2057 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner, parquet > Fix For: 0.14.0 > > > It would be useful to be able to set the size of data pages (within Parquet > column chunks) from Python. The current default is set to 1MiB at > https://github.com/apache/parquet-cpp/blob/0875e43010af485e1c0b506d77d7e0edc80c66cc/src/parquet/properties.h#L81. > It might be useful in some situations to lower this for more granular access. > We should provide this value as a parameter to > {{pyarrow.parquet.write_table}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2098) [Python] Implement "errors as null" option when coercing Python object arrays to Arrow format
[ https://issues.apache.org/jira/browse/ARROW-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2098: Fix Version/s: (was: 0.14.0) > [Python] Implement "errors as null" option when coercing Python object arrays > to Arrow format > - > > Key: ARROW-2098 > URL: https://issues.apache.org/jira/browse/ARROW-2098 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet > > Inspired by > https://stackoverflow.com/questions/48611998/type-error-on-first-steps-with-apache-parquet > where the user has a string inside a mostly integer column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2037) [Python]: Add tests for ARROW-1941 cases where pandas inferred type is 'empty'
[ https://issues.apache.org/jira/browse/ARROW-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852497#comment-16852497 ] Wes McKinney commented on ARROW-2037: - cc [~jorisvandenbossche] > [Python]: Add tests for ARROW-1941 cases where pandas inferred type is 'empty' > -- > > Key: ARROW-2037 > URL: https://issues.apache.org/jira/browse/ARROW-2037 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Fix For: 0.14.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (ARROW-2186) [C++] Clean up architecture specific compiler flags
[ https://issues.apache.org/jira/browse/ARROW-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-2186. --- Resolution: Not A Problem > [C++] Clean up architecture specific compiler flags > --- > > Key: ARROW-2186 > URL: https://issues.apache.org/jira/browse/ARROW-2186 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > I noticed that {{-maltivec}} is being passed to the compiler on Linux, with > an x86_64 processor. That seemed odd to me. It prompted me to look more > generally at our compiler flags related to hardware optimizations. We have > the ability to pass {{-msse3}}, but there is a {{ARROW_USE_SSE}} which is > only used as a define in some headers. There is {{ARROW_ALTIVEC}}, but no > option to pass {{-march}}. Nothing related to AVX/AVX2/AVX512. I think this > could do for an overhaul -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2130) [Python] Support converting pandas.Timestamp in pyarrow.array
[ https://issues.apache.org/jira/browse/ARROW-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2130: Fix Version/s: (was: 0.14.0) > [Python] Support converting pandas.Timestamp in pyarrow.array > - > > Key: ARROW-2130 > URL: https://issues.apache.org/jira/browse/ARROW-2130 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > > This is follow up work to ARROW-2106; since pandas.Timestamp supports > nanoseconds, this will require a slightly different code path. Tests should > also include using {{Table.from_pandas}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2127) [Plasma] Transfer of objects between CPUs and GPUs
[ https://issues.apache.org/jira/browse/ARROW-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2127: Fix Version/s: (was: 0.14.0) > [Plasma] Transfer of objects between CPUs and GPUs > -- > > Key: ARROW-2127 > URL: https://issues.apache.org/jira/browse/ARROW-2127 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Plasma >Reporter: Philipp Moritz >Priority: Major > > It should be possible to transfer an object that was created on the CPU to > the GPU and vice versa. One natural implementation is to introduce a flag to > plasma::Get that specifies where the object should end up and then transfer > the object under the hood and return the appropriate buffer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2041) [Python] pyarrow.serialize has high overhead for list of NumPy arrays
[ https://issues.apache.org/jira/browse/ARROW-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2041: Fix Version/s: (was: 0.14.0) 0.15.0 > [Python] pyarrow.serialize has high overhead for list of NumPy arrays > - > > Key: ARROW-2041 > URL: https://issues.apache.org/jira/browse/ARROW-2041 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Richard Shin >Priority: Minor > Labels: Performance > Fix For: 0.15.0 > > > {{Python 2.7.12 (default, Nov 20 2017, 18:23:56)}} > {{[GCC 5.4.0 20160609] on linux2}} > {{Type "help", "copyright", "credits" or "license" for more information.}} > {{>>> import pyarrow as pa, numpy as np}} > {{>>> arrays = [np.arange(100, dtype=np.int32) for _ in range(1)]}} > {{>>> with open('test.pyarrow', 'w') as f:}} > {{... f.write(pa.serialize(arrays).to_buffer().to_pybytes())}} > {{...}} > {{>>> import cPickle as pickle}} > {{>>> pickle.dump(arrays, open('test.pkl', 'w'), pickle.HIGHEST_PROTOCOL)}} > test.pyarrow is 6.2 MB, while test.pkl is only 4.2 MB. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1848) [Python] Add documentation examples for reading single Parquet files and datasets from HDFS
[ https://issues.apache.org/jira/browse/ARROW-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1848: Fix Version/s: (was: 0.14.0) 0.15.0 > [Python] Add documentation examples for reading single Parquet files and > datasets from HDFS > --- > > Key: ARROW-1848 > URL: https://issues.apache.org/jira/browse/ARROW-1848 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: filesystem, parquet > Fix For: 0.15.0 > > > see > https://stackoverflow.com/questions/47443151/read-a-parquet-files-from-hdfs-using-pyarrow -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2939) [Python] Provide links to documentation pages for old versions
[ https://issues.apache.org/jira/browse/ARROW-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2939: Summary: [Python] Provide links to documentation pages for old versions (was: [Python] API documentation version doesn't match latest on PyPI) > [Python] Provide links to documentation pages for old versions > -- > > Key: ARROW-2939 > URL: https://issues.apache.org/jira/browse/ARROW-2939 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Ian Robertson >Priority: Minor > Labels: documentation > Fix For: 0.14.0 > > > Hey folks, apologies if this isn't the right place to raise this. In poking > around the web documentation (for pyarrow specifically), it looks like the > auto-generated API docs contain commits past the release of 0.9.0. For > example: > * > [https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.column] > * Contains differences merged here: > [https://github.com/apache/arrow/pull/1923] > * But latest pypi/conda versions of pyarrow are 0.9.0, which don't include > that change. > Not sure if the docs are auto-built off master somewhere, I couldn't find > anything about building docs in the docs itself. I would guess that you may > want some of the usage docs to be published in between releases if they're > not about new functionality, but the API reference being out of date can be > confusing. Is it possible to anchor the API docs to the latest released > version? Or even something like how Pandas has a whole bunch of old versions > still available? (e.g. [https://pandas.pydata.org/pandas-docs/stable/] vs. > old versions like [http://pandas.pydata.org/pandas-docs/version/0.17.0/]) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2984) [JS] Refactor release verification script to share code with main source release verification script
[ https://issues.apache.org/jira/browse/ARROW-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852519#comment-16852519 ] Wes McKinney commented on ARROW-2984: - To close this, let us remove the old JavaScript release scripts > [JS] Refactor release verification script to share code with main source > release verification script > > > Key: ARROW-2984 > URL: https://issues.apache.org/jira/browse/ARROW-2984 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > There is some possible code duplication. See discussion in ARROW-2977 > https://github.com/apache/arrow/pull/2369 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3052) [C++] Detect ORC system packages
[ https://issues.apache.org/jira/browse/ARROW-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852517#comment-16852517 ] Wes McKinney commented on ARROW-3052: - ORC is now in conda-forge https://github.com/conda-forge/orc-feedstock > [C++] Detect ORC system packages > > > Key: ARROW-3052 > URL: https://issues.apache.org/jira/browse/ARROW-3052 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > See > https://github.com/apache/arrow/blob/master/cpp/cmake_modules/ThirdpartyToolchain.cmake#L155. > After the CMake refactor it is possible to use built ORC packages with > {{$ORC_HOME}} but not detected like the other toolchain dependencies -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3016) [C++] Add ability to enable call stack logging for each memory allocation
[ https://issues.apache.org/jira/browse/ARROW-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3016: Fix Version/s: (was: 0.14.0) > [C++] Add ability to enable call stack logging for each memory allocation > - > > Key: ARROW-3016 > URL: https://issues.apache.org/jira/browse/ARROW-3016 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > > It is possible to gain programmatic access to the call stack in C/C++, e.g. > https://eli.thegreenplace.net/2015/programmatic-access-to-the-call-stack-in-c/ > It would be valuable to have a debugging option to log the sizes of memory > allocations as well as showing the call stack where that allocation is > performed. In complex programs, this could help determine the origin of a > memory leak -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3702) [R] POSIXct mapped to DateType not TimestampType?
[ https://issues.apache.org/jira/browse/ARROW-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852546#comment-16852546 ] Wes McKinney commented on ARROW-3702: - cc [~npr] > [R] POSIXct mapped to DateType not TimestampType? > - > > Key: ARROW-3702 > URL: https://issues.apache.org/jira/browse/ARROW-3702 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Javier Luraschi >Priority: Major > Fix For: 0.14.0 > > > Why was POSIXct mapped to > [DataType|https://arrow.apache.org/docs/cpp/classarrow_1_1_date_type.html#a6aea1fcfd9f998e8fa50f5ae62dbd7e6] > not > [TimestampType|https://arrow.apache.org/docs/cpp/classarrow_1_1_timestamp_type.html#a88e0ba47b82571b3fc3798b6c099499b]? > What are the PRO/CONs from each approach? > This is mostly to interoperate with Spark which choose to map POSIXct to > Timestamps since in Spark, not Arrow, dates do not have a time component. > There is a way to make this work in Spark with POSIXct mapped to DateType by > mapping DateType to timestamps, so mostly looking to understand tradeoffs. > One particular question, timestamps in arrow seem to support timezones, > wouldn't it make more sense to map POSIXct to timestamps? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3706) [Rust] Add record batch reader trait.
[ https://issues.apache.org/jira/browse/ARROW-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3706: Fix Version/s: (was: 0.14.0) > [Rust] Add record batch reader trait. > - > > Key: ARROW-3706 > URL: https://issues.apache.org/jira/browse/ARROW-3706 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Renjie Liu >Assignee: Renjie Liu >Priority: Major > > Add an RecordBatchReader trait. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3686) [Python] Support for masked arrays in to/from numpy
[ https://issues.apache.org/jira/browse/ARROW-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852545#comment-16852545 ] Wes McKinney commented on ARROW-3686: - cc [~jorisvandenbossche] > [Python] Support for masked arrays in to/from numpy > --- > > Key: ARROW-3686 > URL: https://issues.apache.org/jira/browse/ARROW-3686 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.11.1 >Reporter: Maarten Breddels >Priority: Major > Fix For: 0.14.0 > > > Again, in this PR for vaex: > [https://github.com/maartenbreddels/vaex/pull/116] I support masked arrays, > it would be nice if this goes into pyarrow. If this approach looks good I > could do a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3705) [Python] Add "nrows" argument to parquet.read_table read indicated number of rows from file instead of whole file
[ https://issues.apache.org/jira/browse/ARROW-3705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3705: Labels: datasets parquet (was: parquet) > [Python] Add "nrows" argument to parquet.read_table read indicated number of > rows from file instead of whole file > - > > Key: ARROW-3705 > URL: https://issues.apache.org/jira/browse/ARROW-3705 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: datasets, parquet > Fix For: 0.14.0 > > > This patterns {{nrows}} in {{pandas.read_csv}} > inspired by > https://stackoverflow.com/questions/53152671/how-to-read-sample-records-parquet-file-in-s3 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3655) [Gandiva] switch away from default_memory_pool
[ https://issues.apache.org/jira/browse/ARROW-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3655: Fix Version/s: (was: 0.14.0) > [Gandiva] switch away from default_memory_pool > -- > > Key: ARROW-3655 > URL: https://issues.apache.org/jira/browse/ARROW-3655 > Project: Apache Arrow > Issue Type: Task > Components: C++ - Gandiva >Reporter: Pindikura Ravindra >Priority: Major > > After changes to ARROW-3519, Gandiva uses default_memory_pool for some > allocations. This needs to be replaced with the pool passed in the Evaluate > call. > > Also, change signatures of all Evaluate APIs (both in project and filter) to > take a pool argument. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3709) [CI/Docker/Python] Plasma tests are failing in the docker-compose setup
[ https://issues.apache.org/jira/browse/ARROW-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3709: Fix Version/s: (was: 0.14.0) > [CI/Docker/Python] Plasma tests are failing in the docker-compose setup > --- > > Key: ARROW-3709 > URL: https://issues.apache.org/jira/browse/ARROW-3709 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Krisztian Szucs >Priority: Major > Labels: docker > > {code} > rc = proc.poll() > if rc is not None: > raise RuntimeError("plasma_store exited unexpectedly with " > > "code %d" % (rc,)) > E RuntimeError: plasma_store exited > unexpectedly with code 127 > opt/conda/lib/python3.6/site-packages/pyarrow-0.11.1.dev62+g669c5bca-py3.6-linux-x86_64.egg/pyarrow/plasma.py:138: > RuntimeError > Captured stderr call > - > /opt/conda/lib/python3.6/site-packages/pyarrow-0.11.1.dev62+g669c5bca-py3.6-linux-x86_64.egg/pyarrow/plasma_store_server: > error while loading shared libraries: libboost_system.so.1.68.0: cannot open > shared object file: No such file or dir > ectory > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3730) [Python] Output a representation of pyarrow.Schema that can be used to reconstruct a schema in a script
[ https://issues.apache.org/jira/browse/ARROW-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852547#comment-16852547 ] Wes McKinney commented on ARROW-3730: - cc [~jorisvandenbossche] > [Python] Output a representation of pyarrow.Schema that can be used to > reconstruct a schema in a script > --- > > Key: ARROW-3730 > URL: https://issues.apache.org/jira/browse/ARROW-3730 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > This would be like what {{__repr__}} is used for in many built-in Python > types, or a schema as a list of tuples that can be passed to > {{pyarrow.schema}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3758) [R] Build R library on Windows, document build instructions for Windows developers
[ https://issues.apache.org/jira/browse/ARROW-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3758: Fix Version/s: (was: 0.14.0) 0.15.0 > [R] Build R library on Windows, document build instructions for Windows > developers > -- > > Key: ARROW-3758 > URL: https://issues.apache.org/jira/browse/ARROW-3758 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3710) [CI/Python] Run nightly tests against pandas master
[ https://issues.apache.org/jira/browse/ARROW-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3710: Fix Version/s: (was: 0.14.0) > [CI/Python] Run nightly tests against pandas master > --- > > Key: ARROW-3710 > URL: https://issues.apache.org/jira/browse/ARROW-3710 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Follow-up of [https://github.com/apache/arrow/pull/2758] and > https://github.com/apache/arrow/pull/2755 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4759) [Rust] [DataFusion] It should be possible to share an execution context between threads
[ https://issues.apache.org/jira/browse/ARROW-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4759: Fix Version/s: (was: 0.14.0) > [Rust] [DataFusion] It should be possible to share an execution context > between threads > --- > > Key: ARROW-4759 > URL: https://issues.apache.org/jira/browse/ARROW-4759 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust, Rust - DataFusion >Affects Versions: 0.12.0 >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > I am working on a PR for this now. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4429) Add git rebase tips to the 'Contributing' page in the developer docs
[ https://issues.apache.org/jira/browse/ARROW-4429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4429: Fix Version/s: (was: 0.14.0) > Add git rebase tips to the 'Contributing' page in the developer docs > > > Key: ARROW-4429 > URL: https://issues.apache.org/jira/browse/ARROW-4429 > Project: Apache Arrow > Issue Type: Task > Components: Documentation >Reporter: Tanya Schlusser >Priority: Major > > A recent discussion on the listserv (link below) asked about how contributors > should handle rebasing. It would be helpful if the tips made it into the > developer documentation somehow. I suggest in the ["Contributing to Apache > Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > page—currently a wiki, but hopefully eventually part of the Sphinx docs > ARROW-4427. > Here is the relevant thread: > [https://lists.apache.org/thread.html/c74d8027184550b8d9041e3f2414b517ffb76ccbc1d5aa4563d364b6@%3Cdev.arrow.apache.org%3E] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4752) [Rust] Add explicit SIMD vectorization for the divide kernel
[ https://issues.apache.org/jira/browse/ARROW-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4752: Fix Version/s: (was: 0.14.0) > [Rust] Add explicit SIMD vectorization for the divide kernel > > > Key: ARROW-4752 > URL: https://issues.apache.org/jira/browse/ARROW-4752 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)