Re: Datasets and Java

2019-11-26 Thread Hongze Zhang
Hi Wes and Micah, Thanks for your kindly reply. Micah: We don't use Spark (vectorized) parquet reader because it is a pure Java implementation. Performance could be worse than doing the similar work natively. Another reason is we may need to integrate some other specific data sources with

[Result] [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding (second iteration)

2019-11-26 Thread Micah Kornfield
The vote carries with 3 bindings votes +1 votes, 1 non-binding +1 vote and 1 non-binding +.5 vote. To follow-up I will: 1. Open up JIRAs for work items in reference implementations (c++/java) 2. Merge the pull request containing the specification changes. Thanks, Micah On Tue, Nov 26, 2019 at

Re: [DISCUSS][C++/Python] Bazel example

2019-11-26 Thread Micah Kornfield
Hi Antoine, > My question would be: what happens after the PR is merged? Are > developers supposed to keep the Bazel setup working in addition to > CMake? Or is there a dedicated maintainer (you? :-)) to fix regressions > when they happen? In the short term, I would be will to be a dedicated

Re: Datasets and Java

2019-11-26 Thread Micah Kornfield
Hi Hongze, To add to Wes's point, there are already some efforts to do JNI for ORC (which needs to be integrated with CI) and some open PRs for Parquet in the project. However, given that you are using Spark I would expect there is already dataset functionality that is equivalent to the dataset

Re: Unions: storing type_ids or type_codes?

2019-11-26 Thread Fan Liya
Hi Antoine, For Java, the physical child id is the same as the logical type code, as the index of each child vector is the code (ordinal) of the vector's minor type. This leads to a problem, that only a single vector for each type can exist in a union vector, so strictly speaking, the Java

[jira] [Created] (ARROW-7268) Propagate `custom_metadata` field from IPC message

2019-11-26 Thread Martin Grund (Jira)
Martin Grund created ARROW-7268: --- Summary: Propagate `custom_metadata` field from IPC message Key: ARROW-7268 URL: https://issues.apache.org/jira/browse/ARROW-7268 Project: Apache Arrow Issue

Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0

2019-11-26 Thread Wes McKinney
OK, so the proposal is not only to drop support for Ubuntu 14.04 but also to stop supporting gcc < 4.9, is that right? Since manylinux1 is gcc 4.8.5 as long as the _libraries_ build then that is okay. I don't know what the implications of dropping manylinux1 (in favor of manylinux2010) would be

Re: Datasets and Java

2019-11-26 Thread Wes McKinney
hi Hongze, The Datasets functionality is indeed extremely useful, and it may make sense to have it available in many languages eventually. With Java, I would raise the issue that things are comparatively weaker there when it comes to actually reading the files themselves. Whereas we have

Re: Union type ids - signed or unsigned?

2019-11-26 Thread Antoine Pitrou
Thanks for all the answers. The assumptions about union types in C++ code are fixed in https://github.com/apache/arrow/pull/5892 Regards Antoine. Le 25/11/2019 à 16:41, Wes McKinney a écrit : > On Mon, Nov 25, 2019 at 9:25 AM Antoine Pitrou wrote: >> >> On Mon, 25 Nov 2019 09:12:21 -0600

[jira] [Created] (ARROW-7267) [CI] [C++] Tests not run on "AMD64 Windows 2019 C++"

2019-11-26 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7267: - Summary: [CI] [C++] Tests not run on "AMD64 Windows 2019 C++" Key: ARROW-7267 URL: https://issues.apache.org/jira/browse/ARROW-7267 Project: Apache Arrow

Re: Is FileSystem._isfilestore considered public?

2019-11-26 Thread Antoine Pitrou
Generally speaking, this API is obsolete (though not formally deprecated yet). So we don't envision to change it significantly in the future. We hope that in the near future the near pyarrow FileSystem API will be usable directly pyarrow.parquet. Regards Antoine. Le 26/11/2019 à 15:34, Tom

Re: Non-chunked large files / hdf5 support

2019-11-26 Thread Francois Saint-Jacques
Hello Maarten, In theory, you could provide a custom mmap-allocator and use the builder facility. Since the array is still in "build-phase" and not sealed, it should be fine if mremap changes the pointer address. This might fail in practice since the allocator is also used for auxiliary data,

Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0

2019-11-26 Thread Antoine Pitrou
I'd rather drop 14.04 rather than spend some time maintaining kludges for old compilers. Regards Antoine. On Tue, 26 Nov 2019 17:24:58 +0900 (JST) Sutou Kouhei wrote: > OK. I submitted a pull request: https://github.com/apache/arrow/pull/5901 > > In > "Re: [NIGHTLY] Arrow Build Report

Non-chunked large files / hdf5 support

2019-11-26 Thread Maarten Breddels
In vaex I always write the data to hdf5 as 1 large chunk (per column). The reason is that it allows the mmapped columns to be exposed as a single numpy array (talking numerical data only for now), which many people are quite comfortable with. The strategy for vaex to write unchunked data, is to

Re: Strategy for mixing large_string and string with chunked arrays

2019-11-26 Thread Maarten Breddels
Op di 26 nov. 2019 om 15:02 schreef Wes McKinney : > hi Maarten > > I opened https://issues.apache.org/jira/browse/ARROW-7245 in part based > on this. > > I think that normalizing to a common type (which would require casting > the offsets buffer, but not the data -- which can be shared -- so not

[jira] [Created] (ARROW-7266) dictionary_encode() of a slice gives wrong result

2019-11-26 Thread Adam Hooper (Jira)
Adam Hooper created ARROW-7266: -- Summary: dictionary_encode() of a slice gives wrong result Key: ARROW-7266 URL: https://issues.apache.org/jira/browse/ARROW-7266 Project: Apache Arrow Issue

Is FileSystem._isfilestore considered public?

2019-11-26 Thread Tom Augspurger
Hi, In https://github.com/dask/dask/issues/5526, we're seeing an issue stemming from a hack to ensure compatibility for Pyarrow. The details aren't too important. The core of the issue is that the Pyarrow parquet writer makes a couple checks for `FileSystem._isfilestore` via

Re: Strategy for mixing large_string and string with chunked arrays

2019-11-26 Thread Wes McKinney
hi Maarten I opened https://issues.apache.org/jira/browse/ARROW-7245 in part based on this. I think that normalizing to a common type (which would require casting the offsets buffer, but not the data -- which can be shared -- so not too wasteful) during concatenation would be the approach I

Re: Unions: storing type_ids or type_codes?

2019-11-26 Thread Francois Saint-Jacques
It seems that the array_union_test.cc does the latter, look at how `expected_types` is constructed. I opened https://issues.apache.org/jira/browse/ARROW-7265 . Wes, is the intended usage of type_ids to allow a producer to pass a subset columns of unions without modifying the type codes? François

[jira] [Created] (ARROW-7265) [Format][C++] Clarify the usage of typeIds in Union type documentation

2019-11-26 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-7265: - Summary: [Format][C++] Clarify the usage of typeIds in Union type documentation Key: ARROW-7265 URL: https://issues.apache.org/jira/browse/ARROW-7265

Strategy for mixing large_string and string with chunked arrays

2019-11-26 Thread Maarten Breddels
Hi Arrow devs, Small intro: I'm the main Vaex developer, an out of core dataframe library for Python - https://github.com/vaexio/vaex -, and we're looking into moving Vaex to use Apache Arrow for the data structure. At the beginning of this year, we added string support in Vaex, which required 64

[NIGHTLY] Arrow Build Report for Job nightly-2019-11-26-0

2019-11-26 Thread Crossbow
Arrow Build Report for Job nightly-2019-11-26-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0 Failed Tasks: - test-conda-python-2.7-pandas-master: URL:

Datasets and Java

2019-11-26 Thread Hongze Zhang
Hi all, Recently the datasets API has been improved a lot and I found some of the new features are very useful to my own work. For example to me a important one is the fix of ARROW-6952[1]. And as I currently work on Java/Scala projects like Spark, I am now investigating a way to call some of

[jira] [Created] (ARROW-7264) [Java] RangeEqualsVisitor type check is not correct

2019-11-26 Thread Ji Liu (Jira)
Ji Liu created ARROW-7264: - Summary: [Java] RangeEqualsVisitor type check is not correct Key: ARROW-7264 URL: https://issues.apache.org/jira/browse/ARROW-7264 Project: Apache Arrow Issue Type: Bug

Re: [DISCUSS][C++/Python] Bazel example

2019-11-26 Thread Antoine Pitrou
Hi Micah, Le 26/11/2019 à 05:52, Micah Kornfield a écrit : > > After going through this exercise I put together a list of pros and cons > below. > > I would like to hear from other devs: > 1. Their opinions on setting this up as an alternative system (I'm willing > to invest some more time

Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding (second iteration)

2019-11-26 Thread Sutou Kouhei
+1 (binding) In "[VOTE] Clarifications and forward compatibility changes for Dictionary Encoding (second iteration)" on Wed, 20 Nov 2019 20:41:57 -0800, Micah Kornfield wrote: > Hello, > As discussed on [1], I've proposed clarifications in a PR [2] that > clarifies: > > 1. It is not

[jira] [Created] (ARROW-7263) [C++][Gandiva] Implement locate and position functions

2019-11-26 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-7263: - Summary: [C++][Gandiva] Implement locate and position functions Key: ARROW-7263 URL: https://issues.apache.org/jira/browse/ARROW-7263 Project: Apache Arrow

[jira] [Created] (ARROW-7262) [C++][Gandiva] Implement replace function in Gandiva

2019-11-26 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-7262: - Summary: [C++][Gandiva] Implement replace function in Gandiva Key: ARROW-7262 URL: https://issues.apache.org/jira/browse/ARROW-7262 Project: Apache Arrow

[jira] [Created] (ARROW-7261) [Python] Python support for fixed size list type

2019-11-26 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7261: Summary: [Python] Python support for fixed size list type Key: ARROW-7261 URL: https://issues.apache.org/jira/browse/ARROW-7261 Project: Apache Arrow

Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0

2019-11-26 Thread Sutou Kouhei
OK. I submitted a pull request: https://github.com/apache/arrow/pull/5901 In "Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0" on Mon, 25 Nov 2019 21:23:34 -0600, Wes McKinney wrote: > I'd be interested to maintain gcc 4.8 support for a time yet but I'm > interested in the

[jira] [Created] (ARROW-7260) [CI] Ubuntu 14.04 test is failed by user defined literal

2019-11-26 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-7260: --- Summary: [CI] Ubuntu 14.04 test is failed by user defined literal Key: ARROW-7260 URL: https://issues.apache.org/jira/browse/ARROW-7260 Project: Apache Arrow