Hi Wes and Micah,
Thanks for your kindly reply.
Micah: We don't use Spark (vectorized) parquet reader because it is a pure Java
implementation. Performance could be worse than doing the similar work
natively. Another reason is we may need to
integrate some other specific data sources with
The vote carries with 3 bindings votes +1 votes, 1 non-binding +1 vote and
1 non-binding +.5 vote.
To follow-up I will:
1. Open up JIRAs for work items in reference implementations (c++/java)
2. Merge the pull request containing the specification changes.
Thanks,
Micah
On Tue, Nov 26, 2019 at
Hi Antoine,
> My question would be: what happens after the PR is merged? Are
> developers supposed to keep the Bazel setup working in addition to
> CMake? Or is there a dedicated maintainer (you? :-)) to fix regressions
> when they happen?
In the short term, I would be will to be a dedicated
Hi Hongze,
To add to Wes's point, there are already some efforts to do JNI for ORC
(which needs to be integrated with CI) and some open PRs for Parquet in the
project. However, given that you are using Spark I would expect there is
already dataset functionality that is equivalent to the dataset
Hi Antoine,
For Java, the physical child id is the same as the logical type code, as
the index of each child vector is the code (ordinal) of the vector's minor
type.
This leads to a problem, that only a single vector for each type can exist
in a union vector, so strictly speaking, the Java
Martin Grund created ARROW-7268:
---
Summary: Propagate `custom_metadata` field from IPC message
Key: ARROW-7268
URL: https://issues.apache.org/jira/browse/ARROW-7268
Project: Apache Arrow
Issue
OK, so the proposal is not only to drop support for Ubuntu 14.04 but
also to stop supporting gcc < 4.9, is that right? Since manylinux1 is
gcc 4.8.5 as long as the _libraries_ build then that is okay. I don't
know what the implications of dropping manylinux1 (in favor of
manylinux2010) would be
hi Hongze,
The Datasets functionality is indeed extremely useful, and it may make
sense to have it available in many languages eventually. With Java, I
would raise the issue that things are comparatively weaker there when
it comes to actually reading the files themselves. Whereas we have
Thanks for all the answers. The assumptions about union types in C++
code are fixed in https://github.com/apache/arrow/pull/5892
Regards
Antoine.
Le 25/11/2019 à 16:41, Wes McKinney a écrit :
> On Mon, Nov 25, 2019 at 9:25 AM Antoine Pitrou wrote:
>>
>> On Mon, 25 Nov 2019 09:12:21 -0600
Antoine Pitrou created ARROW-7267:
-
Summary: [CI] [C++] Tests not run on "AMD64 Windows 2019 C++"
Key: ARROW-7267
URL: https://issues.apache.org/jira/browse/ARROW-7267
Project: Apache Arrow
Generally speaking, this API is obsolete (though not formally deprecated
yet). So we don't envision to change it significantly in the future.
We hope that in the near future the near pyarrow FileSystem API will be
usable directly pyarrow.parquet.
Regards
Antoine.
Le 26/11/2019 à 15:34, Tom
Hello Maarten,
In theory, you could provide a custom mmap-allocator and use the
builder facility. Since the array is still in "build-phase" and not
sealed, it should be fine if mremap changes the pointer address. This
might fail in practice since the allocator is also used for auxiliary
data,
I'd rather drop 14.04 rather than spend some time maintaining kludges
for old compilers.
Regards
Antoine.
On Tue, 26 Nov 2019 17:24:58 +0900 (JST)
Sutou Kouhei wrote:
> OK. I submitted a pull request: https://github.com/apache/arrow/pull/5901
>
> In
> "Re: [NIGHTLY] Arrow Build Report
In vaex I always write the data to hdf5 as 1 large chunk (per column).
The reason is that it allows the mmapped columns to be exposed as a
single numpy array (talking numerical data only for now), which many
people are quite comfortable with.
The strategy for vaex to write unchunked data, is to
Op di 26 nov. 2019 om 15:02 schreef Wes McKinney :
> hi Maarten
>
> I opened https://issues.apache.org/jira/browse/ARROW-7245 in part based
> on this.
>
> I think that normalizing to a common type (which would require casting
> the offsets buffer, but not the data -- which can be shared -- so not
Adam Hooper created ARROW-7266:
--
Summary: dictionary_encode() of a slice gives wrong result
Key: ARROW-7266
URL: https://issues.apache.org/jira/browse/ARROW-7266
Project: Apache Arrow
Issue
Hi,
In https://github.com/dask/dask/issues/5526, we're seeing an issue stemming
from a hack to ensure compatibility for Pyarrow. The details aren't too
important. The core of the issue is that the Pyarrow parquet writer makes a
couple checks for `FileSystem._isfilestore` via
hi Maarten
I opened https://issues.apache.org/jira/browse/ARROW-7245 in part based on this.
I think that normalizing to a common type (which would require casting
the offsets buffer, but not the data -- which can be shared -- so not
too wasteful) during concatenation would be the approach I
It seems that the array_union_test.cc does the latter, look at how
`expected_types` is constructed. I opened
https://issues.apache.org/jira/browse/ARROW-7265 .
Wes, is the intended usage of type_ids to allow a producer to pass a
subset columns of unions without modifying the type codes?
François
Francois Saint-Jacques created ARROW-7265:
-
Summary: [Format][C++] Clarify the usage of typeIds in Union type
documentation
Key: ARROW-7265
URL: https://issues.apache.org/jira/browse/ARROW-7265
Hi Arrow devs,
Small intro: I'm the main Vaex developer, an out of core dataframe
library for Python - https://github.com/vaexio/vaex -, and we're
looking into moving Vaex to use Apache Arrow for the data structure.
At the beginning of this year, we added string support in Vaex, which
required 64
Arrow Build Report for Job nightly-2019-11-26-0
All tasks:
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0
Failed Tasks:
- test-conda-python-2.7-pandas-master:
URL:
Hi all,
Recently the datasets API has been improved a lot and I found some of the new
features are very useful to my own work. For example to me a important one is
the fix of ARROW-6952[1]. And as I currently work on Java/Scala projects like
Spark, I am now investigating a way to call some of
Ji Liu created ARROW-7264:
-
Summary: [Java] RangeEqualsVisitor type check is not correct
Key: ARROW-7264
URL: https://issues.apache.org/jira/browse/ARROW-7264
Project: Apache Arrow
Issue Type: Bug
Hi Micah,
Le 26/11/2019 à 05:52, Micah Kornfield a écrit :
>
> After going through this exercise I put together a list of pros and cons
> below.
>
> I would like to hear from other devs:
> 1. Their opinions on setting this up as an alternative system (I'm willing
> to invest some more time
+1 (binding)
In
"[VOTE] Clarifications and forward compatibility changes for Dictionary
Encoding (second iteration)" on Wed, 20 Nov 2019 20:41:57 -0800,
Micah Kornfield wrote:
> Hello,
> As discussed on [1], I've proposed clarifications in a PR [2] that
> clarifies:
>
> 1. It is not
Projjal Chanda created ARROW-7263:
-
Summary: [C++][Gandiva] Implement locate and position functions
Key: ARROW-7263
URL: https://issues.apache.org/jira/browse/ARROW-7263
Project: Apache Arrow
Projjal Chanda created ARROW-7262:
-
Summary: [C++][Gandiva] Implement replace function in Gandiva
Key: ARROW-7262
URL: https://issues.apache.org/jira/browse/ARROW-7262
Project: Apache Arrow
Joris Van den Bossche created ARROW-7261:
Summary: [Python] Python support for fixed size list type
Key: ARROW-7261
URL: https://issues.apache.org/jira/browse/ARROW-7261
Project: Apache Arrow
OK. I submitted a pull request: https://github.com/apache/arrow/pull/5901
In
"Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0" on Mon, 25
Nov 2019 21:23:34 -0600,
Wes McKinney wrote:
> I'd be interested to maintain gcc 4.8 support for a time yet but I'm
> interested in the
Kouhei Sutou created ARROW-7260:
---
Summary: [CI] Ubuntu 14.04 test is failed by user defined literal
Key: ARROW-7260
URL: https://issues.apache.org/jira/browse/ARROW-7260
Project: Apache Arrow
31 matches
Mail list logo