Re: Datasets and Java

2019-11-26 Thread Hongze Zhang
Hi Wes and Micah,


Thanks for your kindly reply.


Micah: We don't use Spark (vectorized) parquet reader because it is a pure Java 
implementation. Performance could be worse than doing the similar work 
natively. Another reason is we may need to
integrate some other specific data sources with Arrow datasets, for limiting 
the workload, we would like to maintain a common read pipeline for both this 
one and other wildly used data sources like Parquet and Csv.


Wes: Yes, Datasets framework along with Parquet/CSV/... reader implementations 
are totally native, So a JNI bridge will be needed then we don't actually read 
files in Java.


My another concern is how many C++ datasets components should be bridged via 
JNI. For example,
bridge the ScanTask only? Or bridge more components including Scanner, Table, 
even the DataSource
discovery system? Or just bridge the C++ arrow Parquet, Orc readers (as Micah 
said, orc-jni is
already there) and reimplement everything needed by datasets in Java? This 
might be not that easy to
decide but currently based on my limited perspective I would prefer to get 
started from the ScanTask
layer as a result we could leverage some valuable work finished in C++ datasets 
and don't have to
maintain too much tedious JNI code. The real IO process still take place inside 
C++ readers when we
do scan operation.


So Wes, Micah, is this similar to your consideration?


Thanks,
Hongze

At 2019-11-27 12:39:52, "Micah Kornfield"  wrote:
>Hi Hongze,
>To add to Wes's point, there are already some efforts to do JNI for ORC
>(which needs to be integrated with CI) and some open PRs for Parquet in the
>project.  However, given that you are using Spark I would expect there is
>already dataset functionality that is equivalent to the dataset API to do
>rowgroup/partition level filtering.  Can you elaborate on what problems you
>are seeing with those and what additional use cases you have?
>
>Thanks,
>Micah
>
>
>On Tue, Nov 26, 2019 at 1:10 PM Wes McKinney  wrote:
>
>> hi Hongze,
>>
>> The Datasets functionality is indeed extremely useful, and it may make
>> sense to have it available in many languages eventually. With Java, I
>> would raise the issue that things are comparatively weaker there when
>> it comes to actually reading the files themselves. Whereas we have
>> reasonably fast Arrow-based interfaces to CSV, JSON, ORC, and Parquet
>> in C++ the same is not true in Java. Not a deal breaker but worth
>> taking into consideration.
>>
>> I wonder aloud whether it might be worth investing in a JNI-based
>> interface to the C++ libraries as one potential approach to save on
>> development time.
>>
>> - Wes
>>
>>
>>
>> On Tue, Nov 26, 2019 at 5:54 AM Hongze Zhang  wrote:
>> >
>> > Hi all,
>> >
>> >
>> > Recently the datasets API has been improved a lot and I found some of
>> the new features are very useful to my own work. For example to me a
>> important one is the fix of ARROW-6952[1]. And as I currently work on
>> Java/Scala projects like Spark, I am now investigating a way to call some
>> of the datasets APIs in Java so that I could gain performance improvement
>> from native dataset filters/projectors. Meantime I am also interested in
>> the ability of scanning different data sources provided by dataset API.
>> >
>> >
>> > Regarding using datasets in Java, my initial idea is to port (by writing
>> Java-version implementations) some of the high-level concepts in Java such
>> as DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and call
>> lower level record batch iterators via JNI. This way we seem to retain
>> performance advantages from c++ dataset code.
>> >
>> >
>> > Is anyone interested in this topic also? Or is this something already on
>> the development plan? Any feedback or thoughts would be much appreciated.
>> >
>> >
>> > Best,
>> > Hongze
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/ARROW-6952
>>


[Result] [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding (second iteration)

2019-11-26 Thread Micah Kornfield
The vote carries with 3 bindings votes +1 votes, 1 non-binding +1 vote and
1 non-binding +.5 vote.

To follow-up I will:
1.  Open up JIRAs for work items in reference implementations (c++/java)
2.  Merge the pull request containing the specification changes.

Thanks,
Micah

On Tue, Nov 26, 2019 at 12:50 AM Sutou Kouhei  wrote:

> +1 (binding)
>
> In 
>   "[VOTE] Clarifications and forward compatibility changes for Dictionary
> Encoding (second iteration)" on Wed, 20 Nov 2019 20:41:57 -0800,
>   Micah Kornfield  wrote:
>
> > Hello,
> > As discussed on [1], I've proposed clarifications in a PR [2] that
> > clarifies:
> >
> > 1.  It is not required that all dictionary batches occur at the beginning
> > of the IPC stream format (if a the first record batch has an all null
> > dictionary encoded column, the null column's dictionary might not be sent
> > until later in the stream).
> >
> > 2.  A second dictionary batch for the same ID that is not a "delta batch"
> > in an IPC stream indicates the dictionary should be replaced.
> >
> > 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> > dictionary batch and multiple "delta" dictionary batches. Dictionary
> > replacement is not supported in the file format.
> >
> > 4.  Add an enum to dictionary metadata for possible future changes in
> what
> > format dictionary batches can be sent. (the most likely would be an array
> > Map).  An enum is needed as a place holder to allow for
> forward
> > compatibility past the release 1.0.0.
> >
> > If accepted there will be work in all implementations to make sure that
> > they cover the edge cases highlighted and additional integration testing
> > will be needed.
> >
> > Please vote whether to accept these additions. The vote will be open for
> at
> > least 72 hours.
> >
> > [ ] +1 Accept these change to the specification
> > [ ] +0
> > [ ] -1 Do not accept the changes because...
> >
> > Thanks,
> > Micah
> >
> >
> > [1]
> >
> https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E
> > [2] https://github.com/apache/arrow/pull/5585
>


Re: [DISCUSS][C++/Python] Bazel example

2019-11-26 Thread Micah Kornfield
Hi Antoine,


> My question would be: what happens after the PR is merged?  Are
> developers supposed to keep the Bazel setup working in addition to
> CMake?  Or is there a dedicated maintainer (you? :-)) to fix regressions
> when they happen?

In the short term, I would be will to be a dedicated maintainer for Mac
(and once I get Linux support working for that as well).   I'd like to
classify the support as very experimental (not advertise in documentation
yet).  If other devs find Bazel useful, I would expect others to help with
maintenance naturally.  If it gets too much for me to maintain, I'm willing
to drop support completely, since it won't be a critical part of the build
infrastructure.  Once the setup is more complete, I would plan on adding a
CI target for it as well.


>  Can you give an example of circular dependency?  Can this be solved by
> having more "type_fwd.h" headers for forward declarations of opaque types?

I think the type_fwd.h might contribute to the problem. The solution would
be more granular header/compilation units when possible (or combining
targets appropriately).  An example of the problem is expression.h/.cc and
operation.h/.cc in the compute library.  Because operation.cc depends on
expression.h and expression.cc relies on expression.h there is cycle
between the two targets.  I fixed this by making a new header only target
for expression.h, which the operation target depends on.   Then the
expression target depends on the operation target.  An alternative approach
would be to combine "expression.*" and "operation.*" into a single target.


> (also, generally, it would be desirable to use more of these, since our
> compile times have become egregious as of late - I'm currently
> considering replacing my 8-core desktop CPU with a beefier one :-/)

I'm not a huge fan of this approach in general, but since I haven't been
able to contribute on a day-to-day basis to the C++ code base, I'll let the
active contributors decide the best course here.  I thought computer
upgrades where something to look forward to ;)

This sounds really like a bummer. Do you have to spell those out by
> hand?  Or is there some tool that infers dependencies and generates the
> declarations for you?

Yes, I had to spell them out by hand.  There is an internal tool at Google
that helps with it (I didn't use it for this PR). There has been some
discussion of open-sourcing the tool [1], but I wouldn't expect it any time
soon.  Luckily things are fairly well modularized at the moment, so while
painful, I still felt it was not tremendously painful.  Another solution
would be to have larger targets (e.g. one per directory) that use globs
which would make it less painful, but this loses some of the benefits
mentioned above.

[1] https://github.com/bazelbuild/bazel/issues/6871

On Tue, Nov 26, 2019 at 1:27 AM Antoine Pitrou  wrote:

>
> Hi Micah,
>
> Le 26/11/2019 à 05:52, Micah Kornfield a écrit :
> >
> > After going through this exercise I put together a list of pros and cons
> > below.
> >
> > I would like to hear from other devs:
> > 1.  Their opinions on setting this up as an alternative system (I'm
> willing
> > to invest some more time in it).
> > 2. What people think the minimum bar for merging a PR like this should
> be?
>
> My question would be: what happens after the PR is merged?  Are
> developers supposed to keep the Bazel setup working in addition to
> CMake?  Or is there a dedicated maintainer (you? :-)) to fix regressions
> when they happen?
>
> > Pros:
> > 1.  Being able to run "bazel test python/..." and having compilation of
> all
> > python dependencies just work is a nice experience.
> > 2.  Because of the granular compilation units, it can improve developer
> > velocity. Unit tests can depend only on the sub-components they are meant
> > to test. They don't need to compile and relink arrow.so.
> > 3.  The built-in documentation it provides about visibility and
> > relationships between components is nice (its uncovered some "interesting
> > dependencies").  I didn't make heavy use of it, but its concept of
> > "visibility" makes things more explicit about what external consumers
> > should be depending on, and what inter-project components should depend
> on
> > (e.g. explicitly limit the scope of vendored code).
> > 4.  Extensions are essentially python, which might be easier to work with
> > then CMake
>
> Those sound nice.
>
> > Cons:
> > 1.  Bazel is opinionated on C++ layout.  In particular it requires some
> > workarounds to deal with circular .h/.cc dependencies.  The two main ways
> > of doing this are either increasing the size of compilable units [4] to
> > span all dependencies in the cycle, or creating separate
> > header/implementation targets, I've used both strategies in the PR.  One
> > could argue that it would be nice to reduce circular dependencies in
> > general.
>
> Can you give an example of circular dependency?  Can this be solved by
> having more "type_fwd.h" 

Re: Datasets and Java

2019-11-26 Thread Micah Kornfield
Hi Hongze,
To add to Wes's point, there are already some efforts to do JNI for ORC
(which needs to be integrated with CI) and some open PRs for Parquet in the
project.  However, given that you are using Spark I would expect there is
already dataset functionality that is equivalent to the dataset API to do
rowgroup/partition level filtering.  Can you elaborate on what problems you
are seeing with those and what additional use cases you have?

Thanks,
Micah


On Tue, Nov 26, 2019 at 1:10 PM Wes McKinney  wrote:

> hi Hongze,
>
> The Datasets functionality is indeed extremely useful, and it may make
> sense to have it available in many languages eventually. With Java, I
> would raise the issue that things are comparatively weaker there when
> it comes to actually reading the files themselves. Whereas we have
> reasonably fast Arrow-based interfaces to CSV, JSON, ORC, and Parquet
> in C++ the same is not true in Java. Not a deal breaker but worth
> taking into consideration.
>
> I wonder aloud whether it might be worth investing in a JNI-based
> interface to the C++ libraries as one potential approach to save on
> development time.
>
> - Wes
>
>
>
> On Tue, Nov 26, 2019 at 5:54 AM Hongze Zhang  wrote:
> >
> > Hi all,
> >
> >
> > Recently the datasets API has been improved a lot and I found some of
> the new features are very useful to my own work. For example to me a
> important one is the fix of ARROW-6952[1]. And as I currently work on
> Java/Scala projects like Spark, I am now investigating a way to call some
> of the datasets APIs in Java so that I could gain performance improvement
> from native dataset filters/projectors. Meantime I am also interested in
> the ability of scanning different data sources provided by dataset API.
> >
> >
> > Regarding using datasets in Java, my initial idea is to port (by writing
> Java-version implementations) some of the high-level concepts in Java such
> as DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and call
> lower level record batch iterators via JNI. This way we seem to retain
> performance advantages from c++ dataset code.
> >
> >
> > Is anyone interested in this topic also? Or is this something already on
> the development plan? Any feedback or thoughts would be much appreciated.
> >
> >
> > Best,
> > Hongze
> >
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-6952
>


Re: Unions: storing type_ids or type_codes?

2019-11-26 Thread Fan Liya
Hi Antoine,

For Java, the physical child id is the same as the logical type code, as
the index of each child vector is the code (ordinal) of the vector's minor
type.
This leads to a problem, that only a single vector for each type can exist
in a union vector, so strictly speaking, the Java implementation is not
consistent with the Arrow specification. (This is indicated by Micah long
ago).

Best,
Liya Fan


On Tue, Nov 26, 2019 at 9:59 PM Francois Saint-Jacques <
fsaintjacq...@gmail.com> wrote:

> It seems that the array_union_test.cc does the latter, look at how
> `expected_types` is constructed. I opened
> https://issues.apache.org/jira/browse/ARROW-7265 .
>
> Wes, is the intended usage of type_ids to allow a producer to pass a
> subset columns of unions without modifying the type codes?
>
> François
>
>
> On Thu, Nov 21, 2019 at 10:51 AM Antoine Pitrou 
> wrote:
> >
> >
> > Hello,
> >
> > There's some ambiguity whether a union array's "types" buffer stores
> > physical child ids, or logical type codes.
> >
> > Some of our C++ tests assume the former:
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L107-L123
> >
> > Some of our C++ tests assume the latter:
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L311-L326
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/json_simple_test.cc#L943-L955
> >
> > Critically, no validation of union data is currently implemented in C++
> > (ARROW-6157).  I can't parse the Java source code.
> >
> > Regards
> >
> > Antoine.
> >
>


[jira] [Created] (ARROW-7268) Propagate `custom_metadata` field from IPC message

2019-11-26 Thread Martin Grund (Jira)
Martin Grund created ARROW-7268:
---

 Summary: Propagate `custom_metadata` field from IPC message
 Key: ARROW-7268
 URL: https://issues.apache.org/jira/browse/ARROW-7268
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Martin Grund


Right now, the custom metadata field in the Schema IPC message is not 
propagated from the IPC message to the internal data type. To be closer to 
parity compared to the other implementations it would be good to add the 
necessary logic to serialize and deserialize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0

2019-11-26 Thread Wes McKinney
OK, so the proposal is not only to drop support for Ubuntu 14.04 but
also to stop supporting gcc < 4.9, is that right? Since manylinux1 is
gcc 4.8.5 as long as the _libraries_ build then that is okay. I don't
know what the implications of dropping manylinux1 (in favor of
manylinux2010) would be

On Tue, Nov 26, 2019 at 9:45 AM Antoine Pitrou  wrote:
>
>
> I'd rather drop 14.04 rather than spend some time maintaining kludges
> for old compilers.
>
> Regards
>
> Antoine.
>
>
> On Tue, 26 Nov 2019 17:24:58 +0900 (JST)
> Sutou Kouhei  wrote:
>
> > OK. I submitted a pull request: https://github.com/apache/arrow/pull/5901
> >
> > In 
> >   "Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0" on Mon, 
> > 25 Nov 2019 21:23:34 -0600,
> >   Wes McKinney  wrote:
> >
> > > I'd be interested to maintain gcc 4.8 support for a time yet but I'm
> > > interested in the opinions of others
> > >
> > > On Mon, Nov 25, 2019 at 9:14 PM Sutou Kouhei  wrote:
> > >>
> > >> > - test-ubuntu-14.04-cpp:
> > >> >   URL: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-14.04-cpp
> > >>
> > >> Error message:
> > >>
> > >>   /arrow/cpp/src/arrow/dataset/filter_test.cc:80:194: error: invalid 
> > >> suffix on literal; C++11 requires a space between literal and identifier 
> > >> [-Werror=literal-suffix]
> > >>  ASSERT_EQ("a"_.ToString(), "a");
> > >>   ^
> > >>
> > >> It seems that g++ on Ubuntu 14.04 is old.
> > >> I think that we can drop support for Ubuntu 14.04 because
> > >> it reaches EOL: https://ubuntu.com/about/release-cycle
> > >>
> > >> Can we remove this test job?
> > >>
> > >>
> > >> Thanks,
> > >> --
> > >> kou
> > >>
> > >> In <5ddbd09a.1c69fb81.165b7.f...@mx.google.com>
> > >>   "[NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0" on Mon, 25 
> > >> Nov 2019 05:01:14 -0800 (PST),
> > >>   Crossbow  wrote:
> > >>
> > >> >
> > >> > Arrow Build Report for Job nightly-2019-11-25-0
> > >> >
> > >> > All tasks: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0
> > >> >
> > >> > Failed Tasks:
> > >> > - homebrew-cpp:
> > >> >   URL: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-travis-homebrew-cpp
> > >> > - test-conda-python-2.7-pandas-master:
> > >> >   URL: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-2.7-pandas-master
> > >> > - test-conda-python-3.7-dask-latest:
> > >> >   URL: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-dask-latest
> > >> > - test-conda-python-3.7-dask-master:
> > >> >   URL: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-dask-master
> > >> > - test-conda-python-3.7-hdfs-2.9.2:
> > >> >   URL: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-hdfs-2.9.2
> > >> > - test-conda-python-3.7-pandas-latest:
> > >> >   URL: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-pandas-latest
> > >> > - test-conda-python-3.7-pandas-master:
> > >> >   URL: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-pandas-master
> > >> > - test-conda-python-3.7-spark-master:
> > >> >   URL: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-spark-master
> > >> > - test-conda-python-3.7-turbodbc-latest:
> > >> >   URL: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-turbodbc-latest
> > >> > - test-conda-python-3.7-turbodbc-master:
> > >> >   URL: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-turbodbc-master
> > >> > - test-conda-python-3.7:
> > >> >   URL: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7
> > >> > - test-debian-10-rust-nightly-2019-09-25:
> > >> >   URL: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-debian-10-rust-nightly-2019-09-25
> > >> > - test-ubuntu-14.04-cpp:
> > >> >   URL: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-14.04-cpp
> > >> > - test-ubuntu-fuzzit:
> > >> >   URL: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-fuzzit
> > >> >
> > >> > Succeeded Tasks:
> > >> > - centos-6:
> > >> >   URL: 
> > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-centos-6
> > >> > - centos-7:
> > >> >   URL: 
> > >> > 

Re: Datasets and Java

2019-11-26 Thread Wes McKinney
hi Hongze,

The Datasets functionality is indeed extremely useful, and it may make
sense to have it available in many languages eventually. With Java, I
would raise the issue that things are comparatively weaker there when
it comes to actually reading the files themselves. Whereas we have
reasonably fast Arrow-based interfaces to CSV, JSON, ORC, and Parquet
in C++ the same is not true in Java. Not a deal breaker but worth
taking into consideration.

I wonder aloud whether it might be worth investing in a JNI-based
interface to the C++ libraries as one potential approach to save on
development time.

- Wes



On Tue, Nov 26, 2019 at 5:54 AM Hongze Zhang  wrote:
>
> Hi all,
>
>
> Recently the datasets API has been improved a lot and I found some of the new 
> features are very useful to my own work. For example to me a important one is 
> the fix of ARROW-6952[1]. And as I currently work on Java/Scala projects like 
> Spark, I am now investigating a way to call some of the datasets APIs in Java 
> so that I could gain performance improvement from native dataset 
> filters/projectors. Meantime I am also interested in the ability of scanning 
> different data sources provided by dataset API.
>
>
> Regarding using datasets in Java, my initial idea is to port (by writing 
> Java-version implementations) some of the high-level concepts in Java such as 
> DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and call lower 
> level record batch iterators via JNI. This way we seem to retain performance 
> advantages from c++ dataset code.
>
>
> Is anyone interested in this topic also? Or is this something already on the 
> development plan? Any feedback or thoughts would be much appreciated.
>
>
> Best,
> Hongze
>
>
> [1] https://issues.apache.org/jira/browse/ARROW-6952


Re: Union type ids - signed or unsigned?

2019-11-26 Thread Antoine Pitrou


Thanks for all the answers.  The assumptions about union types in C++
code are fixed in https://github.com/apache/arrow/pull/5892

Regards

Antoine.


Le 25/11/2019 à 16:41, Wes McKinney a écrit :
> On Mon, Nov 25, 2019 at 9:25 AM Antoine Pitrou  wrote:
>>
>> On Mon, 25 Nov 2019 09:12:21 -0600
>> Wes McKinney  wrote:
>>> On Mon, Nov 25, 2019 at 8:52 AM Antoine Pitrou  wrote:


 Hello,

 The spec has the following language about union type ids:
 """
 Types buffer: A buffer of 8-bit signed integers. Each type in the union
 has a corresponding type id whose values are found in this buffer. A
 union with more than 127 possible types can be modeled as a union of 
 unions.
 """
 https://arrow.apache.org/docs/format/Columnar.html#union-layout

 However, in several places the C++ code assumes type ids are unsigned.
 Java doesn't seem to implement type ids (and there is no integration
 task for union types).

 In the flatbuffers description, the type ids array is modeled as an
 array of signed 32-bit integers.

 Moreover, according to the language above, type ids should be restricted
 to the [0, 127] interval?  Which one should it be?
>>>
>>> The (optional) type ids in the metadata provide a correspondence
>>> between the union types / children and the values found in the types
>>> buffer (data). As stated in the spec, the types buffer are 8-bit
>>> signed integers. As I recall the reason that we used [ Int ] in the
>>> metadata was that the Int type is thought to be easier for languages
>>> to work with in general when serializing/deserializing the metadata.
>>
>> Ok, but is there a reason the C++ code uses `std::vector` for
>> the type codes?
> 
> Oversight on my part. Suggest we change to int8_t
> 
>> Regards
>>
>> Antoine.
>>
>>


[jira] [Created] (ARROW-7267) [CI] [C++] Tests not run on "AMD64 Windows 2019 C++"

2019-11-26 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7267:
-

 Summary: [CI] [C++] Tests not run on "AMD64 Windows 2019 C++"
 Key: ARROW-7267
 URL: https://issues.apache.org/jira/browse/ARROW-7267
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou


We build the tests ({{ARROW_BUILD_TESTS=ON}}) but we don't run them:
https://github.com/apache/arrow/pull/5608/checks?check_run_id=321619958

cc [~kszucs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Is FileSystem._isfilestore considered public?

2019-11-26 Thread Antoine Pitrou


Generally speaking, this API is obsolete (though not formally deprecated
yet).  So we don't envision to change it significantly in the future.

We hope that in the near future the near pyarrow FileSystem API will be
usable directly pyarrow.parquet.

Regards

Antoine.


Le 26/11/2019 à 15:34, Tom Augspurger a écrit :
> Hi,
> 
> In https://github.com/dask/dask/issues/5526, we're seeing an issue stemming
> from a hack to ensure compatibility for Pyarrow. The details aren't too
> important. The core of the issue is that the Pyarrow parquet writer makes a
> couple checks for `FileSystem._isfilestore` via `_mkdir_if_not_exists`,
> e.g. in
> https://github.com/apache/arrow/blob/207b3507be82e92ebf29ec7d6d3b0bb86091c09a/python/pyarrow/parquet.py#L1349-L1350
> .
> 
> Is it OK for my FileSystem subclass to override _isfilestore? Is it
> considered public?
> 
> Thanks,
> 
> Tom
> 


Re: Non-chunked large files / hdf5 support

2019-11-26 Thread Francois Saint-Jacques
Hello Maarten,

In theory, you could provide a custom mmap-allocator and use the
builder facility. Since the array is still in "build-phase" and not
sealed, it should be fine if mremap changes the pointer address. This
might fail in practice since the allocator is also used for auxiliary
data, e.g. dictionary hash table data in the case of Dictionary type.


Another solution is to create a `FixedBuilder class where
- the number of elements is known
- the data type is of fixed width
- Nullability is know (whether you need an extra buffer).

I think sooner or later we'll need such class.

François

On Tue, Nov 26, 2019 at 10:01 AM Maarten Breddels
 wrote:
>
> In vaex I always write the data to hdf5 as 1 large chunk (per column).
> The reason is that it allows the mmapped columns to be exposed as a
> single numpy array (talking numerical data only for now), which many
> people are quite comfortable with.
>
> The strategy for vaex to write unchunked data, is to first create an
> 'empty' hdf5 file (filled with zeros), mmap those huge arrays, and
> write to that in chunks.
>
> This means that in vaex I need to support mutable data (only used
> internally, vaex' default is immutable data like arrow), since I need
> to write to the memory mapped data. It also makes the exporting code
> relatively simple.
>
> I could not find a way in Arrow to get something similar done, at
> least not without having a single pa.array instance for each column. I
> think Arrow's mindset is that you should just use chunks right? Or is
> this also something that can be considered for Arrow?
>
> An alternative would be to implement Arrow in hdf5, which I basically
> do now in vaex (with limited support). Again, I'm wondering if there
> is there an interest in storing arrow data in hdf5 from the Arrow
> community?
>
> cheers,
>
> Maarten


Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0

2019-11-26 Thread Antoine Pitrou


I'd rather drop 14.04 rather than spend some time maintaining kludges
for old compilers.

Regards

Antoine.


On Tue, 26 Nov 2019 17:24:58 +0900 (JST)
Sutou Kouhei  wrote:

> OK. I submitted a pull request: https://github.com/apache/arrow/pull/5901
> 
> In 
>   "Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0" on Mon, 25 
> Nov 2019 21:23:34 -0600,
>   Wes McKinney  wrote:
> 
> > I'd be interested to maintain gcc 4.8 support for a time yet but I'm
> > interested in the opinions of others
> > 
> > On Mon, Nov 25, 2019 at 9:14 PM Sutou Kouhei  wrote:  
> >>  
> >> > - test-ubuntu-14.04-cpp:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-14.04-cpp
> >> >   
> >>
> >> Error message:
> >>
> >>   /arrow/cpp/src/arrow/dataset/filter_test.cc:80:194: error: invalid 
> >> suffix on literal; C++11 requires a space between literal and identifier 
> >> [-Werror=literal-suffix]
> >>  ASSERT_EQ("a"_.ToString(), "a");
> >>   ^
> >>
> >> It seems that g++ on Ubuntu 14.04 is old.
> >> I think that we can drop support for Ubuntu 14.04 because
> >> it reaches EOL: https://ubuntu.com/about/release-cycle
> >>
> >> Can we remove this test job?
> >>
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In <5ddbd09a.1c69fb81.165b7.f...@mx.google.com>
> >>   "[NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0" on Mon, 25 
> >> Nov 2019 05:01:14 -0800 (PST),
> >>   Crossbow  wrote:
> >>  
> >> >
> >> > Arrow Build Report for Job nightly-2019-11-25-0
> >> >
> >> > All tasks: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0
> >> >
> >> > Failed Tasks:
> >> > - homebrew-cpp:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-travis-homebrew-cpp
> >> > - test-conda-python-2.7-pandas-master:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-2.7-pandas-master
> >> > - test-conda-python-3.7-dask-latest:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-dask-latest
> >> > - test-conda-python-3.7-dask-master:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-dask-master
> >> > - test-conda-python-3.7-hdfs-2.9.2:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-hdfs-2.9.2
> >> > - test-conda-python-3.7-pandas-latest:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-pandas-latest
> >> > - test-conda-python-3.7-pandas-master:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-pandas-master
> >> > - test-conda-python-3.7-spark-master:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-spark-master
> >> > - test-conda-python-3.7-turbodbc-latest:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-turbodbc-latest
> >> > - test-conda-python-3.7-turbodbc-master:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-turbodbc-master
> >> > - test-conda-python-3.7:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7
> >> > - test-debian-10-rust-nightly-2019-09-25:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-debian-10-rust-nightly-2019-09-25
> >> > - test-ubuntu-14.04-cpp:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-14.04-cpp
> >> > - test-ubuntu-fuzzit:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-fuzzit
> >> >
> >> > Succeeded Tasks:
> >> > - centos-6:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-centos-6
> >> > - centos-7:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-centos-7
> >> > - centos-8:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-centos-8
> >> > - conda-linux-gcc-py27:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-conda-linux-gcc-py27
> >> > - conda-linux-gcc-py36:
> >> >   URL: 
> >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-conda-linux-gcc-py36
> >> > - conda-linux-gcc-py37:
> >> >   URL: 
> 

Non-chunked large files / hdf5 support

2019-11-26 Thread Maarten Breddels
In vaex I always write the data to hdf5 as 1 large chunk (per column).
The reason is that it allows the mmapped columns to be exposed as a
single numpy array (talking numerical data only for now), which many
people are quite comfortable with.

The strategy for vaex to write unchunked data, is to first create an
'empty' hdf5 file (filled with zeros), mmap those huge arrays, and
write to that in chunks.

This means that in vaex I need to support mutable data (only used
internally, vaex' default is immutable data like arrow), since I need
to write to the memory mapped data. It also makes the exporting code
relatively simple.

I could not find a way in Arrow to get something similar done, at
least not without having a single pa.array instance for each column. I
think Arrow's mindset is that you should just use chunks right? Or is
this also something that can be considered for Arrow?

An alternative would be to implement Arrow in hdf5, which I basically
do now in vaex (with limited support). Again, I'm wondering if there
is there an interest in storing arrow data in hdf5 from the Arrow
community?

cheers,

Maarten


Re: Strategy for mixing large_string and string with chunked arrays

2019-11-26 Thread Maarten Breddels
Op di 26 nov. 2019 om 15:02 schreef Wes McKinney :

> hi Maarten
>
> I opened https://issues.apache.org/jira/browse/ARROW-7245 in part based
> on this.
>
> I think that normalizing to a common type (which would require casting
> the offsets buffer, but not the data -- which can be shared -- so not
> too wasteful) during concatenation would be the approach I would take.
> I would be surprised if normalizing string offsets during record batch
> / table concatenation showed up as a performance or memory use issue
> relative to other kinds of operations -- in theory the
> string->large_string promotion should be relatively exceptional (< 5%
> of the time). I've found in performance tests that creating many
> smaller array chunks is faster anyway due to interplay with the memory
> allocator.
>

Yes, I think it is rare, but it does mean that if a user wants to convert a
Vaex dataframe to an Arrow table it might use GB's of RAM (thinking ~1
billion rows). Ideally, it would use zero RAM (imagine concatenating many
large memory-mapped datasets).
I'm ok living with this limitation, but I wanted to raise it before v1.0
goes out.



>
> Of course I think we should have string kernels for both 32-bit and
> 64-bit variants. Note that Gandiva already has significant string
> kernel support (for 32-bit offsets at the moment) and there is
> discussion about pre-compiling the LLVM IR into a shared library to
> not introduce an LLVM runtime dependency, so we could maintain a
> single code path for string algorithms that can be used both in a
> JIT-ed setting as well as pre-compiled / interpreted setting. See
> https://issues.apache.org/jira/browse/ARROW-7083


That is a very interesting approach, thanks for sharing that resource, I'll
consider that.


> Note that many analytic database engines (notably: Dremio, which is
> natively Arrow-based) don't support exceeding the 2GB / 32-bit limit
> at all and it does not seem to be an impedance in practical use. We
> have the Chunked* builder classes [1] in C++ to facilitate the
> creation of chunked binary arrays where there is concern about
> overflowing the 2GB limit.
>
> Others may have different opinions so I'll let them comment.
>

Yes, I think in many cases it's not a problem at all. Also in vaex, all the
processing happens in chunks, and no chunk will ever be that large (for the
near future...).
In vaex, when exporting to hdf5, I always write in 1 chunk, and that's
where most of my issues show up.

cheers,

Maarten


>
> - Wes
>
> [1]:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_binary.h#L510
>
> On Tue, Nov 26, 2019 at 7:44 AM Maarten Breddels
>  wrote:
> >
> > Hi Arrow devs,
> >
> > Small intro: I'm the main Vaex developer, an out of core dataframe
> > library for Python - https://github.com/vaexio/vaex -, and we're
> > looking into moving Vaex to use Apache Arrow for the data structure.
> > At the beginning of this year, we added string support in Vaex, which
> > required 64 bit offsets. Those were not available back then, so we
> > added our own data structure for string arrays. Our first step to move
> > to Apache Arrow is to see if we can use Arrow for the data structure,
> > and later on, move the strings algorithms of Vaex to Arrow.
> >
> > (originally posted at https://github.com/apache/arrow/issues/5874)
> >
> > In vaex I can lazily concatenate dataframes without memory copy. If I
> > want to implement this using a pa.ChunkedArray, users cannot
> > concatenate dataframes that have a string column with pa.string type
> > to a dataframe that has a column with pa.large_string.
> >
> > In short, there is no arrow data structure to handle this 'mixed
> > chunked array', but I was wondering if this could change. The only way
> > out seems to cast them manually to a common type (although blocked by
> > https://issues.apache.org/jira/browse/ARROW-6071).
> > Internally I could solve this in vaex, but feedback from building a
> > DataFrame library with arrow might be useful. Also, it means I cannot
> > expose the concatenated DataFrame as an arrow table.
> >
> > Because of this, I am wondering if having two types (large_string and
> > string) is a good idea in the end since it makes type checking
> > cumbersome (having to check two types each time).  Could an option be
> > that there is only 1 string and list type, and that the width of the
> > indices/offsets can be obtained at runtime? That would also make it
> > easy to support 16 and 8-bit offsets. That would make Arrow more
> > flexible and efficient, and I guess it would play better with
> > pa.ChunkedArray.
> >
> > Regards,
> >
> > Maarten Breddels
>


[jira] [Created] (ARROW-7266) dictionary_encode() of a slice gives wrong result

2019-11-26 Thread Adam Hooper (Jira)
Adam Hooper created ARROW-7266:
--

 Summary: dictionary_encode() of a slice gives wrong result
 Key: ARROW-7266
 URL: https://issues.apache.org/jira/browse/ARROW-7266
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.15.1
 Environment: Docker on Linux 5.2.18-200.fc30.x86_64; Python 3.7.4
Reporter: Adam Hooper


Steps to reproduce:

{code:python}
import pyarrow as pa
arr = pa.array(["a", "b", "b", "b"])[1:]
arr.dictionary_encode()
{code}

Expected results:

{code}
-- dictionary:
  [
"b"
  ]
-- indices:
  [
0,
0,
0
  ]
{code}

Actual results:

{code}
-- dictionary:
  [
"b",
""
  ]
-- indices:
  [
0,
0,
1
  ]
{code}

I don't know a workaround. Converting to pylist and back is too slow. Is there 
a way to copy the slice to a new offset-0 StringArray that I could then 
dictionary-encode? Otherwise, I'm considering building buffers by hand



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Is FileSystem._isfilestore considered public?

2019-11-26 Thread Tom Augspurger
Hi,

In https://github.com/dask/dask/issues/5526, we're seeing an issue stemming
from a hack to ensure compatibility for Pyarrow. The details aren't too
important. The core of the issue is that the Pyarrow parquet writer makes a
couple checks for `FileSystem._isfilestore` via `_mkdir_if_not_exists`,
e.g. in
https://github.com/apache/arrow/blob/207b3507be82e92ebf29ec7d6d3b0bb86091c09a/python/pyarrow/parquet.py#L1349-L1350
.

Is it OK for my FileSystem subclass to override _isfilestore? Is it
considered public?

Thanks,

Tom


Re: Strategy for mixing large_string and string with chunked arrays

2019-11-26 Thread Wes McKinney
hi Maarten

I opened https://issues.apache.org/jira/browse/ARROW-7245 in part based on this.

I think that normalizing to a common type (which would require casting
the offsets buffer, but not the data -- which can be shared -- so not
too wasteful) during concatenation would be the approach I would take.
I would be surprised if normalizing string offsets during record batch
/ table concatenation showed up as a performance or memory use issue
relative to other kinds of operations -- in theory the
string->large_string promotion should be relatively exceptional (< 5%
of the time). I've found in performance tests that creating many
smaller array chunks is faster anyway due to interplay with the memory
allocator.

Of course I think we should have string kernels for both 32-bit and
64-bit variants. Note that Gandiva already has significant string
kernel support (for 32-bit offsets at the moment) and there is
discussion about pre-compiling the LLVM IR into a shared library to
not introduce an LLVM runtime dependency, so we could maintain a
single code path for string algorithms that can be used both in a
JIT-ed setting as well as pre-compiled / interpreted setting. See
https://issues.apache.org/jira/browse/ARROW-7083

Note that many analytic database engines (notably: Dremio, which is
natively Arrow-based) don't support exceeding the 2GB / 32-bit limit
at all and it does not seem to be an impedance in practical use. We
have the Chunked* builder classes [1] in C++ to facilitate the
creation of chunked binary arrays where there is concern about
overflowing the 2GB limit.

Others may have different opinions so I'll let them comment.

- Wes

[1]: 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_binary.h#L510

On Tue, Nov 26, 2019 at 7:44 AM Maarten Breddels
 wrote:
>
> Hi Arrow devs,
>
> Small intro: I'm the main Vaex developer, an out of core dataframe
> library for Python - https://github.com/vaexio/vaex -, and we're
> looking into moving Vaex to use Apache Arrow for the data structure.
> At the beginning of this year, we added string support in Vaex, which
> required 64 bit offsets. Those were not available back then, so we
> added our own data structure for string arrays. Our first step to move
> to Apache Arrow is to see if we can use Arrow for the data structure,
> and later on, move the strings algorithms of Vaex to Arrow.
>
> (originally posted at https://github.com/apache/arrow/issues/5874)
>
> In vaex I can lazily concatenate dataframes without memory copy. If I
> want to implement this using a pa.ChunkedArray, users cannot
> concatenate dataframes that have a string column with pa.string type
> to a dataframe that has a column with pa.large_string.
>
> In short, there is no arrow data structure to handle this 'mixed
> chunked array', but I was wondering if this could change. The only way
> out seems to cast them manually to a common type (although blocked by
> https://issues.apache.org/jira/browse/ARROW-6071).
> Internally I could solve this in vaex, but feedback from building a
> DataFrame library with arrow might be useful. Also, it means I cannot
> expose the concatenated DataFrame as an arrow table.
>
> Because of this, I am wondering if having two types (large_string and
> string) is a good idea in the end since it makes type checking
> cumbersome (having to check two types each time).  Could an option be
> that there is only 1 string and list type, and that the width of the
> indices/offsets can be obtained at runtime? That would also make it
> easy to support 16 and 8-bit offsets. That would make Arrow more
> flexible and efficient, and I guess it would play better with
> pa.ChunkedArray.
>
> Regards,
>
> Maarten Breddels


Re: Unions: storing type_ids or type_codes?

2019-11-26 Thread Francois Saint-Jacques
It seems that the array_union_test.cc does the latter, look at how
`expected_types` is constructed. I opened
https://issues.apache.org/jira/browse/ARROW-7265 .

Wes, is the intended usage of type_ids to allow a producer to pass a
subset columns of unions without modifying the type codes?

François


On Thu, Nov 21, 2019 at 10:51 AM Antoine Pitrou  wrote:
>
>
> Hello,
>
> There's some ambiguity whether a union array's "types" buffer stores
> physical child ids, or logical type codes.
>
> Some of our C++ tests assume the former:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L107-L123
>
> Some of our C++ tests assume the latter:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L311-L326
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/json_simple_test.cc#L943-L955
>
> Critically, no validation of union data is currently implemented in C++
> (ARROW-6157).  I can't parse the Java source code.
>
> Regards
>
> Antoine.
>


[jira] [Created] (ARROW-7265) [Format][C++] Clarify the usage of typeIds in Union type documentation

2019-11-26 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-7265:
-

 Summary: [Format][C++] Clarify the usage of typeIds in Union type 
documentation
 Key: ARROW-7265
 URL: https://issues.apache.org/jira/browse/ARROW-7265
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques


The documentation is unclear.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Strategy for mixing large_string and string with chunked arrays

2019-11-26 Thread Maarten Breddels
Hi Arrow devs,

Small intro: I'm the main Vaex developer, an out of core dataframe
library for Python - https://github.com/vaexio/vaex -, and we're
looking into moving Vaex to use Apache Arrow for the data structure.
At the beginning of this year, we added string support in Vaex, which
required 64 bit offsets. Those were not available back then, so we
added our own data structure for string arrays. Our first step to move
to Apache Arrow is to see if we can use Arrow for the data structure,
and later on, move the strings algorithms of Vaex to Arrow.

(originally posted at https://github.com/apache/arrow/issues/5874)

In vaex I can lazily concatenate dataframes without memory copy. If I
want to implement this using a pa.ChunkedArray, users cannot
concatenate dataframes that have a string column with pa.string type
to a dataframe that has a column with pa.large_string.

In short, there is no arrow data structure to handle this 'mixed
chunked array', but I was wondering if this could change. The only way
out seems to cast them manually to a common type (although blocked by
https://issues.apache.org/jira/browse/ARROW-6071).
Internally I could solve this in vaex, but feedback from building a
DataFrame library with arrow might be useful. Also, it means I cannot
expose the concatenated DataFrame as an arrow table.

Because of this, I am wondering if having two types (large_string and
string) is a good idea in the end since it makes type checking
cumbersome (having to check two types each time).  Could an option be
that there is only 1 string and list type, and that the width of the
indices/offsets can be obtained at runtime? That would also make it
easy to support 16 and 8-bit offsets. That would make Arrow more
flexible and efficient, and I guess it would play better with
pa.ChunkedArray.

Regards,

Maarten Breddels


[NIGHTLY] Arrow Build Report for Job nightly-2019-11-26-0

2019-11-26 Thread Crossbow


Arrow Build Report for Job nightly-2019-11-26-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0

Failed Tasks:
- test-conda-python-2.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-2.7-pandas-master
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-dask-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7-dask-master
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7-pandas-latest
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7-pandas-master
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7-spark-master
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7
- test-debian-10-rust-nightly-2019-09-25:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-debian-10-rust-nightly-2019-09-25
- test-ubuntu-14.04-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-ubuntu-14.04-cpp
- wheel-manylinux1-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux1-cp27m
- wheel-manylinux1-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux1-cp27mu
- wheel-manylinux1-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux1-cp35m
- wheel-manylinux1-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux1-cp36m
- wheel-manylinux1-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux1-cp37m
- wheel-manylinux2010-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux2010-cp27m
- wheel-manylinux2010-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux2010-cp27mu
- wheel-manylinux2010-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux2010-cp35m
- wheel-manylinux2010-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux2010-cp36m
- wheel-manylinux2010-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux2010-cp37m
- wheel-osx-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-osx-cp35m

Succeeded Tasks:
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-centos-6
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-centos-7
- centos-8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-centos-8
- conda-linux-gcc-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-conda-linux-gcc-py27
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-conda-linux-gcc-py37
- conda-osx-clang-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-conda-osx-clang-py27
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-conda-osx-clang-py37
- conda-win-vs2015-py36:
  URL: 

Datasets and Java

2019-11-26 Thread Hongze Zhang
Hi all,


Recently the datasets API has been improved a lot and I found some of the new 
features are very useful to my own work. For example to me a important one is 
the fix of ARROW-6952[1]. And as I currently work on Java/Scala projects like 
Spark, I am now investigating a way to call some of the datasets APIs in Java 
so that I could gain performance improvement from native dataset 
filters/projectors. Meantime I am also interested in the ability of scanning 
different data sources provided by dataset API.


Regarding using datasets in Java, my initial idea is to port (by writing 
Java-version implementations) some of the high-level concepts in Java such as 
DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and call lower 
level record batch iterators via JNI. This way we seem to retain performance 
advantages from c++ dataset code.


Is anyone interested in this topic also? Or is this something already on the 
development plan? Any feedback or thoughts would be much appreciated.


Best,
Hongze


[1] https://issues.apache.org/jira/browse/ARROW-6952

[jira] [Created] (ARROW-7264) [Java] RangeEqualsVisitor type check is not correct

2019-11-26 Thread Ji Liu (Jira)
Ji Liu created ARROW-7264:
-

 Summary: [Java] RangeEqualsVisitor type check is not correct
 Key: ARROW-7264
 URL: https://issues.apache.org/jira/browse/ARROW-7264
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 0.15.1
Reporter: Ji Liu
Assignee: Ji Liu


Currently {{RangeEqualsVisitor}} generally only checks type once and keep the 
result to avoid repeated type checking, see
{code:java}
typeCompareResult = 
left.getField().getType().equals(right.getField().getType());
{code}
This only compares {{ArrowType}} and for complex type, this may cause 
unexpected behavior, for example {{List}} and {{List}} would be 
type equals which not consider their child field.

We should compare Field here instead and to make it more extendable, we use 
{{TypeEqualsVisitor}} to compare Field, in this way, one could choose whether 
checks names or metadata either.

 

Also provide a test for ListVector to validate this change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS][C++/Python] Bazel example

2019-11-26 Thread Antoine Pitrou


Hi Micah,

Le 26/11/2019 à 05:52, Micah Kornfield a écrit :
> 
> After going through this exercise I put together a list of pros and cons
> below.
> 
> I would like to hear from other devs:
> 1.  Their opinions on setting this up as an alternative system (I'm willing
> to invest some more time in it).
> 2. What people think the minimum bar for merging a PR like this should be?

My question would be: what happens after the PR is merged?  Are
developers supposed to keep the Bazel setup working in addition to
CMake?  Or is there a dedicated maintainer (you? :-)) to fix regressions
when they happen?

> Pros:
> 1.  Being able to run "bazel test python/..." and having compilation of all
> python dependencies just work is a nice experience.
> 2.  Because of the granular compilation units, it can improve developer
> velocity. Unit tests can depend only on the sub-components they are meant
> to test. They don't need to compile and relink arrow.so.
> 3.  The built-in documentation it provides about visibility and
> relationships between components is nice (its uncovered some "interesting
> dependencies").  I didn't make heavy use of it, but its concept of
> "visibility" makes things more explicit about what external consumers
> should be depending on, and what inter-project components should depend on
> (e.g. explicitly limit the scope of vendored code).
> 4.  Extensions are essentially python, which might be easier to work with
> then CMake

Those sound nice.

> Cons:
> 1.  Bazel is opinionated on C++ layout.  In particular it requires some
> workarounds to deal with circular .h/.cc dependencies.  The two main ways
> of doing this are either increasing the size of compilable units [4] to
> span all dependencies in the cycle, or creating separate
> header/implementation targets, I've used both strategies in the PR.  One
> could argue that it would be nice to reduce circular dependencies in
> general.

Can you give an example of circular dependency?  Can this be solved by
having more "type_fwd.h" headers for forward declarations of opaque types?

(also, generally, it would be desirable to use more of these, since our
compile times have become egregious as of late - I'm currently
considering replacing my 8-core desktop CPU with a beefier one :-/)

> 4.  It is more verbose to configure then CMake (each compilation unit needs
> to be spelled out with dependencies).

This sounds really like a bummer. Do you have to spell those out by
hand?  Or is there some tool that infers dependencies and generates the
declarations for you?

Regards

Antoine.


Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding (second iteration)

2019-11-26 Thread Sutou Kouhei
+1 (binding)

In 
  "[VOTE] Clarifications and forward compatibility changes for Dictionary 
Encoding (second iteration)" on Wed, 20 Nov 2019 20:41:57 -0800,
  Micah Kornfield  wrote:

> Hello,
> As discussed on [1], I've proposed clarifications in a PR [2] that
> clarifies:
> 
> 1.  It is not required that all dictionary batches occur at the beginning
> of the IPC stream format (if a the first record batch has an all null
> dictionary encoded column, the null column's dictionary might not be sent
> until later in the stream).
> 
> 2.  A second dictionary batch for the same ID that is not a "delta batch"
> in an IPC stream indicates the dictionary should be replaced.
> 
> 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> dictionary batch and multiple "delta" dictionary batches. Dictionary
> replacement is not supported in the file format.
> 
> 4.  Add an enum to dictionary metadata for possible future changes in what
> format dictionary batches can be sent. (the most likely would be an array
> Map).  An enum is needed as a place holder to allow for forward
> compatibility past the release 1.0.0.
> 
> If accepted there will be work in all implementations to make sure that
> they cover the edge cases highlighted and additional integration testing
> will be needed.
> 
> Please vote whether to accept these additions. The vote will be open for at
> least 72 hours.
> 
> [ ] +1 Accept these change to the specification
> [ ] +0
> [ ] -1 Do not accept the changes because...
> 
> Thanks,
> Micah
> 
> 
> [1]
> https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E
> [2] https://github.com/apache/arrow/pull/5585


[jira] [Created] (ARROW-7263) [C++][Gandiva] Implement locate and position functions

2019-11-26 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-7263:
-

 Summary: [C++][Gandiva] Implement locate and position functions
 Key: ARROW-7263
 URL: https://issues.apache.org/jira/browse/ARROW-7263
 Project: Apache Arrow
  Issue Type: Task
  Components: C++ - Gandiva
Reporter: Projjal Chanda
Assignee: Projjal Chanda


Add {{int32 locate(utf8, utf8, int32)}} and {{int32 locate(utf8, utf8) ** 
}}functions. Same for {{position}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7262) [C++][Gandiva] Implement replace function in Gandiva

2019-11-26 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-7262:
-

 Summary: [C++][Gandiva] Implement replace function in Gandiva
 Key: ARROW-7262
 URL: https://issues.apache.org/jira/browse/ARROW-7262
 Project: Apache Arrow
  Issue Type: Task
  Components: C++ - Gandiva
Reporter: Projjal Chanda
Assignee: Projjal Chanda


add _utf8 replace(utf8, utf8, utf8)_ function in Gandiva



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7261) [Python] Python support for fixed size list type

2019-11-26 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7261:


 Summary: [Python] Python support for fixed size list type
 Key: ARROW-7261
 URL: https://issues.apache.org/jira/browse/ARROW-7261
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


I didn't see any issue about this, but {{FixedSizeListArray}} (ARROW-1280) is 
not yet exposed in Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0

2019-11-26 Thread Sutou Kouhei
OK. I submitted a pull request: https://github.com/apache/arrow/pull/5901

In 
  "Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0" on Mon, 25 
Nov 2019 21:23:34 -0600,
  Wes McKinney  wrote:

> I'd be interested to maintain gcc 4.8 support for a time yet but I'm
> interested in the opinions of others
> 
> On Mon, Nov 25, 2019 at 9:14 PM Sutou Kouhei  wrote:
>>
>> > - test-ubuntu-14.04-cpp:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-14.04-cpp
>>
>> Error message:
>>
>>   /arrow/cpp/src/arrow/dataset/filter_test.cc:80:194: error: invalid suffix 
>> on literal; C++11 requires a space between literal and identifier 
>> [-Werror=literal-suffix]
>>  ASSERT_EQ("a"_.ToString(), "a");
>>   ^
>>
>> It seems that g++ on Ubuntu 14.04 is old.
>> I think that we can drop support for Ubuntu 14.04 because
>> it reaches EOL: https://ubuntu.com/about/release-cycle
>>
>> Can we remove this test job?
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In <5ddbd09a.1c69fb81.165b7.f...@mx.google.com>
>>   "[NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0" on Mon, 25 Nov 
>> 2019 05:01:14 -0800 (PST),
>>   Crossbow  wrote:
>>
>> >
>> > Arrow Build Report for Job nightly-2019-11-25-0
>> >
>> > All tasks: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0
>> >
>> > Failed Tasks:
>> > - homebrew-cpp:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-travis-homebrew-cpp
>> > - test-conda-python-2.7-pandas-master:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-2.7-pandas-master
>> > - test-conda-python-3.7-dask-latest:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-dask-latest
>> > - test-conda-python-3.7-dask-master:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-dask-master
>> > - test-conda-python-3.7-hdfs-2.9.2:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-hdfs-2.9.2
>> > - test-conda-python-3.7-pandas-latest:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-pandas-latest
>> > - test-conda-python-3.7-pandas-master:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-pandas-master
>> > - test-conda-python-3.7-spark-master:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-spark-master
>> > - test-conda-python-3.7-turbodbc-latest:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-turbodbc-latest
>> > - test-conda-python-3.7-turbodbc-master:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-turbodbc-master
>> > - test-conda-python-3.7:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7
>> > - test-debian-10-rust-nightly-2019-09-25:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-debian-10-rust-nightly-2019-09-25
>> > - test-ubuntu-14.04-cpp:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-14.04-cpp
>> > - test-ubuntu-fuzzit:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-fuzzit
>> >
>> > Succeeded Tasks:
>> > - centos-6:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-centos-6
>> > - centos-7:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-centos-7
>> > - centos-8:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-centos-8
>> > - conda-linux-gcc-py27:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-conda-linux-gcc-py27
>> > - conda-linux-gcc-py36:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-conda-linux-gcc-py36
>> > - conda-linux-gcc-py37:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-conda-linux-gcc-py37
>> > - conda-osx-clang-py27:
>> >   URL: 
>> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-conda-osx-clang-py27
>> > - conda-osx-clang-py36:
>> >   URL: 
>> > 

[jira] [Created] (ARROW-7260) [CI] Ubuntu 14.04 test is failed by user defined literal

2019-11-26 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-7260:
---

 Summary: [CI] Ubuntu 14.04 test is failed by user defined literal
 Key: ARROW-7260
 URL: https://issues.apache.org/jira/browse/ARROW-7260
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


https://circleci.com/gh/ursa-labs/crossbow/5329

{noformat}
  /arrow/cpp/src/arrow/dataset/filter_test.cc:80:194: error: invalid suffix on 
literal; C++11 requires a space between literal and identifier 
[-Werror=literal-suffix]
 ASSERT_EQ("a"_.ToString(), "a");
  ^
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)