Re: Datasets and Java
Hi Wes and Micah, Thanks for your kindly reply. Micah: We don't use Spark (vectorized) parquet reader because it is a pure Java implementation. Performance could be worse than doing the similar work natively. Another reason is we may need to integrate some other specific data sources with Arrow datasets, for limiting the workload, we would like to maintain a common read pipeline for both this one and other wildly used data sources like Parquet and Csv. Wes: Yes, Datasets framework along with Parquet/CSV/... reader implementations are totally native, So a JNI bridge will be needed then we don't actually read files in Java. My another concern is how many C++ datasets components should be bridged via JNI. For example, bridge the ScanTask only? Or bridge more components including Scanner, Table, even the DataSource discovery system? Or just bridge the C++ arrow Parquet, Orc readers (as Micah said, orc-jni is already there) and reimplement everything needed by datasets in Java? This might be not that easy to decide but currently based on my limited perspective I would prefer to get started from the ScanTask layer as a result we could leverage some valuable work finished in C++ datasets and don't have to maintain too much tedious JNI code. The real IO process still take place inside C++ readers when we do scan operation. So Wes, Micah, is this similar to your consideration? Thanks, Hongze At 2019-11-27 12:39:52, "Micah Kornfield" wrote: >Hi Hongze, >To add to Wes's point, there are already some efforts to do JNI for ORC >(which needs to be integrated with CI) and some open PRs for Parquet in the >project. However, given that you are using Spark I would expect there is >already dataset functionality that is equivalent to the dataset API to do >rowgroup/partition level filtering. Can you elaborate on what problems you >are seeing with those and what additional use cases you have? > >Thanks, >Micah > > >On Tue, Nov 26, 2019 at 1:10 PM Wes McKinney wrote: > >> hi Hongze, >> >> The Datasets functionality is indeed extremely useful, and it may make >> sense to have it available in many languages eventually. With Java, I >> would raise the issue that things are comparatively weaker there when >> it comes to actually reading the files themselves. Whereas we have >> reasonably fast Arrow-based interfaces to CSV, JSON, ORC, and Parquet >> in C++ the same is not true in Java. Not a deal breaker but worth >> taking into consideration. >> >> I wonder aloud whether it might be worth investing in a JNI-based >> interface to the C++ libraries as one potential approach to save on >> development time. >> >> - Wes >> >> >> >> On Tue, Nov 26, 2019 at 5:54 AM Hongze Zhang wrote: >> > >> > Hi all, >> > >> > >> > Recently the datasets API has been improved a lot and I found some of >> the new features are very useful to my own work. For example to me a >> important one is the fix of ARROW-6952[1]. And as I currently work on >> Java/Scala projects like Spark, I am now investigating a way to call some >> of the datasets APIs in Java so that I could gain performance improvement >> from native dataset filters/projectors. Meantime I am also interested in >> the ability of scanning different data sources provided by dataset API. >> > >> > >> > Regarding using datasets in Java, my initial idea is to port (by writing >> Java-version implementations) some of the high-level concepts in Java such >> as DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and call >> lower level record batch iterators via JNI. This way we seem to retain >> performance advantages from c++ dataset code. >> > >> > >> > Is anyone interested in this topic also? Or is this something already on >> the development plan? Any feedback or thoughts would be much appreciated. >> > >> > >> > Best, >> > Hongze >> > >> > >> > [1] https://issues.apache.org/jira/browse/ARROW-6952 >>
[Result] [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding (second iteration)
The vote carries with 3 bindings votes +1 votes, 1 non-binding +1 vote and 1 non-binding +.5 vote. To follow-up I will: 1. Open up JIRAs for work items in reference implementations (c++/java) 2. Merge the pull request containing the specification changes. Thanks, Micah On Tue, Nov 26, 2019 at 12:50 AM Sutou Kouhei wrote: > +1 (binding) > > In > "[VOTE] Clarifications and forward compatibility changes for Dictionary > Encoding (second iteration)" on Wed, 20 Nov 2019 20:41:57 -0800, > Micah Kornfield wrote: > > > Hello, > > As discussed on [1], I've proposed clarifications in a PR [2] that > > clarifies: > > > > 1. It is not required that all dictionary batches occur at the beginning > > of the IPC stream format (if a the first record batch has an all null > > dictionary encoded column, the null column's dictionary might not be sent > > until later in the stream). > > > > 2. A second dictionary batch for the same ID that is not a "delta batch" > > in an IPC stream indicates the dictionary should be replaced. > > > > 3. Clarifies that the file format, can only contain 1 "NON-delta" > > dictionary batch and multiple "delta" dictionary batches. Dictionary > > replacement is not supported in the file format. > > > > 4. Add an enum to dictionary metadata for possible future changes in > what > > format dictionary batches can be sent. (the most likely would be an array > > Map). An enum is needed as a place holder to allow for > forward > > compatibility past the release 1.0.0. > > > > If accepted there will be work in all implementations to make sure that > > they cover the edge cases highlighted and additional integration testing > > will be needed. > > > > Please vote whether to accept these additions. The vote will be open for > at > > least 72 hours. > > > > [ ] +1 Accept these change to the specification > > [ ] +0 > > [ ] -1 Do not accept the changes because... > > > > Thanks, > > Micah > > > > > > [1] > > > https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E > > [2] https://github.com/apache/arrow/pull/5585 >
Re: [DISCUSS][C++/Python] Bazel example
Hi Antoine, > My question would be: what happens after the PR is merged? Are > developers supposed to keep the Bazel setup working in addition to > CMake? Or is there a dedicated maintainer (you? :-)) to fix regressions > when they happen? In the short term, I would be will to be a dedicated maintainer for Mac (and once I get Linux support working for that as well). I'd like to classify the support as very experimental (not advertise in documentation yet). If other devs find Bazel useful, I would expect others to help with maintenance naturally. If it gets too much for me to maintain, I'm willing to drop support completely, since it won't be a critical part of the build infrastructure. Once the setup is more complete, I would plan on adding a CI target for it as well. > Can you give an example of circular dependency? Can this be solved by > having more "type_fwd.h" headers for forward declarations of opaque types? I think the type_fwd.h might contribute to the problem. The solution would be more granular header/compilation units when possible (or combining targets appropriately). An example of the problem is expression.h/.cc and operation.h/.cc in the compute library. Because operation.cc depends on expression.h and expression.cc relies on expression.h there is cycle between the two targets. I fixed this by making a new header only target for expression.h, which the operation target depends on. Then the expression target depends on the operation target. An alternative approach would be to combine "expression.*" and "operation.*" into a single target. > (also, generally, it would be desirable to use more of these, since our > compile times have become egregious as of late - I'm currently > considering replacing my 8-core desktop CPU with a beefier one :-/) I'm not a huge fan of this approach in general, but since I haven't been able to contribute on a day-to-day basis to the C++ code base, I'll let the active contributors decide the best course here. I thought computer upgrades where something to look forward to ;) This sounds really like a bummer. Do you have to spell those out by > hand? Or is there some tool that infers dependencies and generates the > declarations for you? Yes, I had to spell them out by hand. There is an internal tool at Google that helps with it (I didn't use it for this PR). There has been some discussion of open-sourcing the tool [1], but I wouldn't expect it any time soon. Luckily things are fairly well modularized at the moment, so while painful, I still felt it was not tremendously painful. Another solution would be to have larger targets (e.g. one per directory) that use globs which would make it less painful, but this loses some of the benefits mentioned above. [1] https://github.com/bazelbuild/bazel/issues/6871 On Tue, Nov 26, 2019 at 1:27 AM Antoine Pitrou wrote: > > Hi Micah, > > Le 26/11/2019 à 05:52, Micah Kornfield a écrit : > > > > After going through this exercise I put together a list of pros and cons > > below. > > > > I would like to hear from other devs: > > 1. Their opinions on setting this up as an alternative system (I'm > willing > > to invest some more time in it). > > 2. What people think the minimum bar for merging a PR like this should > be? > > My question would be: what happens after the PR is merged? Are > developers supposed to keep the Bazel setup working in addition to > CMake? Or is there a dedicated maintainer (you? :-)) to fix regressions > when they happen? > > > Pros: > > 1. Being able to run "bazel test python/..." and having compilation of > all > > python dependencies just work is a nice experience. > > 2. Because of the granular compilation units, it can improve developer > > velocity. Unit tests can depend only on the sub-components they are meant > > to test. They don't need to compile and relink arrow.so. > > 3. The built-in documentation it provides about visibility and > > relationships between components is nice (its uncovered some "interesting > > dependencies"). I didn't make heavy use of it, but its concept of > > "visibility" makes things more explicit about what external consumers > > should be depending on, and what inter-project components should depend > on > > (e.g. explicitly limit the scope of vendored code). > > 4. Extensions are essentially python, which might be easier to work with > > then CMake > > Those sound nice. > > > Cons: > > 1. Bazel is opinionated on C++ layout. In particular it requires some > > workarounds to deal with circular .h/.cc dependencies. The two main ways > > of doing this are either increasing the size of compilable units [4] to > > span all dependencies in the cycle, or creating separate > > header/implementation targets, I've used both strategies in the PR. One > > could argue that it would be nice to reduce circular dependencies in > > general. > > Can you give an example of circular dependency? Can this be solved by > having more "type_fwd.h"
Re: Datasets and Java
Hi Hongze, To add to Wes's point, there are already some efforts to do JNI for ORC (which needs to be integrated with CI) and some open PRs for Parquet in the project. However, given that you are using Spark I would expect there is already dataset functionality that is equivalent to the dataset API to do rowgroup/partition level filtering. Can you elaborate on what problems you are seeing with those and what additional use cases you have? Thanks, Micah On Tue, Nov 26, 2019 at 1:10 PM Wes McKinney wrote: > hi Hongze, > > The Datasets functionality is indeed extremely useful, and it may make > sense to have it available in many languages eventually. With Java, I > would raise the issue that things are comparatively weaker there when > it comes to actually reading the files themselves. Whereas we have > reasonably fast Arrow-based interfaces to CSV, JSON, ORC, and Parquet > in C++ the same is not true in Java. Not a deal breaker but worth > taking into consideration. > > I wonder aloud whether it might be worth investing in a JNI-based > interface to the C++ libraries as one potential approach to save on > development time. > > - Wes > > > > On Tue, Nov 26, 2019 at 5:54 AM Hongze Zhang wrote: > > > > Hi all, > > > > > > Recently the datasets API has been improved a lot and I found some of > the new features are very useful to my own work. For example to me a > important one is the fix of ARROW-6952[1]. And as I currently work on > Java/Scala projects like Spark, I am now investigating a way to call some > of the datasets APIs in Java so that I could gain performance improvement > from native dataset filters/projectors. Meantime I am also interested in > the ability of scanning different data sources provided by dataset API. > > > > > > Regarding using datasets in Java, my initial idea is to port (by writing > Java-version implementations) some of the high-level concepts in Java such > as DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and call > lower level record batch iterators via JNI. This way we seem to retain > performance advantages from c++ dataset code. > > > > > > Is anyone interested in this topic also? Or is this something already on > the development plan? Any feedback or thoughts would be much appreciated. > > > > > > Best, > > Hongze > > > > > > [1] https://issues.apache.org/jira/browse/ARROW-6952 >
Re: Unions: storing type_ids or type_codes?
Hi Antoine, For Java, the physical child id is the same as the logical type code, as the index of each child vector is the code (ordinal) of the vector's minor type. This leads to a problem, that only a single vector for each type can exist in a union vector, so strictly speaking, the Java implementation is not consistent with the Arrow specification. (This is indicated by Micah long ago). Best, Liya Fan On Tue, Nov 26, 2019 at 9:59 PM Francois Saint-Jacques < fsaintjacq...@gmail.com> wrote: > It seems that the array_union_test.cc does the latter, look at how > `expected_types` is constructed. I opened > https://issues.apache.org/jira/browse/ARROW-7265 . > > Wes, is the intended usage of type_ids to allow a producer to pass a > subset columns of unions without modifying the type codes? > > François > > > On Thu, Nov 21, 2019 at 10:51 AM Antoine Pitrou > wrote: > > > > > > Hello, > > > > There's some ambiguity whether a union array's "types" buffer stores > > physical child ids, or logical type codes. > > > > Some of our C++ tests assume the former: > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L107-L123 > > > > Some of our C++ tests assume the latter: > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L311-L326 > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/json_simple_test.cc#L943-L955 > > > > Critically, no validation of union data is currently implemented in C++ > > (ARROW-6157). I can't parse the Java source code. > > > > Regards > > > > Antoine. > > >
[jira] [Created] (ARROW-7268) Propagate `custom_metadata` field from IPC message
Martin Grund created ARROW-7268: --- Summary: Propagate `custom_metadata` field from IPC message Key: ARROW-7268 URL: https://issues.apache.org/jira/browse/ARROW-7268 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Martin Grund Right now, the custom metadata field in the Schema IPC message is not propagated from the IPC message to the internal data type. To be closer to parity compared to the other implementations it would be good to add the necessary logic to serialize and deserialize. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0
OK, so the proposal is not only to drop support for Ubuntu 14.04 but also to stop supporting gcc < 4.9, is that right? Since manylinux1 is gcc 4.8.5 as long as the _libraries_ build then that is okay. I don't know what the implications of dropping manylinux1 (in favor of manylinux2010) would be On Tue, Nov 26, 2019 at 9:45 AM Antoine Pitrou wrote: > > > I'd rather drop 14.04 rather than spend some time maintaining kludges > for old compilers. > > Regards > > Antoine. > > > On Tue, 26 Nov 2019 17:24:58 +0900 (JST) > Sutou Kouhei wrote: > > > OK. I submitted a pull request: https://github.com/apache/arrow/pull/5901 > > > > In > > "Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0" on Mon, > > 25 Nov 2019 21:23:34 -0600, > > Wes McKinney wrote: > > > > > I'd be interested to maintain gcc 4.8 support for a time yet but I'm > > > interested in the opinions of others > > > > > > On Mon, Nov 25, 2019 at 9:14 PM Sutou Kouhei wrote: > > >> > > >> > - test-ubuntu-14.04-cpp: > > >> > URL: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-14.04-cpp > > >> > > >> Error message: > > >> > > >> /arrow/cpp/src/arrow/dataset/filter_test.cc:80:194: error: invalid > > >> suffix on literal; C++11 requires a space between literal and identifier > > >> [-Werror=literal-suffix] > > >> ASSERT_EQ("a"_.ToString(), "a"); > > >> ^ > > >> > > >> It seems that g++ on Ubuntu 14.04 is old. > > >> I think that we can drop support for Ubuntu 14.04 because > > >> it reaches EOL: https://ubuntu.com/about/release-cycle > > >> > > >> Can we remove this test job? > > >> > > >> > > >> Thanks, > > >> -- > > >> kou > > >> > > >> In <5ddbd09a.1c69fb81.165b7.f...@mx.google.com> > > >> "[NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0" on Mon, 25 > > >> Nov 2019 05:01:14 -0800 (PST), > > >> Crossbow wrote: > > >> > > >> > > > >> > Arrow Build Report for Job nightly-2019-11-25-0 > > >> > > > >> > All tasks: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0 > > >> > > > >> > Failed Tasks: > > >> > - homebrew-cpp: > > >> > URL: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-travis-homebrew-cpp > > >> > - test-conda-python-2.7-pandas-master: > > >> > URL: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-2.7-pandas-master > > >> > - test-conda-python-3.7-dask-latest: > > >> > URL: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-dask-latest > > >> > - test-conda-python-3.7-dask-master: > > >> > URL: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-dask-master > > >> > - test-conda-python-3.7-hdfs-2.9.2: > > >> > URL: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-hdfs-2.9.2 > > >> > - test-conda-python-3.7-pandas-latest: > > >> > URL: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-pandas-latest > > >> > - test-conda-python-3.7-pandas-master: > > >> > URL: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-pandas-master > > >> > - test-conda-python-3.7-spark-master: > > >> > URL: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-spark-master > > >> > - test-conda-python-3.7-turbodbc-latest: > > >> > URL: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-turbodbc-latest > > >> > - test-conda-python-3.7-turbodbc-master: > > >> > URL: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-turbodbc-master > > >> > - test-conda-python-3.7: > > >> > URL: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7 > > >> > - test-debian-10-rust-nightly-2019-09-25: > > >> > URL: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-debian-10-rust-nightly-2019-09-25 > > >> > - test-ubuntu-14.04-cpp: > > >> > URL: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-14.04-cpp > > >> > - test-ubuntu-fuzzit: > > >> > URL: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-fuzzit > > >> > > > >> > Succeeded Tasks: > > >> > - centos-6: > > >> > URL: > > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-centos-6 > > >> > - centos-7: > > >> > URL: > > >> >
Re: Datasets and Java
hi Hongze, The Datasets functionality is indeed extremely useful, and it may make sense to have it available in many languages eventually. With Java, I would raise the issue that things are comparatively weaker there when it comes to actually reading the files themselves. Whereas we have reasonably fast Arrow-based interfaces to CSV, JSON, ORC, and Parquet in C++ the same is not true in Java. Not a deal breaker but worth taking into consideration. I wonder aloud whether it might be worth investing in a JNI-based interface to the C++ libraries as one potential approach to save on development time. - Wes On Tue, Nov 26, 2019 at 5:54 AM Hongze Zhang wrote: > > Hi all, > > > Recently the datasets API has been improved a lot and I found some of the new > features are very useful to my own work. For example to me a important one is > the fix of ARROW-6952[1]. And as I currently work on Java/Scala projects like > Spark, I am now investigating a way to call some of the datasets APIs in Java > so that I could gain performance improvement from native dataset > filters/projectors. Meantime I am also interested in the ability of scanning > different data sources provided by dataset API. > > > Regarding using datasets in Java, my initial idea is to port (by writing > Java-version implementations) some of the high-level concepts in Java such as > DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and call lower > level record batch iterators via JNI. This way we seem to retain performance > advantages from c++ dataset code. > > > Is anyone interested in this topic also? Or is this something already on the > development plan? Any feedback or thoughts would be much appreciated. > > > Best, > Hongze > > > [1] https://issues.apache.org/jira/browse/ARROW-6952
Re: Union type ids - signed or unsigned?
Thanks for all the answers. The assumptions about union types in C++ code are fixed in https://github.com/apache/arrow/pull/5892 Regards Antoine. Le 25/11/2019 à 16:41, Wes McKinney a écrit : > On Mon, Nov 25, 2019 at 9:25 AM Antoine Pitrou wrote: >> >> On Mon, 25 Nov 2019 09:12:21 -0600 >> Wes McKinney wrote: >>> On Mon, Nov 25, 2019 at 8:52 AM Antoine Pitrou wrote: Hello, The spec has the following language about union type ids: """ Types buffer: A buffer of 8-bit signed integers. Each type in the union has a corresponding type id whose values are found in this buffer. A union with more than 127 possible types can be modeled as a union of unions. """ https://arrow.apache.org/docs/format/Columnar.html#union-layout However, in several places the C++ code assumes type ids are unsigned. Java doesn't seem to implement type ids (and there is no integration task for union types). In the flatbuffers description, the type ids array is modeled as an array of signed 32-bit integers. Moreover, according to the language above, type ids should be restricted to the [0, 127] interval? Which one should it be? >>> >>> The (optional) type ids in the metadata provide a correspondence >>> between the union types / children and the values found in the types >>> buffer (data). As stated in the spec, the types buffer are 8-bit >>> signed integers. As I recall the reason that we used [ Int ] in the >>> metadata was that the Int type is thought to be easier for languages >>> to work with in general when serializing/deserializing the metadata. >> >> Ok, but is there a reason the C++ code uses `std::vector` for >> the type codes? > > Oversight on my part. Suggest we change to int8_t > >> Regards >> >> Antoine. >> >>
[jira] [Created] (ARROW-7267) [CI] [C++] Tests not run on "AMD64 Windows 2019 C++"
Antoine Pitrou created ARROW-7267: - Summary: [CI] [C++] Tests not run on "AMD64 Windows 2019 C++" Key: ARROW-7267 URL: https://issues.apache.org/jira/browse/ARROW-7267 Project: Apache Arrow Issue Type: Bug Components: C++, Continuous Integration Reporter: Antoine Pitrou We build the tests ({{ARROW_BUILD_TESTS=ON}}) but we don't run them: https://github.com/apache/arrow/pull/5608/checks?check_run_id=321619958 cc [~kszucs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Is FileSystem._isfilestore considered public?
Generally speaking, this API is obsolete (though not formally deprecated yet). So we don't envision to change it significantly in the future. We hope that in the near future the near pyarrow FileSystem API will be usable directly pyarrow.parquet. Regards Antoine. Le 26/11/2019 à 15:34, Tom Augspurger a écrit : > Hi, > > In https://github.com/dask/dask/issues/5526, we're seeing an issue stemming > from a hack to ensure compatibility for Pyarrow. The details aren't too > important. The core of the issue is that the Pyarrow parquet writer makes a > couple checks for `FileSystem._isfilestore` via `_mkdir_if_not_exists`, > e.g. in > https://github.com/apache/arrow/blob/207b3507be82e92ebf29ec7d6d3b0bb86091c09a/python/pyarrow/parquet.py#L1349-L1350 > . > > Is it OK for my FileSystem subclass to override _isfilestore? Is it > considered public? > > Thanks, > > Tom >
Re: Non-chunked large files / hdf5 support
Hello Maarten, In theory, you could provide a custom mmap-allocator and use the builder facility. Since the array is still in "build-phase" and not sealed, it should be fine if mremap changes the pointer address. This might fail in practice since the allocator is also used for auxiliary data, e.g. dictionary hash table data in the case of Dictionary type. Another solution is to create a `FixedBuilder class where - the number of elements is known - the data type is of fixed width - Nullability is know (whether you need an extra buffer). I think sooner or later we'll need such class. François On Tue, Nov 26, 2019 at 10:01 AM Maarten Breddels wrote: > > In vaex I always write the data to hdf5 as 1 large chunk (per column). > The reason is that it allows the mmapped columns to be exposed as a > single numpy array (talking numerical data only for now), which many > people are quite comfortable with. > > The strategy for vaex to write unchunked data, is to first create an > 'empty' hdf5 file (filled with zeros), mmap those huge arrays, and > write to that in chunks. > > This means that in vaex I need to support mutable data (only used > internally, vaex' default is immutable data like arrow), since I need > to write to the memory mapped data. It also makes the exporting code > relatively simple. > > I could not find a way in Arrow to get something similar done, at > least not without having a single pa.array instance for each column. I > think Arrow's mindset is that you should just use chunks right? Or is > this also something that can be considered for Arrow? > > An alternative would be to implement Arrow in hdf5, which I basically > do now in vaex (with limited support). Again, I'm wondering if there > is there an interest in storing arrow data in hdf5 from the Arrow > community? > > cheers, > > Maarten
Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0
I'd rather drop 14.04 rather than spend some time maintaining kludges for old compilers. Regards Antoine. On Tue, 26 Nov 2019 17:24:58 +0900 (JST) Sutou Kouhei wrote: > OK. I submitted a pull request: https://github.com/apache/arrow/pull/5901 > > In > "Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0" on Mon, 25 > Nov 2019 21:23:34 -0600, > Wes McKinney wrote: > > > I'd be interested to maintain gcc 4.8 support for a time yet but I'm > > interested in the opinions of others > > > > On Mon, Nov 25, 2019 at 9:14 PM Sutou Kouhei wrote: > >> > >> > - test-ubuntu-14.04-cpp: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-14.04-cpp > >> > > >> > >> Error message: > >> > >> /arrow/cpp/src/arrow/dataset/filter_test.cc:80:194: error: invalid > >> suffix on literal; C++11 requires a space between literal and identifier > >> [-Werror=literal-suffix] > >> ASSERT_EQ("a"_.ToString(), "a"); > >> ^ > >> > >> It seems that g++ on Ubuntu 14.04 is old. > >> I think that we can drop support for Ubuntu 14.04 because > >> it reaches EOL: https://ubuntu.com/about/release-cycle > >> > >> Can we remove this test job? > >> > >> > >> Thanks, > >> -- > >> kou > >> > >> In <5ddbd09a.1c69fb81.165b7.f...@mx.google.com> > >> "[NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0" on Mon, 25 > >> Nov 2019 05:01:14 -0800 (PST), > >> Crossbow wrote: > >> > >> > > >> > Arrow Build Report for Job nightly-2019-11-25-0 > >> > > >> > All tasks: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0 > >> > > >> > Failed Tasks: > >> > - homebrew-cpp: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-travis-homebrew-cpp > >> > - test-conda-python-2.7-pandas-master: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-2.7-pandas-master > >> > - test-conda-python-3.7-dask-latest: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-dask-latest > >> > - test-conda-python-3.7-dask-master: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-dask-master > >> > - test-conda-python-3.7-hdfs-2.9.2: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-hdfs-2.9.2 > >> > - test-conda-python-3.7-pandas-latest: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-pandas-latest > >> > - test-conda-python-3.7-pandas-master: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-pandas-master > >> > - test-conda-python-3.7-spark-master: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-spark-master > >> > - test-conda-python-3.7-turbodbc-latest: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-turbodbc-latest > >> > - test-conda-python-3.7-turbodbc-master: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-turbodbc-master > >> > - test-conda-python-3.7: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7 > >> > - test-debian-10-rust-nightly-2019-09-25: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-debian-10-rust-nightly-2019-09-25 > >> > - test-ubuntu-14.04-cpp: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-14.04-cpp > >> > - test-ubuntu-fuzzit: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-fuzzit > >> > > >> > Succeeded Tasks: > >> > - centos-6: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-centos-6 > >> > - centos-7: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-centos-7 > >> > - centos-8: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-centos-8 > >> > - conda-linux-gcc-py27: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-conda-linux-gcc-py27 > >> > - conda-linux-gcc-py36: > >> > URL: > >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-conda-linux-gcc-py36 > >> > - conda-linux-gcc-py37: > >> > URL: >
Non-chunked large files / hdf5 support
In vaex I always write the data to hdf5 as 1 large chunk (per column). The reason is that it allows the mmapped columns to be exposed as a single numpy array (talking numerical data only for now), which many people are quite comfortable with. The strategy for vaex to write unchunked data, is to first create an 'empty' hdf5 file (filled with zeros), mmap those huge arrays, and write to that in chunks. This means that in vaex I need to support mutable data (only used internally, vaex' default is immutable data like arrow), since I need to write to the memory mapped data. It also makes the exporting code relatively simple. I could not find a way in Arrow to get something similar done, at least not without having a single pa.array instance for each column. I think Arrow's mindset is that you should just use chunks right? Or is this also something that can be considered for Arrow? An alternative would be to implement Arrow in hdf5, which I basically do now in vaex (with limited support). Again, I'm wondering if there is there an interest in storing arrow data in hdf5 from the Arrow community? cheers, Maarten
Re: Strategy for mixing large_string and string with chunked arrays
Op di 26 nov. 2019 om 15:02 schreef Wes McKinney : > hi Maarten > > I opened https://issues.apache.org/jira/browse/ARROW-7245 in part based > on this. > > I think that normalizing to a common type (which would require casting > the offsets buffer, but not the data -- which can be shared -- so not > too wasteful) during concatenation would be the approach I would take. > I would be surprised if normalizing string offsets during record batch > / table concatenation showed up as a performance or memory use issue > relative to other kinds of operations -- in theory the > string->large_string promotion should be relatively exceptional (< 5% > of the time). I've found in performance tests that creating many > smaller array chunks is faster anyway due to interplay with the memory > allocator. > Yes, I think it is rare, but it does mean that if a user wants to convert a Vaex dataframe to an Arrow table it might use GB's of RAM (thinking ~1 billion rows). Ideally, it would use zero RAM (imagine concatenating many large memory-mapped datasets). I'm ok living with this limitation, but I wanted to raise it before v1.0 goes out. > > Of course I think we should have string kernels for both 32-bit and > 64-bit variants. Note that Gandiva already has significant string > kernel support (for 32-bit offsets at the moment) and there is > discussion about pre-compiling the LLVM IR into a shared library to > not introduce an LLVM runtime dependency, so we could maintain a > single code path for string algorithms that can be used both in a > JIT-ed setting as well as pre-compiled / interpreted setting. See > https://issues.apache.org/jira/browse/ARROW-7083 That is a very interesting approach, thanks for sharing that resource, I'll consider that. > Note that many analytic database engines (notably: Dremio, which is > natively Arrow-based) don't support exceeding the 2GB / 32-bit limit > at all and it does not seem to be an impedance in practical use. We > have the Chunked* builder classes [1] in C++ to facilitate the > creation of chunked binary arrays where there is concern about > overflowing the 2GB limit. > > Others may have different opinions so I'll let them comment. > Yes, I think in many cases it's not a problem at all. Also in vaex, all the processing happens in chunks, and no chunk will ever be that large (for the near future...). In vaex, when exporting to hdf5, I always write in 1 chunk, and that's where most of my issues show up. cheers, Maarten > > - Wes > > [1]: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_binary.h#L510 > > On Tue, Nov 26, 2019 at 7:44 AM Maarten Breddels > wrote: > > > > Hi Arrow devs, > > > > Small intro: I'm the main Vaex developer, an out of core dataframe > > library for Python - https://github.com/vaexio/vaex -, and we're > > looking into moving Vaex to use Apache Arrow for the data structure. > > At the beginning of this year, we added string support in Vaex, which > > required 64 bit offsets. Those were not available back then, so we > > added our own data structure for string arrays. Our first step to move > > to Apache Arrow is to see if we can use Arrow for the data structure, > > and later on, move the strings algorithms of Vaex to Arrow. > > > > (originally posted at https://github.com/apache/arrow/issues/5874) > > > > In vaex I can lazily concatenate dataframes without memory copy. If I > > want to implement this using a pa.ChunkedArray, users cannot > > concatenate dataframes that have a string column with pa.string type > > to a dataframe that has a column with pa.large_string. > > > > In short, there is no arrow data structure to handle this 'mixed > > chunked array', but I was wondering if this could change. The only way > > out seems to cast them manually to a common type (although blocked by > > https://issues.apache.org/jira/browse/ARROW-6071). > > Internally I could solve this in vaex, but feedback from building a > > DataFrame library with arrow might be useful. Also, it means I cannot > > expose the concatenated DataFrame as an arrow table. > > > > Because of this, I am wondering if having two types (large_string and > > string) is a good idea in the end since it makes type checking > > cumbersome (having to check two types each time). Could an option be > > that there is only 1 string and list type, and that the width of the > > indices/offsets can be obtained at runtime? That would also make it > > easy to support 16 and 8-bit offsets. That would make Arrow more > > flexible and efficient, and I guess it would play better with > > pa.ChunkedArray. > > > > Regards, > > > > Maarten Breddels >
[jira] [Created] (ARROW-7266) dictionary_encode() of a slice gives wrong result
Adam Hooper created ARROW-7266: -- Summary: dictionary_encode() of a slice gives wrong result Key: ARROW-7266 URL: https://issues.apache.org/jira/browse/ARROW-7266 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.15.1 Environment: Docker on Linux 5.2.18-200.fc30.x86_64; Python 3.7.4 Reporter: Adam Hooper Steps to reproduce: {code:python} import pyarrow as pa arr = pa.array(["a", "b", "b", "b"])[1:] arr.dictionary_encode() {code} Expected results: {code} -- dictionary: [ "b" ] -- indices: [ 0, 0, 0 ] {code} Actual results: {code} -- dictionary: [ "b", "" ] -- indices: [ 0, 0, 1 ] {code} I don't know a workaround. Converting to pylist and back is too slow. Is there a way to copy the slice to a new offset-0 StringArray that I could then dictionary-encode? Otherwise, I'm considering building buffers by hand -- This message was sent by Atlassian Jira (v8.3.4#803005)
Is FileSystem._isfilestore considered public?
Hi, In https://github.com/dask/dask/issues/5526, we're seeing an issue stemming from a hack to ensure compatibility for Pyarrow. The details aren't too important. The core of the issue is that the Pyarrow parquet writer makes a couple checks for `FileSystem._isfilestore` via `_mkdir_if_not_exists`, e.g. in https://github.com/apache/arrow/blob/207b3507be82e92ebf29ec7d6d3b0bb86091c09a/python/pyarrow/parquet.py#L1349-L1350 . Is it OK for my FileSystem subclass to override _isfilestore? Is it considered public? Thanks, Tom
Re: Strategy for mixing large_string and string with chunked arrays
hi Maarten I opened https://issues.apache.org/jira/browse/ARROW-7245 in part based on this. I think that normalizing to a common type (which would require casting the offsets buffer, but not the data -- which can be shared -- so not too wasteful) during concatenation would be the approach I would take. I would be surprised if normalizing string offsets during record batch / table concatenation showed up as a performance or memory use issue relative to other kinds of operations -- in theory the string->large_string promotion should be relatively exceptional (< 5% of the time). I've found in performance tests that creating many smaller array chunks is faster anyway due to interplay with the memory allocator. Of course I think we should have string kernels for both 32-bit and 64-bit variants. Note that Gandiva already has significant string kernel support (for 32-bit offsets at the moment) and there is discussion about pre-compiling the LLVM IR into a shared library to not introduce an LLVM runtime dependency, so we could maintain a single code path for string algorithms that can be used both in a JIT-ed setting as well as pre-compiled / interpreted setting. See https://issues.apache.org/jira/browse/ARROW-7083 Note that many analytic database engines (notably: Dremio, which is natively Arrow-based) don't support exceeding the 2GB / 32-bit limit at all and it does not seem to be an impedance in practical use. We have the Chunked* builder classes [1] in C++ to facilitate the creation of chunked binary arrays where there is concern about overflowing the 2GB limit. Others may have different opinions so I'll let them comment. - Wes [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/builder_binary.h#L510 On Tue, Nov 26, 2019 at 7:44 AM Maarten Breddels wrote: > > Hi Arrow devs, > > Small intro: I'm the main Vaex developer, an out of core dataframe > library for Python - https://github.com/vaexio/vaex -, and we're > looking into moving Vaex to use Apache Arrow for the data structure. > At the beginning of this year, we added string support in Vaex, which > required 64 bit offsets. Those were not available back then, so we > added our own data structure for string arrays. Our first step to move > to Apache Arrow is to see if we can use Arrow for the data structure, > and later on, move the strings algorithms of Vaex to Arrow. > > (originally posted at https://github.com/apache/arrow/issues/5874) > > In vaex I can lazily concatenate dataframes without memory copy. If I > want to implement this using a pa.ChunkedArray, users cannot > concatenate dataframes that have a string column with pa.string type > to a dataframe that has a column with pa.large_string. > > In short, there is no arrow data structure to handle this 'mixed > chunked array', but I was wondering if this could change. The only way > out seems to cast them manually to a common type (although blocked by > https://issues.apache.org/jira/browse/ARROW-6071). > Internally I could solve this in vaex, but feedback from building a > DataFrame library with arrow might be useful. Also, it means I cannot > expose the concatenated DataFrame as an arrow table. > > Because of this, I am wondering if having two types (large_string and > string) is a good idea in the end since it makes type checking > cumbersome (having to check two types each time). Could an option be > that there is only 1 string and list type, and that the width of the > indices/offsets can be obtained at runtime? That would also make it > easy to support 16 and 8-bit offsets. That would make Arrow more > flexible and efficient, and I guess it would play better with > pa.ChunkedArray. > > Regards, > > Maarten Breddels
Re: Unions: storing type_ids or type_codes?
It seems that the array_union_test.cc does the latter, look at how `expected_types` is constructed. I opened https://issues.apache.org/jira/browse/ARROW-7265 . Wes, is the intended usage of type_ids to allow a producer to pass a subset columns of unions without modifying the type codes? François On Thu, Nov 21, 2019 at 10:51 AM Antoine Pitrou wrote: > > > Hello, > > There's some ambiguity whether a union array's "types" buffer stores > physical child ids, or logical type codes. > > Some of our C++ tests assume the former: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L107-L123 > > Some of our C++ tests assume the latter: > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array_union_test.cc#L311-L326 > https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/json_simple_test.cc#L943-L955 > > Critically, no validation of union data is currently implemented in C++ > (ARROW-6157). I can't parse the Java source code. > > Regards > > Antoine. >
[jira] [Created] (ARROW-7265) [Format][C++] Clarify the usage of typeIds in Union type documentation
Francois Saint-Jacques created ARROW-7265: - Summary: [Format][C++] Clarify the usage of typeIds in Union type documentation Key: ARROW-7265 URL: https://issues.apache.org/jira/browse/ARROW-7265 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques The documentation is unclear. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Strategy for mixing large_string and string with chunked arrays
Hi Arrow devs, Small intro: I'm the main Vaex developer, an out of core dataframe library for Python - https://github.com/vaexio/vaex -, and we're looking into moving Vaex to use Apache Arrow for the data structure. At the beginning of this year, we added string support in Vaex, which required 64 bit offsets. Those were not available back then, so we added our own data structure for string arrays. Our first step to move to Apache Arrow is to see if we can use Arrow for the data structure, and later on, move the strings algorithms of Vaex to Arrow. (originally posted at https://github.com/apache/arrow/issues/5874) In vaex I can lazily concatenate dataframes without memory copy. If I want to implement this using a pa.ChunkedArray, users cannot concatenate dataframes that have a string column with pa.string type to a dataframe that has a column with pa.large_string. In short, there is no arrow data structure to handle this 'mixed chunked array', but I was wondering if this could change. The only way out seems to cast them manually to a common type (although blocked by https://issues.apache.org/jira/browse/ARROW-6071). Internally I could solve this in vaex, but feedback from building a DataFrame library with arrow might be useful. Also, it means I cannot expose the concatenated DataFrame as an arrow table. Because of this, I am wondering if having two types (large_string and string) is a good idea in the end since it makes type checking cumbersome (having to check two types each time). Could an option be that there is only 1 string and list type, and that the width of the indices/offsets can be obtained at runtime? That would also make it easy to support 16 and 8-bit offsets. That would make Arrow more flexible and efficient, and I guess it would play better with pa.ChunkedArray. Regards, Maarten Breddels
[NIGHTLY] Arrow Build Report for Job nightly-2019-11-26-0
Arrow Build Report for Job nightly-2019-11-26-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0 Failed Tasks: - test-conda-python-2.7-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-2.7-pandas-master - test-conda-python-3.7-dask-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7-dask-latest - test-conda-python-3.7-dask-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7-dask-master - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7-hdfs-2.9.2 - test-conda-python-3.7-pandas-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7-pandas-latest - test-conda-python-3.7-pandas-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7-pandas-master - test-conda-python-3.7-spark-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7-spark-master - test-conda-python-3.7-turbodbc-latest: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7-turbodbc-latest - test-conda-python-3.7-turbodbc-master: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7-turbodbc-master - test-conda-python-3.7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-conda-python-3.7 - test-debian-10-rust-nightly-2019-09-25: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-debian-10-rust-nightly-2019-09-25 - test-ubuntu-14.04-cpp: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-circle-test-ubuntu-14.04-cpp - wheel-manylinux1-cp27m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux1-cp27m - wheel-manylinux1-cp27mu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux1-cp27mu - wheel-manylinux1-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux1-cp35m - wheel-manylinux1-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux1-cp36m - wheel-manylinux1-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux1-cp37m - wheel-manylinux2010-cp27m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux2010-cp27m - wheel-manylinux2010-cp27mu: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux2010-cp27mu - wheel-manylinux2010-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux2010-cp35m - wheel-manylinux2010-cp36m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux2010-cp36m - wheel-manylinux2010-cp37m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-manylinux2010-cp37m - wheel-osx-cp35m: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-travis-wheel-osx-cp35m Succeeded Tasks: - centos-6: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-centos-6 - centos-7: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-centos-7 - centos-8: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-centos-8 - conda-linux-gcc-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-conda-linux-gcc-py27 - conda-linux-gcc-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-conda-linux-gcc-py36 - conda-linux-gcc-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-conda-linux-gcc-py37 - conda-osx-clang-py27: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-conda-osx-clang-py27 - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-conda-osx-clang-py36 - conda-osx-clang-py37: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-26-0-azure-conda-osx-clang-py37 - conda-win-vs2015-py36: URL:
Datasets and Java
Hi all, Recently the datasets API has been improved a lot and I found some of the new features are very useful to my own work. For example to me a important one is the fix of ARROW-6952[1]. And as I currently work on Java/Scala projects like Spark, I am now investigating a way to call some of the datasets APIs in Java so that I could gain performance improvement from native dataset filters/projectors. Meantime I am also interested in the ability of scanning different data sources provided by dataset API. Regarding using datasets in Java, my initial idea is to port (by writing Java-version implementations) some of the high-level concepts in Java such as DataSourceDiscovery/DataSet/Scanner/FileFormat, then create and call lower level record batch iterators via JNI. This way we seem to retain performance advantages from c++ dataset code. Is anyone interested in this topic also? Or is this something already on the development plan? Any feedback or thoughts would be much appreciated. Best, Hongze [1] https://issues.apache.org/jira/browse/ARROW-6952
[jira] [Created] (ARROW-7264) [Java] RangeEqualsVisitor type check is not correct
Ji Liu created ARROW-7264: - Summary: [Java] RangeEqualsVisitor type check is not correct Key: ARROW-7264 URL: https://issues.apache.org/jira/browse/ARROW-7264 Project: Apache Arrow Issue Type: Bug Components: Java Affects Versions: 0.15.1 Reporter: Ji Liu Assignee: Ji Liu Currently {{RangeEqualsVisitor}} generally only checks type once and keep the result to avoid repeated type checking, see {code:java} typeCompareResult = left.getField().getType().equals(right.getField().getType()); {code} This only compares {{ArrowType}} and for complex type, this may cause unexpected behavior, for example {{List}} and {{List}} would be type equals which not consider their child field. We should compare Field here instead and to make it more extendable, we use {{TypeEqualsVisitor}} to compare Field, in this way, one could choose whether checks names or metadata either. Also provide a test for ListVector to validate this change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS][C++/Python] Bazel example
Hi Micah, Le 26/11/2019 à 05:52, Micah Kornfield a écrit : > > After going through this exercise I put together a list of pros and cons > below. > > I would like to hear from other devs: > 1. Their opinions on setting this up as an alternative system (I'm willing > to invest some more time in it). > 2. What people think the minimum bar for merging a PR like this should be? My question would be: what happens after the PR is merged? Are developers supposed to keep the Bazel setup working in addition to CMake? Or is there a dedicated maintainer (you? :-)) to fix regressions when they happen? > Pros: > 1. Being able to run "bazel test python/..." and having compilation of all > python dependencies just work is a nice experience. > 2. Because of the granular compilation units, it can improve developer > velocity. Unit tests can depend only on the sub-components they are meant > to test. They don't need to compile and relink arrow.so. > 3. The built-in documentation it provides about visibility and > relationships between components is nice (its uncovered some "interesting > dependencies"). I didn't make heavy use of it, but its concept of > "visibility" makes things more explicit about what external consumers > should be depending on, and what inter-project components should depend on > (e.g. explicitly limit the scope of vendored code). > 4. Extensions are essentially python, which might be easier to work with > then CMake Those sound nice. > Cons: > 1. Bazel is opinionated on C++ layout. In particular it requires some > workarounds to deal with circular .h/.cc dependencies. The two main ways > of doing this are either increasing the size of compilable units [4] to > span all dependencies in the cycle, or creating separate > header/implementation targets, I've used both strategies in the PR. One > could argue that it would be nice to reduce circular dependencies in > general. Can you give an example of circular dependency? Can this be solved by having more "type_fwd.h" headers for forward declarations of opaque types? (also, generally, it would be desirable to use more of these, since our compile times have become egregious as of late - I'm currently considering replacing my 8-core desktop CPU with a beefier one :-/) > 4. It is more verbose to configure then CMake (each compilation unit needs > to be spelled out with dependencies). This sounds really like a bummer. Do you have to spell those out by hand? Or is there some tool that infers dependencies and generates the declarations for you? Regards Antoine.
Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding (second iteration)
+1 (binding) In "[VOTE] Clarifications and forward compatibility changes for Dictionary Encoding (second iteration)" on Wed, 20 Nov 2019 20:41:57 -0800, Micah Kornfield wrote: > Hello, > As discussed on [1], I've proposed clarifications in a PR [2] that > clarifies: > > 1. It is not required that all dictionary batches occur at the beginning > of the IPC stream format (if a the first record batch has an all null > dictionary encoded column, the null column's dictionary might not be sent > until later in the stream). > > 2. A second dictionary batch for the same ID that is not a "delta batch" > in an IPC stream indicates the dictionary should be replaced. > > 3. Clarifies that the file format, can only contain 1 "NON-delta" > dictionary batch and multiple "delta" dictionary batches. Dictionary > replacement is not supported in the file format. > > 4. Add an enum to dictionary metadata for possible future changes in what > format dictionary batches can be sent. (the most likely would be an array > Map). An enum is needed as a place holder to allow for forward > compatibility past the release 1.0.0. > > If accepted there will be work in all implementations to make sure that > they cover the edge cases highlighted and additional integration testing > will be needed. > > Please vote whether to accept these additions. The vote will be open for at > least 72 hours. > > [ ] +1 Accept these change to the specification > [ ] +0 > [ ] -1 Do not accept the changes because... > > Thanks, > Micah > > > [1] > https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E > [2] https://github.com/apache/arrow/pull/5585
[jira] [Created] (ARROW-7263) [C++][Gandiva] Implement locate and position functions
Projjal Chanda created ARROW-7263: - Summary: [C++][Gandiva] Implement locate and position functions Key: ARROW-7263 URL: https://issues.apache.org/jira/browse/ARROW-7263 Project: Apache Arrow Issue Type: Task Components: C++ - Gandiva Reporter: Projjal Chanda Assignee: Projjal Chanda Add {{int32 locate(utf8, utf8, int32)}} and {{int32 locate(utf8, utf8) ** }}functions. Same for {{position}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7262) [C++][Gandiva] Implement replace function in Gandiva
Projjal Chanda created ARROW-7262: - Summary: [C++][Gandiva] Implement replace function in Gandiva Key: ARROW-7262 URL: https://issues.apache.org/jira/browse/ARROW-7262 Project: Apache Arrow Issue Type: Task Components: C++ - Gandiva Reporter: Projjal Chanda Assignee: Projjal Chanda add _utf8 replace(utf8, utf8, utf8)_ function in Gandiva -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7261) [Python] Python support for fixed size list type
Joris Van den Bossche created ARROW-7261: Summary: [Python] Python support for fixed size list type Key: ARROW-7261 URL: https://issues.apache.org/jira/browse/ARROW-7261 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 I didn't see any issue about this, but {{FixedSizeListArray}} (ARROW-1280) is not yet exposed in Python. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0
OK. I submitted a pull request: https://github.com/apache/arrow/pull/5901 In "Re: [NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0" on Mon, 25 Nov 2019 21:23:34 -0600, Wes McKinney wrote: > I'd be interested to maintain gcc 4.8 support for a time yet but I'm > interested in the opinions of others > > On Mon, Nov 25, 2019 at 9:14 PM Sutou Kouhei wrote: >> >> > - test-ubuntu-14.04-cpp: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-14.04-cpp >> >> Error message: >> >> /arrow/cpp/src/arrow/dataset/filter_test.cc:80:194: error: invalid suffix >> on literal; C++11 requires a space between literal and identifier >> [-Werror=literal-suffix] >> ASSERT_EQ("a"_.ToString(), "a"); >> ^ >> >> It seems that g++ on Ubuntu 14.04 is old. >> I think that we can drop support for Ubuntu 14.04 because >> it reaches EOL: https://ubuntu.com/about/release-cycle >> >> Can we remove this test job? >> >> >> Thanks, >> -- >> kou >> >> In <5ddbd09a.1c69fb81.165b7.f...@mx.google.com> >> "[NIGHTLY] Arrow Build Report for Job nightly-2019-11-25-0" on Mon, 25 Nov >> 2019 05:01:14 -0800 (PST), >> Crossbow wrote: >> >> > >> > Arrow Build Report for Job nightly-2019-11-25-0 >> > >> > All tasks: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0 >> > >> > Failed Tasks: >> > - homebrew-cpp: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-travis-homebrew-cpp >> > - test-conda-python-2.7-pandas-master: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-2.7-pandas-master >> > - test-conda-python-3.7-dask-latest: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-dask-latest >> > - test-conda-python-3.7-dask-master: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-dask-master >> > - test-conda-python-3.7-hdfs-2.9.2: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-hdfs-2.9.2 >> > - test-conda-python-3.7-pandas-latest: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-pandas-latest >> > - test-conda-python-3.7-pandas-master: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-pandas-master >> > - test-conda-python-3.7-spark-master: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-spark-master >> > - test-conda-python-3.7-turbodbc-latest: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-turbodbc-latest >> > - test-conda-python-3.7-turbodbc-master: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7-turbodbc-master >> > - test-conda-python-3.7: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-conda-python-3.7 >> > - test-debian-10-rust-nightly-2019-09-25: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-debian-10-rust-nightly-2019-09-25 >> > - test-ubuntu-14.04-cpp: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-14.04-cpp >> > - test-ubuntu-fuzzit: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-circle-test-ubuntu-fuzzit >> > >> > Succeeded Tasks: >> > - centos-6: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-centos-6 >> > - centos-7: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-centos-7 >> > - centos-8: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-centos-8 >> > - conda-linux-gcc-py27: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-conda-linux-gcc-py27 >> > - conda-linux-gcc-py36: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-conda-linux-gcc-py36 >> > - conda-linux-gcc-py37: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-conda-linux-gcc-py37 >> > - conda-osx-clang-py27: >> > URL: >> > https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-11-25-0-azure-conda-osx-clang-py27 >> > - conda-osx-clang-py36: >> > URL: >> >
[jira] [Created] (ARROW-7260) [CI] Ubuntu 14.04 test is failed by user defined literal
Kouhei Sutou created ARROW-7260: --- Summary: [CI] Ubuntu 14.04 test is failed by user defined literal Key: ARROW-7260 URL: https://issues.apache.org/jira/browse/ARROW-7260 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Kouhei Sutou Assignee: Kouhei Sutou https://circleci.com/gh/ursa-labs/crossbow/5329 {noformat} /arrow/cpp/src/arrow/dataset/filter_test.cc:80:194: error: invalid suffix on literal; C++11 requires a space between literal and identifier [-Werror=literal-suffix] ASSERT_EQ("a"_.ToString(), "a"); ^ {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)