Build system discussion for Arrow (and Orc?)
Greetings, I'm emailing after discussion with Wes at AnacondaCon today. I work for Anaconda, and I've recently been trying to package Arrow for Anaconda. The current CMake configuration seems to strongly impose a vendoring/hermetic approach. We find that approach to be difficult to integrate with our system, which relies on package modularity and basically anti-vendoring. Specifically, imposing -Werror in Orc led to always failing builds for Arrow on OSX, and because Arrow vendored Orc through CMake, my only option was to attempt a build, let it fail, and then patch -Werror away for Orc after the failed build, then rebuild. I would like to contribute patches to Arrow's CMake files, and ideally also to Orc's CMake files that will allow us to switch between the hermetic approach, and using externally provided dependencies. Orc would become a separate package for us, and Arrow would depend on it. This brings me to two questions: 1. Is this a welcome change, or should we just carry patches locally? 2. Assuming change is welcome, what is the preferred method for submitting changes? Github PR(s)? Best, Michael
Re: rust using nightly channel
Yes, so maybe we need a conditional compilation method so that the user can choose. On Tue, Apr 10, 2018 at 9:42 PM Andy Grovewrote: > My opinion is that we should continue to support Rust stable since there > are users who can only use Arrow if it works with Rust stable. > > However, maybe it is possible to provide an API so that users can provide > their own allocators and in that case they could choose to use nightly? > > It's a bit more work for us, but gives users more choice. > > Also, SIMD and alloc are both going to be stabilized very soon anyway so we > might not have to wait too long. > > Thanks, > > Andy. > > > > On Tue, Apr 10, 2018 at 4:38 AM, Renjie Liu > wrote: > > > Hi: > > Can we use experimental features in nightly channel? There are many > useful > > features that can only be use in nightly channel, e.g. the Alloc api, > since > > arrow requires control over low level primitives such as memory > allocation, > > simd execution, etc. > > > > > > -- > > Liu, Renjie > > Software Engineer, MVAD > > > -- Liu, Renjie Software Engineer, MVAD
[jira] [Created] (ARROW-2445) Add documentation and make some fields private
Andy Grove created ARROW-2445: - Summary: Add documentation and make some fields private Key: ARROW-2445 URL: https://issues.apache.org/jira/browse/ARROW-2445 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 0.10.0 A first pass at adding rustdoc comments and made some struct fields private and added accessor methods. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Tensor column types in arrow
Thanks, I’ll create a jira and google doc. I agree those are the main questions to iron out. If there’s a desire to avoid scope creeping this in before 1.0, I think in parallel I’ll start a conversation with the spark community about using the existing FixedSizeBinary type plus some custom metadata to provide serialization for their ML UDTs, and let them know that in the future if this is added to arrow, they could switch that implementation to use those arrow types instead. On Tue, Apr 10, 2018 at 19:18 Wes McKinneywrote: > The simplest thing would be to have a "tensor" or "ndarray" type where > each cell has the same shape. This would amount to adding the current > "Tensor" Flatbuffers table to the Type union in > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194 > > The benefit of having each cell having the same shape is that the > physical representation is FixedSizeBinary. > > Some caveats / notes: > > * We have a prior unresolved discussion about our approach to logical > types. I could argue that this might fall into the same bucket of > logical types. I don't think we should merge any patches related to > this issue until we resolve that discussion > > * Using FixedSizeBinary as the physical representation constrains > value sizes to 2GB (product of shape) because the FixedSizeBinary > metadata uses int for the byteWidth. We might consider changing this > to long (64 bits), but that's a separate discussion > > * If we permitted each cell to have a different shape, then we would > need to use Binary (vs. FixedSizeBinary), which would limit the entire > size of a column to 2GB of total tensor data. This could be mitigated > by introducing LargeBinary (64 bit offsets), but this requires > additional discussion (there is a JIRA about this already from some > time ago) > > Given that we are still falling short of a complete implementation of > other Arrow types (unions, intervals, fixed size lists), I urge all to > be deliberate about not piling on more technical debt / format > implementation shortfall if it can be avoided -- so a solution to this > might be to have a patch for initial Tensor/Ndarray value support that > is implemented in Java and/or C++ > > How about creating a JIRA about this broad topic and creating a Google > doc with a proposed implementation approach for discussion? > > Thanks > Wes > > On Tue, Apr 10, 2018 at 5:48 PM, Li Jin wrote: > > What do people think whether "shape" should be included as a optional > part > > of schema metadata or a required part of the schema itself? > > > > I feel having it be required might be too restrictive for interop with > > other systems. > > > > On Mon, Apr 9, 2018 at 9:13 PM, Leif Walsh wrote: > > > >> My gut feeling is that such a column type should specify both the shape > and > >> primitive type of all values in the column. I can’t think of a common > use > >> case that requires differently shaped tensors in a single column. > >> > >> Can anyone here come up with such a use case? > >> > >> If not, I can try to draft a proposal change to the spec that adds these > >> types. The next question is whether such a change can make it in (with > c++ > >> and java implementations) before 1.0. > >> On Mon, Apr 9, 2018 at 17:36 Wes McKinney wrote: > >> > >> > > As far as I know, there is an implementation of tensor type in > >> > C++/Python already. Should we just finalize the spec and add > >> implementation > >> > to Java? > >> > > >> > There is nothing specified yet as far as a *column* of > >> > ndarrays/tensors. We defined Tensor metadata for the purposes of > >> > IPC/serialization but made no effort to incorporate such data into the > >> > columnar format. > >> > > >> > There are likely many ways to implement column whose values are > >> > ndarrays, each cell with its own shape. Whether we would want to > >> > permit each cell to have a different ndarray cell type is another > >> > question (i.e. would we want to constrain every cell in a column to > >> > contain ndarrays of a particular type, like float64) > >> > > >> > So there's a couple of questions > >> > > >> > * How to represent the data using the columnar format > >> > * How to incorporate ndarray metadata into columnar schemas > >> > > >> > - Wes > >> > > >> > On Mon, Apr 9, 2018 at 5:30 PM, Li Jin wrote: > >> > > As far as I know, there is an implementation of tensor type in > >> C++/Python > >> > > already. Should we just finalize the spec and add implementation to > >> Java? > >> > > > >> > > On the Spark side, it's probably more complicated as Vector and > Matrix > >> > are > >> > > not "first class" types in Spark SQL. Spark ML implements them as > UDT > >> > > (user-defined types) so it's not clear how to make Spark/Arrow > >> converter > >> > > work with them. > >> > > > >> > > I wonder if Bryan and Holden have some more thoughts on that? >
Re: Tensor column types in arrow
The simplest thing would be to have a "tensor" or "ndarray" type where each cell has the same shape. This would amount to adding the current "Tensor" Flatbuffers table to the Type union in https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194 The benefit of having each cell having the same shape is that the physical representation is FixedSizeBinary. Some caveats / notes: * We have a prior unresolved discussion about our approach to logical types. I could argue that this might fall into the same bucket of logical types. I don't think we should merge any patches related to this issue until we resolve that discussion * Using FixedSizeBinary as the physical representation constrains value sizes to 2GB (product of shape) because the FixedSizeBinary metadata uses int for the byteWidth. We might consider changing this to long (64 bits), but that's a separate discussion * If we permitted each cell to have a different shape, then we would need to use Binary (vs. FixedSizeBinary), which would limit the entire size of a column to 2GB of total tensor data. This could be mitigated by introducing LargeBinary (64 bit offsets), but this requires additional discussion (there is a JIRA about this already from some time ago) Given that we are still falling short of a complete implementation of other Arrow types (unions, intervals, fixed size lists), I urge all to be deliberate about not piling on more technical debt / format implementation shortfall if it can be avoided -- so a solution to this might be to have a patch for initial Tensor/Ndarray value support that is implemented in Java and/or C++ How about creating a JIRA about this broad topic and creating a Google doc with a proposed implementation approach for discussion? Thanks Wes On Tue, Apr 10, 2018 at 5:48 PM, Li Jinwrote: > What do people think whether "shape" should be included as a optional part > of schema metadata or a required part of the schema itself? > > I feel having it be required might be too restrictive for interop with > other systems. > > On Mon, Apr 9, 2018 at 9:13 PM, Leif Walsh wrote: > >> My gut feeling is that such a column type should specify both the shape and >> primitive type of all values in the column. I can’t think of a common use >> case that requires differently shaped tensors in a single column. >> >> Can anyone here come up with such a use case? >> >> If not, I can try to draft a proposal change to the spec that adds these >> types. The next question is whether such a change can make it in (with c++ >> and java implementations) before 1.0. >> On Mon, Apr 9, 2018 at 17:36 Wes McKinney wrote: >> >> > > As far as I know, there is an implementation of tensor type in >> > C++/Python already. Should we just finalize the spec and add >> implementation >> > to Java? >> > >> > There is nothing specified yet as far as a *column* of >> > ndarrays/tensors. We defined Tensor metadata for the purposes of >> > IPC/serialization but made no effort to incorporate such data into the >> > columnar format. >> > >> > There are likely many ways to implement column whose values are >> > ndarrays, each cell with its own shape. Whether we would want to >> > permit each cell to have a different ndarray cell type is another >> > question (i.e. would we want to constrain every cell in a column to >> > contain ndarrays of a particular type, like float64) >> > >> > So there's a couple of questions >> > >> > * How to represent the data using the columnar format >> > * How to incorporate ndarray metadata into columnar schemas >> > >> > - Wes >> > >> > On Mon, Apr 9, 2018 at 5:30 PM, Li Jin wrote: >> > > As far as I know, there is an implementation of tensor type in >> C++/Python >> > > already. Should we just finalize the spec and add implementation to >> Java? >> > > >> > > On the Spark side, it's probably more complicated as Vector and Matrix >> > are >> > > not "first class" types in Spark SQL. Spark ML implements them as UDT >> > > (user-defined types) so it's not clear how to make Spark/Arrow >> converter >> > > work with them. >> > > >> > > I wonder if Bryan and Holden have some more thoughts on that? >> > > >> > > Li >> > > >> > > On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh >> wrote: >> > > >> > >> Hi all, >> > >> >> > >> I’ve been doing some work lately with Spark’s ML interfaces, which >> > include >> > >> sparse and dense Vector and Matrix types, backed on the Scala side by >> > >> Breeze. Using these interfaces, you can construct DataFrames whose >> > column >> > >> types are vectors and matrices, and though the API isn’t terribly >> rich, >> > it >> > >> is possible to run Python UDFs over such a DataFrame and get numpy >> > ndarrays >> > >> out of each row. However, if you’re using Spark’s Arrow serialization >> > >> between the executor and Python workers, you get this >> > >> UnsupportedOperationException: >> > >>
[jira] [Created] (ARROW-2443) [Python] Conversion from pandas of empty categorical fails with ArrowInvalid
Florian Jetter created ARROW-2443: - Summary: [Python] Conversion from pandas of empty categorical fails with ArrowInvalid Key: ARROW-2443 URL: https://issues.apache.org/jira/browse/ARROW-2443 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.9.0 Reporter: Florian Jetter The conversion of an empty pandas categorical raises an exception. Before version `0.9.0` this was possible {code:java} import pandas as pd import pyarrow as pa pa.Table.from_pandas(pd.DataFrame({'cat': pd.Categorical([])})){code} raises: {{ArrowInvalid: Dictionary indices must have non-zero length}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Tasks for upcoming Hackathons & Sprints
Le 10/04/2018 à 15:43, Uwe L. Korn a écrit : > Seems like I'm not allowed to make public filters. I will ask INFRA about > what I can do. > > You'll find the results by querying for `labels = beginner AND project = > Arrow AND status = open` in JIRA. Yes, I've added a couple of beginner tickets. Regards Antoine. > > Uwe > > On Tue, Apr 10, 2018, at 3:33 PM, Antoine Pitrou wrote: >> >> Hi Uwe, >> >> On Mon, 09 Apr 2018 17:28:50 +0200 >> "Uwe L. Korn"wrote: >>> >>> To get people on board and have things they can work on, I'm collecting >>> possible tasks. To make these tasks visible, we should flags simple things >>> that everyone could work on with a "beginner" label in JIRA so they appear >>> in https://issues.apache.org/jira/issues/?filter=12343593 >> >> That link doesn't work. >> >> Regards >> >> Antoine.
Re: Tasks for upcoming Hackathons & Sprints
Seems like I'm not allowed to make public filters. I will ask INFRA about what I can do. You'll find the results by querying for `labels = beginner AND project = Arrow AND status = open` in JIRA. Uwe On Tue, Apr 10, 2018, at 3:33 PM, Antoine Pitrou wrote: > > Hi Uwe, > > On Mon, 09 Apr 2018 17:28:50 +0200 > "Uwe L. Korn"wrote: > > > > To get people on board and have things they can work on, I'm collecting > > possible tasks. To make these tasks visible, we should flags simple things > > that everyone could work on with a "beginner" label in JIRA so they appear > > in https://issues.apache.org/jira/issues/?filter=12343593 > > That link doesn't work. > > Regards > > Antoine.
Re: rust using nightly channel
My opinion is that we should continue to support Rust stable since there are users who can only use Arrow if it works with Rust stable. However, maybe it is possible to provide an API so that users can provide their own allocators and in that case they could choose to use nightly? It's a bit more work for us, but gives users more choice. Also, SIMD and alloc are both going to be stabilized very soon anyway so we might not have to wait too long. Thanks, Andy. On Tue, Apr 10, 2018 at 4:38 AM, Renjie Liuwrote: > Hi: > Can we use experimental features in nightly channel? There are many useful > features that can only be use in nightly channel, e.g. the Alloc api, since > arrow requires control over low level primitives such as memory allocation, > simd execution, etc. > > > -- > Liu, Renjie > Software Engineer, MVAD >
Re: Tasks for upcoming Hackathons & Sprints
Hi Uwe, On Mon, 09 Apr 2018 17:28:50 +0200 "Uwe L. Korn"wrote: > > To get people on board and have things they can work on, I'm collecting > possible tasks. To make these tasks visible, we should flags simple things > that everyone could work on with a "beginner" label in JIRA so they appear in > https://issues.apache.org/jira/issues/?filter=12343593 That link doesn't work. Regards Antoine.
[jira] [Created] (ARROW-2442) [C++] Disambiguate Builder::Append overloads
Antoine Pitrou created ARROW-2442: - Summary: [C++] Disambiguate Builder::Append overloads Key: ARROW-2442 URL: https://issues.apache.org/jira/browse/ARROW-2442 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.9.0 Reporter: Antoine Pitrou See discussion in [https://github.com/apache/arrow/pull/1852#discussion_r179919627] There are various {{Append()}} overloads in Builder and subclasses, some of which append one value, some of which append multiple values at once. The API might be clearer and less error-prone if multiple-append variants were named differently, for example {{AppendValues()}}. Especially with the pointer-taking variants, it's probably easy to call the wrong overload by mistake. The existing methods would have to go through a deprecation cycle. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2441) [Rust] Bulder::slice_mut assertions are too strict
Andy Grove created ARROW-2441: - Summary: [Rust] Bulder::slice_mut assertions are too strict Key: ARROW-2441 URL: https://issues.apache.org/jira/browse/ARROW-2441 Project: Apache Arrow Issue Type: Bug Reporter: Andy Grove Fix For: 0.10.0 The assertions only allow slice up to builder length, rather than up to builder capacity. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2440) [Rust[ Implement ListBuilder
Andy Grove created ARROW-2440: - Summary: [Rust[ Implement ListBuilder Key: ARROW-2440 URL: https://issues.apache.org/jira/browse/ARROW-2440 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 0.10.0 Implement ListBuilder -- This message was sent by Atlassian JIRA (v7.6.3#76005)
rust using nightly channel
Hi: Can we use experimental features in nightly channel? There are many useful features that can only be use in nightly channel, e.g. the Alloc api, since arrow requires control over low level primitives such as memory allocation, simd execution, etc. -- Liu, Renjie Software Engineer, MVAD
Re: Allowing every JIRA user to assign issues to themselves
Hi Uwe, I believe projects have had problems with spam in the past, but we could give it a shot and disable if there is spam. Wes On Tue, Apr 10, 2018 at 5:19 AM Uwe L. Kornwrote: > Hello all, > > we currently have many new contributors. This is very exciting but a trap > that catches every new contributor is that they cannot assign issues by > themselves but must be added to the contributors role by a PMC. > > Would be ok for all if we give contributor permission to everyone on the > Arrow JIRA project? > > Uwe >
Re: Rust Arrow status and plans for this week
Hello Uwe: My JIRA id is liurenjie1024 and it seems that I have been given contibutor permission. On Tue, Apr 10, 2018 at 3:00 PM Uwe L. Kornwrote: > Hello Andy, > > this is very exciting. Once we have basic documentation, we should have a > look at streamlining the release process in the ASF infrastructure so > making releases is straight-forward. We have a small collection of scripts > to do this for the main release and the JS release that we should be able > to adapt to the Rust part of the project. I could simply make the > respective JIRAs for that or we have a small chat first about the ASF > release process. > > > My next area of interest personally is the IPC mechanism and interop > > testing with other languages, especially Java. > > This is a very important step for all our implementations. We have an > integration test setup in > https://github.com/apache/arrow/tree/master/integration where we test the > compatibility of all Arrow implementations to each other to verify that > they all have the same understanding of the data structures. > > Uwe > > On Mon, Apr 9, 2018, at 3:26 PM, Andy Grove wrote: > > Over the weekend I added preliminary Parquet support to DataFusion (it > only > > supports int/float primitives and UTF8 so far). This was possible due to > > the great work happening with the parquet-rs crate. > > > > Integrating this with the current Rust version of Arrow was simple and I > > have now started running benchmarks (and we now have some benchmark code > > checked into the Arrow project). > > > > Now that the basic functionality is stable enough to support this use > case > > I am going to focus on quality this week and start improving unit tests > and > > adding documentation. > > > > I think we might be at the point where it makes sense to start > discussing a > > first official release and maybe a roadmap for the Rust library? > > > > My next area of interest personally is the IPC mechanism and interop > > testing with other languages, especially Java. > > > > Thanks, > > > > Andy. > -- Liu, Renjie Software Engineer, MVAD
Allowing every JIRA user to assign issues to themselves
Hello all, we currently have many new contributors. This is very exciting but a trap that catches every new contributor is that they cannot assign issues by themselves but must be added to the contributors role by a PMC. Would be ok for all if we give contributor permission to everyone on the Arrow JIRA project? Uwe
[jira] [Created] (ARROW-2439) [Rust] Run license header checks also in Rust CI entry
Uwe L. Korn created ARROW-2439: -- Summary: [Rust] Run license header checks also in Rust CI entry Key: ARROW-2439 URL: https://issues.apache.org/jira/browse/ARROW-2439 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Uwe L. Korn Fix For: 0.10.0 Currently we only audit license headers in the C++ builds. We should also do this in the Rust Travis entry. The overhead for them is so minimal that we can do it twice. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2438) [Rust] memory_pool.rs misses license header
Uwe L. Korn created ARROW-2438: -- Summary: [Rust] memory_pool.rs misses license header Key: ARROW-2438 URL: https://issues.apache.org/jira/browse/ARROW-2438 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Uwe L. Korn Fix For: 0.10.0 Travis output: {code} NOT APPROVED: rust/src/memory_pool.rs (apache-arrow/rust/src/memory_pool.rs): false 1 unapproved licences. Check rat report: rat.txt {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2437) [C++] Change of arrow::ipc::ReadMessage signature breaks ABI compability
Uwe L. Korn created ARROW-2437: -- Summary: [C++] Change of arrow::ipc::ReadMessage signature breaks ABI compability Key: ARROW-2437 URL: https://issues.apache.org/jira/browse/ARROW-2437 Project: Apache Arrow Issue Type: Bug Reporter: Uwe L. Korn Fix For: 0.9.1 We changed the signature of the method from {code} ReadMessage ( arrow::io::InputStream* file, std::unique_ptr* message ) {code} to {code} ReadMessage ( arrow::io::InputStream* file, std::unique_ptr * message, bool aligned ) {code} We should add the old signature so that the 0.9.1 release is ABI compatible to 0.9.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Rust Arrow status and plans for this week
Hello Andy, this is very exciting. Once we have basic documentation, we should have a look at streamlining the release process in the ASF infrastructure so making releases is straight-forward. We have a small collection of scripts to do this for the main release and the JS release that we should be able to adapt to the Rust part of the project. I could simply make the respective JIRAs for that or we have a small chat first about the ASF release process. > My next area of interest personally is the IPC mechanism and interop > testing with other languages, especially Java. This is a very important step for all our implementations. We have an integration test setup in https://github.com/apache/arrow/tree/master/integration where we test the compatibility of all Arrow implementations to each other to verify that they all have the same understanding of the data structures. Uwe On Mon, Apr 9, 2018, at 3:26 PM, Andy Grove wrote: > Over the weekend I added preliminary Parquet support to DataFusion (it only > supports int/float primitives and UTF8 so far). This was possible due to > the great work happening with the parquet-rs crate. > > Integrating this with the current Rust version of Arrow was simple and I > have now started running benchmarks (and we now have some benchmark code > checked into the Arrow project). > > Now that the basic functionality is stable enough to support this use case > I am going to focus on quality this week and start improving unit tests and > adding documentation. > > I think we might be at the point where it makes sense to start discussing a > first official release and maybe a roadmap for the Rust library? > > My next area of interest personally is the IPC mechanism and interop > testing with other languages, especially Java. > > Thanks, > > Andy.
Re: Rust Arrow status and plans for this week
Hello Renjie, I can give you contributor permissions on JIRA so you can assign issues to yourself. I would need to know your JIRA id for that. Code contributions happen per pull request on github. Just fork the project, open a new branch and once it's ready: make a pull request to the main arrow repository. Cheers Uwe On Tue, Apr 10, 2018, at 4:38 AM, Renjie Liu wrote: > Cool! > I'm also trying to use arrow-rs in my project and would like to contribute > to arrow-rs, can anybody give me contributor permission? > > On Tue, Apr 10, 2018 at 10:31 AM Jacques Nadeauwrote: > > > Super cool, congrats on the progress! > > > > The IPC/interop is top priority for me as well. > > > > On Mon, Apr 9, 2018 at 6:26 AM, Andy Grove wrote: > > > > > Over the weekend I added preliminary Parquet support to DataFusion (it > > only > > > supports int/float primitives and UTF8 so far). This was possible due to > > > the great work happening with the parquet-rs crate. > > > > > > Integrating this with the current Rust version of Arrow was simple and I > > > have now started running benchmarks (and we now have some benchmark code > > > checked into the Arrow project). > > > > > > Now that the basic functionality is stable enough to support this use > > case > > > I am going to focus on quality this week and start improving unit tests > > and > > > adding documentation. > > > > > > I think we might be at the point where it makes sense to start > > discussing a > > > first official release and maybe a roadmap for the Rust library? > > > > > > My next area of interest personally is the IPC mechanism and interop > > > testing with other languages, especially Java. > > > > > > Thanks, > > > > > > Andy. > > > > > > -- > Liu, Renjie > Software Engineer, MVAD