Build system discussion for Arrow (and Orc?)

2018-04-10 Thread Michael Sarahan
Greetings,

I'm emailing after discussion with Wes at AnacondaCon today.  I work for
Anaconda, and I've recently been trying to package Arrow for Anaconda.  The
current CMake configuration seems to strongly impose a vendoring/hermetic
approach.  We find that approach to be difficult to integrate with our
system, which relies on package modularity and basically anti-vendoring.
Specifically, imposing -Werror in Orc led to always failing builds for
Arrow on OSX, and because Arrow vendored Orc through CMake, my only option
was to attempt a build, let it fail, and then patch -Werror away for Orc
after the failed build, then rebuild.  I would like to contribute patches
to Arrow's CMake files, and ideally also to Orc's CMake files that will
allow us to switch between the hermetic approach, and using externally
provided dependencies.  Orc would become a separate package for us, and
Arrow would depend on it.  This brings me to two questions:

1. Is this a welcome change, or should we just carry patches locally?
2. Assuming change is welcome, what is the preferred method for submitting
changes?  Github PR(s)?

Best,
Michael


Re: rust using nightly channel

2018-04-10 Thread Renjie Liu
Yes, so maybe we need a conditional compilation method so that the user can
choose.

On Tue, Apr 10, 2018 at 9:42 PM Andy Grove  wrote:

> My opinion is that we should continue to support Rust stable since there
> are users who can only use Arrow if it works with Rust stable.
>
> However, maybe it is possible to provide an API so that users can provide
> their own allocators and in that case they could choose to use nightly?
>
> It's a bit more work for us, but gives users more choice.
>
> Also, SIMD and alloc are both going to be stabilized very soon anyway so we
> might not have to wait too long.
>
> Thanks,
>
> Andy.
>
>
>
> On Tue, Apr 10, 2018 at 4:38 AM, Renjie Liu 
> wrote:
>
> > Hi:
> > Can we use experimental features in nightly channel? There are many
> useful
> > features that can only be use in nightly channel, e.g. the Alloc api,
> since
> > arrow requires control over low level primitives such as memory
> allocation,
> > simd execution, etc.
> >
> >
> > --
> > Liu, Renjie
> > Software Engineer, MVAD
> >
>
-- 
Liu, Renjie
Software Engineer, MVAD


[jira] [Created] (ARROW-2445) Add documentation and make some fields private

2018-04-10 Thread Andy Grove (JIRA)
Andy Grove created ARROW-2445:
-

 Summary: Add documentation and make some fields private
 Key: ARROW-2445
 URL: https://issues.apache.org/jira/browse/ARROW-2445
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.10.0


A first pass at adding rustdoc comments and made some struct fields private and 
added accessor methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Tensor column types in arrow

2018-04-10 Thread Leif Walsh
Thanks, I’ll create a jira and google doc. I agree those are the main
questions to iron out.

If there’s a desire to avoid scope creeping this in before 1.0, I think in
parallel I’ll start a conversation with the spark community about using the
existing FixedSizeBinary type plus some custom metadata to provide
serialization for their ML UDTs, and let them know that in the future if
this is added to arrow, they could switch that implementation to use those
arrow types instead.


On Tue, Apr 10, 2018 at 19:18 Wes McKinney  wrote:

> The simplest thing would be to have a "tensor" or "ndarray" type where
> each cell has the same shape. This would amount to adding the current
> "Tensor" Flatbuffers table to the Type union in
>
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194
>
> The benefit of having each cell having the same shape is that the
> physical representation is FixedSizeBinary.
>
> Some caveats / notes:
>
> * We have a prior unresolved discussion about our approach to logical
> types. I could argue that this might fall into the same bucket of
> logical types. I don't think we should merge any patches related to
> this issue until we resolve that discussion
>
> * Using FixedSizeBinary as the physical representation constrains
> value sizes to 2GB (product of shape) because the FixedSizeBinary
> metadata uses int for the byteWidth. We might consider changing this
> to long (64 bits), but that's a separate discussion
>
> * If we permitted each cell to have a different shape, then we would
> need to use Binary (vs. FixedSizeBinary), which would limit the entire
> size of a column to 2GB of total tensor data. This could be mitigated
> by introducing LargeBinary (64 bit offsets), but this requires
> additional discussion (there is a JIRA about this already from some
> time ago)
>
> Given that we are still falling short of a complete implementation of
> other Arrow types (unions, intervals, fixed size lists), I urge all to
> be deliberate about not piling on more technical debt / format
> implementation shortfall if it can be avoided -- so a solution to this
> might be to have a patch for initial Tensor/Ndarray value support that
> is implemented in Java and/or C++
>
> How about creating a JIRA about this broad topic and creating a Google
> doc with a proposed implementation approach for discussion?
>
> Thanks
> Wes
>
> On Tue, Apr 10, 2018 at 5:48 PM, Li Jin  wrote:
> > What do people think whether "shape" should be included as a optional
> part
> > of schema metadata or a required part of the schema itself?
> >
> > I feel having it be required might be too restrictive for interop with
> > other systems.
> >
> > On Mon, Apr 9, 2018 at 9:13 PM, Leif Walsh  wrote:
> >
> >> My gut feeling is that such a column type should specify both the shape
> and
> >> primitive type of all values in the column. I can’t think of a common
> use
> >> case that requires differently shaped tensors in a single column.
> >>
> >> Can anyone here come up with such a use case?
> >>
> >> If not, I can try to draft a proposal change to the spec that adds these
> >> types. The next question is whether such a change can make it in (with
> c++
> >> and java implementations) before 1.0.
> >> On Mon, Apr 9, 2018 at 17:36 Wes McKinney  wrote:
> >>
> >> > > As far as I know, there is an implementation of tensor type in
> >> > C++/Python already. Should we just finalize the spec and add
> >> implementation
> >> > to Java?
> >> >
> >> > There is nothing specified yet as far as a *column* of
> >> > ndarrays/tensors. We defined Tensor metadata for the purposes of
> >> > IPC/serialization but made no effort to incorporate such data into the
> >> > columnar format.
> >> >
> >> > There are likely many ways to implement column whose values are
> >> > ndarrays, each cell with its own shape. Whether we would want to
> >> > permit each cell to have a different ndarray cell type is another
> >> > question (i.e. would we want to constrain every cell in a column to
> >> > contain ndarrays of a particular type, like float64)
> >> >
> >> > So there's a couple of questions
> >> >
> >> > * How to represent the data using the columnar format
> >> > * How to incorporate ndarray metadata into columnar schemas
> >> >
> >> > - Wes
> >> >
> >> > On Mon, Apr 9, 2018 at 5:30 PM, Li Jin  wrote:
> >> > > As far as I know, there is an implementation of tensor type in
> >> C++/Python
> >> > > already. Should we just finalize the spec and add implementation to
> >> Java?
> >> > >
> >> > > On the Spark side, it's probably more complicated as Vector and
> Matrix
> >> > are
> >> > > not "first class" types in Spark SQL. Spark ML implements them as
> UDT
> >> > > (user-defined types) so it's not clear how to make Spark/Arrow
> >> converter
> >> > > work with them.
> >> > >
> >> > > I wonder if Bryan and Holden have some more thoughts on that?
> 

Re: Tensor column types in arrow

2018-04-10 Thread Wes McKinney
The simplest thing would be to have a "tensor" or "ndarray" type where
each cell has the same shape. This would amount to adding the current
"Tensor" Flatbuffers table to the Type union in

https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194

The benefit of having each cell having the same shape is that the
physical representation is FixedSizeBinary.

Some caveats / notes:

* We have a prior unresolved discussion about our approach to logical
types. I could argue that this might fall into the same bucket of
logical types. I don't think we should merge any patches related to
this issue until we resolve that discussion

* Using FixedSizeBinary as the physical representation constrains
value sizes to 2GB (product of shape) because the FixedSizeBinary
metadata uses int for the byteWidth. We might consider changing this
to long (64 bits), but that's a separate discussion

* If we permitted each cell to have a different shape, then we would
need to use Binary (vs. FixedSizeBinary), which would limit the entire
size of a column to 2GB of total tensor data. This could be mitigated
by introducing LargeBinary (64 bit offsets), but this requires
additional discussion (there is a JIRA about this already from some
time ago)

Given that we are still falling short of a complete implementation of
other Arrow types (unions, intervals, fixed size lists), I urge all to
be deliberate about not piling on more technical debt / format
implementation shortfall if it can be avoided -- so a solution to this
might be to have a patch for initial Tensor/Ndarray value support that
is implemented in Java and/or C++

How about creating a JIRA about this broad topic and creating a Google
doc with a proposed implementation approach for discussion?

Thanks
Wes

On Tue, Apr 10, 2018 at 5:48 PM, Li Jin  wrote:
> What do people think whether "shape" should be included as a optional part
> of schema metadata or a required part of the schema itself?
>
> I feel having it be required might be too restrictive for interop with
> other systems.
>
> On Mon, Apr 9, 2018 at 9:13 PM, Leif Walsh  wrote:
>
>> My gut feeling is that such a column type should specify both the shape and
>> primitive type of all values in the column. I can’t think of a common use
>> case that requires differently shaped tensors in a single column.
>>
>> Can anyone here come up with such a use case?
>>
>> If not, I can try to draft a proposal change to the spec that adds these
>> types. The next question is whether such a change can make it in (with c++
>> and java implementations) before 1.0.
>> On Mon, Apr 9, 2018 at 17:36 Wes McKinney  wrote:
>>
>> > > As far as I know, there is an implementation of tensor type in
>> > C++/Python already. Should we just finalize the spec and add
>> implementation
>> > to Java?
>> >
>> > There is nothing specified yet as far as a *column* of
>> > ndarrays/tensors. We defined Tensor metadata for the purposes of
>> > IPC/serialization but made no effort to incorporate such data into the
>> > columnar format.
>> >
>> > There are likely many ways to implement column whose values are
>> > ndarrays, each cell with its own shape. Whether we would want to
>> > permit each cell to have a different ndarray cell type is another
>> > question (i.e. would we want to constrain every cell in a column to
>> > contain ndarrays of a particular type, like float64)
>> >
>> > So there's a couple of questions
>> >
>> > * How to represent the data using the columnar format
>> > * How to incorporate ndarray metadata into columnar schemas
>> >
>> > - Wes
>> >
>> > On Mon, Apr 9, 2018 at 5:30 PM, Li Jin  wrote:
>> > > As far as I know, there is an implementation of tensor type in
>> C++/Python
>> > > already. Should we just finalize the spec and add implementation to
>> Java?
>> > >
>> > > On the Spark side, it's probably more complicated as Vector and Matrix
>> > are
>> > > not "first class" types in Spark SQL. Spark ML implements them as UDT
>> > > (user-defined types) so it's not clear how to make Spark/Arrow
>> converter
>> > > work with them.
>> > >
>> > > I wonder if Bryan and Holden have some more thoughts on that?
>> > >
>> > > Li
>> > >
>> > > On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh 
>> wrote:
>> > >
>> > >> Hi all,
>> > >>
>> > >> I’ve been doing some work lately with Spark’s ML interfaces, which
>> > include
>> > >> sparse and dense Vector and Matrix types, backed on the Scala side by
>> > >> Breeze. Using these interfaces, you can construct DataFrames whose
>> > column
>> > >> types are vectors and matrices, and though the API isn’t terribly
>> rich,
>> > it
>> > >> is possible to run Python UDFs over such a DataFrame and get numpy
>> > ndarrays
>> > >> out of each row. However, if you’re using Spark’s Arrow serialization
>> > >> between the executor and Python workers, you get this
>> > >> UnsupportedOperationException:
>> > >> 

[jira] [Created] (ARROW-2443) [Python] Conversion from pandas of empty categorical fails with ArrowInvalid

2018-04-10 Thread Florian Jetter (JIRA)
Florian Jetter created ARROW-2443:
-

 Summary: [Python] Conversion from pandas of empty categorical 
fails with ArrowInvalid
 Key: ARROW-2443
 URL: https://issues.apache.org/jira/browse/ARROW-2443
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Florian Jetter


The conversion of an empty pandas categorical raises an exception. Before 
version `0.9.0` this was possible
{code:java}
import pandas as pd
import pyarrow as pa
pa.Table.from_pandas(pd.DataFrame({'cat': pd.Categorical([])})){code}
raises:

{{ArrowInvalid: Dictionary indices must have non-zero length}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Tasks for upcoming Hackathons & Sprints

2018-04-10 Thread Antoine Pitrou

Le 10/04/2018 à 15:43, Uwe L. Korn a écrit :
> Seems like I'm not allowed to make public filters. I will ask INFRA about 
> what I can do.
> 
> You'll find the results by querying for `labels = beginner AND project = 
> Arrow AND status = open` in JIRA.

Yes, I've added a couple of beginner tickets.

Regards

Antoine.


> 
> Uwe
> 
> On Tue, Apr 10, 2018, at 3:33 PM, Antoine Pitrou wrote:
>>
>> Hi Uwe,
>>
>> On Mon, 09 Apr 2018 17:28:50 +0200
>> "Uwe L. Korn"  wrote:
>>>
>>> To get people on board and have things they can work on, I'm collecting 
>>> possible tasks. To make these tasks visible, we should flags simple things 
>>> that everyone could work on with a "beginner" label in JIRA so they appear 
>>> in https://issues.apache.org/jira/issues/?filter=12343593
>>
>> That link doesn't work.
>>
>> Regards
>>
>> Antoine.


Re: Tasks for upcoming Hackathons & Sprints

2018-04-10 Thread Uwe L. Korn
Seems like I'm not allowed to make public filters. I will ask INFRA about what 
I can do.

You'll find the results by querying for `labels = beginner AND project = Arrow 
AND status = open` in JIRA.

Uwe

On Tue, Apr 10, 2018, at 3:33 PM, Antoine Pitrou wrote:
> 
> Hi Uwe,
> 
> On Mon, 09 Apr 2018 17:28:50 +0200
> "Uwe L. Korn"  wrote:
> > 
> > To get people on board and have things they can work on, I'm collecting 
> > possible tasks. To make these tasks visible, we should flags simple things 
> > that everyone could work on with a "beginner" label in JIRA so they appear 
> > in https://issues.apache.org/jira/issues/?filter=12343593
> 
> That link doesn't work.
> 
> Regards
> 
> Antoine.


Re: rust using nightly channel

2018-04-10 Thread Andy Grove
My opinion is that we should continue to support Rust stable since there
are users who can only use Arrow if it works with Rust stable.

However, maybe it is possible to provide an API so that users can provide
their own allocators and in that case they could choose to use nightly?

It's a bit more work for us, but gives users more choice.

Also, SIMD and alloc are both going to be stabilized very soon anyway so we
might not have to wait too long.

Thanks,

Andy.



On Tue, Apr 10, 2018 at 4:38 AM, Renjie Liu  wrote:

> Hi:
> Can we use experimental features in nightly channel? There are many useful
> features that can only be use in nightly channel, e.g. the Alloc api, since
> arrow requires control over low level primitives such as memory allocation,
> simd execution, etc.
>
>
> --
> Liu, Renjie
> Software Engineer, MVAD
>


Re: Tasks for upcoming Hackathons & Sprints

2018-04-10 Thread Antoine Pitrou

Hi Uwe,

On Mon, 09 Apr 2018 17:28:50 +0200
"Uwe L. Korn"  wrote:
> 
> To get people on board and have things they can work on, I'm collecting 
> possible tasks. To make these tasks visible, we should flags simple things 
> that everyone could work on with a "beginner" label in JIRA so they appear in 
> https://issues.apache.org/jira/issues/?filter=12343593

That link doesn't work.

Regards

Antoine.


[jira] [Created] (ARROW-2442) [C++] Disambiguate Builder::Append overloads

2018-04-10 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2442:
-

 Summary: [C++] Disambiguate Builder::Append overloads
 Key: ARROW-2442
 URL: https://issues.apache.org/jira/browse/ARROW-2442
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.9.0
Reporter: Antoine Pitrou


See discussion in 
[https://github.com/apache/arrow/pull/1852#discussion_r179919627]

There are various {{Append()}} overloads in Builder and subclasses, some of 
which append one value, some of which append multiple values at once.

The API might be clearer and less error-prone if multiple-append variants were 
named differently, for example {{AppendValues()}}. Especially with the 
pointer-taking variants, it's probably easy to call the wrong overload by 
mistake.

The existing methods would have to go through a deprecation cycle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2441) [Rust] Bulder::slice_mut assertions are too strict

2018-04-10 Thread Andy Grove (JIRA)
Andy Grove created ARROW-2441:
-

 Summary: [Rust] Bulder::slice_mut assertions are too strict
 Key: ARROW-2441
 URL: https://issues.apache.org/jira/browse/ARROW-2441
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Andy Grove
 Fix For: 0.10.0


The assertions only allow slice up to builder length, rather than up to builder 
capacity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2440) [Rust[ Implement ListBuilder

2018-04-10 Thread Andy Grove (JIRA)
Andy Grove created ARROW-2440:
-

 Summary: [Rust[ Implement ListBuilder
 Key: ARROW-2440
 URL: https://issues.apache.org/jira/browse/ARROW-2440
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.10.0


Implement ListBuilder



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


rust using nightly channel

2018-04-10 Thread Renjie Liu
Hi:
Can we use experimental features in nightly channel? There are many useful
features that can only be use in nightly channel, e.g. the Alloc api, since
arrow requires control over low level primitives such as memory allocation,
simd execution, etc.


-- 
Liu, Renjie
Software Engineer, MVAD


Re: Allowing every JIRA user to assign issues to themselves

2018-04-10 Thread Wes McKinney
Hi Uwe,

I believe projects have had problems with spam in the past, but we could
give it a shot and disable if there is spam.

Wes

On Tue, Apr 10, 2018 at 5:19 AM Uwe L. Korn  wrote:

> Hello all,
>
> we currently have many new contributors. This is very exciting but a trap
> that catches every new contributor is that they cannot assign issues by
> themselves but must be added to the contributors role by a PMC.
>
> Would be ok for all if we give contributor permission to everyone on the
> Arrow JIRA project?
>
> Uwe
>


Re: Rust Arrow status and plans for this week

2018-04-10 Thread Renjie Liu
Hello Uwe:
My JIRA id is liurenjie1024 and it seems that I have been given contibutor
permission.

On Tue, Apr 10, 2018 at 3:00 PM Uwe L. Korn  wrote:

> Hello Andy,
>
> this is very exciting. Once we have basic documentation, we should have a
> look at streamlining the release process in the ASF infrastructure so
> making releases is straight-forward. We have a small collection of scripts
> to do this for the main release and the JS release that we should be able
> to adapt to the Rust part of the project. I could simply make the
> respective JIRAs for that or we have a small chat first about the ASF
> release process.
>
> > My next area of interest personally is the IPC mechanism and interop
> > testing with other languages, especially Java.
>
> This is a very important step for all our implementations. We have an
> integration test setup in
> https://github.com/apache/arrow/tree/master/integration where we test the
> compatibility of all Arrow implementations to each other to verify that
> they all have the same understanding of the data structures.
>
> Uwe
>
> On Mon, Apr 9, 2018, at 3:26 PM, Andy Grove wrote:
> > Over the weekend I added preliminary Parquet support to DataFusion (it
> only
> > supports int/float primitives and UTF8 so far). This was possible due to
> > the great work happening with the parquet-rs crate.
> >
> > Integrating this with the current Rust version of Arrow was simple and I
> > have now started running benchmarks (and we now have some benchmark code
> > checked into the Arrow project).
> >
> > Now that the basic functionality is stable enough to support this use
> case
> > I am going to focus on quality this week and start improving unit tests
> and
> > adding documentation.
> >
> > I think we might be at the point where it makes sense to start
> discussing a
> > first official release and maybe a roadmap for the Rust library?
> >
> > My next area of interest personally is the IPC mechanism and interop
> > testing with other languages, especially Java.
> >
> > Thanks,
> >
> > Andy.
>
-- 
Liu, Renjie
Software Engineer, MVAD


Allowing every JIRA user to assign issues to themselves

2018-04-10 Thread Uwe L. Korn
Hello all,

we currently have many new contributors. This is very exciting but a trap that 
catches every new contributor is that they cannot assign issues by themselves 
but must be added to the contributors role by a PMC.

Would be ok for all if we give contributor permission to everyone on the Arrow 
JIRA project?

Uwe


[jira] [Created] (ARROW-2439) [Rust] Run license header checks also in Rust CI entry

2018-04-10 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-2439:
--

 Summary: [Rust] Run license header checks also in Rust CI entry
 Key: ARROW-2439
 URL: https://issues.apache.org/jira/browse/ARROW-2439
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Uwe L. Korn
 Fix For: 0.10.0


Currently we only audit license headers in the C++ builds. We should also do 
this in the Rust Travis entry. The overhead for them is so minimal that we can 
do it twice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2438) [Rust] memory_pool.rs misses license header

2018-04-10 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-2438:
--

 Summary: [Rust] memory_pool.rs misses license header
 Key: ARROW-2438
 URL: https://issues.apache.org/jira/browse/ARROW-2438
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Uwe L. Korn
 Fix For: 0.10.0


Travis output:

{code}
NOT APPROVED: rust/src/memory_pool.rs (apache-arrow/rust/src/memory_pool.rs): 
false
1 unapproved licences. Check rat report: rat.txt
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2437) [C++] Change of arrow::ipc::ReadMessage signature breaks ABI compability

2018-04-10 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-2437:
--

 Summary: [C++] Change of arrow::ipc::ReadMessage signature breaks 
ABI compability
 Key: ARROW-2437
 URL: https://issues.apache.org/jira/browse/ARROW-2437
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Uwe L. Korn
 Fix For: 0.9.1


We changed the signature of the method from
{code}
ReadMessage ( arrow::io::InputStream* file, std::unique_ptr* message ) 
{code}
to
{code}
ReadMessage ( arrow::io::InputStream* file, std::unique_ptr* message, bool aligned ) 
{code}

We should add the old signature so that the 0.9.1 release is ABI compatible to 
0.9.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Rust Arrow status and plans for this week

2018-04-10 Thread Uwe L. Korn
Hello Andy,

this is very exciting. Once we have basic documentation, we should have a look 
at streamlining the release process in the ASF infrastructure so making 
releases is straight-forward. We have a small collection of scripts to do this 
for the main release and the JS release that we should be able to adapt to the 
Rust part of the project. I could simply make the respective JIRAs for that or 
we have a small chat first about the ASF release process.

> My next area of interest personally is the IPC mechanism and interop
> testing with other languages, especially Java.

This is a very important step for all our implementations. We have an 
integration test setup in 
https://github.com/apache/arrow/tree/master/integration where we test the 
compatibility of all Arrow implementations to each other to verify that they 
all have the same understanding of the data structures.

Uwe

On Mon, Apr 9, 2018, at 3:26 PM, Andy Grove wrote:
> Over the weekend I added preliminary Parquet support to DataFusion (it only
> supports int/float primitives and UTF8 so far). This was possible due to
> the great work happening with the parquet-rs crate.
> 
> Integrating this with the current Rust version of Arrow was simple and I
> have now started running benchmarks (and we now have some benchmark code
> checked into the Arrow project).
> 
> Now that the basic functionality is stable enough to support this use case
> I am going to focus on quality this week and start improving unit tests and
> adding documentation.
> 
> I think we might be at the point where it makes sense to start discussing a
> first official release and maybe a roadmap for the Rust library?
> 
> My next area of interest personally is the IPC mechanism and interop
> testing with other languages, especially Java.
> 
> Thanks,
> 
> Andy.


Re: Rust Arrow status and plans for this week

2018-04-10 Thread Uwe L. Korn
Hello Renjie,

I can give you contributor permissions on JIRA so you can assign issues to 
yourself. I would need to know your JIRA id for that.

Code contributions happen per pull request on github. Just fork the project, 
open a new branch and once it's ready: make a pull request to the main arrow 
repository.

Cheers
Uwe

On Tue, Apr 10, 2018, at 4:38 AM, Renjie Liu wrote:
> Cool!
> I'm also trying to use arrow-rs in my project and would like to contribute
> to arrow-rs, can anybody give me contributor permission?
> 
> On Tue, Apr 10, 2018 at 10:31 AM Jacques Nadeau  wrote:
> 
> > Super cool, congrats on the progress!
> >
> > The IPC/interop is top priority for me as well.
> >
> > On Mon, Apr 9, 2018 at 6:26 AM, Andy Grove  wrote:
> >
> > > Over the weekend I added preliminary Parquet support to DataFusion (it
> > only
> > > supports int/float primitives and UTF8 so far). This was possible due to
> > > the great work happening with the parquet-rs crate.
> > >
> > > Integrating this with the current Rust version of Arrow was simple and I
> > > have now started running benchmarks (and we now have some benchmark code
> > > checked into the Arrow project).
> > >
> > > Now that the basic functionality is stable enough to support this use
> > case
> > > I am going to focus on quality this week and start improving unit tests
> > and
> > > adding documentation.
> > >
> > > I think we might be at the point where it makes sense to start
> > discussing a
> > > first official release and maybe a roadmap for the Rust library?
> > >
> > > My next area of interest personally is the IPC mechanism and interop
> > > testing with other languages, especially Java.
> > >
> > > Thanks,
> > >
> > > Andy.
> > >
> >
> -- 
> Liu, Renjie
> Software Engineer, MVAD