[jira] [Created] (ARROW-2436) [Rust] Add windows CI

2018-04-09 Thread Paddy Horan (JIRA)
Paddy Horan created ARROW-2436:
--

 Summary: [Rust] Add windows CI
 Key: ARROW-2436
 URL: https://issues.apache.org/jira/browse/ARROW-2436
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Paddy Horan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Rust Arrow status and plans for this week

2018-04-09 Thread Renjie Liu
Cool!
I'm also trying to use arrow-rs in my project and would like to contribute
to arrow-rs, can anybody give me contributor permission?

On Tue, Apr 10, 2018 at 10:31 AM Jacques Nadeau  wrote:

> Super cool, congrats on the progress!
>
> The IPC/interop is top priority for me as well.
>
> On Mon, Apr 9, 2018 at 6:26 AM, Andy Grove  wrote:
>
> > Over the weekend I added preliminary Parquet support to DataFusion (it
> only
> > supports int/float primitives and UTF8 so far). This was possible due to
> > the great work happening with the parquet-rs crate.
> >
> > Integrating this with the current Rust version of Arrow was simple and I
> > have now started running benchmarks (and we now have some benchmark code
> > checked into the Arrow project).
> >
> > Now that the basic functionality is stable enough to support this use
> case
> > I am going to focus on quality this week and start improving unit tests
> and
> > adding documentation.
> >
> > I think we might be at the point where it makes sense to start
> discussing a
> > first official release and maybe a roadmap for the Rust library?
> >
> > My next area of interest personally is the IPC mechanism and interop
> > testing with other languages, especially Java.
> >
> > Thanks,
> >
> > Andy.
> >
>
-- 
Liu, Renjie
Software Engineer, MVAD


Re: Rust Arrow status and plans for this week

2018-04-09 Thread Jacques Nadeau
Super cool, congrats on the progress!

The IPC/interop is top priority for me as well.

On Mon, Apr 9, 2018 at 6:26 AM, Andy Grove  wrote:

> Over the weekend I added preliminary Parquet support to DataFusion (it only
> supports int/float primitives and UTF8 so far). This was possible due to
> the great work happening with the parquet-rs crate.
>
> Integrating this with the current Rust version of Arrow was simple and I
> have now started running benchmarks (and we now have some benchmark code
> checked into the Arrow project).
>
> Now that the basic functionality is stable enough to support this use case
> I am going to focus on quality this week and start improving unit tests and
> adding documentation.
>
> I think we might be at the point where it makes sense to start discussing a
> first official release and maybe a roadmap for the Rust library?
>
> My next area of interest personally is the IPC mechanism and interop
> testing with other languages, especially Java.
>
> Thanks,
>
> Andy.
>


[jira] [Created] (ARROW-2434) [Rust] Add windows support

2018-04-09 Thread Paddy Horan (JIRA)
Paddy Horan created ARROW-2434:
--

 Summary: [Rust] Add windows support
 Key: ARROW-2434
 URL: https://issues.apache.org/jira/browse/ARROW-2434
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Paddy Horan
 Fix For: 0.10.0


Currently `cargo test` fails on windows OS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Tensor column types in arrow

2018-04-09 Thread Leif Walsh
My gut feeling is that such a column type should specify both the shape and
primitive type of all values in the column. I can’t think of a common use
case that requires differently shaped tensors in a single column.

Can anyone here come up with such a use case?

If not, I can try to draft a proposal change to the spec that adds these
types. The next question is whether such a change can make it in (with c++
and java implementations) before 1.0.
On Mon, Apr 9, 2018 at 17:36 Wes McKinney  wrote:

> > As far as I know, there is an implementation of tensor type in
> C++/Python already. Should we just finalize the spec and add implementation
> to Java?
>
> There is nothing specified yet as far as a *column* of
> ndarrays/tensors. We defined Tensor metadata for the purposes of
> IPC/serialization but made no effort to incorporate such data into the
> columnar format.
>
> There are likely many ways to implement column whose values are
> ndarrays, each cell with its own shape. Whether we would want to
> permit each cell to have a different ndarray cell type is another
> question (i.e. would we want to constrain every cell in a column to
> contain ndarrays of a particular type, like float64)
>
> So there's a couple of questions
>
> * How to represent the data using the columnar format
> * How to incorporate ndarray metadata into columnar schemas
>
> - Wes
>
> On Mon, Apr 9, 2018 at 5:30 PM, Li Jin  wrote:
> > As far as I know, there is an implementation of tensor type in C++/Python
> > already. Should we just finalize the spec and add implementation to Java?
> >
> > On the Spark side, it's probably more complicated as Vector and Matrix
> are
> > not "first class" types in Spark SQL. Spark ML implements them as UDT
> > (user-defined types) so it's not clear how to make Spark/Arrow converter
> > work with them.
> >
> > I wonder if Bryan and Holden have some more thoughts on that?
> >
> > Li
> >
> > On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh  wrote:
> >
> >> Hi all,
> >>
> >> I’ve been doing some work lately with Spark’s ML interfaces, which
> include
> >> sparse and dense Vector and Matrix types, backed on the Scala side by
> >> Breeze. Using these interfaces, you can construct DataFrames whose
> column
> >> types are vectors and matrices, and though the API isn’t terribly rich,
> it
> >> is possible to run Python UDFs over such a DataFrame and get numpy
> ndarrays
> >> out of each row. However, if you’re using Spark’s Arrow serialization
> >> between the executor and Python workers, you get this
> >> UnsupportedOperationException:
> >> https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3
> >> d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/
> >> execution/arrow/ArrowWriter.scala#L71
> >>
> >> I think it would be useful for Arrow to support something like a column
> of
> >> tensors, and I’d like to see if anyone else here is interested in such a
> >> thing.  If so, I’d like to propose adding it to the spec and getting it
> >> implemented in at least Java and C++/Python.
> >>
> >> Some initial mildly-scattered thoughts:
> >>
> >> 1. You can certainly represent these today as List and
> >> List, but then need to do some copying to get them back
> into
> >> numpy ndarrays.
> >>
> >> 2. In some cases it might be useful to know that a column contains 3x3x4
> >> tensors, for example, and not just that there are three dimensions as
> you’d
> >> get with List>.  This could constrain what operations
> >> are meaningful (for example, in Spark you could imagine type checking
> that
> >> verifies dimension alignment for matrix multiplication).
> >>
> >> 3. You could approximate that with a FixedSizeList and metadata about
> the
> >> tensor shape.
> >>
> >> 4. But I kind of feel like this is generally useful enough that it’s
> worth
> >> having one implementation of it (well, one for each runtime) in Arrow.
> >>
> >> 5. Or, maybe everyone here thinks Spark should just do this with
> metadata?
> >>
> >> Curious to hear what you all think.
> >>
> >> Thanks,
> >> Leif
> >>
> >> --
> >> --
> >> Cheers,
> >> Leif
> >>
>
-- 
-- 
Cheers,
Leif


[jira] [Created] (ARROW-2433) [Rust] Add Builder.push_slice(&[T])

2018-04-09 Thread Andy Grove (JIRA)
Andy Grove created ARROW-2433:
-

 Summary: [Rust] Add Builder.push_slice(&[T])
 Key: ARROW-2433
 URL: https://issues.apache.org/jira/browse/ARROW-2433
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.10.0


When populating a Builder with Utf8 data it is more efficient to push whole 
strings as &[u8] rather than one byte at a time.

The same optimization works for all other types too.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Tensor column types in arrow

2018-04-09 Thread Wes McKinney
> As far as I know, there is an implementation of tensor type in C++/Python 
> already. Should we just finalize the spec and add implementation to Java?

There is nothing specified yet as far as a *column* of
ndarrays/tensors. We defined Tensor metadata for the purposes of
IPC/serialization but made no effort to incorporate such data into the
columnar format.

There are likely many ways to implement column whose values are
ndarrays, each cell with its own shape. Whether we would want to
permit each cell to have a different ndarray cell type is another
question (i.e. would we want to constrain every cell in a column to
contain ndarrays of a particular type, like float64)

So there's a couple of questions

* How to represent the data using the columnar format
* How to incorporate ndarray metadata into columnar schemas

- Wes

On Mon, Apr 9, 2018 at 5:30 PM, Li Jin  wrote:
> As far as I know, there is an implementation of tensor type in C++/Python
> already. Should we just finalize the spec and add implementation to Java?
>
> On the Spark side, it's probably more complicated as Vector and Matrix are
> not "first class" types in Spark SQL. Spark ML implements them as UDT
> (user-defined types) so it's not clear how to make Spark/Arrow converter
> work with them.
>
> I wonder if Bryan and Holden have some more thoughts on that?
>
> Li
>
> On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh  wrote:
>
>> Hi all,
>>
>> I’ve been doing some work lately with Spark’s ML interfaces, which include
>> sparse and dense Vector and Matrix types, backed on the Scala side by
>> Breeze. Using these interfaces, you can construct DataFrames whose column
>> types are vectors and matrices, and though the API isn’t terribly rich, it
>> is possible to run Python UDFs over such a DataFrame and get numpy ndarrays
>> out of each row. However, if you’re using Spark’s Arrow serialization
>> between the executor and Python workers, you get this
>> UnsupportedOperationException:
>> https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3
>> d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/
>> execution/arrow/ArrowWriter.scala#L71
>>
>> I think it would be useful for Arrow to support something like a column of
>> tensors, and I’d like to see if anyone else here is interested in such a
>> thing.  If so, I’d like to propose adding it to the spec and getting it
>> implemented in at least Java and C++/Python.
>>
>> Some initial mildly-scattered thoughts:
>>
>> 1. You can certainly represent these today as List and
>> List, but then need to do some copying to get them back into
>> numpy ndarrays.
>>
>> 2. In some cases it might be useful to know that a column contains 3x3x4
>> tensors, for example, and not just that there are three dimensions as you’d
>> get with List>.  This could constrain what operations
>> are meaningful (for example, in Spark you could imagine type checking that
>> verifies dimension alignment for matrix multiplication).
>>
>> 3. You could approximate that with a FixedSizeList and metadata about the
>> tensor shape.
>>
>> 4. But I kind of feel like this is generally useful enough that it’s worth
>> having one implementation of it (well, one for each runtime) in Arrow.
>>
>> 5. Or, maybe everyone here thinks Spark should just do this with metadata?
>>
>> Curious to hear what you all think.
>>
>> Thanks,
>> Leif
>>
>> --
>> --
>> Cheers,
>> Leif
>>


Re: Tensor column types in arrow

2018-04-09 Thread Leif Walsh
The tensor type in the c++ api is a stand-alone object afaict, Phillip and
I were unable to construct an arrow column of them. I agree that it’s a
good starting point, one interpretation of what I’m suggesting is that we
take it as the reference implementation, add it to the spec, and write the
java implementation.
On Mon, Apr 9, 2018 at 17:30 Li Jin  wrote:

> As far as I know, there is an implementation of tensor type in C++/Python
> already. Should we just finalize the spec and add implementation to Java?
>
> On the Spark side, it's probably more complicated as Vector and Matrix are
> not "first class" types in Spark SQL. Spark ML implements them as UDT
> (user-defined types) so it's not clear how to make Spark/Arrow converter
> work with them.
>
> I wonder if Bryan and Holden have some more thoughts on that?
>
> Li
>
> On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh  wrote:
>
> > Hi all,
> >
> > I’ve been doing some work lately with Spark’s ML interfaces, which
> include
> > sparse and dense Vector and Matrix types, backed on the Scala side by
> > Breeze. Using these interfaces, you can construct DataFrames whose column
> > types are vectors and matrices, and though the API isn’t terribly rich,
> it
> > is possible to run Python UDFs over such a DataFrame and get numpy
> ndarrays
> > out of each row. However, if you’re using Spark’s Arrow serialization
> > between the executor and Python workers, you get this
> > UnsupportedOperationException:
> > https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3
> > d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/
> > execution/arrow/ArrowWriter.scala#L71
> >
> > I think it would be useful for Arrow to support something like a column
> of
> > tensors, and I’d like to see if anyone else here is interested in such a
> > thing.  If so, I’d like to propose adding it to the spec and getting it
> > implemented in at least Java and C++/Python.
> >
> > Some initial mildly-scattered thoughts:
> >
> > 1. You can certainly represent these today as List and
> > List, but then need to do some copying to get them back
> into
> > numpy ndarrays.
> >
> > 2. In some cases it might be useful to know that a column contains 3x3x4
> > tensors, for example, and not just that there are three dimensions as
> you’d
> > get with List>.  This could constrain what operations
> > are meaningful (for example, in Spark you could imagine type checking
> that
> > verifies dimension alignment for matrix multiplication).
> >
> > 3. You could approximate that with a FixedSizeList and metadata about the
> > tensor shape.
> >
> > 4. But I kind of feel like this is generally useful enough that it’s
> worth
> > having one implementation of it (well, one for each runtime) in Arrow.
> >
> > 5. Or, maybe everyone here thinks Spark should just do this with
> metadata?
> >
> > Curious to hear what you all think.
> >
> > Thanks,
> > Leif
> >
> > --
> > --
> > Cheers,
> > Leif
> >
>
-- 
-- 
Cheers,
Leif


Re: Tensor column types in arrow

2018-04-09 Thread Li Jin
As far as I know, there is an implementation of tensor type in C++/Python
already. Should we just finalize the spec and add implementation to Java?

On the Spark side, it's probably more complicated as Vector and Matrix are
not "first class" types in Spark SQL. Spark ML implements them as UDT
(user-defined types) so it's not clear how to make Spark/Arrow converter
work with them.

I wonder if Bryan and Holden have some more thoughts on that?

Li

On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh  wrote:

> Hi all,
>
> I’ve been doing some work lately with Spark’s ML interfaces, which include
> sparse and dense Vector and Matrix types, backed on the Scala side by
> Breeze. Using these interfaces, you can construct DataFrames whose column
> types are vectors and matrices, and though the API isn’t terribly rich, it
> is possible to run Python UDFs over such a DataFrame and get numpy ndarrays
> out of each row. However, if you’re using Spark’s Arrow serialization
> between the executor and Python workers, you get this
> UnsupportedOperationException:
> https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3
> d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/
> execution/arrow/ArrowWriter.scala#L71
>
> I think it would be useful for Arrow to support something like a column of
> tensors, and I’d like to see if anyone else here is interested in such a
> thing.  If so, I’d like to propose adding it to the spec and getting it
> implemented in at least Java and C++/Python.
>
> Some initial mildly-scattered thoughts:
>
> 1. You can certainly represent these today as List and
> List, but then need to do some copying to get them back into
> numpy ndarrays.
>
> 2. In some cases it might be useful to know that a column contains 3x3x4
> tensors, for example, and not just that there are three dimensions as you’d
> get with List>.  This could constrain what operations
> are meaningful (for example, in Spark you could imagine type checking that
> verifies dimension alignment for matrix multiplication).
>
> 3. You could approximate that with a FixedSizeList and metadata about the
> tensor shape.
>
> 4. But I kind of feel like this is generally useful enough that it’s worth
> having one implementation of it (well, one for each runtime) in Arrow.
>
> 5. Or, maybe everyone here thinks Spark should just do this with metadata?
>
> Curious to hear what you all think.
>
> Thanks,
> Leif
>
> --
> --
> Cheers,
> Leif
>


Tensor column types in arrow

2018-04-09 Thread Leif Walsh
Hi all,

I’ve been doing some work lately with Spark’s ML interfaces, which include
sparse and dense Vector and Matrix types, backed on the Scala side by
Breeze. Using these interfaces, you can construct DataFrames whose column
types are vectors and matrices, and though the API isn’t terribly rich, it
is possible to run Python UDFs over such a DataFrame and get numpy ndarrays
out of each row. However, if you’re using Spark’s Arrow serialization
between the executor and Python workers, you get this
UnsupportedOperationException:
https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala#L71

I think it would be useful for Arrow to support something like a column of
tensors, and I’d like to see if anyone else here is interested in such a
thing.  If so, I’d like to propose adding it to the spec and getting it
implemented in at least Java and C++/Python.

Some initial mildly-scattered thoughts:

1. You can certainly represent these today as List and
List, but then need to do some copying to get them back into
numpy ndarrays.

2. In some cases it might be useful to know that a column contains 3x3x4
tensors, for example, and not just that there are three dimensions as you’d
get with List>.  This could constrain what operations
are meaningful (for example, in Spark you could imagine type checking that
verifies dimension alignment for matrix multiplication).

3. You could approximate that with a FixedSizeList and metadata about the
tensor shape.

4. But I kind of feel like this is generally useful enough that it’s worth
having one implementation of it (well, one for each runtime) in Arrow.

5. Or, maybe everyone here thinks Spark should just do this with metadata?

Curious to hear what you all think.

Thanks,
Leif

-- 
-- 
Cheers,
Leif


[jira] [Created] (ARROW-2432) [Python] from_pandas fails when converting decimals if contain None

2018-04-09 Thread Bryan Cutler (JIRA)
Bryan Cutler created ARROW-2432:
---

 Summary: [Python] from_pandas fails when converting decimals if 
contain None
 Key: ARROW-2432
 URL: https://issues.apache.org/jira/browse/ARROW-2432
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
Reporter: Bryan Cutler


Using from_pandas to convert decimals fails if encounters a value of {{None}}. 
For example:
{code:java}
In [1]: import pyarrow as pa
...: import pandas as pd
...: from decimal import Decimal
...:

In [2]: s_dec = pd.Series([Decimal('3.14'), None])

In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
---
ArrowInvalid Traceback (most recent call last)
 in ()
> 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))

array.pxi in pyarrow.lib.Array.from_pandas()

array.pxi in pyarrow.lib.array()

error.pxi in pyarrow.lib.check_status()

error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
object of type NoneType but can only handle these types: decimal.Decimal

In [4]: s_dec
Out[4]:
0 3.14
1 None
dtype: object{code}

The above error is raised when specifying decimal type.  When no type is 
specified, a seg fault happens.

This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2431) [Rust] Schema fidelity

2018-04-09 Thread Maximilian Roos (JIRA)
Maximilian Roos created ARROW-2431:
--

 Summary: [Rust] Schema fidelity
 Key: ARROW-2431
 URL: https://issues.apache.org/jira/browse/ARROW-2431
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Maximilian Roos


ref [https://github.com/apache/arrow/pull/1829#discussion_r179248743]

Currently our Traits are not loyal to 
[https://arrow.apache.org/docs/metadata.html].

For example, we nest `Field`s in the `DataType` (aka `type`) attribute of the 
parent Field (rather than having the type be `Struct` and a separate `Children` 
parameter)

 

Is this OK, assuming that we can read and write accurate schemas? Or should we 
move towards having the Schema trait be consistent with the metadata spec?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2430) MVP for branch based packaging automation

2018-04-09 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-2430:
--

 Summary: MVP for branch based packaging automation
 Key: ARROW-2430
 URL: https://issues.apache.org/jira/browse/ARROW-2430
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs


Described in 
https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: What do people think about a one day get together?

2018-04-09 Thread Julian Hyde
+1 The Arrow community would benefit greatly from a conference/unconference.

Remember not to schedule it too close to ApacheCon. 

Julian


> On Apr 9, 2018, at 10:18 AM, Jacques Nadeau  wrote:
> 
> Hey all, given that several people are busy in June, let's way until the
> fall. I'll take at look at the schedule of things and throw out a new idea
> in the next few months.
> 
> Thanks!
> Jacques
> 
> 
> On Wed, Apr 4, 2018 at 10:17 AM, Wes McKinney  wrote:
> 
>> I'm +1 on the idea of a Arrow conference. June is a pretty busy month
>> for me, so I would prefer another time, but I can arrange to make it
>> if we do it then.
>> 
>> - Wes
>> 
>> On Wed, Apr 4, 2018 at 1:06 PM, Siddharth Teotia 
>> wrote:
>>> +1. I would love to attend.
>>> 
>>> On Tue, Apr 3, 2018 at 4:18 PM, Kevin Moore  wrote:
>>> 
 Sounds great. Quilt Data may be able to sponsor some of the refreshment
 costs.
 
 
 Kevin Moore
 CEO, Quilt Data, Inc.
 ke...@quiltdata.io | LinkedIn >> 
 (415) 497-7895
 
 
 Manage Data like Code
 quiltdata.com
 
 On Tue, Apr 3, 2018 at 1:41 PM, Li Jin  wrote:
 
> I'd love to attend. I will be around for Spark Summit.
> 
> Li
> 
> 
> On Tue, Apr 3, 2018 at 11:48 AM, Jacques Nadeau 
> wrote:
> 
>> Hey All,
>> 
>> In light of growing interest in Apache Arrow over the past year and
>> the
>> great response to the meetup talk invitation I sent last week, I was
>> thinking it may be time to hold a single day conference focused on
>> the
>> project. Wes and I have previously thrown this idea around and it
>> seems
>> like it might be a good time to get something started. Some of my
>> colleagues did an investigation on how and when we could do this.
>> I'm
>> raising this to you all now to get people's thoughts.
>> 
>> 
>> A rough sketch of what Wes and I have bounced around:
>> 
>> *One day developer-focused event on Apache Arrow in San Francisco,
>> June
> 7,
>> just after Spark Summit (open to other dates, but it would be nice
>> for
>> folks attending the conference to stay one extra day for Arrow).
>> 
>> * Focus on interesting use cases and applications of Arrow. We could
 also
>> use this event to discuss/plan/present about movement to Arrow 1.0
>> this
>> year and beyond.
>> 
>> *Goal of 100-200 attendees.
>> 
>> *Dremio can offer to organize the event (venue, logistics,
 registrations,
>> etc). The goal would be to keep ticket costs very modest to
>> encourage
>> attendance (eg, $50). Opportunity for sponsorship by vendors to help
> drive
>> down costs (eg, refreshments).
>> 
>> *Still need to determine a venue but probably something downtown SF
> nearish
>> Moscone.
>> 
>> *PMC or appointed sub-committee could review talk submissions. We
>> could
> use
>> something like EasyChair to make this as simple as possible.
>> 
>> What do people think? I think this could be good to continue to
>> drive
 and
>> grow the community in a positive way.
>> 
>> thanks,
>> Jacques
>> 
> 
 
>> 



Re: What do people think about a one day get together?

2018-04-09 Thread Jacques Nadeau
Hey all, given that several people are busy in June, let's way until the
fall. I'll take at look at the schedule of things and throw out a new idea
in the next few months.

Thanks!
Jacques


On Wed, Apr 4, 2018 at 10:17 AM, Wes McKinney  wrote:

> I'm +1 on the idea of a Arrow conference. June is a pretty busy month
> for me, so I would prefer another time, but I can arrange to make it
> if we do it then.
>
> - Wes
>
> On Wed, Apr 4, 2018 at 1:06 PM, Siddharth Teotia 
> wrote:
> > +1. I would love to attend.
> >
> > On Tue, Apr 3, 2018 at 4:18 PM, Kevin Moore  wrote:
> >
> >> Sounds great. Quilt Data may be able to sponsor some of the refreshment
> >> costs.
> >>
> >> 
> >> Kevin Moore
> >> CEO, Quilt Data, Inc.
> >> ke...@quiltdata.io | LinkedIn  >
> >> (415) 497-7895
> >>
> >>
> >> Manage Data like Code
> >> quiltdata.com
> >>
> >> On Tue, Apr 3, 2018 at 1:41 PM, Li Jin  wrote:
> >>
> >> > I'd love to attend. I will be around for Spark Summit.
> >> >
> >> > Li
> >> >
> >> >
> >> > On Tue, Apr 3, 2018 at 11:48 AM, Jacques Nadeau 
> >> > wrote:
> >> >
> >> > > Hey All,
> >> > >
> >> > > In light of growing interest in Apache Arrow over the past year and
> the
> >> > > great response to the meetup talk invitation I sent last week, I was
> >> > > thinking it may be time to hold a single day conference focused on
> the
> >> > > project. Wes and I have previously thrown this idea around and it
> seems
> >> > > like it might be a good time to get something started. Some of my
> >> > > colleagues did an investigation on how and when we could do this.
> I'm
> >> > > raising this to you all now to get people's thoughts.
> >> > >
> >> > >
> >> > > A rough sketch of what Wes and I have bounced around:
> >> > >
> >> > > *One day developer-focused event on Apache Arrow in San Francisco,
> June
> >> > 7,
> >> > > just after Spark Summit (open to other dates, but it would be nice
> for
> >> > > folks attending the conference to stay one extra day for Arrow).
> >> > >
> >> > > * Focus on interesting use cases and applications of Arrow. We could
> >> also
> >> > > use this event to discuss/plan/present about movement to Arrow 1.0
> this
> >> > > year and beyond.
> >> > >
> >> > > *Goal of 100-200 attendees.
> >> > >
> >> > > *Dremio can offer to organize the event (venue, logistics,
> >> registrations,
> >> > > etc). The goal would be to keep ticket costs very modest to
> encourage
> >> > > attendance (eg, $50). Opportunity for sponsorship by vendors to help
> >> > drive
> >> > > down costs (eg, refreshments).
> >> > >
> >> > > *Still need to determine a venue but probably something downtown SF
> >> > nearish
> >> > > Moscone.
> >> > >
> >> > > *PMC or appointed sub-committee could review talk submissions. We
> could
> >> > use
> >> > > something like EasyChair to make this as simple as possible.
> >> > >
> >> > > What do people think? I think this could be good to continue to
> drive
> >> and
> >> > > grow the community in a positive way.
> >> > >
> >> > > thanks,
> >> > > Jacques
> >> > >
> >> >
> >>
>


[jira] [Created] (ARROW-2429) [Python] Timestamp unit in schema changes when writing to Parquet file then reading back

2018-04-09 Thread Dave Challis (JIRA)
Dave Challis created ARROW-2429:
---

 Summary: [Python] Timestamp unit in schema changes when writing to 
Parquet file then reading back
 Key: ARROW-2429
 URL: https://issues.apache.org/jira/browse/ARROW-2429
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
 Environment: Mac OS High Sierra
PyArrow 0.9.0 (py36_1)
Python
Reporter: Dave Challis


When creating an Arrow table from a Pandas DataFrame, the table schema contains 
a field of type `timestamp[ns]`.

When serialising that table to a parquet file and then immediately reading it 
back, the schema of the table read instead contains a field with type 
`timestamp[us]`.

 
{code:python}
#!/usr/bin/env python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])

# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)

# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')



print(table.schema[0])  # pyarrow.Field (nanosecond 
units)
print(table2.schema[0]) # pyarrow.Field (microsecond 
units)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2428) [Python] Support ExtensionArrays in to_pandas conversion

2018-04-09 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-2428:
--

 Summary: [Python] Support ExtensionArrays in to_pandas conversion
 Key: ARROW-2428
 URL: https://issues.apache.org/jira/browse/ARROW-2428
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Uwe L. Korn
 Fix For: 1.0.0


With the next release of Pandas, it will be possible to define custom column 
types that back a {{pandas.Series}}. Thus we will not be able to cover all 
possible column types in the {{to_pandas}} conversion by default as we won't be 
aware of all extension arrays.

To enable users to create {{ExtensionArray}} instances from Arrow columns in 
the {{to_pandas}} conversion, we should provide a hook in the {{to_pandas}} 
call where they can overload the default conversion routines with the ones that 
produce their {{ExtensionArray}} instances.

This should avoid additional copies in the case where we would nowadays first 
convert the Arrow column into a default Pandas column (probably of object type) 
and the user would afterwards convert it to a more efficient 
{{ExtensionArray}}. This hook here will be especially useful when you build 
{{ExtensionArrays}} where the storage is backed by Arrow.

The meta-issue that tracks the implementation inside of Pandas is: 
https://github.com/pandas-dev/pandas/issues/19696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Tasks for upcoming Hackathons & Sprints

2018-04-09 Thread Uwe L. Korn
Hello all,

in the next weeks and months some of us are taking part in Hackathons and 
Sprints and hope to attract new people to Arrow development. This includes:

* MAN AHL Hackathon in 2 weeks: https://www.ahl.com/hackathon
* PyCon US in May: https://us.pycon.org/2018/community/sprints/
* PyCon DE in October: https://de.pycon.org/

To get people on board and have things they can work on, I'm collecting 
possible tasks. To make these tasks visible, we should flags simple things that 
everyone could work on with a "beginner" label in JIRA so they appear in 
https://issues.apache.org/jira/issues/?filter=12343593 If you can also think of 
things to work on that are good for getting started with Arrow development and 
yet no ticket exists, please make one even if you don't plan to work on it. It 
might be a good indicator for someone else on how they could get started.

Cheers
Uwe


[jira] [Created] (ARROW-2427) [C++] ReadAt implementations suboptimal

2018-04-09 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2427:
-

 Summary: [C++] ReadAt implementations suboptimal
 Key: ARROW-2427
 URL: https://issues.apache.org/jira/browse/ARROW-2427
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.9.0
Reporter: Antoine Pitrou


The {{ReadAt}} implementations for at least {{OSFile}} and {{MemoryMappedFile}} 
take the file lock and seek. They could instead read directly from the given 
offset, allowing concurrent I/O from multiple threads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2426) [CI] glib build failure

2018-04-09 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2426:
-

 Summary: [CI] glib build failure
 Key: ARROW-2426
 URL: https://issues.apache.org/jira/browse/ARROW-2426
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Antoine Pitrou


The glib build on Travis-CI fails:

[https://travis-ci.org/apache/arrow/jobs/364123364#L6840]

{code}
==> Installing gobject-introspection
==> Downloading 
https://homebrew.bintray.com/bottles/gobject-introspection-1.56.0_1.sierra.bottle.tar.gz
==> Pouring gobject-introspection-1.56.0_1.sierra.bottle.tar.gz
  /usr/local/Cellar/gobject-introspection/1.56.0_1: 173 files, 9.8MB
Installing gobject-introspection has failed!
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2425) [Rust] Array::from missing mapping for u8 type

2018-04-09 Thread Andy Grove (JIRA)
Andy Grove created ARROW-2425:
-

 Summary: [Rust] Array::from missing mapping for u8 type
 Key: ARROW-2425
 URL: https://issues.apache.org/jira/browse/ARROW-2425
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.10.0


Macros are used to support Array::from for each primitive type but u8 was 
missing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2423) [Python] PyArrow datatypes raise ValueError on equality checks against non-PyArrow objects

2018-04-09 Thread Dave Challis (JIRA)
Dave Challis created ARROW-2423:
---

 Summary: [Python] PyArrow datatypes raise ValueError on equality 
checks against non-PyArrow objects
 Key: ARROW-2423
 URL: https://issues.apache.org/jira/browse/ARROW-2423
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
 Environment: Mac OS High Sierra
PyArrow 0.9.0 (py36_1)
Python 3.6.3
Reporter: Dave Challis


Checking a PyArrow datatype object for equality with non-PyArrow datatypes 
causes a `ValueError` to be raised, rather than either returning a True/False 
value, or returning 
[NotImplemented|https://docs.python.org/3/library/constants.html#NotImplemented]
 if the comparison isn't implemented.

E.g. attempting to call:
{code:java}
import pyarrow
pyarrow.int32() == 'foo'
{code}
results in:
{code:java}
Traceback (most recent call last):
  File "types.pxi", line 1221, in pyarrow.lib.type_for_alias
KeyError: 'foo'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "t.py", line 2, in 
pyarrow.int32() == 'foo'
  File "types.pxi", line 90, in pyarrow.lib.DataType.__richcmp__
  File "types.pxi", line 113, in pyarrow.lib.DataType.equals
  File "types.pxi", line 1223, in pyarrow.lib.type_for_alias
ValueError: No type alias for foo
{code}
The expected outcome for the above would be for the comparison to return 
`False`, as that's the general behaviour for comparisons between objects of 
different types (e.g. `1 == 'foo'` or `object() == 12.4` both return `False`).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Rust Arrow status and plans for this week

2018-04-09 Thread Andy Grove
Over the weekend I added preliminary Parquet support to DataFusion (it only
supports int/float primitives and UTF8 so far). This was possible due to
the great work happening with the parquet-rs crate.

Integrating this with the current Rust version of Arrow was simple and I
have now started running benchmarks (and we now have some benchmark code
checked into the Arrow project).

Now that the basic functionality is stable enough to support this use case
I am going to focus on quality this week and start improving unit tests and
adding documentation.

I think we might be at the point where it makes sense to start discussing a
first official release and maybe a roadmap for the Rust library?

My next area of interest personally is the IPC mechanism and interop
testing with other languages, especially Java.

Thanks,

Andy.


[jira] [Created] (ARROW-2422) Support more filter operators on Hive partitioned Parquet files

2018-04-09 Thread Julius Neuffer (JIRA)
Julius Neuffer created ARROW-2422:
-

 Summary: Support more filter operators on Hive partitioned Parquet 
files
 Key: ARROW-2422
 URL: https://issues.apache.org/jira/browse/ARROW-2422
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Julius Neuffer


After implementing basic filters ('=', '!=') on Hive partitioned Parquet files 
(ARROW-2401), I'll extend them ('>', '<', '<=', '>=') with a new PR on Github.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2420) [Rust] Memory is never released

2018-04-09 Thread Andy Grove (JIRA)
Andy Grove created ARROW-2420:
-

 Summary: [Rust] Memory is never released
 Key: ARROW-2420
 URL: https://issues.apache.org/jira/browse/ARROW-2420
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.10.0


Another embarrassing bug ... the code was calling the wrong method to release 
memory and wasn't releasing memory.

I have added some benchmarks for testing performance of creating arrays (and 
dropping them) and these are working well now after fixing the memory bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2419) [Site] Website generation depends on local timezone

2018-04-09 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2419:
-

 Summary: [Site] Website generation depends on local timezone
 Key: ARROW-2419
 URL: https://issues.apache.org/jira/browse/ARROW-2419
 Project: Apache Arrow
  Issue Type: Bug
  Components: Website
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


See discussion at 
https://github.com/apache/arrow/pull/1853#issuecomment-379670199



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)