date:20190814

Hi Micah,

Thanks for the good points.
I agree with you that we should improve the efficiency of algorithms.

This is related to another improvement: reduce the if/switch statements in
the code.

To account for the edge cases, can we remove the set methods, and leaving
only the get method?
This is because for some scenarios, we have vectors that can either be int
vectors or float vectors.

The common interface for float4 and float8 sounds good. Let's do it in
another issue.

Best,
Liya FAn

On Thu, Aug 15, 2019 at 12:49 PM Micah Kornfield 
wrote:

> Hi Liya Fan,
> I'm not sure if this is a good idea.  First, floating point operations have
> more edge cases than integer arithmetic (e.g. dealing with NaNs).  Second,
> and I apologize that I've been remiss in thinking this through on reviews,
> but I think we should be thinking about how to make algorithms/operations
> as efficient as possible.  In this regards putting everything behind an
> interface prevents the JVM JIT from inlining them effectively.
>
> It might not be a bad idea to have a have a common interface for just the
> Float4 and Float8 vectors, but I'd like to get other peoples thoughts on
> this.
>
> Thanks,
> Micah
>
> On Wed, Aug 14, 2019 at 7:26 PM Fan Liya  wrote:
>
> > Dear all,
> >
> >
> >
> > We want to provide an interface for all vectors with numeric types (small
> > int, float4, float8, etc). This interface will make it convenient for
> many
> > operations on a vector, like average, sum, variance, etc. With this
> > interface, the client code will be greatly simplified, with many
> > branches/switch removed.
> >
> >
> >
> > The design is similar to BaseIntVector (the interface for all integer
> > vectors). We provide 3 methods for setting & getting numeric values:
> >
> >  setWithPossibleRounding
> >
> >  setSafeWithPossibleRounding
> >
> >  getValueAsDouble
> >
> >
> >
> > Please give some comments. Thanks a lot.
> >
> >
> >
> > Best,
> >
> > Liya Fan
> >
>

Re: [Java] CI builds failing on master

Hi Ji,

Thanks for fixing this.

Best,
Liya Fan

On Thu, Aug 15, 2019 at 12:50 PM Micah Kornfield 
wrote:

> I just merged this.  Thank you Ji Liu.
>
> On Wed, Aug 14, 2019 at 4:50 PM Ji Liu  wrote:
>
> > Hi, Wes, as described in JIRA, this was introduced by our recent two
> > patches, I have just submitted a PR[1] to fix this. Thanks for tracking
> > this issue.
> >
> >
> > Thanks,
> > Ji Liu
> >
> > [1] https://github.com/apache/arrow/pull/5090
> >
> >
> > --
> > From:Wes McKinney 
> > Send Time:2019年8月15日(星期四) 06:16
> > To:dev 
> > Subject:[Java] CI builds failing on master
> >
> > We've got some Java-related build failures occurring on master
> >
> > https://travis-ci.org/apache/arrow/jobs/571998256
> >
> > Since we build the Java library in some of the C++/Python builds
> > sorting this out is fairly urgent so we can continue to merge patches.
> >
> > I opened
> >
> > https://issues.apache.org/jira/browse/ARROW-6241
> >
> > to track
> >
> > thanks
> > Wes
>

Re: Timeline for 0.15.0 release

Hi Wes,
>
> Do these need to be dependent on the 64-bit array length discussion?

We could hack something that can read the lower 32-bit range, so I guess
not, but this leaves a bad taste in my mouth.  I think there is likely
still enough time to have the discussion and get these implemented, one way
or another.

For the record, I don't think we should hold a major release hostage
> if we aren't able to complete various feature milestones in time.
> Since it's been about 5-6 weeks since 0.14.0 we're coming close to the
> desired 8-10 week timeline for major releases, so if we need to have
> 0.16.0 prior to 1.0.0, I think that is OK also.

I agree with the time based milestones in practice, but we are backpedaling
on the intent to keep type parity between the two reference
implementations.  At least the way I read the previous threads on the
topic, I thought there was lazy consensus that in lieu of requiring working
implementations in Java and C++ be checked in at the same time, we would
rely on the release as a mechanism to forcing function for parity.

Thanks,
Micah

On Wed, Aug 14, 2019 at 11:32 AM Antoine Pitrou  wrote:

>
> Agreed with Wes.
>
> Regards
>
> Antoine.
>
>
> Le 14/08/2019 à 20:30, Wes McKinney a écrit :
> > For the record, I don't think we should hold a major release hostage
> > if we aren't able to complete various feature milestones in time.
> > Since it's been about 5-6 weeks since 0.14.0 we're coming close to the
> > desired 8-10 week timeline for major releases, so if we need to have
> > 0.16.0 prior to 1.0.0, I think that is OK also.
> >
> > On Wed, Aug 14, 2019 at 11:45 AM Wes McKinney 
> wrote:
> >>
> >> On Wed, Aug 14, 2019 at 11:43 AM Micah Kornfield 
> wrote:
> >>>
> 
>   is there anything else that has come up that
>  definitely needs to happen before we can release again?
> >>>
> >>> We need to decide on a way forward for LargeList, LargeBinary, etc,
> types...
> >>>
> >>
> >> Do these need to be dependent on the 64-bit array length discussion?
> >> They seem somewhat orthogonal to me. If we have to release 0.15.0
> >> without the Java side of these, that's OK with me, since reaching
> >> format implementation completeness is more of a 1.0.0 concern
> >>
> >>> On Tue, Aug 13, 2019 at 8:27 PM Wes McKinney 
> wrote:
> >>>
>  hi folks,
> 
>  Since there have been a number of fairly serious issues (e.g.
>  ARROW-6060) since 0.14.1 that have been fixed I think we should start
>  planning of the next major release. Note that we still have some
>  format-related work (the Flatbuffers alignment issue) that ought to be
>  resolved (not a small task since it affects 4 or 5 implementations),
>  but aside from that, is there anything else that has come up that
>  definitely needs to happen before we can release again?
> 
>  I would say cutting a release somewhere around the US Labor Day
>  holiday (~the week after or so) would be called for.
> 
>  Thanks,
>  Wes
> 
>

Re: [Java] CI builds failing on master

I just merged this.  Thank you Ji Liu.

On Wed, Aug 14, 2019 at 4:50 PM Ji Liu  wrote:

> Hi, Wes, as described in JIRA, this was introduced by our recent two
> patches, I have just submitted a PR[1] to fix this. Thanks for tracking
> this issue.
>
>
> Thanks,
> Ji Liu
>
> [1] https://github.com/apache/arrow/pull/5090
>
>
> --
> From:Wes McKinney 
> Send Time:2019年8月15日(星期四) 06:16
> To:dev 
> Subject:[Java] CI builds failing on master
>
> We've got some Java-related build failures occurring on master
>
> https://travis-ci.org/apache/arrow/jobs/571998256
>
> Since we build the Java library in some of the C++/Python builds
> sorting this out is fairly urgent so we can continue to merge patches.
>
> I opened
>
> https://issues.apache.org/jira/browse/ARROW-6241
>
> to track
>
> thanks
> Wes

Re: [DISCUSS][Java] Provide an interface for numeric vectors

Hi Liya Fan,
I'm not sure if this is a good idea.  First, floating point operations have
more edge cases than integer arithmetic (e.g. dealing with NaNs).  Second,
and I apologize that I've been remiss in thinking this through on reviews,
but I think we should be thinking about how to make algorithms/operations
as efficient as possible.  In this regards putting everything behind an
interface prevents the JVM JIT from inlining them effectively.

It might not be a bad idea to have a have a common interface for just the
Float4 and Float8 vectors, but I'd like to get other peoples thoughts on
this.

Thanks,
Micah

On Wed, Aug 14, 2019 at 7:26 PM Fan Liya  wrote:

> Dear all,
>
>
>
> We want to provide an interface for all vectors with numeric types (small
> int, float4, float8, etc). This interface will make it convenient for many
> operations on a vector, like average, sum, variance, etc. With this
> interface, the client code will be greatly simplified, with many
> branches/switch removed.
>
>
>
> The design is similar to BaseIntVector (the interface for all integer
> vectors). We provide 3 methods for setting & getting numeric values:
>
>  setWithPossibleRounding
>
>  setSafeWithPossibleRounding
>
>  getValueAsDouble
>
>
>
> Please give some comments. Thanks a lot.
>
>
>
> Best,
>
> Liya Fan
>

Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

2019-08-14 Thread Ji Liu

My original thoughts is that introduce a new interface makes the hierarchy a 
little confused (FixedListVector->BaseListVector, 
ListVector->BaseRepeatedVector->BaseListVector) and should try to avoid 
introducing new classes.

And you are right, FixedSizeListVector should not include offsetBuffer, I am 
fine with the approach you suggested, BaseRepeatedVector/ListVector and 
FixedSizeListVector both inherit from new interface(BaseListVector) they can be 
used interchangeably in dictionary encoding.Let see if others have different 
opinions.

Thanks,
Ji Liu

--
From:Micah Kornfield 
Send Time:2019年8月14日(星期三) 15:10
To:Ji Liu 
Cc:dev 
Subject:Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

You are right, the mainly difference between FixSizedListVector and ListVector 
is the offsetBuffer, but I think this could be avoided through 
allocateNewSafe() overwrite which calls allocateOffsetBuffer() in 
BaseRepeatedValueVector, in this way, offsetBuffer in FixSizedListVector will 
remain allocator.getEmpty(). 

I think there other methods that FixedSizeList shouldn't be implementing that 
are on List as well.   In an ideal world, I think the parent class/interface 
would be called ListVector and there would then be specific children of 
FixedSizeList and VariableSizeList.  I think that is too big a change to 
something core, but we should try to keep the relationship in that shape, so we 
don't need to override methods just to throw NotSupportedExceptions.
On Sun, Aug 11, 2019 at 7:35 AM Ji Liu  wrote:

Thanks Jacques, to avoid complex call paths for getObject, should keep 
getObject for both classes. I'll also checked for other methods.

Thanks,
Ji Liu

--
From:Jacques Nadeau 
Send Time:2019年8月11日(星期日) 21:43
To:dev ; Ji Liu 
Cc:emkornfield 
Subject:Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

We tried to get away from this kind of back and forth with subclassing as
much as possible. (call getObject on base class which then calls getIndex
on child class which then calls something else on base class). I haven't
looked through the code but let's try to avoid having complex call paths
for the vectors.

On Sat, Aug 10, 2019 at 6:07 PM Ji Liu  wrote:

> Hi Micah, thanks for your suggestion.
> You are right, the mainly difference between FixSizedListVector and
> ListVector is the offsetBuffer, but I think this could be avoided through
> allocateNewSafe() overwrite which calls allocateOffsetBuffer() in
> BaseRepeatedValueVector, in this way, offsetBuffer in FixSizedListVector
> will remain allocator.getEmpty().
>
> Meanwhile, we could add getStartIndex(int index)/getEndIndex(int index)
> API to handle read data logic respectively which could be used in
> getObject(int index) or encoding parts. What’s more, no new interface need
> to be introduced.
>
> What do you think?
>
>
> Thanks,
> Ji Liu
>
>
> --
> From:Micah Kornfield 
> Send Time:2019年8月11日(星期日) 08:47
> To:dev ; Ji Liu 
> Subject:Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from
> ListVector
>
> Hi Ji Liu,
> I think have a common interface/base-class for the two makes sense (but
> don't have historical context) from a reading data perspective.
>
> I think the change would need to be something above
> BaseRepeatedValueVector, since the FixedSizeListVector doesn't contain an
> offset buffer, and that field is contained on BaseRepeatedValueVector.
>
> Thanks,
> Micah
> On Sat, Aug 10, 2019 at 5:25 PM Ji Liu  wrote:
> Hi, all
>
>  While working on the issue to implement dictionary-encoded subfields[1]
> [2], I found FixedSizeListVector not extends ListVector(Thanks Micah
> pointing this out and curious why implemented FixedSizeListVector this way
>  before). Since FixedSizeListVector is a specific case of ListVector,
> should we make former extends the latter to reduce the plenty duplicated
> logic in these two and writer/reader classes?
>
>
>  Thanks,
>  Ji Liu
>
>  [1]
> https://issues.apache.org/jira/browse/ARROW-1175[2]https://github.com/apache/arrow/pull/4972
>
>

Re: [ANNOUNCE] New Arrow PMC member: Sebastien Binet

Congratulations,  Sebastien!

Best,
Liya Fan

On Thu, Aug 15, 2019 at 10:47 AM Ji Liu  wrote:

> Congrats Sebastian!
>
>
> --
> From:Micah Kornfield 
> Send Time:2019年8月15日(星期四) 10:46
> To:dev@arrow.apache.org 
> Subject:Re: [ANNOUNCE] New Arrow PMC member: Sebastien Binet
>
> Congrats.  Well deserved.
>
> On Wednesday, August 14, 2019, paddy horan  wrote:
>
> > Congrats Sebastian!
> >
> > Get Outlook for iOS
> > 
> > From: Wes McKinney 
> > Sent: Tuesday, August 13, 2019 4:54 PM
> > To: dev@arrow.apache.org
> > Subject: [ANNOUNCE] New Arrow PMC member: Sebastien Binet
> >
> > The Project Management Committee (PMC) for Apache Arrow has invited
> > Sebastien Binet to become a PMC member and we are pleased to announce
> > that Sebastien has accepted.
> >
> > Congratulations and welcome!
> >
>

[jira] [Created] (ARROW-6246) [Website] Add link to R documentation site

2019-08-14 Thread Neal Richardson (JIRA)

Neal Richardson created ARROW-6246:
--

 Summary: [Website] Add link to R documentation site
 Key: ARROW-6246
 URL: https://issues.apache.org/jira/browse/ARROW-6246
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Neal Richardson
Assignee: Neal Richardson


ARROW-6139 added the R documentation at /docs/r/, but we still need to link to 
it from the website header.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Re: [ANNOUNCE] New Arrow PMC member: Sebastien Binet

2019-08-14 Thread Ji Liu

Congrats Sebastian!


--
From:Micah Kornfield 
Send Time:2019年8月15日(星期四) 10:46
To:dev@arrow.apache.org 
Subject:Re: [ANNOUNCE] New Arrow PMC member: Sebastien Binet

Congrats.  Well deserved.

On Wednesday, August 14, 2019, paddy horan  wrote:

> Congrats Sebastian!
>
> Get Outlook for iOS
> 
> From: Wes McKinney 
> Sent: Tuesday, August 13, 2019 4:54 PM
> To: dev@arrow.apache.org
> Subject: [ANNOUNCE] New Arrow PMC member: Sebastien Binet
>
> The Project Management Committee (PMC) for Apache Arrow has invited
> Sebastien Binet to become a PMC member and we are pleased to announce
> that Sebastien has accepted.
>
> Congratulations and welcome!
>

Re: [ANNOUNCE] New Arrow PMC member: Sebastien Binet

Congrats.  Well deserved.

On Wednesday, August 14, 2019, paddy horan  wrote:

> Congrats Sebastian!
>
> Get Outlook for iOS
> 
> From: Wes McKinney 
> Sent: Tuesday, August 13, 2019 4:54 PM
> To: dev@arrow.apache.org
> Subject: [ANNOUNCE] New Arrow PMC member: Sebastien Binet
>
> The Project Management Committee (PMC) for Apache Arrow has invited
> Sebastien Binet to become a PMC member and we are pleased to announce
> that Sebastien has accepted.
>
> Congratulations and welcome!
>

Re: [ANNOUNCE] New Arrow PMC member: Sebastien Binet

2019-08-14 Thread paddy horan

Congrats Sebastian!

Get Outlook for iOS

From: Wes McKinney 
Sent: Tuesday, August 13, 2019 4:54 PM
To: dev@arrow.apache.org
Subject: [ANNOUNCE] New Arrow PMC member: Sebastien Binet

The Project Management Committee (PMC) for Apache Arrow has invited
Sebastien Binet to become a PMC member and we are pleased to announce
that Sebastien has accepted.

Congratulations and welcome!

[jira] [Created] (ARROW-6245) [DISCUSS][Java] Provide an interface for numeric vectors

2019-08-14 Thread Liya Fan (JIRA)

Liya Fan created ARROW-6245:
---

 Summary: [DISCUSS][Java] Provide an interface for numeric vectors
 Key: ARROW-6245
 URL: https://issues.apache.org/jira/browse/ARROW-6245
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


We want to provide an interface for all vectors with numeric types (small int, 
float4, float8, etc). This interface will make it convenient for many 
operations on a vector, like average, sum, variance, etc. With this interface, 
the client code will be greatly simplified, with many branches/switch removed.

 

The design is similar to BaseIntVector (the interface for all integer vectors). 
We provide 3 methods for setting & getting numeric values:

 setWithPossibleRounding

 setSafeWithPossibleRounding

 getValueAsDouble



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[DISCUSS][Java] Provide an interface for numeric vectors

Dear all,



We want to provide an interface for all vectors with numeric types (small
int, float4, float8, etc). This interface will make it convenient for many
operations on a vector, like average, sum, variance, etc. With this
interface, the client code will be greatly simplified, with many
branches/switch removed.



The design is similar to BaseIntVector (the interface for all integer
vectors). We provide 3 methods for setting & getting numeric values:

 setWithPossibleRounding

 setSafeWithPossibleRounding

 getValueAsDouble



Please give some comments. Thanks a lot.



Best,

Liya Fan

Re: [Java] CI builds failing on master

2019-08-14 Thread Ji Liu

Hi, Wes, as described in JIRA, this was introduced by our recent two patches, I 
have just submitted a PR[1] to fix this. Thanks for tracking this issue.


Thanks,
Ji Liu

[1] https://github.com/apache/arrow/pull/5090


--
From:Wes McKinney 
Send Time:2019年8月15日(星期四) 06:16
To:dev 
Subject:[Java] CI builds failing on master

We've got some Java-related build failures occurring on master

https://travis-ci.org/apache/arrow/jobs/571998256

Since we build the Java library in some of the C++/Python builds
sorting this out is fairly urgent so we can continue to merge patches.

I opened

https://issues.apache.org/jira/browse/ARROW-6241

to track

thanks
Wes

[VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements (2nd vote)

hi all,

As we've been discussing [1], there is a need to introduce 4 bytes of
padding into the preamble of the "encapsulated IPC message" format to
ensure that the Flatbuffers metadata payload begins on an 8-byte
aligned memory offset. The alternative to this would be for Arrow
implementations where alignment is important (e.g. C or C++) to copy
the metadata (which is not always small) into memory when it is
unaligned.

Micah has proposed to address this by adding a
4-byte "continuation" value at the beginning of the payload
having the value 0x. The reason to do it this way is that
old clients will see an invalid length (what is currently the
first 4 bytes of the message -- a 32-bit little endian signed
integer indicating the metadata length) rather than potentially
crashing on a valid length. We also propose to expand the "end of
stream" marker used in the stream and file format from 4 to 8
bytes. This has the additional effect of aligning the file footer
defined in File.fbs.

This would be a backwards incompatible protocol change, so older Arrow
libraries would not be able to read these new messages. Maintaining
forward compatibility (reading data produced by older libraries) would
be possible as we can reason that a value other than the continuation
value was produced by an older library (and then validate the
Flatbuffer message of course). Arrow implementations could offer a
backward compatibility mode for the sake of old readers if they desire
(this may also assist with testing).

Additionally with this vote, we want to formally approve the change to
the Arrow "file" format to always write the (new 8-byte) end-of-stream
marker, which enables code that processes Arrow streams to safely read
the file's internal messages as though they were a normal stream.

The PR making these changes to the IPC documentation is here

https://github.com/apache/arrow/pull/4951

Please vote to accept these changes. This vote will be open for at
least 72 hours

[ ] +1 Adopt these Arrow protocol changes
[ ] +0
[ ] -1 I disagree because...

Here is my vote: +1

Thanks,
Wes

[1]:
https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E

Re: [DISCUSS] Add GetFlightSchema to Flight RPC

2019-08-14 Thread Ryan Murray

Hi All,

Does this require a vote? If yes what is the process for initiating one &
if no I hope this is enough time for feedback and I would like to remove
the draft designation from the PR

Best,
Ryan

On Wed, Aug 7, 2019 at 9:31 AM Ryan Murray  wrote:

> As per everyone's feedback I have renamed GetFlightSchema -> GetSchema and
> have removed the descriptor on the rpc result message. The doc has been
> updated as has the draft PR
>
> On Thu, Aug 1, 2019 at 6:32 PM Bryan Cutler  wrote:
>
>> Sounds good to me, I would just echo what others have said.
>>
>> On Thu, Aug 1, 2019 at 8:17 AM Ryan Murray  wrote:
>>
>> > Thanks Wes,
>> >
>> > The descriptor is only there to maintain a bit of symmetry with
>> > GetFlightInfo. Happy to remove it, I don't think its necessary and
>> already
>> > a few people agree. Similar with the method name, I am neutral to the
>> > naming and can call it whatever the community is happy with.
>> >
>> > Best,
>> > Ryan
>> >
>> > On Thu, Aug 1, 2019 at 3:56 PM Wes McKinney 
>> wrote:
>> >
>> > > I'm generally supporting of adding the new RPC endpoint.
>> > >
>> > > To make a couple points from the document
>> > >
>> > > * I'm not sure what the purpose of returning the FlightDescriptor is,
>> > > but I haven't thought too hard about it
>> > > * The Schema consists of a single IPC message -- dictionaries will
>> > > appear in the actual DoGet stream. To motivate why this is --
>> > > different endpoints might have different dictionaries corresponding to
>> > > fields in the schema, to have static/constant dictionaries in a
>> > > distributed Flight setting is likely to be impractical. I summarize
>> > > the issue as "dictionaries are data, not metadata".
>> > > * I would be OK calling this GetSchema instead of GetFlightSchema but
>> > > either is okay
>> > >
>> > > - Wes
>> > >
>> > > On Thu, Aug 1, 2019 at 8:08 AM David Li 
>> wrote:
>> > > >
>> > > > Hi Ryan,
>> > > >
>> > > > Thanks for writing this up! I made a couple of minor comments in the
>> > > > doc/implementation, but overall I'm in favor of having this RPC
>> > > > method.
>> > > >
>> > > > Best,
>> > > > David
>> > > >
>> > > > On 8/1/19, Ryan Murray  wrote:
>> > > > > Hi All,
>> > > > >
>> > > > > Please see the attached document for a proposed addition to the
>> > Flight
>> > > > > RPC[1]. This is the result of a previous mailing list
>> discussion[2].
>> > > > >
>> > > > > I have created the Pull Request[3] to make the proposal a little
>> more
>> > > > > concrete.
>> > > > > 
>> > > > > Please let me know if you have any questions or concerns.
>> > > > >
>> > > > > Best,
>> > > > > Ryan
>> > > > >
>> > > > > [1]:
>> > > > >
>> > >
>> >
>> https://docs.google.com/document/d/1zLdFYikk3owbKpHvJrARLMlmYpi-Ef6OJy7H90MqViA/edit?usp=sharing
>> > > > > [2]:
>> > > > >
>> > >
>> >
>> https://lists.apache.org/thread.html/3539984493cf3d4d439bef25c150fa9e09e0b43ce0afb6be378d41df@%3Cdev.arrow.apache.org%3E
>> > > > > [3]: https://github.com/apache/arrow/pull/4980
>> > > > >
>> > >
>> >
>> >
>> > --
>> >
>> > Ryan Murray  | Principal Consulting Engineer
>> >
>> > +447540852009 | rym...@dremio.com
>> >
>> > 
>> > Check out our GitHub , join our
>> community
>> > site  & Download Dremio
>> > 
>> >
>>
>
>
> --
>
> Ryan Murray  | Principal Consulting Engineer
>
> +447540852009 | rym...@dremio.com
>
> 
> Check out our GitHub , join our community
> site  & Download Dremio
> 
>


-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | rym...@dremio.com


Check out our GitHub , join our community
site  & Download Dremio

Re: [Discuss][Java] 64-bit lengths for ValueVectors

On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield  wrote:
>
> Hi Wes and Jacques,
> See responses below.
>
> With regards to the reference implementation point. It is a good point. I'm
> > on vacation this week. Unless you're pushing hard on this, can we pick this
> > up and discuss more next week?
>
>
> Sure thing, enjoy your vacation.  I think the only practical implications
> are it delays choices around implementing LargeList, LargeBinary,
> LargeString in Java, which in turn might push out the 0.15.0 release.
>

To copy the sentiments from the 0.15.0 release thread, I think it
would be best to decouple this discussion from the release timeline
given how many people we have relying on regular releases coming out.
We can keep continue making major 0.x releases until we're ready to
release 1.0.0.

> My stance on this is that I don't know how important it is for Java to
> > support vectors over INT32_MAX elements. The use cases enabled by
> > having very large arrays seem to be concentrated in the native code
> > world (e.g. C/C++/Rust) -- that could just be implementation-centrism
> > on my part, though.
>
>
> A data point against this view is Spark has done work to eliminate 2GB
> memory limits on its block sizes [1].  I don't claim to understand the
> implications of this. Bryan might you have any thoughts here?  I'm OK with
> INT32_MAX, as well, I think we should think about what this means for
> adding Large types to Java and implications for reference implementations.
>
> Thanks,
> Micah
>
> [1] https://issues.apache.org/jira/browse/SPARK-6235
>
> On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau  wrote:
>
> > Hey Micah,
> >
> > Appreciate the offer on the compiling. The reality is I'm more concerned
> > about the unknowns than the compiling issue itself. Any time you've been
> > tuning for a while, changing something like this could be totally fine or
> > cause a couple of major issues. For example, we've done a very large amount
> > of work reducing heap memory footprint of the vectors. Are target is to
> > actually get it down to 24 bytes per ArrowBuf and 24 bytes heap per vector
> > (not including arrow bufs).
> >
> > With regards to the reference implementation point. It is a good point.
> > I'm on vacation this week. Unless you're pushing hard on this, can we pick
> > this up and discuss more next week?
> >
> > thanks,
> > Jacques
> >
> > On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield 
> > wrote:
> >
> >> Hi Jacques,
> >> I definitely understand these concerns and this change is risky because it
> >> is so large.  Perhaps, creating a new hierarchy, might be the cleanest way
> >> of dealing with this.  This could have other benefits like cleaning up
> >> some
> >> cruft around dictionary encode and "orphaned" method.   Per past e-mail
> >> threads I agree it is beneficial to have 2 separate reference
> >> implementations that can communicate fully, and my intent here was to
> >> close
> >> that gap.
> >>
> >> Trying to
> >> > determine the ramifications of these changes would be challenging and
> >> time
> >> > consuming against all the different ways we interact with the Arrow Java
> >> > library.
> >>
> >>
> >> Understood.  I took a quick look at Dremio-OSS it seems like it has a
> >> simple java build system?  If it is helpful, I can try to get a fork
> >> running that at least compiles against this PR.  My plan would be to cast
> >> any place that was changed to return a long back to an int, so in essence
> >> the Dremio algorithms would reman 32-bit implementations.
> >>
> >> I don't  have the infrastructure to test this change properly from a
> >> distributed systems perspective, so it would still take some time from
> >> Dremio to validate for regressions.
> >>
> >> I'm not saying I'm against this but want to make sure we've
> >> > explored all less disruptive options before considering changing
> >> something
> >> > this fundamental (especially when I generally hold the view that large
> >> cell
> >> > counts against massive contiguous memory is an anti pattern to scalable
> >> > analytical processing--purely subjective of course).
> >>
> >>
> >> I'm open to other ideas here, as well. I don't think it is out of the
> >> question to leave the Java implementation as 32-bit, but if we do, then I
> >> think we should consider a different strategy for reference
> >> implementations.
> >>
> >> Thanks,
> >> Micah
> >>
> >> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau 
> >> wrote:
> >>
> >> > Hey Micah, I didn't have a particular path in mind. Was thinking more
> >> along
> >> > the lines of extra methods as opposed to separate classes.
> >> >
> >> > Arrow hasn't historically been a place where we're writing algorithms in
> >> > Java so the fact that they aren't there doesn't mean they don't exist.
> >> We
> >> > have a large amount of code that depends on the current behavior that is
> >> > deployed in hundreds of customer clusters (you can peruse our dremio
> >> repo
> >> > to see how

[jira] [Created] (ARROW-6244) [C++] Implement Partition DataSource

2019-08-14 Thread Francois Saint-Jacques (JIRA)

Francois Saint-Jacques created ARROW-6244:
-

 Summary: [C++] Implement Partition DataSource
 Key: ARROW-6244
 URL: https://issues.apache.org/jira/browse/ARROW-6244
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Francois Saint-Jacques


This is a DataSource that also has partition metadata. The end goal is to 
support filtering with a DataSelector/Filter expression. The initial 
implementation should not deal with PartitionScheme yet.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[Java] CI builds failing on master

We've got some Java-related build failures occurring on master

https://travis-ci.org/apache/arrow/jobs/571998256

Since we build the Java library in some of the C++/Python builds
sorting this out is fairly urgent so we can continue to merge patches.

I opened

https://issues.apache.org/jira/browse/ARROW-6241

to track

thanks
Wes

[jira] [Created] (ARROW-6242) [C++] Implements basic Dataset/Scanner/ScannerBuilder

2019-08-14 Thread Francois Saint-Jacques (JIRA)

Francois Saint-Jacques created ARROW-6242:
-

 Summary: [C++] Implements basic Dataset/Scanner/ScannerBuilder
 Key: ARROW-6242
 URL: https://issues.apache.org/jira/browse/ARROW-6242
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Francois Saint-Jacques


The goal of this would be to iterate over a Dataset and generate a "flattened" 
stream of RecordBatches from the union of data sources and data fragments. This 
should not bother with filtering yet.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6241) [Java] Failures on master

Wes McKinney created ARROW-6241:
---

 Summary: [Java] Failures on master
 Key: ARROW-6241
 URL: https://issues.apache.org/jira/browse/ARROW-6241
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Wes McKinney
 Fix For: 0.15.0


I'm getting builds failing today with errors like

{code}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.6.2:compile (default-compile) 
on project arrow-vector: Compilation failure: Compilation failure:
[ERROR] 
/home/travis/build/apache/arrow/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java:[356,4]
 error: cannot find symbol
[ERROR] symbol:   variable Preconditions
[ERROR] location: class ListVector
[ERROR] 
/home/travis/build/apache/arrow/java/vector/src/main/java/org/apache/arrow/vector/complex/NonNullableStructVector.java:[96,4]
 error: cannot find symbol
[ERROR] symbol:   variable Preconditions
[ERROR] location: class NonNullableStructVector
[ERROR] -> [Help 1]
{code}

see https://travis-ci.org/apache/arrow/jobs/571958044

Is this introduced by a recent patch?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6240) [Ruby] Arrow::Decimal128Array returns BigDecimal

2019-08-14 Thread Sutou Kouhei (JIRA)

Sutou Kouhei created ARROW-6240:
---

 Summary: [Ruby] Arrow::Decimal128Array returns BigDecimal
 Key: ARROW-6240
 URL: https://issues.apache.org/jira/browse/ARROW-6240
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Ruby
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6239) [Python][Parquet] Add examples of using HDFS filesystem and Parquet files together

Wes McKinney created ARROW-6239:
---

 Summary: [Python][Parquet] Add examples of using HDFS filesystem 
and Parquet files together
 Key: ARROW-6239
 URL: https://issues.apache.org/jira/browse/ARROW-6239
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


It seems this is not clear enough in the documentation

https://stackoverflow.com/questions/57500284/how-to-write-on-hdfs-using-pyarrow



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Re: Timeline for 0.15.0 release

2019-08-14 Thread Antoine Pitrou

Agreed with Wes.

Regards

Antoine.

Le 14/08/2019 à 20:30, Wes McKinney a écrit :
> For the record, I don't think we should hold a major release hostage
> if we aren't able to complete various feature milestones in time.
> Since it's been about 5-6 weeks since 0.14.0 we're coming close to the
> desired 8-10 week timeline for major releases, so if we need to have
> 0.16.0 prior to 1.0.0, I think that is OK also.
> 
> On Wed, Aug 14, 2019 at 11:45 AM Wes McKinney  wrote:
>>
>> On Wed, Aug 14, 2019 at 11:43 AM Micah Kornfield  
>> wrote:
>>>

  is there anything else that has come up that
 definitely needs to happen before we can release again?
>>>
>>> We need to decide on a way forward for LargeList, LargeBinary, etc, types...
>>>
>>
>> Do these need to be dependent on the 64-bit array length discussion?
>> They seem somewhat orthogonal to me. If we have to release 0.15.0
>> without the Java side of these, that's OK with me, since reaching
>> format implementation completeness is more of a 1.0.0 concern
>>
>>> On Tue, Aug 13, 2019 at 8:27 PM Wes McKinney  wrote:
>>>
 hi folks,

 Since there have been a number of fairly serious issues (e.g.
 ARROW-6060) since 0.14.1 that have been fixed I think we should start
 planning of the next major release. Note that we still have some
 format-related work (the Flatbuffers alignment issue) that ought to be
 resolved (not a small task since it affects 4 or 5 implementations),
 but aside from that, is there anything else that has come up that
 definitely needs to happen before we can release again?

 I would say cutting a release somewhere around the US Labor Day
 holiday (~the week after or so) would be called for.

 Thanks,
 Wes

Re: Timeline for 0.15.0 release

For the record, I don't think we should hold a major release hostage
if we aren't able to complete various feature milestones in time.
Since it's been about 5-6 weeks since 0.14.0 we're coming close to the
desired 8-10 week timeline for major releases, so if we need to have
0.16.0 prior to 1.0.0, I think that is OK also.

On Wed, Aug 14, 2019 at 11:45 AM Wes McKinney  wrote:
>
> On Wed, Aug 14, 2019 at 11:43 AM Micah Kornfield  
> wrote:
> >
> > >
> > >  is there anything else that has come up that
> > > definitely needs to happen before we can release again?
> >
> > We need to decide on a way forward for LargeList, LargeBinary, etc, types...
> >
>
> Do these need to be dependent on the 64-bit array length discussion?
> They seem somewhat orthogonal to me. If we have to release 0.15.0
> without the Java side of these, that's OK with me, since reaching
> format implementation completeness is more of a 1.0.0 concern
>
> > On Tue, Aug 13, 2019 at 8:27 PM Wes McKinney  wrote:
> >
> > > hi folks,
> > >
> > > Since there have been a number of fairly serious issues (e.g.
> > > ARROW-6060) since 0.14.1 that have been fixed I think we should start
> > > planning of the next major release. Note that we still have some
> > > format-related work (the Flatbuffers alignment issue) that ought to be
> > > resolved (not a small task since it affects 4 or 5 implementations),
> > > but aside from that, is there anything else that has come up that
> > > definitely needs to happen before we can release again?
> > >
> > > I would say cutting a release somewhere around the US Labor Day
> > > holiday (~the week after or so) would be called for.
> > >
> > > Thanks,
> > > Wes
> > >

[jira] [Created] (ARROW-6237) [R] Add option to set CXXFLAGS when compiling R package with $ARROW_R_CXXFLAGS

Wes McKinney created ARROW-6237:
---

 Summary: [R] Add option to set CXXFLAGS when compiling R package 
with $ARROW_R_CXXFLAGS
 Key: ARROW-6237
 URL: https://issues.apache.org/jira/browse/ARROW-6237
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.15.0


I want to be able to pass {{-fno-omit-frame-pointer}} with an environment 
variable



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Re: Timeline for 0.15.0 release

On Wed, Aug 14, 2019 at 11:43 AM Micah Kornfield  wrote:
>
> >
> >  is there anything else that has come up that
> > definitely needs to happen before we can release again?
>
> We need to decide on a way forward for LargeList, LargeBinary, etc, types...
>

Do these need to be dependent on the 64-bit array length discussion?
They seem somewhat orthogonal to me. If we have to release 0.15.0
without the Java side of these, that's OK with me, since reaching
format implementation completeness is more of a 1.0.0 concern

> On Tue, Aug 13, 2019 at 8:27 PM Wes McKinney  wrote:
>
> > hi folks,
> >
> > Since there have been a number of fairly serious issues (e.g.
> > ARROW-6060) since 0.14.1 that have been fixed I think we should start
> > planning of the next major release. Note that we still have some
> > format-related work (the Flatbuffers alignment issue) that ought to be
> > resolved (not a small task since it affects 4 or 5 implementations),
> > but aside from that, is there anything else that has come up that
> > definitely needs to happen before we can release again?
> >
> > I would say cutting a release somewhere around the US Labor Day
> > holiday (~the week after or so) would be called for.
> >
> > Thanks,
> > Wes
> >

Re: Timeline for 0.15.0 release

>
>  is there anything else that has come up that
> definitely needs to happen before we can release again?

We need to decide on a way forward for LargeList, LargeBinary, etc, types...

On Tue, Aug 13, 2019 at 8:27 PM Wes McKinney  wrote:

> hi folks,
>
> Since there have been a number of fairly serious issues (e.g.
> ARROW-6060) since 0.14.1 that have been fixed I think we should start
> planning of the next major release. Note that we still have some
> format-related work (the Flatbuffers alignment issue) that ought to be
> resolved (not a small task since it affects 4 or 5 implementations),
> but aside from that, is there anything else that has come up that
> definitely needs to happen before we can release again?
>
> I would say cutting a release somewhere around the US Labor Day
> holiday (~the week after or so) would be called for.
>
> Thanks,
> Wes
>

Re: Timeline for 0.15.0 release

Is there a JIRA for the issue that caused us to pull the 0.14.1
Windows Python wheel installers? If we want to have working wheels for
0.15.0 we'll need a volunteer to help address whatever was wrong with
0.14.1.

On Tue, Aug 13, 2019 at 10:26 PM Wes McKinney  wrote:
>
> hi folks,
>
> Since there have been a number of fairly serious issues (e.g.
> ARROW-6060) since 0.14.1 that have been fixed I think we should start
> planning of the next major release. Note that we still have some
> format-related work (the Flatbuffers alignment issue) that ought to be
> resolved (not a small task since it affects 4 or 5 implementations),
> but aside from that, is there anything else that has come up that
> definitely needs to happen before we can release again?
>
> I would say cutting a release somewhere around the US Labor Day
> holiday (~the week after or so) would be called for.
>
> Thanks,
> Wes

[jira] [Created] (ARROW-6236) [R] Deduplicate strings using Arrow hash tables instead of passing all values through R's global hash table

Wes McKinney created ARROW-6236:
---

 Summary: [R] Deduplicate strings using Arrow hash tables instead 
of passing all values through R's global hash table
 Key: ARROW-6236
 URL: https://issues.apache.org/jira/browse/ARROW-6236
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Wes McKinney


I suspect that deserialization performance from Arrow to R vector can be 
improved by deduplicating strings locally prior to setting into R character 
vectors. You can see our corresponding deduplication logic in Python (which is 
a slightly different situation because Python does not have a GSHT, but we can 
have many references to the same object in an output NumPy array)

https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pandas.cc#L409



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6235) [R] Conversion from arrow::BinaryArray to R character vector not implemented

Wes McKinney created ARROW-6235:
---

 Summary: [R] Conversion from arrow::BinaryArray to R character 
vector not implemented
 Key: ARROW-6235
 URL: https://issues.apache.org/jira/browse/ARROW-6235
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Wes McKinney


See unhandled case at 

https://github.com/apache/arrow/blob/master/r/src/array__to_vector.cpp#L644



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (ARROW-6234) [Java] ListVector hashCode() is not correct

2019-08-14 Thread Ji Liu (JIRA)

Ji Liu created ARROW-6234:
-

 Summary: [Java] ListVector hashCode() is not correct
 Key: ARROW-6234
 URL: https://issues.apache.org/jira/browse/ARROW-6234
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Current implement is not correct:
{code:java}
for (int i = start; i < end; i++) {
  hash = 31 * vector.hashCode(i);
}
{code}
Should be something like:
{code:java}
hash = 31 * hash + vector.hashCode(i);{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector