Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

2019-08-10 Thread Ji Liu
Hi Micah, thanks for your suggestion. 
You are right, the mainly difference between FixSizedListVector and ListVector 
is the offsetBuffer, but I think this could be avoided through 
allocateNewSafe() overwrite which calls allocateOffsetBuffer() in 
BaseRepeatedValueVector, in this way, offsetBuffer in FixSizedListVector will 
remain allocator.getEmpty(). 

Meanwhile, we could add getStartIndex(int index)/getEndIndex(int index) API to 
handle read data logic respectively which could be used in getObject(int index) 
or encoding parts. What’s more, no new interface need to be introduced.

What do you think?


Thanks,
Ji Liu


--
From:Micah Kornfield 
Send Time:2019年8月11日(星期日) 08:47
To:dev ; Ji Liu 
Subject:Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

Hi Ji Liu,
I think have a common interface/base-class for the two makes sense (but don't 
have historical context) from a reading data perspective. 

I think the change would need to be something above BaseRepeatedValueVector, 
since the FixedSizeListVector doesn't contain an offset buffer, and that field 
is contained on BaseRepeatedValueVector.

Thanks,
Micah
On Sat, Aug 10, 2019 at 5:25 PM Ji Liu  wrote:
Hi, all

 While working on the issue to implement dictionary-encoded subfields[1] [2], I 
found FixedSizeListVector not extends ListVector(Thanks Micah pointing this out 
and curious why implemented FixedSizeListVector this way 
 before). Since FixedSizeListVector is a specific case of ListVector, should we 
make former extends the latter to reduce the plenty duplicated logic in these 
two and writer/reader classes? 


 Thanks,
 Ji Liu

 [1] 
https://issues.apache.org/jira/browse/ARROW-1175[2]https://github.com/apache/arrow/pull/4972



Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

2019-08-10 Thread Micah Kornfield
Hi Ji Liu,
I think have a common interface/base-class for the two makes sense (but
don't have historical context) from a reading data perspective.

I think the change would need to be something above
BaseRepeatedValueVector, since the FixedSizeListVector doesn't contain an
offset buffer, and that field is contained on BaseRepeatedValueVector.

Thanks,
Micah

On Sat, Aug 10, 2019 at 5:25 PM Ji Liu  wrote:

> Hi, all
>
> While working on the issue to implement dictionary-encoded subfields[1]
> [2], I found FixedSizeListVector not extends ListVector(Thanks Micah
> pointing this out and curious why implemented FixedSizeListVector this way
> before). Since FixedSizeListVector is a specific case of ListVector,
> should we make former extends the latter to reduce the plenty duplicated
> logic in these two and writer/reader classes?
>
>
> Thanks,
> Ji Liu
>
> [1]
> https://issues.apache.org/jira/browse/ARROW-1175[2]https://github.com/apache/arrow/pull/4972
>
>


Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-10 Thread Micah Kornfield
Hi Jacques,
I definitely understand these concerns and this change is risky because it
is so large.  Perhaps, creating a new hierarchy, might be the cleanest way
of dealing with this.  This could have other benefits like cleaning up some
cruft around dictionary encode and "orphaned" method.   Per past e-mail
threads I agree it is beneficial to have 2 separate reference
implementations that can communicate fully, and my intent here was to close
that gap.

Trying to
> determine the ramifications of these changes would be challenging and time
> consuming against all the different ways we interact with the Arrow Java
> library.


Understood.  I took a quick look at Dremio-OSS it seems like it has a
simple java build system?  If it is helpful, I can try to get a fork
running that at least compiles against this PR.  My plan would be to cast
any place that was changed to return a long back to an int, so in essence
the Dremio algorithms would reman 32-bit implementations.

I don't  have the infrastructure to test this change properly from a
distributed systems perspective, so it would still take some time from
Dremio to validate for regressions.

I'm not saying I'm against this but want to make sure we've
> explored all less disruptive options before considering changing something
> this fundamental (especially when I generally hold the view that large cell
> counts against massive contiguous memory is an anti pattern to scalable
> analytical processing--purely subjective of course).


I'm open to other ideas here, as well. I don't think it is out of the
question to leave the Java implementation as 32-bit, but if we do, then I
think we should consider a different strategy for reference implementations.

Thanks,
Micah

On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau  wrote:

> Hey Micah, I didn't have a particular path in mind. Was thinking more along
> the lines of extra methods as opposed to separate classes.
>
> Arrow hasn't historically been a place where we're writing algorithms in
> Java so the fact that they aren't there doesn't mean they don't exist. We
> have a large amount of code that depends on the current behavior that is
> deployed in hundreds of customer clusters (you can peruse our dremio repo
> to see how extensively we leverage Arrow if interested). Trying to
> determine the ramifications of these changes would be challenging and time
> consuming against all the different ways we interact with the Arrow Java
> library. I'm not saying I'm against this but want to make sure we've
> explored all less disruptive options before considering changing something
> this fundamental (especially when I generally hold the view that large cell
> counts against massive contiguous memory is an anti pattern to scalable
> analytical processing--purely subjective of course).
>
> On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield 
> wrote:
>
> > Hi Jacques,
> > What avenue were you thinking for supporting both paths?   I didn't want
> > to pursue a different class hierarchy, because I felt like that would
> > effectively fork the code base, but that is potentially an option that
> > would allow us to have a complete reference implementation in Java that
> can
> > fully interact with C++, without major changes to this code.
> >
> > For supporting both APIs on the same classes/interfaces, I think they
> > roughly fall into three categories, changes to input parameters, changes
> to
> > output parameters and algorithm changes.
> >
> > For inputs, changing from int to long is essentially a no-op from the
> > compiler perspective.  From the limited micro-benchmarking this also
> > doesn't seem to have a performance impact.  So we could keep two versions
> > of the methods that only differ on inputs, but it is not clear what the
> > value of that would be.
> >
> > For outputs, we can't support methods "long getLength()" and "int
> > getLength()" in the same class, so we would be forced into something like
> > "long getLength(boolean dummy)" which I think is a less desirable.
> >
> > For algorithm changes, there did not appear to be too many places where
> we
> > actually loop over all elements (it is quite possible I missed something
> > here), the ones that I did find I was able to mitigate performance
> > penalties as noted above.  Some of the current implementation will get a
> > lot slower for "large arrays", but we can likely fix those later or in
> this
> > PR with a nested while loop instead of 2 for loops.
> >
> > Thanks,
> > Micah
> >
> >
> > On Saturday, August 10, 2019, Jacques Nadeau  wrote:
> >
> >> This is a pretty massive change to the apis. I wonder how nasty it would
> >> be to just support both paths. Have you evaluated how complex that
> would be?
> >>
> >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield 
> >> wrote:
> >>
> >>> After more investigation, it looks like Float8Benchmarks at least on my
> >>> machine are within the range of noise.
> >>>
> >>> For BitVectorHelper I pushed a new commit [1], seems to 

Re: [Java] Arrow PR queue build up?

2019-08-10 Thread Ji Liu
Hi, Jacques, thanks for your valuable feedback. 
Sorry for the lack of discuss. Some of these PRs are small change/bugfix which 
not deserving a discuss. You are right, some PRs are more complex than we 
thought before in the review process, making a discuss on ML/JIRA would 
actually help. This situation will be avoided as far as possible in the future 
work.


ThanksJi Liu


--
From:Jacques Nadeau 
Send Time:2019年8月11日(星期日) 03:36
To:dev ; Micah Kornfield 
Subject:Re: [Java] Arrow PR queue build up?

I think one of the issues here is that there is no upfront discussion about
most of the changes that are being proposed. In most cases, a pull request
just appears without. This makes the reviews much more intensive and time
consuming as frequently there are questions about the validity, nature or
rationale of the change. Having short design discussions before starting on
these changes would ensure that the nature of the reviews are less
involved, thus decreasing the effort associated with reviewing them.

On Thu, Aug 8, 2019 at 10:18 PM Micah Kornfield 
wrote:

> I did a pass through most of the open PRs (I might have missed one or
> two).  Most had at least a few minor comments so the backlog hasn't gone
> down that much, but I expect most will be mergeable very soon.
>
> On Thu, Aug 8, 2019 at 9:44 AM Micah Kornfield 
> wrote:
>
> > Not a full solution, but I've fallen behind a bit. I'm going to plan to
> > spend some time tonight at least reviewing PRs I've already done the
> first
> > pass on and I'll try to pickup some more.
> >
> > Having more engaged reviewers would be helpful though.
> >
> > Cheers,
> > Micah
> >
> > On Thursday, August 8, 2019, Wes McKinney  wrote:
> >
> >> hi folks,
> >>
> >> Liya Fan and Ji Liu have about 24 open Java PRs between them if I
> >> counted right -- it seems like the project is having a hard time
> >> keeping up with code reviews and merging on these. It looks to me like
> >> they are making a lot of material improvements to the Java library
> >> where previously there had not been a lot of development, so I would
> >> like to see PRs get merged faster -- any ideas how we might be able to
> >> achieve that?  I know that Micah has been spending a lot of time
> >> reviewing and giving feedback on these PRs so that is much appreciated
> >>
> >> Thanks,
> >> Wes
> >>
> >
>



[DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

2019-08-10 Thread Ji Liu
Hi, all

While working on the issue to implement dictionary-encoded subfields[1] [2], I 
found FixedSizeListVector not extends ListVector(Thanks Micah pointing this out 
and curious why implemented FixedSizeListVector this way 
before). Since FixedSizeListVector is a specific case of ListVector, should we 
make former extends the latter to reduce the plenty duplicated logic in these 
two and writer/reader classes? 


Thanks,
Ji Liu

[1] 
https://issues.apache.org/jira/browse/ARROW-1175[2]https://github.com/apache/arrow/pull/4972



Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-10 Thread Jacques Nadeau
Hey Micah, I didn't have a particular path in mind. Was thinking more along
the lines of extra methods as opposed to separate classes.

Arrow hasn't historically been a place where we're writing algorithms in
Java so the fact that they aren't there doesn't mean they don't exist. We
have a large amount of code that depends on the current behavior that is
deployed in hundreds of customer clusters (you can peruse our dremio repo
to see how extensively we leverage Arrow if interested). Trying to
determine the ramifications of these changes would be challenging and time
consuming against all the different ways we interact with the Arrow Java
library. I'm not saying I'm against this but want to make sure we've
explored all less disruptive options before considering changing something
this fundamental (especially when I generally hold the view that large cell
counts against massive contiguous memory is an anti pattern to scalable
analytical processing--purely subjective of course).

On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield  wrote:

> Hi Jacques,
> What avenue were you thinking for supporting both paths?   I didn't want
> to pursue a different class hierarchy, because I felt like that would
> effectively fork the code base, but that is potentially an option that
> would allow us to have a complete reference implementation in Java that can
> fully interact with C++, without major changes to this code.
>
> For supporting both APIs on the same classes/interfaces, I think they
> roughly fall into three categories, changes to input parameters, changes to
> output parameters and algorithm changes.
>
> For inputs, changing from int to long is essentially a no-op from the
> compiler perspective.  From the limited micro-benchmarking this also
> doesn't seem to have a performance impact.  So we could keep two versions
> of the methods that only differ on inputs, but it is not clear what the
> value of that would be.
>
> For outputs, we can't support methods "long getLength()" and "int
> getLength()" in the same class, so we would be forced into something like
> "long getLength(boolean dummy)" which I think is a less desirable.
>
> For algorithm changes, there did not appear to be too many places where we
> actually loop over all elements (it is quite possible I missed something
> here), the ones that I did find I was able to mitigate performance
> penalties as noted above.  Some of the current implementation will get a
> lot slower for "large arrays", but we can likely fix those later or in this
> PR with a nested while loop instead of 2 for loops.
>
> Thanks,
> Micah
>
>
> On Saturday, August 10, 2019, Jacques Nadeau  wrote:
>
>> This is a pretty massive change to the apis. I wonder how nasty it would
>> be to just support both paths. Have you evaluated how complex that would be?
>>
>> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield 
>> wrote:
>>
>>> After more investigation, it looks like Float8Benchmarks at least on my
>>> machine are within the range of noise.
>>>
>>> For BitVectorHelper I pushed a new commit [1], seems to bring the
>>> BitVectorHelper benchmarks back inline (and even with some improvement
>>> for
>>> getNullCountBenchmark).
>>>
>>> BenchmarkMode  Cnt   Score
>>>  Error
>>>  Units
>>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt5   3.821 ±
>>> 0.031
>>>  ns/op
>>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt5  14.884 ±
>>> 0.141
>>>  ns/op
>>>
>>> I applied the same pattern to other loops that I could find, and for any
>>> "for (long" loop on the critical path, I broke it up into two loops.  the
>>> first loop does iteration by integer, the second finishes off for any
>>> long
>>> values.  As a side note it seems like optimization for loops using long
>>> counters at least have a semi-recent open bug for the JVM [2]
>>>
>>> Thanks,
>>> Micah
>>>
>>> [1]
>>>
>>> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
>>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>>>
>>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield 
>>> wrote:
>>>
>>> > Indeed, the BoundChecking and CheckNullForGet variables can make a big
>>> > difference.  I didn't initially run the benchmarks with these turned on
>>> > (you can see the result from above with Float8Benchmarks).  Here are
>>> new
>>> > numbers including with the flags enabled.  It looks like using longs
>>> might
>>> > be a little bit slower, I'll see what I can do to mitigate this.
>>> >
>>> > Ravindra also volunteered to try to benchmark the changes with Dremio's
>>> > code on today's sync call.
>>> >
>>> > New
>>> >
>>> > BenchmarkMode  Cnt   Score
>>>  Error
>>> > Units
>>> >
>>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt5   4.176 ±
>>> 1.292
>>> > ns/op
>>> >
>>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt5  26.102 ±
>>> 0.700
>>> > ns/op
>>> >
>>> > 

[jira] [Created] (ARROW-6197) [GLib] Add garrow_decimal128_rescale()

2019-08-10 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-6197:
---

 Summary: [GLib] Add garrow_decimal128_rescale()
 Key: ARROW-6197
 URL: https://issues.apache.org/jira/browse/ARROW-6197
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Format] Semantics for dictionary batches in streams

2019-08-10 Thread Micah Kornfield
Reading data from two different parquet files sequentially with different
dictionaries for the same column.  This could be handled by re-encoding
data but that seems potentially sub-optimal.

On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau  wrote:

> What situation are anticipating where you're going to be restating ids mid
> stream?
>
> On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield 
> wrote:
>
>> The IPC specification [1] defines behavior when isDelta on a
>> DictionaryBatch [2] is "true".  I might have missed it in the
>> specification, but I couldn't find the interpretation for what the
>> expected
>> behavior is when isDelta=false and and two  dictionary batches  with the
>> same ID are sent.
>>
>> It seems like there are two options:
>> 1.  Interpret the new dictionary batch as replacing the old one.
>> 2.  Regard this as an error condition.
>>
>> Based on the fact that in the "file format" dictionaries are allowed to be
>> placed in any order relative to the record batches, I assume it is the
>> second, but just wanted to make sure.
>>
>> Thanks,
>> Micah
>>
>> [1] https://arrow.apache.org/docs/ipc.html
>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
>>
>


Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-10 Thread Micah Kornfield
Hi Jacques,
What avenue were you thinking for supporting both paths?   I didn't want to
pursue a different class hierarchy, because I felt like that would
effectively fork the code base, but that is potentially an option that
would allow us to have a complete reference implementation in Java that can
fully interact with C++, without major changes to this code.

For supporting both APIs on the same classes/interfaces, I think they
roughly fall into three categories, changes to input parameters, changes to
output parameters and algorithm changes.

For inputs, changing from int to long is essentially a no-op from the
compiler perspective.  From the limited micro-benchmarking this also
doesn't seem to have a performance impact.  So we could keep two versions
of the methods that only differ on inputs, but it is not clear what the
value of that would be.

For outputs, we can't support methods "long getLength()" and "int
getLength()" in the same class, so we would be forced into something like
"long getLength(boolean dummy)" which I think is a less desirable.

For algorithm changes, there did not appear to be too many places where we
actually loop over all elements (it is quite possible I missed something
here), the ones that I did find I was able to mitigate performance
penalties as noted above.  Some of the current implementation will get a
lot slower for "large arrays", but we can likely fix those later or in this
PR with a nested while loop instead of 2 for loops.

Thanks,
Micah


On Saturday, August 10, 2019, Jacques Nadeau  wrote:

> This is a pretty massive change to the apis. I wonder how nasty it would
> be to just support both paths. Have you evaluated how complex that would be?
>
> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield 
> wrote:
>
>> After more investigation, it looks like Float8Benchmarks at least on my
>> machine are within the range of noise.
>>
>> For BitVectorHelper I pushed a new commit [1], seems to bring the
>> BitVectorHelper benchmarks back inline (and even with some improvement for
>> getNullCountBenchmark).
>>
>> BenchmarkMode  Cnt   Score   Error
>>  Units
>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt5   3.821 ± 0.031
>>  ns/op
>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt5  14.884 ± 0.141
>>  ns/op
>>
>> I applied the same pattern to other loops that I could find, and for any
>> "for (long" loop on the critical path, I broke it up into two loops.  the
>> first loop does iteration by integer, the second finishes off for any long
>> values.  As a side note it seems like optimization for loops using long
>> counters at least have a semi-recent open bug for the JVM [2]
>>
>> Thanks,
>> Micah
>>
>> [1]
>>
>> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>>
>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield 
>> wrote:
>>
>> > Indeed, the BoundChecking and CheckNullForGet variables can make a big
>> > difference.  I didn't initially run the benchmarks with these turned on
>> > (you can see the result from above with Float8Benchmarks).  Here are new
>> > numbers including with the flags enabled.  It looks like using longs
>> might
>> > be a little bit slower, I'll see what I can do to mitigate this.
>> >
>> > Ravindra also volunteered to try to benchmark the changes with Dremio's
>> > code on today's sync call.
>> >
>> > New
>> >
>> > BenchmarkMode  Cnt   Score
>>  Error
>> > Units
>> >
>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt5   4.176 ±
>> 1.292
>> > ns/op
>> >
>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt5  26.102 ±
>> 0.700
>> > ns/op
>> >
>> > Float8Benchmarks.copyFromBenchmark   avgt5  7.398 ± 0.084  us/op
>> >
>> > Float8Benchmarks.readWriteBenchmark  avgt5  2.711 ± 0.057  us/op
>> >
>> >
>> >
>> > old
>> >
>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt5   3.828 ±
>> 0.030
>> > ns/op
>> >
>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt5  20.611 ±
>> 0.188
>> > ns/op
>> >
>> > Float8Benchmarks.copyFromBenchmark   avgt5  6.597 ± 0.462  us/op
>> >
>> > Float8Benchmarks.readWriteBenchmark  avgt5  2.615 ± 0.027  us/op
>> >
>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya  wrote:
>> >
>> >> Hi Gonzalo,
>> >>
>> >> Thanks for sharing the performance results.
>> >> I am wondering if you have turned off the flag
>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
>> >> If not, the lower throughput should be expected.
>> >>
>> >> Best,
>> >> Liya Fan
>> >>
>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield > >
>> >> wrote:
>> >>
>> >>> Hi Gonzalo,
>> >>> Thank you for the feedback.  I wasn't aware of the JIT implications.
>>  At
>> >>> least on the benchmark run they don't seem to have an impact.
>> >>>
>> >>> If there are other benchmarks that people have that can validate if
>> this
>> >>> change will be 

[jira] [Created] (ARROW-6196) [Ruby] Add support for building Arrow::TimeNNArray by .new

2019-08-10 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-6196:
---

 Summary: [Ruby] Add support for building Arrow::TimeNNArray by .new
 Key: ARROW-6196
 URL: https://issues.apache.org/jira/browse/ARROW-6196
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Ruby
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-10 Thread Jacques Nadeau
This is a pretty massive change to the apis. I wonder how nasty it would be
to just support both paths. Have you evaluated how complex that would be?

On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield 
wrote:

> After more investigation, it looks like Float8Benchmarks at least on my
> machine are within the range of noise.
>
> For BitVectorHelper I pushed a new commit [1], seems to bring the
> BitVectorHelper benchmarks back inline (and even with some improvement for
> getNullCountBenchmark).
>
> BenchmarkMode  Cnt   Score   Error
>  Units
> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt5   3.821 ± 0.031
>  ns/op
> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt5  14.884 ± 0.141
>  ns/op
>
> I applied the same pattern to other loops that I could find, and for any
> "for (long" loop on the critical path, I broke it up into two loops.  the
> first loop does iteration by integer, the second finishes off for any long
> values.  As a side note it seems like optimization for loops using long
> counters at least have a semi-recent open bug for the JVM [2]
>
> Thanks,
> Micah
>
> [1]
>
> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
>
> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield 
> wrote:
>
> > Indeed, the BoundChecking and CheckNullForGet variables can make a big
> > difference.  I didn't initially run the benchmarks with these turned on
> > (you can see the result from above with Float8Benchmarks).  Here are new
> > numbers including with the flags enabled.  It looks like using longs
> might
> > be a little bit slower, I'll see what I can do to mitigate this.
> >
> > Ravindra also volunteered to try to benchmark the changes with Dremio's
> > code on today's sync call.
> >
> > New
> >
> > BenchmarkMode  Cnt   Score
>  Error
> > Units
> >
> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt5   4.176 ±
> 1.292
> > ns/op
> >
> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt5  26.102 ±
> 0.700
> > ns/op
> >
> > Float8Benchmarks.copyFromBenchmark   avgt5  7.398 ± 0.084  us/op
> >
> > Float8Benchmarks.readWriteBenchmark  avgt5  2.711 ± 0.057  us/op
> >
> >
> >
> > old
> >
> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt5   3.828 ±
> 0.030
> > ns/op
> >
> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt5  20.611 ±
> 0.188
> > ns/op
> >
> > Float8Benchmarks.copyFromBenchmark   avgt5  6.597 ± 0.462  us/op
> >
> > Float8Benchmarks.readWriteBenchmark  avgt5  2.615 ± 0.027  us/op
> >
> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya  wrote:
> >
> >> Hi Gonzalo,
> >>
> >> Thanks for sharing the performance results.
> >> I am wondering if you have turned off the flag
> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
> >> If not, the lower throughput should be expected.
> >>
> >> Best,
> >> Liya Fan
> >>
> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield 
> >> wrote:
> >>
> >>> Hi Gonzalo,
> >>> Thank you for the feedback.  I wasn't aware of the JIT implications.
>  At
> >>> least on the benchmark run they don't seem to have an impact.
> >>>
> >>> If there are other benchmarks that people have that can validate if
> this
> >>> change will be problematic I would appreciate trying to run them with
> the
> >>> PR.  I will try to run the ones for zeroing/popcnt tonight to see if
> >>> there
> >>> is a change in those.
> >>>
> >>> -Micah
> >>>
> >>>
> >>>
> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
> >>> golthir...@gmail.com> wrote:
> >>>
> >>> > I would recommend to take care with this kind of changes.
> >>> >
> >>> > I didn't try Arrow in more than one year, but by then the performance
> >>> was
> >>> > quite bad in comparison with plain byte buffer access
> >>> > (see http://git.net/apache-arrow-development/msg02353.html *) and
> >>> > there are several optimizations that the JVM (specifically, C2) does
> >>> not
> >>> > apply when dealing with int instead of longs. One of the
> >>> > most commons is the loop unrolling and vectorization.
> >>> >
> >>> > * It doesn't seem the best way to reference an old email on the list,
> >>> but
> >>> > it is the only result shown by Google
> >>> >
> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya ()
> >>> > escribió:
> >>> >
> >>> >> Hi Micah,
> >>> >>
> >>> >> Thanks for your effort. The performance result looks good.
> >>> >>
> >>> >> As you indicated, ArrowBuf will take additional 12 bytes (4 bytes
> for
> >>> each
> >>> >> of length, write index, and read index).
> >>> >> Similar overheads also exist for vectors like BaseFixedWidthVector,
> >>> >> BaseVariableWidthVector, etc.
> >>> >>
> >>> >> IMO, such overheads are small enough to justify the change.
> >>> >> Let's check if there are other overheads.
> >>> >>
> >>> >> Best,
> >>> >> Liya Fan
> >>> >>
> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
> 

Re: [Format] Semantics for dictionary batches in streams

2019-08-10 Thread Jacques Nadeau
What situation are anticipating where you're going to be restating ids mid
stream?

On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield 
wrote:

> The IPC specification [1] defines behavior when isDelta on a
> DictionaryBatch [2] is "true".  I might have missed it in the
> specification, but I couldn't find the interpretation for what the expected
> behavior is when isDelta=false and and two  dictionary batches  with the
> same ID are sent.
>
> It seems like there are two options:
> 1.  Interpret the new dictionary batch as replacing the old one.
> 2.  Regard this as an error condition.
>
> Based on the fact that in the "file format" dictionaries are allowed to be
> placed in any order relative to the record batches, I assume it is the
> second, but just wanted to make sure.
>
> Thanks,
> Micah
>
> [1] https://arrow.apache.org/docs/ipc.html
> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
>


[jira] [Created] (ARROW-6195) [C++] CMake fails with file not found error while bundling thrift if python is not installed

2019-08-10 Thread Omer Ozarslan (JIRA)
Omer Ozarslan created ARROW-6195:


 Summary: [C++] CMake fails with file not found error while 
bundling thrift if python is not installed
 Key: ARROW-6195
 URL: https://issues.apache.org/jira/browse/ARROW-6195
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Omer Ozarslan


I had this error message while I was trying to reproduce another issue in 
docker.

To reproduce:

```
FROM debian:buster 
RUN apt-get update 
RUN DEBIAN_FRONTEND=noninteractive apt-get install -y git build-essential cmake 
 
WORKDIR /app 
RUN git clone https://github.com/apache/arrow.git 
RUN git checkout 167cea0 # HEAD as of 10-Aug-19
WORKDIR /app/arrow/cpp/build 
RUN cmake -DARROW_PARQUET=ON -DARROW_DEPENDENCY_SOURCE=BUNDLED .. 
RUN cmake --build . --target thrift_ep -j 8
```

Relevant part of output:
```
Scanning dependencies of target thrift_ep
[ 66%] Creating directories for 'thrift_ep'
[ 66%] Performing download step (verify and extract) for 'thrift_ep'
CMake Error at thrift_ep-stamp/verify-thrift_ep.cmake:11 (message):
 File not found: /thrift/0.12.0/thrift-0.12.0.tar.gz
make[3]: *** [CMakeFiles/thrift_ep.dir/build.make:90: 
thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-download] Error 1 
make[2]: *** [CMakeFiles/Makefile2:916: CMakeFiles/thrift_ep.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:928: CMakeFiles/thrift_ep.dir/rule] Error 2
make: *** [Makefile:487: thrift_ep] Error 2
```

Installing python fixes the problem, but this isn't directly clear from the 
error message. The source of issue is that execute_process in 
get_apache_mirrors macro silently fails and returns empty APACHE_MIRROR value 
since PYTHON_EXECUTABLE was empty.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6194) [Java] Make DictionaryEncoder non-static making it easy to extend and reuse

2019-08-10 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6194:
-

 Summary: [Java] Make DictionaryEncoder non-static making it easy 
to extend and reuse
 Key: ARROW-6194
 URL: https://issues.apache.org/jira/browse/ARROW-6194
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


As discussed in [https://github.com/apache/arrow/pull/4994].

Current static DictionaryEncoder has some limitation for extension and reuse.

Slightly change the APIs and migrate static method to object based approach.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Format] Semantics for dictionary batches in streams

2019-08-10 Thread Micah Kornfield
I should add that Option #1 above would be my preference, even though it
adds some complications (especially for the file format).

On Sat, Aug 10, 2019 at 12:12 AM Micah Kornfield 
wrote:

> The IPC specification [1] defines behavior when isDelta on a
> DictionaryBatch [2] is "true".  I might have missed it in the
> specification, but I couldn't find the interpretation for what the expected
> behavior is when isDelta=false and and two  dictionary batches  with the
> same ID are sent.
>
> It seems like there are two options:
> 1.  Interpret the new dictionary batch as replacing the old one.
> 2.  Regard this as an error condition.
>
> Based on the fact that in the "file format" dictionaries are allowed to be
> placed in any order relative to the record batches, I assume it is the
> second, but just wanted to make sure.
>
> Thanks,
> Micah
>
> [1] https://arrow.apache.org/docs/ipc.html
> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
>


[Format] Semantics for dictionary batches in streams

2019-08-10 Thread Micah Kornfield
The IPC specification [1] defines behavior when isDelta on a
DictionaryBatch [2] is "true".  I might have missed it in the
specification, but I couldn't find the interpretation for what the expected
behavior is when isDelta=false and and two  dictionary batches  with the
same ID are sent.

It seems like there are two options:
1.  Interpret the new dictionary batch as replacing the old one.
2.  Regard this as an error condition.

Based on the fact that in the "file format" dictionaries are allowed to be
placed in any order relative to the record batches, I assume it is the
second, but just wanted to make sure.

Thanks,
Micah

[1] https://arrow.apache.org/docs/ipc.html
[2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72