[jira] [Created] (ARROW-6207) [Java] Incorporate jmh benchmarks into archery

2019-08-11 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-6207:
--

 Summary: [Java] Incorporate jmh benchmarks into archery
 Key: ARROW-6207
 URL: https://issues.apache.org/jira/browse/ARROW-6207
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Benchmarking, Java
Reporter: Micah Kornfield


We should be able to detect performance regressions using archery for java 
related benchmarks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [ANNOUNCE] New Arrow PMC member: Micah Kornfield

2019-08-11 Thread Micah Kornfield
Thanks everyone for the good wishes!

On Fri, Aug 9, 2019 at 5:41 PM Fan Liya  wrote:

> Big congratulations! Micah
> Thank you so much for all the help!
>
> Best,
> Liya Fan
>
> On Saturday, August 10, 2019, Brian Hulette  wrote:
> > Congratulations Micah! Well deserved :)
> >
> > On Fri, Aug 9, 2019 at 9:02 AM Francois Saint-Jacques <
> > fsaintjacq...@gmail.com> wrote:
> >
> >> Congrats!
> >>
> >> well deserved.
> >>
> >> On Fri, Aug 9, 2019 at 11:12 AM Wes McKinney 
> wrote:
> >> >
> >> > The Project Management Committee (PMC) for Apache Arrow has invited
> >> > Micah Kornfield to become a PMC member and we are pleased to announce
> >> > that Micah has accepted.
> >> >
> >> > Congratulations and welcome!
> >>
> >
>


[jira] [Created] (ARROW-6206) [Java][Docs] Document environment variables/java properties

2019-08-11 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-6206:
--

 Summary: [Java][Docs] Document environment variables/java 
properties
 Key: ARROW-6206
 URL: https://issues.apache.org/jira/browse/ARROW-6206
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Java
Reporter: Micah Kornfield


Specifically, "-Dio.netty.tryReflectionSetAccessible=true" for JVMs >= 9 and 
BoundsChecking/NullChecking for get.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-11 Thread Micah Kornfield
Hi Wes and Jacques,
See responses below.

With regards to the reference implementation point. It is a good point. I'm
> on vacation this week. Unless you're pushing hard on this, can we pick this
> up and discuss more next week?


Sure thing, enjoy your vacation.  I think the only practical implications
are it delays choices around implementing LargeList, LargeBinary,
LargeString in Java, which in turn might push out the 0.15.0 release.

My stance on this is that I don't know how important it is for Java to
> support vectors over INT32_MAX elements. The use cases enabled by
> having very large arrays seem to be concentrated in the native code
> world (e.g. C/C++/Rust) -- that could just be implementation-centrism
> on my part, though.


A data point against this view is Spark has done work to eliminate 2GB
memory limits on its block sizes [1].  I don't claim to understand the
implications of this. Bryan might you have any thoughts here?  I'm OK with
INT32_MAX, as well, I think we should think about what this means for
adding Large types to Java and implications for reference implementations.

Thanks,
Micah

[1] https://issues.apache.org/jira/browse/SPARK-6235

On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau  wrote:

> Hey Micah,
>
> Appreciate the offer on the compiling. The reality is I'm more concerned
> about the unknowns than the compiling issue itself. Any time you've been
> tuning for a while, changing something like this could be totally fine or
> cause a couple of major issues. For example, we've done a very large amount
> of work reducing heap memory footprint of the vectors. Are target is to
> actually get it down to 24 bytes per ArrowBuf and 24 bytes heap per vector
> (not including arrow bufs).
>
> With regards to the reference implementation point. It is a good point.
> I'm on vacation this week. Unless you're pushing hard on this, can we pick
> this up and discuss more next week?
>
> thanks,
> Jacques
>
> On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield 
> wrote:
>
>> Hi Jacques,
>> I definitely understand these concerns and this change is risky because it
>> is so large.  Perhaps, creating a new hierarchy, might be the cleanest way
>> of dealing with this.  This could have other benefits like cleaning up
>> some
>> cruft around dictionary encode and "orphaned" method.   Per past e-mail
>> threads I agree it is beneficial to have 2 separate reference
>> implementations that can communicate fully, and my intent here was to
>> close
>> that gap.
>>
>> Trying to
>> > determine the ramifications of these changes would be challenging and
>> time
>> > consuming against all the different ways we interact with the Arrow Java
>> > library.
>>
>>
>> Understood.  I took a quick look at Dremio-OSS it seems like it has a
>> simple java build system?  If it is helpful, I can try to get a fork
>> running that at least compiles against this PR.  My plan would be to cast
>> any place that was changed to return a long back to an int, so in essence
>> the Dremio algorithms would reman 32-bit implementations.
>>
>> I don't  have the infrastructure to test this change properly from a
>> distributed systems perspective, so it would still take some time from
>> Dremio to validate for regressions.
>>
>> I'm not saying I'm against this but want to make sure we've
>> > explored all less disruptive options before considering changing
>> something
>> > this fundamental (especially when I generally hold the view that large
>> cell
>> > counts against massive contiguous memory is an anti pattern to scalable
>> > analytical processing--purely subjective of course).
>>
>>
>> I'm open to other ideas here, as well. I don't think it is out of the
>> question to leave the Java implementation as 32-bit, but if we do, then I
>> think we should consider a different strategy for reference
>> implementations.
>>
>> Thanks,
>> Micah
>>
>> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau 
>> wrote:
>>
>> > Hey Micah, I didn't have a particular path in mind. Was thinking more
>> along
>> > the lines of extra methods as opposed to separate classes.
>> >
>> > Arrow hasn't historically been a place where we're writing algorithms in
>> > Java so the fact that they aren't there doesn't mean they don't exist.
>> We
>> > have a large amount of code that depends on the current behavior that is
>> > deployed in hundreds of customer clusters (you can peruse our dremio
>> repo
>> > to see how extensively we leverage Arrow if interested). Trying to
>> > determine the ramifications of these changes would be challenging and
>> time
>> > consuming against all the different ways we interact with the Arrow Java
>> > library. I'm not saying I'm against this but want to make sure we've
>> > explored all less disruptive options before considering changing
>> something
>> > this fundamental (especially when I generally hold the view that large
>> cell
>> > counts against massive contiguous memory is an anti pattern to scalable
>> > analytical 

Re: [Format] Semantics for dictionary batches in streams

2019-08-11 Thread Micah Kornfield
I'm not sure what you mean by record-in-dictionary-id, so it is possible
this is a solution that I just don't understand :)

The only two references to dictionary IDs that I could find, are  one in
schema.fbs [1] which is attached a column in a schema and the one
referenced above in DictionaryBatches define Message.fbs [2] for
transmitting dictionaries.  It is quite possible I missed something.

 The indices into the dictionary are Int Arrays in a normal record batch.
I suppose the other option is to reset the stream by sending a new schema,
but I don't think that is supported either. This is what lead to my
original question.

Does no one do this today?

I think Wes did some recent work on the C++ Parquet in reading
dictionaries, and might have faced some of these issues, I'm not sure how
he dealt with it (haven't gotten back to the Parquet code yet).

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L271
[2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72

On Sun, Aug 11, 2019 at 6:32 PM Jacques Nadeau  wrote:

> Wow, you've shown how little I've thought about Arrow dictionaries for a
> while. I thought we had a dictionary id and a record-in-dictionary-id.
> Wouldn't that approach make more sense? Does no one do this today? (We
> frequently use compound values for this type of scenario...)
>
> On Sat, Aug 10, 2019 at 4:20 PM Micah Kornfield 
> wrote:
>
>> Reading data from two different parquet files sequentially with different
>> dictionaries for the same column.  This could be handled by re-encoding
>> data but that seems potentially sub-optimal.
>>
>> On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau 
>> wrote:
>>
>>> What situation are anticipating where you're going to be restating ids
>>> mid stream?
>>>
>>> On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield 
>>> wrote:
>>>
 The IPC specification [1] defines behavior when isDelta on a
 DictionaryBatch [2] is "true".  I might have missed it in the
 specification, but I couldn't find the interpretation for what the
 expected
 behavior is when isDelta=false and and two  dictionary batches  with the
 same ID are sent.

 It seems like there are two options:
 1.  Interpret the new dictionary batch as replacing the old one.
 2.  Regard this as an error condition.

 Based on the fact that in the "file format" dictionaries are allowed to
 be
 placed in any order relative to the record batches, I assume it is the
 second, but just wanted to make sure.

 Thanks,
 Micah

 [1] https://arrow.apache.org/docs/ipc.html
 [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72

>>>


Re: [Format] Semantics for dictionary batches in streams

2019-08-11 Thread Jacques Nadeau
Wow, you've shown how little I've thought about Arrow dictionaries for a
while. I thought we had a dictionary id and a record-in-dictionary-id.
Wouldn't that approach make more sense? Does no one do this today? (We
frequently use compound values for this type of scenario...)

On Sat, Aug 10, 2019 at 4:20 PM Micah Kornfield 
wrote:

> Reading data from two different parquet files sequentially with different
> dictionaries for the same column.  This could be handled by re-encoding
> data but that seems potentially sub-optimal.
>
> On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau 
> wrote:
>
>> What situation are anticipating where you're going to be restating ids
>> mid stream?
>>
>> On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield 
>> wrote:
>>
>>> The IPC specification [1] defines behavior when isDelta on a
>>> DictionaryBatch [2] is "true".  I might have missed it in the
>>> specification, but I couldn't find the interpretation for what the
>>> expected
>>> behavior is when isDelta=false and and two  dictionary batches  with the
>>> same ID are sent.
>>>
>>> It seems like there are two options:
>>> 1.  Interpret the new dictionary batch as replacing the old one.
>>> 2.  Regard this as an error condition.
>>>
>>> Based on the fact that in the "file format" dictionaries are allowed to
>>> be
>>> placed in any order relative to the record batches, I assume it is the
>>> second, but just wanted to make sure.
>>>
>>> Thanks,
>>> Micah
>>>
>>> [1] https://arrow.apache.org/docs/ipc.html
>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
>>>
>>


Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-11 Thread Jacques Nadeau
Hey Micah,

Appreciate the offer on the compiling. The reality is I'm more concerned
about the unknowns than the compiling issue itself. Any time you've been
tuning for a while, changing something like this could be totally fine or
cause a couple of major issues. For example, we've done a very large amount
of work reducing heap memory footprint of the vectors. Are target is to
actually get it down to 24 bytes per ArrowBuf and 24 bytes heap per vector
(not including arrow bufs).

With regards to the reference implementation point. It is a good point. I'm
on vacation this week. Unless you're pushing hard on this, can we pick this
up and discuss more next week?

thanks,
Jacques

On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield 
wrote:

> Hi Jacques,
> I definitely understand these concerns and this change is risky because it
> is so large.  Perhaps, creating a new hierarchy, might be the cleanest way
> of dealing with this.  This could have other benefits like cleaning up some
> cruft around dictionary encode and "orphaned" method.   Per past e-mail
> threads I agree it is beneficial to have 2 separate reference
> implementations that can communicate fully, and my intent here was to close
> that gap.
>
> Trying to
> > determine the ramifications of these changes would be challenging and
> time
> > consuming against all the different ways we interact with the Arrow Java
> > library.
>
>
> Understood.  I took a quick look at Dremio-OSS it seems like it has a
> simple java build system?  If it is helpful, I can try to get a fork
> running that at least compiles against this PR.  My plan would be to cast
> any place that was changed to return a long back to an int, so in essence
> the Dremio algorithms would reman 32-bit implementations.
>
> I don't  have the infrastructure to test this change properly from a
> distributed systems perspective, so it would still take some time from
> Dremio to validate for regressions.
>
> I'm not saying I'm against this but want to make sure we've
> > explored all less disruptive options before considering changing
> something
> > this fundamental (especially when I generally hold the view that large
> cell
> > counts against massive contiguous memory is an anti pattern to scalable
> > analytical processing--purely subjective of course).
>
>
> I'm open to other ideas here, as well. I don't think it is out of the
> question to leave the Java implementation as 32-bit, but if we do, then I
> think we should consider a different strategy for reference
> implementations.
>
> Thanks,
> Micah
>
> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau  wrote:
>
> > Hey Micah, I didn't have a particular path in mind. Was thinking more
> along
> > the lines of extra methods as opposed to separate classes.
> >
> > Arrow hasn't historically been a place where we're writing algorithms in
> > Java so the fact that they aren't there doesn't mean they don't exist. We
> > have a large amount of code that depends on the current behavior that is
> > deployed in hundreds of customer clusters (you can peruse our dremio repo
> > to see how extensively we leverage Arrow if interested). Trying to
> > determine the ramifications of these changes would be challenging and
> time
> > consuming against all the different ways we interact with the Arrow Java
> > library. I'm not saying I'm against this but want to make sure we've
> > explored all less disruptive options before considering changing
> something
> > this fundamental (especially when I generally hold the view that large
> cell
> > counts against massive contiguous memory is an anti pattern to scalable
> > analytical processing--purely subjective of course).
> >
> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield 
> > wrote:
> >
> > > Hi Jacques,
> > > What avenue were you thinking for supporting both paths?   I didn't
> want
> > > to pursue a different class hierarchy, because I felt like that would
> > > effectively fork the code base, but that is potentially an option that
> > > would allow us to have a complete reference implementation in Java that
> > can
> > > fully interact with C++, without major changes to this code.
> > >
> > > For supporting both APIs on the same classes/interfaces, I think they
> > > roughly fall into three categories, changes to input parameters,
> changes
> > to
> > > output parameters and algorithm changes.
> > >
> > > For inputs, changing from int to long is essentially a no-op from the
> > > compiler perspective.  From the limited micro-benchmarking this also
> > > doesn't seem to have a performance impact.  So we could keep two
> versions
> > > of the methods that only differ on inputs, but it is not clear what the
> > > value of that would be.
> > >
> > > For outputs, we can't support methods "long getLength()" and "int
> > > getLength()" in the same class, so we would be forced into something
> like
> > > "long getLength(boolean dummy)" which I think is a less desirable.
> > >
> > > For algorithm changes, there did not 

[jira] [Created] (ARROW-6205) ARROW_DEPRECATED warning when including io/interfaces.h from CUDA (.cu) source

2019-08-11 Thread Mark Harris (JIRA)
Mark Harris created ARROW-6205:
--

 Summary: ARROW_DEPRECATED warning when including io/interfaces.h 
from CUDA (.cu) source
 Key: ARROW-6205
 URL: https://issues.apache.org/jira/browse/ARROW-6205
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.14.1
Reporter: Mark Harris


When including arrow/io/interfaces.h from a .cu source file (compiled with 
nvcc, the CUDA compiler), we get warnings like the following:

{{[6/58] Building CUDA obje.../csv/csv_reader_impl.cu.}}
{{arrow/install/include/arrow/io/interfaces.h(195): warning: attribute does not 
apply to any entity}}

This example is from compiling [libcudf|[https://github.com/rapidsai/cudf].] 
libcudf installs these headers and includes them. This is a problem for 
libraries like libcudf which treat warnings as errors.

There is a simple fix (I will submit a PR): change this code in interfaces.h:


{{// TODO(kszucs): remove this after 0.13}}
{{#ifndef _MSC_VER}}
{{using WriteableFile ARROW_DEPRECATED("Use WritableFile") = WritableFile;}}
{{using ReadableFileInterface ARROW_DEPRECATED("Use RandomAccessFile") = 
RandomAccessFile;}}
{{#else}}{{// MSVC does not like using ARROW_DEPRECATED with using 
declarations}}{{using WriteableFile = WritableFile;}}
{{using ReadableFileInterface = RandomAccessFile;}}
{{#endif}}
 
Just change the ifndef to:
{{#if not defined(_MSC_VER) && not defined(__CUDACC__)}}
This fix should have no impact on other compilers.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6204) [GLib] Add garrow_array_is_in_chunked_array()

2019-08-11 Thread Yosuke Shiro (JIRA)
Yosuke Shiro created ARROW-6204:
---

 Summary: [GLib] Add garrow_array_is_in_chunked_array()
 Key: ARROW-6204
 URL: https://issues.apache.org/jira/browse/ARROW-6204
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Yosuke Shiro
Assignee: Yosuke Shiro


This is follow-up of 
[https://github.com/apache/arrow/pull/5047#issuecomment-520103706].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6203) [GLib] Add garrow_array_argsort()

2019-08-11 Thread Yosuke Shiro (JIRA)
Yosuke Shiro created ARROW-6203:
---

 Summary: [GLib] Add garrow_array_argsort()
 Key: ARROW-6203
 URL: https://issues.apache.org/jira/browse/ARROW-6203
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Yosuke Shiro
Assignee: Yosuke Shiro
 Fix For: 0.15.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6202) Exception in thread "main" org.apache.arrow.memory.OutOfMemoryException: Unable to allocate buffer of size 4 due to memory limit. Current allocation: 2147483646

2019-08-11 Thread Jim Northrup (JIRA)
Jim Northrup created ARROW-6202:
---

 Summary: Exception in thread "main" 
org.apache.arrow.memory.OutOfMemoryException: Unable to allocate buffer of size 
4 due to memory limit. Current allocation: 2147483646
 Key: ARROW-6202
 URL: https://issues.apache.org/jira/browse/ARROW-6202
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.14.1
Reporter: Jim Northrup



jdbc query results exceed native heap when using generous -Xmx settings. 

for roughly 800 megabytes of csv/flatfile resultset, arrow is unable to house 
the contents in RAM long enough to persist to disk, without explicit knowledge 
beyond unit test sample code.

source:
https://github.com/jnorthrup/jdbc2json/blob/master/src/main/java/com/fnreport/QueryToFeather.kt#L83


{code:java}
Exception in thread "main" org.apache.arrow.memory.OutOfMemoryException: Unable 
to allocate buffer of size 4 due to memory limit. Current allocation: 2147483646
at org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:307)
at org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:277)
at 
org.apache.arrow.adapter.jdbc.JdbcToArrowUtils.updateVector(JdbcToArrowUtils.java:610)
at 
org.apache.arrow.adapter.jdbc.JdbcToArrowUtils.jdbcToFieldVector(JdbcToArrowUtils.java:462)
at 
org.apache.arrow.adapter.jdbc.JdbcToArrowUtils.jdbcToArrowVectors(JdbcToArrowUtils.java:396)
at 
org.apache.arrow.adapter.jdbc.JdbcToArrow.sqlToArrow(JdbcToArrow.java:225)
at 
org.apache.arrow.adapter.jdbc.JdbcToArrow.sqlToArrow(JdbcToArrow.java:187)
at 
org.apache.arrow.adapter.jdbc.JdbcToArrow.sqlToArrow(JdbcToArrow.java:156)
at com.fnreport.QueryToFeather$Companion.go(QueryToFeather.kt:83)
at 
com.fnreport.QueryToFeather$Companion$main$1.invokeSuspend(QueryToFeather.kt:95)
at 
kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(Dispatched.kt:241)
at 
kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:270)
at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:79)
at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:54)
at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
at 
kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:36)
at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
at com.fnreport.QueryToFeather$Companion.main(QueryToFeather.kt:93)
at com.fnreport.QueryToFeather.main(QueryToFeather.kt)
{code}





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6201) [Python] Add pyarrow.read_schema to API documentation, add prose documentation for schema serialization workflow

2019-08-11 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6201:
---

 Summary: [Python] Add pyarrow.read_schema to API documentation, 
add prose documentation for schema serialization workflow
 Key: ARROW-6201
 URL: https://issues.apache.org/jira/browse/ARROW-6201
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


Per discussion on user@ mailing list



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: Proposal to move website source to arrow-site, add automatic builds

2019-08-11 Thread Wes McKinney
It looks like the git pruning is done. So we can remove the site/
directory from the main repository at some point soon.

On Thu, Aug 8, 2019 at 2:29 PM Neal Richardson
 wrote:
>
> I need a committer to make a master branch on arrow-site so that I can
> PR to it. I thought it could be just an empty orphan branch but that
> proved not to work, so a committer will need to do the following:
>
> ```
> git clone g...@github.com:$YOURGITHUB/arrow.git arrow-copy
> cd arrow-copy
> git filter-branch --prune-empty --subdirectory-filter site master
> vi .git/config
> # Change remote "origin"'s URL to be g...@github.com:arrow/arrow-site.git
> git push -f origin master
> ```
>
> On Thu, Aug 8, 2019 at 12:07 PM Wes McKinney  wrote:
> >
> > Yes, I think we have adequate lazy consensus. Can you spell out what
> > are the next steps?
> >
> > On Thu, Aug 8, 2019 at 2:01 PM Neal Richardson
> >  wrote:
> > >
> > > Have we reached "lazy consensus" here? No further comments in the last
> > > three days.
> > >
> > > Thanks,
> > > Neal
> > >
> > > On Mon, Aug 5, 2019 at 1:46 PM Joris Van den Bossche
> > >  wrote:
> > > >
> > > > This sounds as a good proposal to me (at least at the moment where we 
> > > > have
> > > > separate docs and main site).
> > > > I agree that documentation should indeed stay with the code, as you 
> > > > want to
> > > > update those together in PRs. But the website is something you can
> > > > typically update separately and also might want to update independently
> > > > from code releases. And certainly if this proposal makes it easier to 
> > > > work
> > > > on the site, all the better.
> > > >
> > > > Joris
> > > >
> > > > Op ma 5 aug. 2019 20:30 schreef Wes McKinney :
> > > >
> > > > > Let's wait a little while to collect any additional opinions about 
> > > > > this.
> > > > >
> > > > > There's pretty good evidence from other Apache projects that this
> > > > > isn't too bad of an idea
> > > > >
> > > > > Apache Calcite: https://github.com/apache/calcite-site
> > > > > Apache Kafka: https://github.com/apache/kafka-site
> > > > > Apache Spark: https://github.com/apache/spark-website
> > > > >
> > > > > The Apache projects I've seen where the same repository is used for
> > > > > $FOO.apache.org tend to be ones where the documentation _is_ the
> > > > > website. I think we would need to commission a significant web design
> > > > > overhaul to be able to make our documentation page adequate as the
> > > > > landing point for visitors to https://arrow.apache.org.
> > > > >
> > > > > On Sat, Aug 3, 2019 at 3:46 PM Neal Richardson
> > > > >  wrote:
> > > > > >
> > > > > > Given the status quo, it would be difficult for this to make the 
> > > > > > Arrow
> > > > > > website less maintained. In fact, arrow-site is currently missing 
> > > > > > the
> > > > > > most recent two patches that modified the site directory in
> > > > > > apache/arrow. Having multiple manual deploy steps increases the
> > > > > > likelihood that the website stays stale.
> > > > > >
> > > > > > As someone who has been working on the arrow site lately, this
> > > > > > proposal makes it easier for me to make changes to the website 
> > > > > > because
> > > > > > I can automatically deploy my changes to a test site, and that lets
> > > > > > others in the community, who perhaps don't touch the website much,
> > > > > > verify that they're good.
> > > > > >
> > > > > > I agree that the documentation situation needs attention, but as I
> > > > > > said initially, that's orthogonal to this static site generation. 
> > > > > > I'd
> > > > > > like to work on that next, and I think these changes will make it
> > > > > > easier to do. I would not propose moving doc generation out of
> > > > > > apache/arrow--that belongs with the code.
> > > > > >
> > > > > > Neal
> > > > > >
> > > > > > On Sat, Aug 3, 2019 at 9:49 AM Wes McKinney  
> > > > > > wrote:
> > > > > > >
> > > > > > > I think that the project website and the project documentation are
> > > > > > > currently distinct entities. The current Jekyll website is 
> > > > > > > independent
> > > > > > > from the Sphinx documentation project aside from a link to the
> > > > > > > documentation from the website.
> > > > > > >
> > > > > > > I am guessing that we would want to maintain some amount of 
> > > > > > > separation
> > > > > > > between the main site at arrow.apache.org and the code / format
> > > > > > > documentation, at minimum because we may want to make 
> > > > > > > documentation
> > > > > > > available for multiple versions of the project (this has already 
> > > > > > > been
> > > > > > > cited as an issue -- when we release, we're overwriting the 
> > > > > > > previous
> > > > > > > version of the docs)
> > > > > > >
> > > > > > > On Sat, Aug 3, 2019 at 11:33 AM Antoine Pitrou 
> > > > > > > 
> > > > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > I am concerned with this.  What happens if we happen to move 
> > > > > > > > part of
> > > > > the
> > > > 

Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-11 Thread Wes McKinney
My stance on this is that I don't know how important it is for Java to
support vectors over INT32_MAX elements. The use cases enabled by
having very large arrays seem to be concentrated in the native code
world (e.g. C/C++/Rust) -- that could just be implementation-centrism
on my part, though. It's possible there are use cases where Java would
want to be able to produce very large memory regions to be exposed to
native code.

On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield  wrote:
>
> Hi Jacques,
> I definitely understand these concerns and this change is risky because it
> is so large.  Perhaps, creating a new hierarchy, might be the cleanest way
> of dealing with this.  This could have other benefits like cleaning up some
> cruft around dictionary encode and "orphaned" method.   Per past e-mail
> threads I agree it is beneficial to have 2 separate reference
> implementations that can communicate fully, and my intent here was to close
> that gap.
>
> Trying to
> > determine the ramifications of these changes would be challenging and time
> > consuming against all the different ways we interact with the Arrow Java
> > library.
>
>
> Understood.  I took a quick look at Dremio-OSS it seems like it has a
> simple java build system?  If it is helpful, I can try to get a fork
> running that at least compiles against this PR.  My plan would be to cast
> any place that was changed to return a long back to an int, so in essence
> the Dremio algorithms would reman 32-bit implementations.
>
> I don't  have the infrastructure to test this change properly from a
> distributed systems perspective, so it would still take some time from
> Dremio to validate for regressions.
>
> I'm not saying I'm against this but want to make sure we've
> > explored all less disruptive options before considering changing something
> > this fundamental (especially when I generally hold the view that large cell
> > counts against massive contiguous memory is an anti pattern to scalable
> > analytical processing--purely subjective of course).
>
>
> I'm open to other ideas here, as well. I don't think it is out of the
> question to leave the Java implementation as 32-bit, but if we do, then I
> think we should consider a different strategy for reference implementations.
>
> Thanks,
> Micah
>
> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau  wrote:
>
> > Hey Micah, I didn't have a particular path in mind. Was thinking more along
> > the lines of extra methods as opposed to separate classes.
> >
> > Arrow hasn't historically been a place where we're writing algorithms in
> > Java so the fact that they aren't there doesn't mean they don't exist. We
> > have a large amount of code that depends on the current behavior that is
> > deployed in hundreds of customer clusters (you can peruse our dremio repo
> > to see how extensively we leverage Arrow if interested). Trying to
> > determine the ramifications of these changes would be challenging and time
> > consuming against all the different ways we interact with the Arrow Java
> > library. I'm not saying I'm against this but want to make sure we've
> > explored all less disruptive options before considering changing something
> > this fundamental (especially when I generally hold the view that large cell
> > counts against massive contiguous memory is an anti pattern to scalable
> > analytical processing--purely subjective of course).
> >
> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield 
> > wrote:
> >
> > > Hi Jacques,
> > > What avenue were you thinking for supporting both paths?   I didn't want
> > > to pursue a different class hierarchy, because I felt like that would
> > > effectively fork the code base, but that is potentially an option that
> > > would allow us to have a complete reference implementation in Java that
> > can
> > > fully interact with C++, without major changes to this code.
> > >
> > > For supporting both APIs on the same classes/interfaces, I think they
> > > roughly fall into three categories, changes to input parameters, changes
> > to
> > > output parameters and algorithm changes.
> > >
> > > For inputs, changing from int to long is essentially a no-op from the
> > > compiler perspective.  From the limited micro-benchmarking this also
> > > doesn't seem to have a performance impact.  So we could keep two versions
> > > of the methods that only differ on inputs, but it is not clear what the
> > > value of that would be.
> > >
> > > For outputs, we can't support methods "long getLength()" and "int
> > > getLength()" in the same class, so we would be forced into something like
> > > "long getLength(boolean dummy)" which I think is a less desirable.
> > >
> > > For algorithm changes, there did not appear to be too many places where
> > we
> > > actually loop over all elements (it is quite possible I missed something
> > > here), the ones that I did find I was able to mitigate performance
> > > penalties as noted above.  Some of the current implementation will get a
> > > 

Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

2019-08-11 Thread Ji Liu
Thanks Jacques, to avoid complex call paths for getObject, should keep 
getObject for both classes. I'll also checked for other methods.

Thanks,
Ji Liu


--
From:Jacques Nadeau 
Send Time:2019年8月11日(星期日) 21:43
To:dev ; Ji Liu 
Cc:emkornfield 
Subject:Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

We tried to get away from this kind of back and forth with subclassing as
much as possible. (call getObject on base class which then calls getIndex
on child class which then calls something else on base class). I haven't
looked through the code but let's try to avoid having complex call paths
for the vectors.

On Sat, Aug 10, 2019 at 6:07 PM Ji Liu  wrote:

> Hi Micah, thanks for your suggestion.
> You are right, the mainly difference between FixSizedListVector and
> ListVector is the offsetBuffer, but I think this could be avoided through
> allocateNewSafe() overwrite which calls allocateOffsetBuffer() in
> BaseRepeatedValueVector, in this way, offsetBuffer in FixSizedListVector
> will remain allocator.getEmpty().
>
> Meanwhile, we could add getStartIndex(int index)/getEndIndex(int index)
> API to handle read data logic respectively which could be used in
> getObject(int index) or encoding parts. What’s more, no new interface need
> to be introduced.
>
> What do you think?
>
>
> Thanks,
> Ji Liu
>
>
> --
> From:Micah Kornfield 
> Send Time:2019年8月11日(星期日) 08:47
> To:dev ; Ji Liu 
> Subject:Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from
> ListVector
>
> Hi Ji Liu,
> I think have a common interface/base-class for the two makes sense (but
> don't have historical context) from a reading data perspective.
>
> I think the change would need to be something above
> BaseRepeatedValueVector, since the FixedSizeListVector doesn't contain an
> offset buffer, and that field is contained on BaseRepeatedValueVector.
>
> Thanks,
> Micah
> On Sat, Aug 10, 2019 at 5:25 PM Ji Liu  wrote:
> Hi, all
>
>  While working on the issue to implement dictionary-encoded subfields[1]
> [2], I found FixedSizeListVector not extends ListVector(Thanks Micah
> pointing this out and curious why implemented FixedSizeListVector this way
>  before). Since FixedSizeListVector is a specific case of ListVector,
> should we make former extends the latter to reduce the plenty duplicated
> logic in these two and writer/reader classes?
>
>
>  Thanks,
>  Ji Liu
>
>  [1]
> https://issues.apache.org/jira/browse/ARROW-1175[2]https://github.com/apache/arrow/pull/4972
>
>


Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

2019-08-11 Thread Jacques Nadeau
We tried to get away from this kind of back and forth with subclassing as
much as possible. (call getObject on base class which then calls getIndex
on child class which then calls something else on base class). I haven't
looked through the code but let's try to avoid having complex call paths
for the vectors.

On Sat, Aug 10, 2019 at 6:07 PM Ji Liu  wrote:

> Hi Micah, thanks for your suggestion.
> You are right, the mainly difference between FixSizedListVector and
> ListVector is the offsetBuffer, but I think this could be avoided through
> allocateNewSafe() overwrite which calls allocateOffsetBuffer() in
> BaseRepeatedValueVector, in this way, offsetBuffer in FixSizedListVector
> will remain allocator.getEmpty().
>
> Meanwhile, we could add getStartIndex(int index)/getEndIndex(int index)
> API to handle read data logic respectively which could be used in
> getObject(int index) or encoding parts. What’s more, no new interface need
> to be introduced.
>
> What do you think?
>
>
> Thanks,
> Ji Liu
>
>
> --
> From:Micah Kornfield 
> Send Time:2019年8月11日(星期日) 08:47
> To:dev ; Ji Liu 
> Subject:Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from
> ListVector
>
> Hi Ji Liu,
> I think have a common interface/base-class for the two makes sense (but
> don't have historical context) from a reading data perspective.
>
> I think the change would need to be something above
> BaseRepeatedValueVector, since the FixedSizeListVector doesn't contain an
> offset buffer, and that field is contained on BaseRepeatedValueVector.
>
> Thanks,
> Micah
> On Sat, Aug 10, 2019 at 5:25 PM Ji Liu  wrote:
> Hi, all
>
>  While working on the issue to implement dictionary-encoded subfields[1]
> [2], I found FixedSizeListVector not extends ListVector(Thanks Micah
> pointing this out and curious why implemented FixedSizeListVector this way
>  before). Since FixedSizeListVector is a specific case of ListVector,
> should we make former extends the latter to reduce the plenty duplicated
> logic in these two and writer/reader classes?
>
>
>  Thanks,
>  Ji Liu
>
>  [1]
> https://issues.apache.org/jira/browse/ARROW-1175[2]https://github.com/apache/arrow/pull/4972
>
>


[jira] [Created] (ARROW-6200) [Java] Method getBufferSizeFor in BaseRepeatedValueVector/ListVector not correct

2019-08-11 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6200:
-

 Summary: [Java] Method getBufferSizeFor in 
BaseRepeatedValueVector/ListVector not correct
 Key: ARROW-6200
 URL: https://issues.apache.org/jira/browse/ARROW-6200
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently, {{getBufferSizeFor}} in {{BaseRepeatedValueVector}} implemented as 
below:
{code:java}
if (valueCount == 0) {

  return 0;

}

return ((valueCount + 1) * OFFSET_WIDTH) + vector.getBufferSizeFor(valueCount);
{code}
Here vector.getBufferSizeFor(valueCount) seems not right which should be

 
{code:java}
int innerVectorValueCount = offsetBuffer.getInt(valueCount * OFFSET_WIDTH);

vector.getBufferSizeFor(innerVectorValueCount)
{code}
 ListVector has the same problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6199) [Java] Avro adapter avoid potential resource leak.

2019-08-11 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6199:
-

 Summary: [Java] Avro adapter avoid potential resource leak.
 Key: ARROW-6199
 URL: https://issues.apache.org/jira/browse/ARROW-6199
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently, avro consumer interface has no close API, which may cause resource 
leak like {{AvroBytesConsumer#cacheBuffer}}.

To resolve this, make consumer extends {{AutoCloseable}} and create 
{{CompositeAvroConsumer}} to encompasses consume and close logic. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[Discussion][Java] Redesign the dictionary encoder

2019-08-11 Thread Fan Liya
Dear all,

Dictionary encoding is an important feature, so it should be implemented
with good performance.
The current Java dictionary encoder implementation is based on static
utility methods in org.apache.arrow.vector.dictionary.DictionaryEncoder,
which has heavy performance overhead, preventing it from being useful in
practice:

1. The hash table cannot be reused for encoding multiple vectors (other
data structure & results cannot be reused either).
2. The output vector should not be created/managed by the encoder (just
like in the out-of-place sorter)
3. Different scenarios requires different algorithms to compute the hash
code to avoid conflicts in the hash table, but this is not supported.

Although some problems can be overcome by refactoring the current
implementation, it is difficult to do so without significantly chaning the
current API.
So we propse new design [1][2] of the dictionary encoder, to make it more
performant in practice.

We plan to implement the new dictionary encoders with stateful objects, so
many useful partial/immediate results can be reused. The new encoders
support using different hash code algorithms in different scenarios to
achieve good performance.

We plan to support the new encoders in the following steps:

1. implement the new dictionary encoders in the algorithm module [3][4]
2. make the old dictionary encoder deprecated
3. remove the old encoder implementations

Please give your valuable comments.

Best,
Liya Fan

[1] https://issues.apache.org/jira/browse/ARROW-5917
[2] https://issues.apache.org/jira/browse/ARROW-6184
[3] https://github.com/apache/arrow/pull/4994
[4] https://github.com/apache/arrow/pull/5058