Re: [ANNOUNCE] New Arrow PMC chair: Wes McKinney

2020-10-25 Thread Ji Liu
Congratulations, Wes!

David Li  于2020年10月25日周日 上午9:40写道:

> Congratulations Wes!
>
> Best,
> David
>
> On 10/24/20, Li Jin  wrote:
> > Congrats Wes!
> >
> > On Sat, Oct 24, 2020 at 10:05 AM Ying Zhou  wrote:
> >
> >> Congratulations Wes! :)
> >>
> >> Ying
> >>
> >> > On Oct 23, 2020, at 7:35 PM, Jacques Nadeau 
> wrote:
> >> >
> >> > I am pleased to announce that we have a new PMC chair and VP as per
> our
> >> > newly started tradition of rotating the chair once a year. I have
> >> resigned
> >> > and Wes was duly elected by the PMC and approved unanimously by the
> >> board.
> >> >
> >> > Please join me in congratulating Wes!
> >> >
> >> > Jacques
> >>
> >>
> >
>


Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers

2020-08-10 Thread Ji Liu
Hi Micah, I am afraid it's not a reasonable solution.
1. The status is that getFieldBuffers has right order buffer and was used
in IPC, getBuffers was not used in IPC.
2. The purpose of this PR is to use getBuffers in IPC instead, and making
changes in getFieldBuffers dose not seem to help this problem since it will
break IPC format by using getBuffers.

Micah Kornfield  于2020年8月8日周六 上午11:50写道:

>  Thinking about this some more, I think maybe we should also potentially
> try to deprecate hold off on any changes to getFieldBuffers.  It should
> likely follow the same sort of pattern for getBuffers (create a new method
> that doesn't set reader/writer indices and deprecate getFieldBuffers).  But
> I think that can be handled in a separate PR?
>
> Anybody else have thoughts?
>
> -Micah
>
> On Tue, Aug 4, 2020 at 11:24 PM Ji Liu  wrote:
>
> > hi liya,
> > Thanks for your careful review, it is a typo, the order of getBuffers is
> > wrong.
> >
> > Fan Liya  于2020年8月5日周三 下午2:14写道:
> >
> > > Hi Ji,
> > >
> > > IMO, for the correct order, the validity buffer should precede the
> offset
> > > buffer (e.g. this is the order used by BaseVariableWidthVector &
> > > BaseLargeVariableWidthVector).
> > > In ListVector#getBuffers, the offset buffer precedes the validity
> buffer,
> > > so I am a little confused why you say the order of
> ListVector#getBuffers
> > is
> > > right?
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Wed, Aug 5, 2020 at 12:32 PM Micah Kornfield  >
> > > wrote:
> > >
> > > > FWIW, I lack historical context on how these methods evolved, so I'd
> > > > appreciate insight from anyone who has worked on the java codebase
> for
> > a
> > > > longer period of time.  The current situation seems less then ideal.
> > > >
> > > > On Tue, Aug 4, 2020 at 12:55 AM Ji Liu  wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > >
> > > > > When I worked on ARROW-7539[1], I met some problems and not sure
> > what's
> > > > the
> > > > > proper way to solve it.
> > > > >
> > > > >
> > > > > This issue was about to avoid set reader/writer indices in
> > > > > FieldVector#getFieldBuffers according to the following reasons:
> > > > >
> > > > > i. getBuffers set reader/writer indices and it's right for the
> > purpose
> > > of
> > > > > sending the data over the wire
> > > > >
> > > > > ii. getFieldBuffers is distinct from getBuffers, it should be for
> > > getting
> > > > > access to underlying data for higher-performance algorithms
> > > > >
> > > > >
> > > > > Currently in VectorUnloader, we used getFieldBuffers to create
> > > > > ArrowRecordBatch that's why we keep writer/reader indices in
> > > > > getFieldBuffers
> > > > > , we should use getBuffers instead. But during the change, we found
> > > > another
> > > > > problem:
> > > > >
> > > > > The order of validity and offset buffers are not in the same order
> in
> > > > > ListVector(getBuffers's order is right), changing the API in
> > > > VectorUnloader
> > > > > creates problems with serialization/deserialization resulting in
> test
> > > > > failures in Dremio which would break backward compatibility with
> > > existing
> > > > > serialised files.
> > > > >
> > > > >
> > > > > Micah gives a solution but seems doesn't reach consistent in the PR
> > > > > thread[2
> > > > > ]:
> > > > >
> > > > >1. Remove setReaderWriterIndeces in getFieldBuffers
> > > > >2. Deprecate getBuffers
> > > > >3. Introduce a new getIpcBuffers which is unambiguously used for
> > > > writing
> > > > >record batches (i.e. in VectorUnloader).
> > > > >4. Update documentation where it makes sense based on all this
> > > > >conversation.
> > > > >
> > > > >
> > > > > More details and discussions can be seen from the PR and hope to
> get
> > > more
> > > > > feedback.
> > > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Ji Liu
> > > > >
> > > > >
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/ARROW-7539
> > > > >
> > > > > [2] https://github.com/apache/arrow/pull/6156
> > > > >
> > > >
> > >
> >
>


Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers

2020-08-05 Thread Ji Liu
hi liya,
Thanks for your careful review, it is a typo, the order of getBuffers is
wrong.

Fan Liya  于2020年8月5日周三 下午2:14写道:

> Hi Ji,
>
> IMO, for the correct order, the validity buffer should precede the offset
> buffer (e.g. this is the order used by BaseVariableWidthVector &
> BaseLargeVariableWidthVector).
> In ListVector#getBuffers, the offset buffer precedes the validity buffer,
> so I am a little confused why you say the order of ListVector#getBuffers is
> right?
>
> Best,
> Liya Fan
>
> On Wed, Aug 5, 2020 at 12:32 PM Micah Kornfield 
> wrote:
>
> > FWIW, I lack historical context on how these methods evolved, so I'd
> > appreciate insight from anyone who has worked on the java codebase for a
> > longer period of time.  The current situation seems less then ideal.
> >
> > On Tue, Aug 4, 2020 at 12:55 AM Ji Liu  wrote:
> >
> > > Hi all,
> > >
> > >
> > > When I worked on ARROW-7539[1], I met some problems and not sure what's
> > the
> > > proper way to solve it.
> > >
> > >
> > > This issue was about to avoid set reader/writer indices in
> > > FieldVector#getFieldBuffers according to the following reasons:
> > >
> > > i. getBuffers set reader/writer indices and it's right for the purpose
> of
> > > sending the data over the wire
> > >
> > > ii. getFieldBuffers is distinct from getBuffers, it should be for
> getting
> > > access to underlying data for higher-performance algorithms
> > >
> > >
> > > Currently in VectorUnloader, we used getFieldBuffers to create
> > > ArrowRecordBatch that's why we keep writer/reader indices in
> > > getFieldBuffers
> > > , we should use getBuffers instead. But during the change, we found
> > another
> > > problem:
> > >
> > > The order of validity and offset buffers are not in the same order in
> > > ListVector(getBuffers's order is right), changing the API in
> > VectorUnloader
> > > creates problems with serialization/deserialization resulting in test
> > > failures in Dremio which would break backward compatibility with
> existing
> > > serialised files.
> > >
> > >
> > > Micah gives a solution but seems doesn't reach consistent in the PR
> > > thread[2
> > > ]:
> > >
> > >1. Remove setReaderWriterIndeces in getFieldBuffers
> > >2. Deprecate getBuffers
> > >3. Introduce a new getIpcBuffers which is unambiguously used for
> > writing
> > >record batches (i.e. in VectorUnloader).
> > >4. Update documentation where it makes sense based on all this
> > >conversation.
> > >
> > >
> > > More details and discussions can be seen from the PR and hope to get
> more
> > > feedback.
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Ji Liu
> > >
> > >
> > >
> > > [1] https://issues.apache.org/jira/browse/ARROW-7539
> > >
> > > [2] https://github.com/apache/arrow/pull/6156
> > >
> >
>


[DISCUSS] How to extended time value range for Timestamp type?

2020-08-04 Thread Ji Liu
Hi all,

Now in Arrow Timestamp type, it support different TimeUnit(seconds,
milliseconds, microseconds, nanoseconds) with int64 type for storage. In
most cases this is enough, but if the timestamp value range of external
system exceeds int64_t::max, then it's impossible to directly convert to
Arrow Timestamp, consider the following user case:

A timestamp in other system with int64 + int32(stores milliseconds and
nanoseconds) can represent data from -00-00 to -12-31
23:59:59.9, if we want to convert type like this, how should we do?
One probably create an extension type with struct(int64, int32) for storage.

Besides ExtensionType, are we considering extending our Timestamp for wider
range or maybe a new type for cases above?


Thanks,
Ji Liu


[DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers

2020-08-04 Thread Ji Liu
Hi all,


When I worked on ARROW-7539[1], I met some problems and not sure what's the
proper way to solve it.


This issue was about to avoid set reader/writer indices in
FieldVector#getFieldBuffers according to the following reasons:

i. getBuffers set reader/writer indices and it's right for the purpose of
sending the data over the wire

ii. getFieldBuffers is distinct from getBuffers, it should be for getting
access to underlying data for higher-performance algorithms


Currently in VectorUnloader, we used getFieldBuffers to create
ArrowRecordBatch that's why we keep writer/reader indices in getFieldBuffers
, we should use getBuffers instead. But during the change, we found another
problem:

The order of validity and offset buffers are not in the same order in
ListVector(getBuffers's order is right), changing the API in VectorUnloader
creates problems with serialization/deserialization resulting in test
failures in Dremio which would break backward compatibility with existing
serialised files.


Micah gives a solution but seems doesn't reach consistent in the PR thread[2
]:

   1. Remove setReaderWriterIndeces in getFieldBuffers
   2. Deprecate getBuffers
   3. Introduce a new getIpcBuffers which is unambiguously used for writing
   record batches (i.e. in VectorUnloader).
   4. Update documentation where it makes sense based on all this
   conversation.


More details and discussions can be seen from the PR and hope to get more
feedback.



Thanks,

Ji Liu



[1] https://issues.apache.org/jira/browse/ARROW-7539

[2] https://github.com/apache/arrow/pull/6156


Re: Java Arrow to C++ Arrow and vice versa

2020-07-21 Thread Ji Liu
Hi Chathura,

https://lists.apache.org/thread.html/5bf70a6f1a3fa3e543a92b3217e64465a3b761ca307e8114550f9d8b@%3Cdev.arrow.apache.org%3E
has
the relevant pointers.


Thanks,
Ji Liu



Chathura Widanage  于2020年7月22日周三 上午3:03写道:

> Hi all,
>
> Was there any particular reason for not writing Java Arrow as a JNI binding
> for CPP Arrow?
>
> What is the most straightforward and efficient way to convert a java arrow
> schema/table to a JNI backed C++ arrow schema/table?
>
> Regards,
> Chathura
>


Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan

2020-06-11 Thread Ji Liu
Thanks everyone for the warm welcome!
It's a great honor for me to be a committer. Looking forward to
contributing more to the community.

Thanks,
Ji Liu


paddy horan  于2020年6月12日周五 上午8:52写道:

> Congrats!
>
> 
> From: Micah Kornfield 
> Sent: Thursday, June 11, 2020 12:59:32 PM
> To: dev 
> Subject: Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan
>
> Congratulations!
>
> On Thu, Jun 11, 2020 at 9:32 AM David Li  wrote:
>
> > Congrats Ji  & Liya!
> >
> > David
> >
> > On 6/11/20, siddharth teotia  wrote:
> > > Congratulations!
> > >
> > > On Thu, Jun 11, 2020 at 7:51 AM Neal Richardson
> > > 
> > > wrote:
> > >
> > >> Congratulations, both!
> > >>
> > >> Neal
> > >>
> > >> On Thu, Jun 11, 2020 at 7:38 AM Wes McKinney 
> > wrote:
> > >>
> > >> > On behalf of the Arrow PMC I'm happy to announce that Ji Liu and
> Liya
> > >> > Fan have been invited to be Arrow committers and they have both
> > >> > accepted.
> > >> >
> > >> > Welcome, and thank you for your contributions!
> > >> >
> > >>
> > >
> > >
> > > --
> > > *Best Regards,*
> > > *SIDDHARTH TEOTIA*
> > > *2008C6PS540G*
> > > *BITS PILANI- GOA CAMPUS*
> > >
> > > *+91 87911 75932*
> > >
> >
>


[jira] [Created] (ARROW-8870) [Java] Make sure Netty Allocator has correct behavior with empty ArrowBuf

2020-05-20 Thread Ji Liu (Jira)
Ji Liu created ARROW-8870:
-

 Summary: [Java] Make sure Netty Allocator has correct behavior 
with empty ArrowBuf
 Key: ARROW-8870
 URL: https://issues.apache.org/jira/browse/ARROW-8870
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Include a test which ensures that the Netty Allocator returns an empty-behaving 
byte buffer when users allocate a zero byte buffer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Contributing to Arrow

2020-04-26 Thread Ji Liu
Hi Karuppayya,
Welcome! 
If you are interested in this issue, feel free to take it.


Thanks,
Ji Liu


--
From:karuppayya 
Send Time:2020年4月25日(星期六) 14:21
To:emkornfield 
Cc:dev 
Subject:Re: Contributing to Arrow

Hi Micah,
Thanks for letting me know.

I will ping Ji Liu on the jira, and see how I can help with the jira issue.

Thanks
Karuppayya


On Fri, 24 Apr 2020, 21:05 Micah Kornfield,  wrote:

> Hi Karuppayya,
> Welcome!
>
> The only issue I can think of off the top of my head on the Java side that
> is on the basic side is: https://issues.apache.org/jira/browse/ARROW-6931
> I'm not sure if Ji Liu is planning on working on it, you might ping Ji Liu
> on the JIRA and see if you can help out.  In particular I think having
> fluent assertions for the Java library would make our unit tests more
> readable.
>
> I'll try to see if I can find any more and/or think of other
> possibilities.  But hopefully others might have some suggestions.
>
> Thanks,
> Micah
>
> On Fri, Apr 24, 2020 at 3:41 AM Antoine Pitrou  wrote:
>
>>
>> Hi,
>>
>> Le 24/04/2020 à 01:36, karuppayya a écrit :
>> > Hi All,
>> > I am interested i contributing to Arrow project
>> >
>> > I am planning to start with some jiras on Arrow Java component.
>> > I tried looking for jiras with component *Java* and labels *beginner*,
>> > *beginners*, *newbie.*
>>
>> We're not using the labels much, so you should just search with
>> component "Java".
>>
>> I hope Java maintainers can offer more guidance.
>>
>> > The ones that came up were either dated or already has some *WIP*.
>> >
>> > Any other suggestions on how to get started?
>> >
>> > Thanks
>> > *Karuppayya*
>> >
>>
>


[jira] [Created] (ARROW-8305) [Java] Add ExtensionType support for visitor API

2020-04-01 Thread Ji Liu (Jira)
Ji Liu created ARROW-8305:
-

 Summary: [Java] Add ExtensionType support for visitor API
 Key: ARROW-8305
 URL: https://issues.apache.org/jira/browse/ARROW-8305
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


We have introduced visitor API for comparing vector/range/type in ARROW-6211, 
but it dose not support {{ExtensionType}} yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8171) Consider pre-allocating memory for fix-width vector in Avro adapter iterator

2020-03-20 Thread Ji Liu (Jira)
Ji Liu created ARROW-8171:
-

 Summary: Consider pre-allocating memory for fix-width vector in 
Avro adapter iterator
 Key: ARROW-8171
 URL: https://issues.apache.org/jira/browse/ARROW-8171
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Discuss] [Java] Implement vector diff functionality

2020-03-11 Thread Ji Liu
Hi Micah,
Thanks for your feedback, you have opened an issue for Google's Truth[1] and it 
was assigned to me, I'll try to use it.

Thanks,
Ji Liu

[1] https://issues.apache.org/jira/browse/ARROW-6931


--
From:Micah Kornfield 
Send Time:2020年3月11日(星期三) 13:43
To:dev ; Ji Liu 
Subject:Re: [Discuss] [Java] Implement vector diff functionality

I'm in favor of this.  I think this can be combined with a custom matcher for 
Google's Truth [1] library, to make a lot of our unit tests much more readable

[1] https://github.com/google/truth
On Thu, Mar 5, 2020 at 11:29 PM Ji Liu  wrote:

 Hi all,
 In C++ side, we already have array diff functionality[1] for array equals and 
testing to make it easy to see difference between arrays and reduce debugging 
time.  
 I think it’s better to have similar functionality in Java side for better 
testing facilities,  and I created an issue to track this[2].


 Thanks,
 Ji Liu

 [1] 
https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538cdc03cfd/cpp/src/arrow/array/diff.h
 [2] https://issues.apache.org/jira/browse/ARROW-8019 

Re: [Java] Port vector validate functionality

2020-03-11 Thread Ji Liu
Hi Wes and Micah,
Thanks for your valuable suggestion, I will create sub-tasks under this issue 
as follow-up works when this one is finished.

Thanks,
Ji Liu


--
From:Micah Kornfield 
Send Time:2020年3月11日(星期三) 13:42
To:dev 
Cc:Ji Liu 
Subject:Re: [Java] Port vector validate functionality

I agree, it would also be good to run with some of the fuzzed IPC files.
On Fri, Mar 6, 2020 at 6:20 AM Wes McKinney  wrote:
Seems useful. It may be a good idea to run within integration tests as
 an extra sanity check also

 On Fri, Mar 6, 2020 at 2:27 AM Ji Liu  wrote:
 >
 >
 > Hi all,
 > In C++ side, we already have array validate functionality[1] but no similar 
 > functionality in Java side.
 > I was thinking if we should port this into Java implementation? Since we 
 > already has visitor interface[2] and it seems not very complicated. I 
 > created an issue to track this[3].
 >
 >
 > Thanks,
 > Ji Liu
 >
 > [1] 
 > https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538cdc03cfd/cpp/src/arrow/array/validate.h
 > [2] 
 > https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538cdc03cfd/java/vector/src/main/java/org/apache/arrow/vector/compare/VectorVisitor.java
 > [3] https://issues.apache.org/jira/browse/ARROW-8020


[Java] Port vector validate functionality

2020-03-06 Thread Ji Liu

Hi all,
In C++ side, we already have array validate functionality[1] but no similar 
functionality in Java side.
I was thinking if we should port this into Java implementation? Since we 
already has visitor interface[2] and it seems not very complicated. I created 
an issue to track this[3].


Thanks,
Ji Liu

[1] 
https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538cdc03cfd/cpp/src/arrow/array/validate.h
[2] 
https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538cdc03cfd/java/vector/src/main/java/org/apache/arrow/vector/compare/VectorVisitor.java
[3] https://issues.apache.org/jira/browse/ARROW-8020

[jira] [Created] (ARROW-8020) [Java] Implement vector validate functionality

2020-03-06 Thread Ji Liu (Jira)
Ji Liu created ARROW-8020:
-

 Summary: [Java] Implement vector validate functionality 
 Key: ARROW-8020
 URL: https://issues.apache.org/jira/browse/ARROW-8020
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


In C++ side, we already have array validate functionality but no similar 
functionality in Java side.

This issue is about to implement this functionality.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[Discuss] [Java] Implement vector diff functionality

2020-03-05 Thread Ji Liu

Hi all,
In C++ side, we already have array diff functionality[1] for array equals and 
testing to make it easy to see difference between arrays and reduce debugging 
time.  
I think it’s better to have similar functionality in Java side for better 
testing facilities,  and I created an issue to track this[2].


Thanks,
Ji Liu

[1] 
https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538cdc03cfd/cpp/src/arrow/array/diff.h
[2] https://issues.apache.org/jira/browse/ARROW-8019

[jira] [Created] (ARROW-8019) [Java] Implement vector diff functionality

2020-03-05 Thread Ji Liu (Jira)
Ji Liu created ARROW-8019:
-

 Summary: [Java] Implement vector diff functionality 
 Key: ARROW-8019
 URL: https://issues.apache.org/jira/browse/ARROW-8019
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


In C++ side, we already have array diff functionality for vector equals and 
testing to make it easy to see differences between Arrays and reduce debugging 
time.  And it’s better to do something similar in Java side for better testing 
facilities.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [ANNOUNCE] New Arrow PMC member: Neal Richardson

2020-03-05 Thread Ji Liu
Congratulations!


--
From:Micah Kornfield 
Send Time:2020年3月5日(星期四) 23:46
To:dev 
Subject:Re: [ANNOUNCE] New Arrow PMC member: Neal Richardson

Congratulations!

On Thu, Mar 5, 2020 at 5:57 AM Fan Liya  wrote:

> Congratulations, Neal Richardson!
>
> Best,
> Liya Fan
>
> On Thu, Mar 5, 2020 at 12:51 AM Wes McKinney  wrote:
>
> > The Project Management Committee (PMC) for Apache Arrow has invited
> > Neal Richardson to become a PMC member and we are pleased to announce
> > that Neal has accepted.
> >
> > Congratulations and welcome!
> >
>


Re: [ANNOUNCE] New Arrow PMC member: Francois Saint-Jacques

2020-03-05 Thread Ji Liu
Congratulations!


--
From:Micah Kornfield 
Send Time:2020年3月5日(星期四) 23:46
To:dev 
Subject:Re: [ANNOUNCE] New Arrow PMC member: Francois Saint-Jacques

Congratulations!

On Thu, Mar 5, 2020 at 5:57 AM Fan Liya  wrote:

> Congratulations,  Francois Saint-Jacques!
>
> Best,
> Liya Fan
>
> On Thu, Mar 5, 2020 at 12:52 AM Wes McKinney  wrote:
>
> > The Project Management Committee (PMC) for Apache Arrow has invited
> > Francois Saint-Jacques to become a PMC member and we are pleased to
> > announce
> > that Francois has accepted.
> >
> > Congratulations and welcome!
> >
>


[jira] [Created] (ARROW-7713) [Java] TastLeak was put at the wrong location

2020-01-28 Thread Ji Liu (Jira)
Ji Liu created ARROW-7713:
-

 Summary: [Java] TastLeak was put at the wrong location
 Key: ARROW-7713
 URL: https://issues.apache.org/jira/browse/ARROW-7713
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Seems {{TestLeak.java}} was put at the wrong place, we should move it into 
{{flight-core}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [CI] Java build broken on master

2020-01-16 Thread Ji Liu
Thanks, PR opened https://github.com/apache/arrow/pull/6216, please help merge 
once the build turns green.


--
From:Micah Kornfield 
Send Time:2020年1月17日(星期五) 14:53
To:Ji Liu 
Cc:dev 
Subject:Re: [CI] Java build broken on master

OK, I've opened https://issues.apache.org/jira/browse/ARROW-7599 to track.
On Thu, Jan 16, 2020 at 10:49 PM Ji Liu  wrote:

 I was fixing, and will open a PR later.

Thanks,
Ji Liu

--
From:Micah Kornfield 
Send Time:2020年1月17日(星期五) 14:48
To:dev 
Subject:[CI] Java build broken on master

This was due to an unexpected conflict between two patches I just merged.
I'm going to see if I can fix this quickly, otherwise I will rollback.



Re: [CI] Java build broken on master

2020-01-16 Thread Ji Liu
 I was fixing, and will open a PR later.

Thanks,
Ji Liu


--
From:Micah Kornfield 
Send Time:2020年1月17日(星期五) 14:48
To:dev 
Subject:[CI] Java build broken on master

This was due to an unexpected conflict between two patches I just merged.
I'm going to see if I can fix this quickly, otherwise I will rollback.



[jira] [Created] (ARROW-7546) [Java] Use new implementation to concat vectors values in batch

2020-01-10 Thread Ji Liu (Jira)
Ji Liu created ARROW-7546:
-

 Summary: [Java] Use new implementation to concat vectors values in 
batch
 Key: ARROW-7546
 URL: https://issues.apache.org/jira/browse/ARROW-7546
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Per discussion https://github.com/apache/arrow/pull/5945#discussion_r365108806.

In ARROW-7284, we write a simple method to concat vectors. However, ARROW-7073 
is about to concat vector values efficiently, after this PR merged, we should 
use this new implementation in {{ArrowReader}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7539) [Java] FieldVector getFieldBuffers API should not set reader/writer indices

2020-01-09 Thread Ji Liu (Jira)
Ji Liu created ARROW-7539:
-

 Summary: [Java] FieldVector getFieldBuffers API should not set 
reader/writer indices
 Key: ARROW-7539
 URL: https://issues.apache.org/jira/browse/ARROW-7539
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Per discussion 
[https://github.com/apache/arrow/pull/6133#discussion_r364906302].

The fact that we have reader/writer settings in {{getFieldBuffers}} is wrong. 
To clarify, {{getFieldBuffers}} is distinct from {{getBuffers}}. The former 
should be for getting access to underlying data for higher-performance 
algorithms. The latter is for sending the data over the wire. Seems we've mixed 
up use of both.

 

Currently in {{VectorUnloader}}, we used {{getFieldBuffers}} to create 
{{ArrowRecordBatch}} that’s why we keep writer/reader indices in 
{{getFieldBuffers}}, we should use {{getBuffers}} instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7490) [Java] Avro converter should convert attributes and props to FieldType metadata

2020-01-01 Thread Ji Liu (Jira)
Ji Liu created ARROW-7490:
-

 Summary: [Java] Avro converter should convert attributes and props 
to FieldType metadata
 Key: ARROW-7490
 URL: https://issues.apache.org/jira/browse/ARROW-7490
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently in Avro converter, some attributes are used when creating vectors 
such as “name”, “size” etc, others are discarded.

For named type like Record, Enum and Fixed, they may have attributes like “doc” 
“aliased” which should keep in metadata for potential further use.

Besides, properties are also not converted properly in some cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7472) [Java] Fix some incorrect behavior in UnionListWriter

2019-12-24 Thread Ji Liu (Jira)
Ji Liu created ARROW-7472:
-

 Summary: [Java] Fix some incorrect behavior in UnionListWriter
 Key: ARROW-7472
 URL: https://issues.apache.org/jira/browse/ARROW-7472
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently the {{UnionListWriter/UnionFixedSizeListWriter}} {{getField/close}} 
APIs seems incorrect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7467) [Java] ComplexCopier does incorrect copy for Map nullable info

2019-12-23 Thread Ji Liu (Jira)
Ji Liu created ARROW-7467:
-

 Summary: [Java] ComplexCopier does incorrect copy for Map nullable 
info
 Key: ARROW-7467
 URL: https://issues.apache.org/jira/browse/ARROW-7467
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


The {{MapVector}} and its 'value' vector are nullable, and its {{structVector}} 
and 'key' vector are non-nullable.

However, the {{MapVector}} generated by ComplexCopier has all nullable fields 
which is not correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7425) [Java] PromotableWriter support writing FixedSizeList type data

2019-12-18 Thread Ji Liu (Jira)
Ji Liu created ARROW-7425:
-

 Summary: [Java] PromotableWriter support writing FixedSizeList 
type data
 Key: ARROW-7425
 URL: https://issues.apache.org/jira/browse/ARROW-7425
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


We have introduced writer API for {{FixedSizeListVector}} via ARROW-6079, but 
{{PromotableWriter}}’s support for it is incomplete.

For example, using {{UnionListWriter}} we could simply write {{List}} 
type data, but for {{List}} or {{FixedSizeList}} 
it doesn’t work.

This issue is about to enhance the {{PromotableWriter}} support for 
{{FixedSizeList}} type and add tests to verify the cases mentioned above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7406) [Java] NonNullableStructVector#hashCode should pass hasher to child vectors

2019-12-16 Thread Ji Liu (Jira)
Ji Liu created ARROW-7406:
-

 Summary: [Java] NonNullableStructVector#hashCode should pass 
hasher to child vectors
 Key: ARROW-7406
 URL: https://issues.apache.org/jira/browse/ARROW-7406
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


This was introduced by ARROW-6866 making parameter hasher useless in 
hashCode(int index, {{ArrowBufHasher}} hasher), and the child vectors would 
calculate hashCode using default hasher which is not correct. 

This issue should be fixed by passing hasher to child vector when calculating 
hashCode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7405) [Java] ListVector isEmpty API is incorrect

2019-12-16 Thread Ji Liu (Jira)
Ji Liu created ARROW-7405:
-

 Summary: [Java] ListVector isEmpty API is incorrect
 Key: ARROW-7405
 URL: https://issues.apache.org/jira/browse/ARROW-7405
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


 Currently {{isEmpty}} API is always return false in 
{{BaseRepeatedValueVector}}, and its subclass {{ListVector}} did not overwrite 
this method.

This will lead to incorrect result, for example, a {{ListVector}} with data 
[1,2], null, [], [5,6] should get [false, false, true, false] with this API, 
but now it would return [false, false, false, false].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [Result] [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding (second iteration)

2019-12-01 Thread Ji Liu
Thanks Micah, I'll take the Java side implementation.

Thanks,
Ji Liu


--
From:Micah Kornfield 
Send Time:2019年12月2日(星期一) 09:25
To:dev 
Subject:Re: [Result] [VOTE] Clarifications and forward compatibility changes 
for Dictionary Encoding (second iteration)

I've merged the PR and created ARROW-7283
<https://issues.apache.org/jira/browse/ARROW-7283> [1] to track
implementation for languages currently in the integration test.


[1] https://issues.apache.org/jira/browse/ARROW-7283

On Wed, Nov 27, 2019 at 1:03 AM Micah Kornfield 
wrote:

> The vote carries with 3 bindings votes +1 votes, 1 non-binding +1 vote and
> 1 non-binding +.5 vote.
>
> To follow-up I will:
> 1.  Open up JIRAs for work items in reference implementations (c++/java)
> 2.  Merge the pull request containing the specification changes.
>
> Thanks,
> Micah
>
> On Tue, Nov 26, 2019 at 12:50 AM Sutou Kouhei  wrote:
>
>> +1 (binding)
>>
>> In 
>>   "[VOTE] Clarifications and forward compatibility changes for Dictionary
>> Encoding (second iteration)" on Wed, 20 Nov 2019 20:41:57 -0800,
>>   Micah Kornfield  wrote:
>>
>> > Hello,
>> > As discussed on [1], I've proposed clarifications in a PR [2] that
>> > clarifies:
>> >
>> > 1.  It is not required that all dictionary batches occur at the
>> beginning
>> > of the IPC stream format (if a the first record batch has an all null
>> > dictionary encoded column, the null column's dictionary might not be
>> sent
>> > until later in the stream).
>> >
>> > 2.  A second dictionary batch for the same ID that is not a "delta
>> batch"
>> > in an IPC stream indicates the dictionary should be replaced.
>> >
>> > 3.  Clarifies that the file format, can only contain 1 "NON-delta"
>> > dictionary batch and multiple "delta" dictionary batches. Dictionary
>> > replacement is not supported in the file format.
>> >
>> > 4.  Add an enum to dictionary metadata for possible future changes in
>> what
>> > format dictionary batches can be sent. (the most likely would be an
>> array
>> > Map).  An enum is needed as a place holder to allow for
>> forward
>> > compatibility past the release 1.0.0.
>> >
>> > If accepted there will be work in all implementations to make sure that
>> > they cover the edge cases highlighted and additional integration testing
>> > will be needed.
>> >
>> > Please vote whether to accept these additions. The vote will be open
>> for at
>> > least 72 hours.
>> >
>> > [ ] +1 Accept these change to the specification
>> > [ ] +0
>> > [ ] -1 Do not accept the changes because...
>> >
>> > Thanks,
>> > Micah
>> >
>> >
>> > [1]
>> >
>> https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E
>> > [2] https://github.com/apache/arrow/pull/5585
>>
>


Re: Datasets and Java

2019-11-27 Thread Ji Liu
Hi Francois, 

Thanks for the proposal and your effort.
I made a simple JNI poc before for RecordBatch/VectorSchemaRoot interaction 
between Java and C++[1][2].
This may help a little.


Thanks,
Ji Liu


[1] https://github.com/tianchen92/jni-poc-java
[2] https://github.com/tianchen92/jni-poc-cpp




--
From:Francois Saint-Jacques 
Send Time:2019年11月28日(星期四) 05:08
To:dev 
Subject:Re: Datasets and Java

Hello Hongze,

The C++ implementation of dataset, notably Dataset, DataSource,
DataSourceDiscovery, and Scanner classes are not ready/designed for
distributed computing. They don't serialize and they reference by
pointer all around, thus I highly doubt that you can implement parts
in Java, and some in C++ with minimal effort and complexity. You can
think of Dataset/DataSource as similar to the Hive Metastore, but
locally (single node) and in-memory. I fail to see how one could use
it with the execution model of spark, e.g. construct all the manifests
on the driver via Dataset, Scanner and pass the ScanTask to executors
due to previous limitations. One cannot construct a ScanTask out of
thin air, it needs a DataFragment (or FileFormat in case of
FileDataFragment).

Having said that, I think I understand where you want to go. The
FileFormat::ScanFile method has what you want without the overhead of
the full dataset API. It acts as an interface to interact with file
format paired with predicate pushdown and column selection options.
This is where I would start:

- Create a JNI bridge between a C++ RecordBatch and Java VectorSchemaRoot [1]
- Create a C++ helper method `Result>
ScanFile(FileSource source, FileFormat& fmt,
std::shared_ptr options, std::shared_ptr
context)`
The goal of this method is similar to `Scanner::ToTable`, i.e. hide
the local scheduling details of ScanTask. Thus you don't need to
expose ScanTask.
- Create a JNI binding to the previous helper and all the class
dependencies to construct the parameters (FileSource, FileFormat,
ScanOptions, ScanContext).
This is where it gets cumbersome, ScanOptions has Expression which may
not be easy to build ad-hoc. FileSource needs a fs::Filesystem,
ScanContext needs a MemoryPool, etc... You may hide this via helper
methods, this is what the R binding is doing.

Your PoC can probably get away with a trivial
`Result> ScanParquetFile(std::string path, Expr&
filter, std::vector columns)` without exposing all the
details and using the "defaults". Thus you only need to wrap a method
(ScanParquetFile) and Expression in your JNI bridge.

Pros:
- Access to native file readers with uniform predicate pushdown (when
the file format supports it), and column selection options. Filtering
done natively in C++.
- Enable usage of said points in distributed computing, since the only
passed information are: the path, the expression (will need a
translation), and the list of columns. All of which are tracktable to
serialize.
- Bonus, you may even get transparent access to gandiva [2]

Cons:
- No predicate pushdown on file partition, e.g. extracted from path
because this information is in the DataSource
- ScanOptions is built by ScannerBuilder, there's a lot of validation
hidden under the hood via DataSource, DataSourceDiscovery and
ScannerBuilder. It's easy to get an error with a malformed
ScanOptions.
- No access to non-file DataSource, e.g. in the future we might have
OdbcDataSource and FlightDataSource

Basically, dataset::FileFormat is meant to be a unified interface to
interact with file formats. Here's an example of such usage without
all the dataset machinery [3].

François

[1] https://issues.apache.org/jira/browse/ARROW-7272
[2] https://issues.apache.org/jira/browse/ARROW-6953
[3] 
https://github.com/apache/arrow/blob/61c8b1b80039119d5905660289dd53a3130ce898/cpp/src/arrow/dataset/file_parquet_test.cc#L345-L393










On Wed, Nov 27, 2019 at 5:17 AM Hongze Zhang  wrote:
>
> Hi Micah,
>
>
> Regarding our use cases, we'd use the API on Parquet files with some pushed 
> filters and projectors, and we'd extend the C++ Datasets code to provide 
> necessary support for our own data formats.
>
>
> > If JNI is seen as too cumbersome, another possible avenue to pursue is
> > writing a gRPC wrapper around the DataSet metadata capabilities.  One could
> > then create a facade on top of that for Java.  For data reads, I can see
> > either building a Flight server or directly use the JNI readers.
>
>
> Thanks for your suggestion but I'm not entirely getting it. Does this mean to 
> start some individual gRPC/Flight server process to deal with the 
> metadata/data exchange problem between Java and C++ Datasets? If yes, then in 
> some cases, doesn't it easily introduce bigger problems about life cycle and 
> resource management of the processes? Please correct me if I misunderstood 
> your point.
>
>
> And IM

[jira] [Created] (ARROW-7264) [Java] RangeEqualsVisitor type check is not correct

2019-11-26 Thread Ji Liu (Jira)
Ji Liu created ARROW-7264:
-

 Summary: [Java] RangeEqualsVisitor type check is not correct
 Key: ARROW-7264
 URL: https://issues.apache.org/jira/browse/ARROW-7264
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 0.15.1
Reporter: Ji Liu
Assignee: Ji Liu


Currently {{RangeEqualsVisitor}} generally only checks type once and keep the 
result to avoid repeated type checking, see
{code:java}
typeCompareResult = 
left.getField().getType().equals(right.getField().getType());
{code}
This only compares {{ArrowType}} and for complex type, this may cause 
unexpected behavior, for example {{List}} and {{List}} would be 
type equals which not consider their child field.

We should compare Field here instead and to make it more extendable, we use 
{{TypeEqualsVisitor}} to compare Field, in this way, one could choose whether 
checks names or metadata either.

 

Also provide a test for ListVector to validate this change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Java API for Arrow Compute

2019-11-25 Thread Ji Liu
Either JNI or pure Java compute is ok for me, and agreed that specific 
use-cases and designs are needed before starting.

A pure Java compute is easy to use for java developers I think and the 
algorithm module is a good start (thanks Liya Fan), if we finally decide this 
way, I can help if needed.


Thanks,
Ji Liu


--
From:Micah Kornfield 
Send Time:2019年11月26日(星期二) 13:02
To:dev 
Subject:Re: Java API for Arrow Compute

+1 to Liya Fan said.  Before starting on the endeavor it would be nice to
have concrete use-cases and what we think the end-state is, and  some
light-weight design upfront.  It seems like in the Java world there are
already a lot of analysis tools, and it isn't clear to me how a "compute"
layer would fit into the ecosystem.  There was some discussion around this
on the mailing list recently [1].

As a personal preference, I would prefer trying to integrate with the C++
code base via JNI (as Wes mentioned) then build yet another set of
computation components unless there were very clear goals of what a Java
version would be looking to accomplish.  However, if there are enough
volunteers willing to support a pure Java compute, then it might be
worthwhile.

Thanks,
Micah

[1]
https://lists.apache.org/thread.html/4b2776353bbc0d74369af19b6897cd9d47caf13fb8fd38c96f2cd32e@%3Cdev.arrow.apache.org%3E

On Mon, Nov 25, 2019 at 6:07 PM Fan Liya  wrote:

> Hi Yuan,
>
> Currently, we have some APIs in the algorithm module of the Java project.
> If you have more requirements, maybe you can describe your
> requirements/scenarios, and start a discussion in the mailing list.
>
> Best,
> Liya Fan
>
>
> On Mon, Nov 25, 2019 at 11:17 PM Wes McKinney  wrote:
>
> > There is a little bit of compute work happening in Java now, but like
> > any part of the project it depends on whether volunteers want to do
> > the work.
> >
> > Since there is some JNI interfaces to C++ available already (for
> > example to Gandiva), making increased use of C++ in Java via JNI may
> > be one approach to exposing more compute functionality in Java. Some
> > developers may wish to create native Java implementations of some
> > things, though
> >
> > On Mon, Nov 25, 2019 at 8:47 AM ZHOU Yuan  wrote:
> > >
> > > Hi Arrow developers,
> > >
> > > The compute APIs(neat!) seems now available in c++, python and rust, is
> > > there any plan/work on adding the Java API?
> > >
> > > Cheers, -yuan
> >
>



Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding (second iteration)

2019-11-25 Thread Ji Liu
To clarify, we have already implemented option #1 ("It is not required that all 
dictionary batches occur at the beginning") in the previous PR[1].

So hope this proposal will be accepted and I would like to take follow-up works 
in Java side if possible.

Thanks,
Ji Liu


[1] https://github.com/apache/arrow/pull/4960


--
From:Ji Liu 
Send Time:2019年11月26日(星期二) 14:04
To:dev ; Micah Kornfield 
Cc:Wes McKinney 
Subject:Re: [VOTE] Clarifications and forward compatibility changes for 
Dictionary Encoding (second iteration)

+1 (non-binding)

Thanks
Ji Liu


--
From:Fan Liya 
Send Time:2019年11月26日(星期二) 14:01
To:dev ; Micah Kornfield 
Cc:Wes McKinney 
Subject:Re: [VOTE] Clarifications and forward compatibility changes for 
Dictionary Encoding (second iteration)

I am sorry I did not follow the thread closely (will follow up later).
However, the proposal above looks good to me.
So I am +0.5 for this.

Best,
Liya Fan

On Tue, Nov 26, 2019 at 1:12 PM Micah Kornfield 
wrote:

> Could other members of the community chime in on this?  In particular
> getting views from other language maintainers would be good.
>
> Thanks,
> Micah
>
> On Thu, Nov 21, 2019 at 12:23 PM Micah Kornfield 
> wrote:
>
> > Forgot to say,  My vote is +1 (binding).
> >
> > On Thu, Nov 21, 2019 at 12:09 PM Wes McKinney 
> wrote:
> >
> >> +1 (binding). Thanks Micah
> >>
> >> On Wed, Nov 20, 2019 at 10:42 PM Micah Kornfield  >
> >> wrote:
> >> >
> >> > Hello,
> >> > As discussed on [1], I've proposed clarifications in a PR [2] that
> >> > clarifies:
> >> >
> >> > 1.  It is not required that all dictionary batches occur at the
> >> beginning
> >> > of the IPC stream format (if a the first record batch has an all null
> >> > dictionary encoded column, the null column's dictionary might not be
> >> sent
> >> > until later in the stream).
> >> >
> >> > 2.  A second dictionary batch for the same ID that is not a "delta
> >> batch"
> >> > in an IPC stream indicates the dictionary should be replaced.
> >> >
> >> > 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> >> > dictionary batch and multiple "delta" dictionary batches. Dictionary
> >> > replacement is not supported in the file format.
> >> >
> >> > 4.  Add an enum to dictionary metadata for possible future changes in
> >> what
> >> > format dictionary batches can be sent. (the most likely would be an
> >> array
> >> > Map).  An enum is needed as a place holder to allow for
> >> forward
> >> > compatibility past the release 1.0.0.
> >> >
> >> > If accepted there will be work in all implementations to make sure
> that
> >> > they cover the edge cases highlighted and additional integration
> testing
> >> > will be needed.
> >> >
> >> > Please vote whether to accept these additions. The vote will be open
> >> for at
> >> > least 72 hours.
> >> >
> >> > [ ] +1 Accept these change to the specification
> >> > [ ] +0
> >> > [ ] -1 Do not accept the changes because...
> >> >
> >> > Thanks,
> >> > Micah
> >> >
> >> >
> >> > [1]
> >> >
> >>
> https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E
> >> > [2] https://github.com/apache/arrow/pull/5585
> >>
> >
>



[jira] [Created] (ARROW-7259) [Java] Support subfield encoder use different hasher

2019-11-25 Thread Ji Liu (Jira)
Ji Liu created ARROW-7259:
-

 Summary: [Java] Support subfield encoder use different hasher
 Key: ARROW-7259
 URL: https://issues.apache.org/jira/browse/ARROW-7259
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently {{ListSubFieldEncoder/StructSubFieldEncoder}} use default hasher for 
calculating hashCode.

This issue enables them to use different hasher or even user-defined hasher for 
their own use cases just like {{DictionaryEncoder}} does.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding (second iteration)

2019-11-25 Thread Ji Liu
+1 (non-binding)

Thanks
Ji Liu


--
From:Fan Liya 
Send Time:2019年11月26日(星期二) 14:01
To:dev ; Micah Kornfield 
Cc:Wes McKinney 
Subject:Re: [VOTE] Clarifications and forward compatibility changes for 
Dictionary Encoding (second iteration)

I am sorry I did not follow the thread closely (will follow up later).
However, the proposal above looks good to me.
So I am +0.5 for this.

Best,
Liya Fan

On Tue, Nov 26, 2019 at 1:12 PM Micah Kornfield 
wrote:

> Could other members of the community chime in on this?  In particular
> getting views from other language maintainers would be good.
>
> Thanks,
> Micah
>
> On Thu, Nov 21, 2019 at 12:23 PM Micah Kornfield 
> wrote:
>
> > Forgot to say,  My vote is +1 (binding).
> >
> > On Thu, Nov 21, 2019 at 12:09 PM Wes McKinney 
> wrote:
> >
> >> +1 (binding). Thanks Micah
> >>
> >> On Wed, Nov 20, 2019 at 10:42 PM Micah Kornfield  >
> >> wrote:
> >> >
> >> > Hello,
> >> > As discussed on [1], I've proposed clarifications in a PR [2] that
> >> > clarifies:
> >> >
> >> > 1.  It is not required that all dictionary batches occur at the
> >> beginning
> >> > of the IPC stream format (if a the first record batch has an all null
> >> > dictionary encoded column, the null column's dictionary might not be
> >> sent
> >> > until later in the stream).
> >> >
> >> > 2.  A second dictionary batch for the same ID that is not a "delta
> >> batch"
> >> > in an IPC stream indicates the dictionary should be replaced.
> >> >
> >> > 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> >> > dictionary batch and multiple "delta" dictionary batches. Dictionary
> >> > replacement is not supported in the file format.
> >> >
> >> > 4.  Add an enum to dictionary metadata for possible future changes in
> >> what
> >> > format dictionary batches can be sent. (the most likely would be an
> >> array
> >> > Map).  An enum is needed as a place holder to allow for
> >> forward
> >> > compatibility past the release 1.0.0.
> >> >
> >> > If accepted there will be work in all implementations to make sure
> that
> >> > they cover the edge cases highlighted and additional integration
> testing
> >> > will be needed.
> >> >
> >> > Please vote whether to accept these additions. The vote will be open
> >> for at
> >> > least 72 hours.
> >> >
> >> > [ ] +1 Accept these change to the specification
> >> > [ ] +0
> >> > [ ] -1 Do not accept the changes because...
> >> >
> >> > Thanks,
> >> > Micah
> >> >
> >> >
> >> > [1]
> >> >
> >>
> https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E
> >> > [2] https://github.com/apache/arrow/pull/5585
> >>
> >
>


[jira] [Created] (ARROW-7026) [Java] Remove assertions in MessageSerializer/vector/writer/reader

2019-10-30 Thread Ji Liu (Jira)
Ji Liu created ARROW-7026:
-

 Summary: [Java] Remove assertions in 
MessageSerializer/vector/writer/reader
 Key: ARROW-7026
 URL: https://issues.apache.org/jira/browse/ARROW-7026
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently assertions exists in many classes like 
{{MessagaSerializer/JsonReader/JsonWriter/ListVector}} etc.

i. If jvm arguments are not specified, these checks will skipped and lead to 
potential problems.

ii. Java errors produced by failed assertions are not caught by traditional 
catch clauses.

To fix this, use {{Preconditions}} instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7021) [Java] UnionFixedSizeListWriter decimal type should check writer index

2019-10-29 Thread Ji Liu (Jira)
Ji Liu created ARROW-7021:
-

 Summary: [Java] UnionFixedSizeListWriter decimal type should check 
writer index
 Key: ARROW-7021
 URL: https://issues.apache.org/jira/browse/ARROW-7021
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


{{UnionFixedSizeListWriter}} should check writer index for decimal type (just 
as other types) to ensure the values written not exceed listSize.

Otherwise, the writer may continue to write data into it’s underlying vector 
quietly even the the writer.idx() > listSize * index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6912) [Java] Extract a common base class for avro converter consumers

2019-10-17 Thread Ji Liu (Jira)
Ji Liu created ARROW-6912:
-

 Summary: [Java] Extract a common base class for avro converter 
consumers
 Key: ARROW-6912
 URL: https://issues.apache.org/jira/browse/ARROW-6912
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently Avro converter consumers have some common variables and methods which 
could be eliminated by extracting a common class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6898) [Java] Fix potential memory leak in ArrowWriter and several test classes

2019-10-16 Thread Ji Liu (Jira)
Ji Liu created ARROW-6898:
-

 Summary: [Java] Fix potential memory leak in ArrowWriter and 
several test classes
 Key: ARROW-6898
 URL: https://issues.apache.org/jira/browse/ARROW-6898
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


ARROW-6040 fixed the problem that dictionary entries are required in IPC 
streams even when empty, which only writes dictionaries when there are at least 
one batch. In this way, if we write empty stream and invoke ArrowWriter#close, 
the dictionaries are not closed leading to memory leak (they are closed after 
the write operation), and it’s really hard to debug, this problem was found by 
{{TestArrowReaderWriter#testEmptyStreamInStreamingIPC}} when I tried to close 
allocator after the test. 

 

Besides, there are several test classes have potential memory leak without 
closing allocator/vector/buf etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6889) [Java] ComplexCopier enable FixedSizeList type & fix RangeEualsVisitor StackOverFlow

2019-10-15 Thread Ji Liu (Jira)
Ji Liu created ARROW-6889:
-

 Summary: [Java] ComplexCopier enable FixedSizeList type & fix 
RangeEualsVisitor StackOverFlow
 Key: ARROW-6889
 URL: https://issues.apache.org/jira/browse/ARROW-6889
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


i. Enable {{ComplexCopier}} copy {{FixedSizeListVector}} value, add related 
tests

ii. Fix {{RangeEqualsVisitor#compareFixedSizeListVectors}} StackOverFlow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6871) [Java] Enhance TransferPair related parameters check and tests

2019-10-13 Thread Ji Liu (Jira)
Ji Liu created ARROW-6871:
-

 Summary: [Java] Enhance TransferPair related parameters check and 
tests
 Key: ARROW-6871
 URL: https://issues.apache.org/jira/browse/ARROW-6871
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


{{TransferPair}} related param checks in different classes have potential 
problems:

i. {{copyValueSafe}} do not check from index, if from > valueCount, no error is 
shown.

ii. {{splitAndTansferPair}} has no indices check in classes like 
{{VarcharVector}}

iii. {{splitAndTranserPair}} indices check in classes like UnionVector is not 
correct (Preconditions.checkArgument(startIndex + length <= valueCount)), 
should check params separately.

iv. some assert usages should be replaced with {{Preconditions}}.

v. should add more UT to cover corner cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6853) [Java] Support vector and dictionary encoder use different hasher for calculating hashCode

2019-10-11 Thread Ji Liu (Jira)
Ji Liu created ARROW-6853:
-

 Summary: [Java] Support vector and dictionary encoder use 
different hasher for calculating hashCode
 Key: ARROW-6853
 URL: https://issues.apache.org/jira/browse/ARROW-6853
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Hasher interface was introduce in ARROW-5898 and now have two different 
implementations ({{MurmurHasher and }}{{SimpleHasher}}) and it could be more in 
the future.

And currently {{ValueVector#hashCode}} and {{DictionaryHashTable}} only use 
{{SimpleHasher}} for calculating hashCode. This issue enables them to use 
different hasher or even user-defined hasher for their own use cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6850) [Java] Jdbc converter support Null type

2019-10-11 Thread Ji Liu (Jira)
Ji Liu created ARROW-6850:
-

 Summary: [Java] Jdbc converter support Null type
 Key: ARROW-6850
 URL: https://issues.apache.org/jira/browse/ARROW-6850
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


java.sql.Types.Null is not supported yet since we have no NullVector in Java 
code before.

This could be implemented after ARROW-1638 merged (IPC roundtrip for null type).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6721) [JAVA] Avro adapter benchmark only runs once in JMH

2019-09-27 Thread Ji Liu (Jira)
Ji Liu created ARROW-6721:
-

 Summary: [JAVA] Avro adapter benchmark only runs once in JMH
 Key: ARROW-6721
 URL: https://issues.apache.org/jira/browse/ARROW-6721
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


The current {{AvroAdapterBenchmark}} actually only run once during JMH 
evaluation, since the decoder was consumed for the first time and the follow-up 
invokes will directly return.

To solve this, we use {{BinaryDecoder}} explicitly in benchmark and reset its 
inner stream first when the test method is invoked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6710) [Java] Add JDBC adapter test to cover cases which contains some null values

2019-09-26 Thread Ji Liu (Jira)
Ji Liu created ARROW-6710:
-

 Summary: [Java] Add JDBC adapter test to cover cases which 
contains some null values
 Key: ARROW-6710
 URL: https://issues.apache.org/jira/browse/ARROW-6710
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


The current JDBC adapter tests only cover the cases that values are all 
non-null or all null.

However, the cases that ResultSet has some null values are not covered 
(ARROW-6709).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Timeline for 0.15.0 release

2019-09-26 Thread Ji Liu
Hi Micah,
Hmm, unfortunately, I just found a bug in JDBC adapter and open a PR, could 
this change catch up with 0.15?
See https://github.com/apache/arrow/pull/5511


Thanks,
Ji Liu



--
From:Micah Kornfield 
Send Time:2019年9月26日(星期四) 14:23
To:Neal Richardson 
Cc:"Krisztián Szűcs" ; Wes McKinney 
; dev 
Subject:Re: Timeline for 0.15.0 release

Just an I've started the RC generation process off, the last commit from
master is [1]

I am currently waiting the crossbow builds (build-690 on
ursa-labs/crossbow).  I think this will take a little while so I will pick
it up tomorrow (Thursday).

Thanks,
Micah

[1]
https://github.com/apache/arrow/commit/07ab5083d5a2925ced6f8168b60b8fa336f4eccc

On Wed, Sep 25, 2019 at 2:07 PM Neal Richardson 
wrote:

> IMO it's too risky to add something that adds a dependency
> (aws-sdk-cpp) on the day of cutting a release.
>
> Neal
>
> On Wed, Sep 25, 2019 at 12:54 PM Krisztián Szűcs
>  wrote:
> >
> > We don't have a comprehensive documentation yet, so let's postpone it.
> >
> >
> > On Wed, Sep 25, 2019 at 9:48 PM Krisztián Szűcs <
> szucs.kriszt...@gmail.com> wrote:
> >>
> >> The S3 python bindings would be a nice addition to the release.
> >> I don't think we should block on this but the PR is ready. Opinions?
> >> https://github.com/apache/arrow/pull/5423
> >>
> >>
> >>
> >>
> >> On Wed, Sep 25, 2019 at 5:28 PM Micah Kornfield 
> wrote:
> >>>
> >>> OK, I'll start the process today.  I'll send up e-mail updates as I
> make progress.
> >>>
> >>> On Wed, Sep 25, 2019 at 8:22 AM Wes McKinney 
> wrote:
> >>>>
> >>>> Yes, all systems go as far as I'm concerned.
> >>>>
> >>>> On Wed, Sep 25, 2019 at 9:56 AM Neal Richardson
> >>>>  wrote:
> >>>> >
> >>>> > Andy's DataFusion issue and Wes's Parquet one have both been merged,
> >>>> > and it looks like the LICENSE issue is being resolved as I type. So
> >>>> > are we good to go now?
> >>>> >
> >>>> > Neal
> >>>> >
> >>>> >
> >>>> > On Tue, Sep 24, 2019 at 10:30 PM Andy Grove 
> wrote:
> >>>> > >
> >>>> > > I found a last minute issue with DataFusion (Rust) and would
> appreciate it
> >>>> > > if we could merge ARROW-6086 (PR is
> >>>> > > https://github.com/apache/arrow/pull/5494) before cutting the RC.
> >>>> > >
> >>>> > > Thanks,
> >>>> > >
> >>>> > > Andy.
> >>>> > >
> >>>> > >
> >>>> > > On Tue, Sep 24, 2019 at 6:19 PM Micah Kornfield <
> emkornfi...@gmail.com>
> >>>> > > wrote:
> >>>> > >
> >>>> > > > OK, I'm going to postpone cutting a release until tomorrow
> (hoping we can
> >>>> > > > issues resolved by then)..  I'll also try to review the
> third-party
> >>>> > > > additions since 14.x.
> >>>> > > >
> >>>> > > > On Tue, Sep 24, 2019 at 4:20 PM Wes McKinney <
> wesmck...@gmail.com> wrote:
> >>>> > > >
> >>>> > > > > I found a licensing issue
> >>>> > > > >
> >>>> > > > > https://issues.apache.org/jira/browse/ARROW-6679
> >>>> > > > >
> >>>> > > > > It might be worth examining third party code added to the
> project
> >>>> > > > > since 0.14.x to make sure there are no other such issues.
> >>>> > > > >
> >>>> > > > > On Tue, Sep 24, 2019 at 6:10 PM Wes McKinney <
> wesmck...@gmail.com>
> >>>> > > > wrote:
> >>>> > > > > >
> >>>> > > > > > I have diagnosed the problem (Thrift "string" data must be
> UTF-8,
> >>>> > > > > > cannot be arbitrary binary) and am working on a patch right
> now
> >>>> > > > > >
> >>>> > > > > > On Tue, Sep 24, 2019 at 6:02 PM Wes McKinney <
> wesmck...@gmail.com>
> >>>> > > > > wrote:
> >>>> > > > > > >
> >>>> > > > > > > I just opened
> >

[jira] [Created] (ARROW-6662) [Java] Implement equals/approxEquals API for VectorSchemaRoot

2019-09-22 Thread Ji Liu (Jira)
Ji Liu created ARROW-6662:
-

 Summary: [Java] Implement equals/approxEquals API for 
VectorSchemaRoot
 Key: ARROW-6662
 URL: https://issues.apache.org/jira/browse/ARROW-6662
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently with the new added visitor APIs(ARROW-6211), we could implement 
equals/approxEquals for VectorSchemaRoot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6661) [Java] Implement APIs like slice to enhance VectorSchemaRoot

2019-09-22 Thread Ji Liu (Jira)
Ji Liu created ARROW-6661:
-

 Summary: [Java] Implement APIs like slice to enhance 
VectorSchemaRoot
 Key: ARROW-6661
 URL: https://issues.apache.org/jira/browse/ARROW-6661
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently in Java Implementation there is no APIs like slice for record batch 
like C++/Python.

This issue is about to implement slice/getVector/addVector/removeVector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6600) [Java] Implement dictionary-encoded subfields for Union type

2019-09-18 Thread Ji Liu (Jira)
Ji Liu created ARROW-6600:
-

 Summary: [Java] Implement dictionary-encoded subfields for Union 
type
 Key: ARROW-6600
 URL: https://issues.apache.org/jira/browse/ARROW-6600
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Implement dictionary-encoded subfields for {{Union}} type. Each child vector 
could be encodable or not.

 

Meanwhile extra common logic into {{DictionaryEncoder}} as well as refactor 
List subfield encoding to keep consistent with {{Struct/Union}} type.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements (2nd vote)

2019-09-06 Thread Ji Liu
Hi all,

During the java code review[1], seems there is a problem with the current 
implementations(C++/Java etc) when reaching EOS, since the new format EOS is 8 
bytes and the reader only reads 4 bytes when reach the end of stream, and the 
additional 4 bytes will not be read which cause problems for following up 
readings.

There are some optional suggestions[2] as below, we should reach consistent and 
fix this problem before 0.15 release.
i. For the new format, an 8-byte EOS token should look like {0x, 
0x}, so we read the continuation token first, and then know to read the 
next 4 bytes, which are then 0 to signal EOS.ii. Reader just remember the 
state, so if it reads the continuation token from the beginning, then read all 
8 bytes at the end.

Thanks,
Ji Liu

[1] https://github.com/apache/arrow/pull/5229
[2] https://github.com/apache/arrow/pull/5229#discussion_r321715682




--
From:Eric Erhardt 
Send Time:2019年9月5日(星期四) 07:16
To:dev@arrow.apache.org ; Ji Liu 
Cc:emkornfield ; Paul Taylor 
Subject:RE: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte 
Flatbuffer alignment requirements (2nd vote)

The C# PR is up.

https://github.com/apache/arrow/pull/5280

Eric

-Original Message-
From: Eric Erhardt  
Sent: Wednesday, September 4, 2019 10:12 AM
To: dev@arrow.apache.org; Ji Liu 
Cc: emkornfield ; Paul Taylor 
Subject: RE: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte 
Flatbuffer alignment requirements (2nd vote)

I'm working on a PR for the C# bindings. I hope to have it up in the next day 
or two. Integration tests for C# would be a great addition at some point - it's 
been on my backlog. For now I plan on manually testing it.

-Original Message-
From: Wes McKinney 
Sent: Tuesday, September 3, 2019 10:17 PM
To: Ji Liu 
Cc: emkornfield ; dev ; Paul 
Taylor 
Subject: Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte 
Flatbuffer alignment requirements (2nd vote)

hi folks,

We now have patches up for Java, JS, and Go. How are we doing on the code 
reviews for getting these in?

Since C# implements the binary protocol, the C# developers might want to look 
at this before the 0.15.0 release also. Absent integration tests it's difficult 
to verify the C# library, though

Thanks

On Thu, Aug 29, 2019 at 8:13 AM Ji Liu  wrote:
>
> Here is the Java implementation
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> ub.com%2Fapache%2Farrow%2Fpull%2F5229data=02%7C01%7CEric.Erhardt%
> 40microsoft.com%7C90f02600c4ce40ff5c9008d730e66b68%7C72f988bf86f141af9
> 1ab2d7cd011db47%7C1%7C0%7C637031638512163816sdata=b87u5x8lLvfdnU5
> 6LrGzYR8H0Jh8FfwY2cVjbOsY9hY%3Dreserved=0
>
> cc @Wes McKinney @emkornfield
>
> Thanks,
> Ji Liu
>
> --
> From:Ji Liu  Send Time:2019年8月28日(星期三)
> 17:34 To:emkornfield ; dev 
>  Cc:Paul Taylor 
> Subject:Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 
> 8-byte Flatbuffer alignment requirements (2nd vote)
>
> I could take the Java implementation and will take a close watch on this 
> issue in the next few days.
>
> Thanks,
> Ji Liu
>
>
> --
> From:Micah Kornfield  Send Time:2019年8月28日(星期三)
> 17:14 To:dev  Cc:Paul Taylor 
> 
> Subject:Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 
> 8-byte Flatbuffer alignment requirements (2nd vote)
>
> I should have integration tests with 0.14.1 generated binaries in the 
> next few days.  I think the one remaining unassigned piece of work in 
> the Java implementation, i can take that up next if no one else gets to it.
>
> On Tue, Aug 27, 2019 at 7:19 PM Wes McKinney  wrote:
>
> > Here's the C++ changes
> >
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
> > thub.com%2Fapache%2Farrow%2Fpull%2F5211data=02%7C01%7CEric.Erha
> > rdt%40microsoft.com%7C90f02600c4ce40ff5c9008d730e66b68%7C72f988bf86f
> > 141af91ab2d7cd011db47%7C1%7C0%7C637031638512163816sdata=zWaHS8X
> > YIQA85xcFG%2FMrOcSfrI8xZtyuHRoaDH%2FIP2g%3Dreserved=0
> >
> > I'm going to create a integration branch where we can merge each 
> > patch before merging to master
> >
> > On Fri, Aug 23, 2019 at 9:03 AM Wes McKinney  wrote:
> > >
> > > It isn't implemented in C++ yet but I will try to get a patch up 
> > > for that soon (today maybe). I think we should create a branch 
> > > where we can stack the patches that implement this for each language.
> > >
> > > On Fri, Aug 23, 2019 at 4:04 AM Paul Taylor 
> > > 
> > wrote:
> > > >
> > > > I'll do the JS updates. Is it safe to 

Re: [ANNOUNCE] New committers: Ben Kietzman, Kenta Murata, and Neal Richardson

2019-09-05 Thread Ji Liu
Congratulations!

Thanks,
Ji Liu


--
From:Fan Liya 
Send Time:2019年9月6日(星期五) 09:28
To:dev 
Subject:Re: [ANNOUNCE] New committers: Ben Kietzman, Kenta Murata, and Neal 
Richardson

Big congratulations to Ben, Kenta and Neal!

Best,
Liya Fan

On Fri, Sep 6, 2019 at 5:33 AM Wes McKinney  wrote:

> hi all,
>
> on behalf of the Arrow PMC, I'm pleased to announce that Ben, Kenta,
> and Neal have accepted invitations to become Arrow committers. Welcome
> and thank you for all your contributions!
>


[jira] [Created] (ARROW-6472) [Java] ValueVector#accept may has potential cast exception

2019-09-05 Thread Ji Liu (Jira)
Ji Liu created ARROW-6472:
-

 Summary: [Java] ValueVector#accept may has potential cast exception
 Key: ARROW-6472
 URL: https://issues.apache.org/jira/browse/ARROW-6472
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Per discussion 
[https://github.com/apache/arrow/pull/5195#issuecomment-528425302]

We may use API this way:
{code:java}
RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector1, vector2);
vector3.accept(visitor, range){code}
if vector1/vector2 are say, {{StructVector}}s and vector3 is an {{IntVector}} - 
things can go bad. we'll use the {{compareBaseFixedWidthVectors()}} and do 
wrong type-casts for vector1/vector2.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6464) [Java] Refactor FixedSizeListVector#splitAndTransfer with slice API

2019-09-05 Thread Ji Liu (Jira)
Ji Liu created ARROW-6464:
-

 Summary: [Java] Refactor FixedSizeListVector#splitAndTransfer with 
slice API
 Key: ARROW-6464
 URL: https://issues.apache.org/jira/browse/ARROW-6464
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently {{FixedSizeListVector#splitAndTransfer}} actually use 
{{copyValueSafe}} which has memory copy, we should use slice API instead.

Meanwhile, {{splitAndTransfer}} in all classes should position index check at 
beginning.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6460) [Java] Add unit test for large avro data

2019-09-04 Thread Ji Liu (Jira)
Ji Liu created ARROW-6460:
-

 Summary: [Java] Add unit test for large avro data
 Key: ARROW-6460
 URL: https://issues.apache.org/jira/browse/ARROW-6460
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


To avoid OOM, we have implement iterator API in ARROW-6220.

This issue is about to add tests with a large fake data (say 6MM rows in JDBC 
adapter test) set and ensures no OOMs occur.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6452) [Java] Overrite ValueVector toString() method

2019-09-03 Thread Ji Liu (Jira)
Ji Liu created ARROW-6452:
-

 Summary: [Java] Overrite ValueVector toString() method
 Key: ARROW-6452
 URL: https://issues.apache.org/jira/browse/ARROW-6452
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently cpp code {{Array#ToString}} returns the human readable format string 
like:

[

  1,

  2,

  3

]

But Java {{ValueVector}} did not implement like this way now.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6401) [Java] Implement dictionary-encoded subfields for Struct type

2019-08-30 Thread Ji Liu (Jira)
Ji Liu created ARROW-6401:
-

 Summary: [Java] Implement dictionary-encoded subfields for Struct 
type
 Key: ARROW-6401
 URL: https://issues.apache.org/jira/browse/ARROW-6401
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Implement dictionary-encoded subfields for Struct type.

Each child vector will have a dictionary, the dictionary vector is struct type 
and holds all dictionaries.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements (2nd vote)

2019-08-29 Thread Ji Liu
Here is the Java implementation
https://github.com/apache/arrow/pull/5229

cc @Wes McKinney @emkornfield

Thanks,
Ji Liu


--
From:Ji Liu 
Send Time:2019年8月28日(星期三) 17:34
To:emkornfield ; dev 
Cc:Paul Taylor 
Subject:Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte 
Flatbuffer alignment requirements (2nd vote)

I could take the Java implementation and will take a close watch on this issue 
in the next few days.

Thanks,
Ji Liu


--
From:Micah Kornfield 
Send Time:2019年8月28日(星期三) 17:14
To:dev 
Cc:Paul Taylor 
Subject:Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte 
Flatbuffer alignment requirements (2nd vote)

I should have integration tests with 0.14.1 generated binaries in the next
few days.  I think the one remaining unassigned piece of work in the Java
implementation, i can take that up next if no one else gets to it.

On Tue, Aug 27, 2019 at 7:19 PM Wes McKinney  wrote:

> Here's the C++ changes
>
> https://github.com/apache/arrow/pull/5211
>
> I'm going to create a integration branch where we can merge each patch
> before merging to master
>
> On Fri, Aug 23, 2019 at 9:03 AM Wes McKinney  wrote:
> >
> > It isn't implemented in C++ yet but I will try to get a patch up for
> > that soon (today maybe). I think we should create a branch where we
> > can stack the patches that implement this for each language.
> >
> > On Fri, Aug 23, 2019 at 4:04 AM Paul Taylor 
> wrote:
> > >
> > > I'll do the JS updates. Is it safe to validate against the Arrow C++
> > > integration tests?
> > >
> > >
> > > On 8/22/19 7:28 PM, Micah Kornfield wrote:
> > > > I created https://issues.apache.org/jira/browse/ARROW-6313 as a
> tracking
> > > > issue with sub-issues on the development work.  So far no-one has
> claimed
> > > > Java and Javascript tasks.
> > > >
> > > > Would it make sense to have a separate dev branch for this work?
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > On Thu, Aug 22, 2019 at 3:24 PM Wes McKinney 
> wrote:
> > > >
> > > >> The vote carries with 4 binding +1 votes and 1 non-binding +1
> > > >>
> > > >> I'll merge the specification patch later today and we can begin
> > > >> working on implementations so we can get this done for 0.15.0
> > > >>
> > > >> On Tue, Aug 20, 2019 at 12:30 PM Bryan Cutler 
> wrote:
> > > >>> +1 (non-binding)
> > > >>>
> > > >>> On Tue, Aug 20, 2019, 7:43 AM Antoine Pitrou 
> > > >> wrote:
> > > >>>> Sorry, had forgotten to send my vote on this.
> > > >>>>
> > > >>>> +1 from me.
> > > >>>>
> > > >>>> Regards
> > > >>>>
> > > >>>> Antoine.
> > > >>>>
> > > >>>>
> > > >>>> On Wed, 14 Aug 2019 17:42:33 -0500
> > > >>>> Wes McKinney  wrote:
> > > >>>>> hi all,
> > > >>>>>
> > > >>>>> As we've been discussing [1], there is a need to introduce 4
> bytes of
> > > >>>>> padding into the preamble of the "encapsulated IPC message"
> format to
> > > >>>>> ensure that the Flatbuffers metadata payload begins on an 8-byte
> > > >>>>> aligned memory offset. The alternative to this would be for Arrow
> > > >>>>> implementations where alignment is important (e.g. C or C++) to
> copy
> > > >>>>> the metadata (which is not always small) into memory when it is
> > > >>>>> unaligned.
> > > >>>>>
> > > >>>>> Micah has proposed to address this by adding a
> > > >>>>> 4-byte "continuation" value at the beginning of the payload
> > > >>>>> having the value 0x. The reason to do it this way is that
> > > >>>>> old clients will see an invalid length (what is currently the
> > > >>>>> first 4 bytes of the message -- a 32-bit little endian signed
> > > >>>>> integer indicating the metadata length) rather than potentially
> > > >>>>> crashing on a valid length. We also propose to expand the "end of
> > > >>>>> stream" marker used i

Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements (2nd vote)

2019-08-28 Thread Ji Liu
I could take the Java implementation and will take a close watch on this issue 
in the next few days.

Thanks,
Ji Liu


--
From:Micah Kornfield 
Send Time:2019年8月28日(星期三) 17:14
To:dev 
Cc:Paul Taylor 
Subject:Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte 
Flatbuffer alignment requirements (2nd vote)

I should have integration tests with 0.14.1 generated binaries in the next
few days.  I think the one remaining unassigned piece of work in the Java
implementation, i can take that up next if no one else gets to it.

On Tue, Aug 27, 2019 at 7:19 PM Wes McKinney  wrote:

> Here's the C++ changes
>
> https://github.com/apache/arrow/pull/5211
>
> I'm going to create a integration branch where we can merge each patch
> before merging to master
>
> On Fri, Aug 23, 2019 at 9:03 AM Wes McKinney  wrote:
> >
> > It isn't implemented in C++ yet but I will try to get a patch up for
> > that soon (today maybe). I think we should create a branch where we
> > can stack the patches that implement this for each language.
> >
> > On Fri, Aug 23, 2019 at 4:04 AM Paul Taylor 
> wrote:
> > >
> > > I'll do the JS updates. Is it safe to validate against the Arrow C++
> > > integration tests?
> > >
> > >
> > > On 8/22/19 7:28 PM, Micah Kornfield wrote:
> > > > I created https://issues.apache.org/jira/browse/ARROW-6313 as a
> tracking
> > > > issue with sub-issues on the development work.  So far no-one has
> claimed
> > > > Java and Javascript tasks.
> > > >
> > > > Would it make sense to have a separate dev branch for this work?
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > On Thu, Aug 22, 2019 at 3:24 PM Wes McKinney 
> wrote:
> > > >
> > > >> The vote carries with 4 binding +1 votes and 1 non-binding +1
> > > >>
> > > >> I'll merge the specification patch later today and we can begin
> > > >> working on implementations so we can get this done for 0.15.0
> > > >>
> > > >> On Tue, Aug 20, 2019 at 12:30 PM Bryan Cutler 
> wrote:
> > > >>> +1 (non-binding)
> > > >>>
> > > >>> On Tue, Aug 20, 2019, 7:43 AM Antoine Pitrou 
> > > >> wrote:
> > > >>>> Sorry, had forgotten to send my vote on this.
> > > >>>>
> > > >>>> +1 from me.
> > > >>>>
> > > >>>> Regards
> > > >>>>
> > > >>>> Antoine.
> > > >>>>
> > > >>>>
> > > >>>> On Wed, 14 Aug 2019 17:42:33 -0500
> > > >>>> Wes McKinney  wrote:
> > > >>>>> hi all,
> > > >>>>>
> > > >>>>> As we've been discussing [1], there is a need to introduce 4
> bytes of
> > > >>>>> padding into the preamble of the "encapsulated IPC message"
> format to
> > > >>>>> ensure that the Flatbuffers metadata payload begins on an 8-byte
> > > >>>>> aligned memory offset. The alternative to this would be for Arrow
> > > >>>>> implementations where alignment is important (e.g. C or C++) to
> copy
> > > >>>>> the metadata (which is not always small) into memory when it is
> > > >>>>> unaligned.
> > > >>>>>
> > > >>>>> Micah has proposed to address this by adding a
> > > >>>>> 4-byte "continuation" value at the beginning of the payload
> > > >>>>> having the value 0x. The reason to do it this way is that
> > > >>>>> old clients will see an invalid length (what is currently the
> > > >>>>> first 4 bytes of the message -- a 32-bit little endian signed
> > > >>>>> integer indicating the metadata length) rather than potentially
> > > >>>>> crashing on a valid length. We also propose to expand the "end of
> > > >>>>> stream" marker used in the stream and file format from 4 to 8
> > > >>>>> bytes. This has the additional effect of aligning the file footer
> > > >>>>> defined in File.fbs.
> > > >>>>>
> > > >>>>> This would be a backwards incompatible protocol change, so older
> > > >> Arrow
> > > >>&g

[jira] [Created] (ARROW-6356) [Java] Avro adapter implement Enum type and nested Record type

2019-08-26 Thread Ji Liu (Jira)
Ji Liu created ARROW-6356:
-

 Summary: [Java] Avro adapter implement Enum type and nested Record 
type
 Key: ARROW-6356
 URL: https://issues.apache.org/jira/browse/ARROW-6356
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Ji Liu
Assignee: Ji Liu


Implement for converting avro {{Enum}} type.

Convert nested avro {{Record}} type to Arrow {{StructVector}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6311) [Java] Make ApproxEqualsVisitor accept DiffFunction to make it more flexible

2019-08-21 Thread Ji Liu (Jira)
Ji Liu created ARROW-6311:
-

 Summary: [Java] Make ApproxEqualsVisitor accept DiffFunction to 
make it more flexible
 Key: ARROW-6311
 URL: https://issues.apache.org/jira/browse/ARROW-6311
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently {{ApproxEqualsVisitor}} will accept a epsilon for both float and 
double compare, and the difference calculation is always {{Math.abs}}(f1-f2)

For some cases like {{Validator}} it is not very suitable as:

i. it has different epsilon values for float/double

ii. it difference function is not Math.abs(f1-f2)

 

To resolve these, make this visitor accept both float/double epsilons and diff 
functions.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[Discuss] Support read/write interleaved dictionaries and batches in IPC stream

2019-08-21 Thread Ji Liu
Hi all,

Recently when we worked on fixing a IPC related bug in both Java/C++ 
sides[1][2],  @emkornfieldfound that the stream reader assumes that all 
dictionaries are at the start of the stream which is inconsistent with  spec[3] 
which says as long as a record batch doesn't reference a dictionary they can be 
interleaved.

Cases below should be supported, however they will crash at current 
implementations.
i. have a record batch of one dictionary encoded column S
 1>Schema
 2>RecordBatch: S=[null, null, null, null]
 3>DictionaryBatch: ['abc', 'efg']
 4>Recordbatch: S=[0, 1, 0, 1]
ii. have a record batch of two dictionary encoded column S1, S2
 1>Schema
 2->DictionaryBatch S1: ['ab', 'cd']
 3->RecordBatch: S1 = [0,1,0,1] S2 =[null, null, null,]
 4->DictionaryBatch S2: ['cc', 'dd']
 5->RecordBatch: S1 = [0,1,0,1] S2 =[0,1,0,1]

We already did some work on Java side via[1] to make it possible to parse 
interleaved dictionaries and batches:
 i. In ArrowStreamReader, do not read all dictionaries at the start
 ii. When call loadNextBatch, we read message to decide read dictionaries first 
or directly read a batch, if former, read all dictionaries out before this 
batch.
 iii.When we read a batch, we check if the dictionaries it needed has already 
been read, if not, check if it's all null column and decide whether need throw 
exception.
In this way, whatever they are interleaved or not, we can parse it properly.

In the future, I think we should also support write interleaved dictionaries 
and batches in IPC stream(created an issue to track this[4]), but not quite 
clear how to implement this.
Any opinions about this are appreciated, thanks!

Thanks,
Ji Liu

[1] https://issues.apache.org/jira/browse/ARROW-6040,
[2] https://issues.apache.org/jira/browse/ARROW-6126,
[3] 
http://arrow.apache.org/docs/format/IPC.html#streaming-format,[4]https://issues.apache.org/jira/browse/ARROW-6308

[jira] [Created] (ARROW-6308) [Java] Support write interleaved dictionaries and batches in IPC stream

2019-08-21 Thread Ji Liu (Jira)
Ji Liu created ARROW-6308:
-

 Summary: [Java] Support write interleaved dictionaries and batches 
in IPC stream
 Key: ARROW-6308
 URL: https://issues.apache.org/jira/browse/ARROW-6308
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Per discussions in the following threads, as 
spec([http://arrow.apache.org/docs/format/IPC.html#streaming-format]) 
described, as long as a record batch doesn't reference a dictionary they can be 
interleaved.

[https://github.com/apache/arrow/pull/4960]

[https://github.com/apache/arrow/pull/5146]

Currently it’s able to parse dictionaries and batches which are interleaved via 
ARROW-6040,  But it’s impossible to write data in this format.

 

 

This issue is used to record this problem, and should be done after a ML 
discuss.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: Timeline for 0.15.0 release

2019-08-19 Thread Ji Liu
Hi, Wes, on the java side, I can think of several bugs that need to be fixed or 
reminded.

i. ARROW-6040: Dictionary entries are required in IPC streams even when empty[1]
This one is under review now, however through this PR we find that there seems 
a bug in java reading and writing dictionaries in IPC which is Inconsistent 
with spec[2] since it assumes all dictionaries are at the start of stream (see 
details in PR comments,  and this fix may not catch up with version 0.15). 
@Micah Kornfield

ii. ARROW-1875: Write 64-bit ints as strings in integration test JSON files[3]
Java side code already checked in, other implementations seems not.

iii. ARROW-6202: OutOfMemory in JdbcAdapter[4]
Caused by trying to load all records in one contiguous batch, fixed by 
providing iterator API for iteratively reading in ARROW-6219[5].

Thanks,
Ji Liu

[1] https://github.com/apache/arrow/pull/4960
[2] https://arrow.apache.org/docs/ipc.html
[3] https://issues.apache.org/jira/browse/ARROW-1875
[4] https://issues.apache.org/jira/browse/ARROW-6202[5] 
https://issues.apache.org/jira/browse/ARROW-6219



--
From:Wes McKinney 
Send Time:2019年8月19日(星期一) 23:03
To:dev 
Subject:Re: Timeline for 0.15.0 release

I'm going to work some on organizing the 0.15.0 backlog some this
week, if anyone wants to help with grooming (particularly for
languages other than C++/Python where I'm focusing) that would be
helpful. There have been almost 500 JIRA issues opened since the
0.14.0 release, so we should make sure to check whether there's any
regressions or other serious bugs that we should try to fix for
0.15.0.

On Thu, Aug 15, 2019 at 6:23 PM Wes McKinney  wrote:
>
> The Windows wheel issue in 0.14.1 seems to be
>
> https://issues.apache.org/jira/browse/ARROW-6015
>
> I think the root cause could be the Windows changes in
>
> https://github.com/apache/arrow/commit/223ae744cc2a12c60cecb5db593263a03c13f85a
>
> I would be appreciative if a volunteer would look into what was wrong
> with the 0.14.1 wheels on Windows. Otherwise 0.15.0 Windows wheels
> will be broken, too
>
> The bad wheels can be found at
>
> https://bintray.com/apache/arrow/python#files/python%2F0.14.1
>
> On Thu, Aug 15, 2019 at 1:28 PM Antoine Pitrou  wrote:
> >
> > On Thu, 15 Aug 2019 11:17:07 -0700
> > Micah Kornfield  wrote:
> > > >
> > > > In C++ they are
> > > > independent, we could have 32-bit array lengths and variable-length
> > > > types with 64-bit offsets if we wanted (we just wouldn't be able to
> > > > have a List child with more than INT32_MAX elements).
> > >
> > > I think the point is we could do this in C++ but we don't.  I'm not sure 
> > > we
> > > would have introduced the "Large" types if we did.
> >
> > 64-bit offsets take twice as much space as 32-bit offsets, so if you're
> > storing lots of small-ish lists or strings, 32-bit offsets are
> > preferrable.  So even with 64-bit array lengths from the start it would
> > still be beneficial to have types with 32-bit offsets.
> >
> > > Going with the limited address space in Java and calling it a reference
> > > implementation seems suboptimal. If a consumer uses a "Large" type
> > > presumably it is because they need the ability to store more than 
> > > INT32_MAX
> > > child elements in a column, otherwise it is just wasting space [1].
> >
> > Probably. Though if the individual elements (lists or strings) are
> > large, not much space is wasted in proportion, so it may be simpler in
> > such a case to always create a "Large" type array.
> >
> > > [1] I suppose theoretically there might be some performance benefits on
> > > 64-bit architectures to using the native word sizes.
> >
> > Concretely, common 64-bit architectures don't do that, as 32-bit is an
> > extremely common integer size even in high-performance code.
> >
> > Regards
> >
> > Antoine.
> >
> >



[jira] [Created] (ARROW-6289) [Java] Add empty() in UnionVector to create instance

2019-08-18 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6289:
-

 Summary: [Java] Add empty() in UnionVector to create instance
 Key: ARROW-6289
 URL: https://issues.apache.org/jira/browse/ARROW-6289
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently complex type vectors all have {{empty}}() API to create instance 
except {{UnionVector}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6288) [Java] Implement TypeEqualsVisitor comparing vector type equals considering names and metadata

2019-08-18 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6288:
-

 Summary: [Java] Implement TypeEqualsVisitor comparing vector type 
equals considering names and metadata
 Key: ARROW-6288
 URL: https://issues.apache.org/jira/browse/ARROW-6288
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently when we compare range/vector equals, we first compare vector 
{{Field}} by its equals method, in this case, it’s hard to specify whether 
compare names or metadata.

Implement a {{TypeEqualsVisitor}} will make type comparisons more flexible like 
cpp implementation dose 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.cc#L712]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6265) [Java] Avro adapter implement Array/Map/Fixed type

2019-08-15 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6265:
-

 Summary: [Java] Avro adapter implement Array/Map/Fixed type
 Key: ARROW-6265
 URL: https://issues.apache.org/jira/browse/ARROW-6265
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Support Array/Map/Fixed type in avro adapter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6250) [Java] Implement ApproxEqualsVisitor comparing approx for floating point

2019-08-15 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6250:
-

 Summary: [Java] Implement ApproxEqualsVisitor comparing approx for 
floating point
 Key: ARROW-6250
 URL: https://issues.apache.org/jira/browse/ARROW-6250
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently we already implemented {{RangeEqualsVisitor/VectorEqualsVisitor}} for 
comparing range/vector.

And ARROW-6211 is created to make {{ValueVector}} work with generic visitor.

We should also implement {{ApproxEqualsVisitor}} to compare floating point just 
like cpp does

[https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.cc]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6249) [Java] Remove useless class ByteArrayWrapper

2019-08-15 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6249:
-

 Summary: [Java] Remove useless class ByteArrayWrapper
 Key: ARROW-6249
 URL: https://issues.apache.org/jira/browse/ARROW-6249
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


This class was introduced into encoding part to compare byte[] values equals.

Since now we compare value/vector equals by new added visitor API by ARROW-6022 
instead of  comparing {{getObject}}, this class is no use anymore.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

2019-08-14 Thread Ji Liu
My original thoughts is that introduce a new interface makes the hierarchy a 
little confused (FixedListVector->BaseListVector, 
ListVector->BaseRepeatedVector->BaseListVector) and should try to avoid 
introducing new classes.

And you are right, FixedSizeListVector should not include offsetBuffer, I am 
fine with the approach you suggested, BaseRepeatedVector/ListVector and 
FixedSizeListVector both inherit from new interface(BaseListVector) they can be 
used interchangeably in dictionary encoding.Let see if others have different 
opinions.


Thanks,
Ji Liu


--
From:Micah Kornfield 
Send Time:2019年8月14日(星期三) 15:10
To:Ji Liu 
Cc:dev 
Subject:Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

You are right, the mainly difference between FixSizedListVector and ListVector 
is the offsetBuffer, but I think this could be avoided through 
allocateNewSafe() overwrite which calls allocateOffsetBuffer() in 
BaseRepeatedValueVector, in this way, offsetBuffer in FixSizedListVector will 
remain allocator.getEmpty(). 

I think there other methods that FixedSizeList shouldn't be implementing that 
are on List as well.   In an ideal world, I think the parent class/interface 
would be called ListVector and there would then be specific children of 
FixedSizeList and VariableSizeList.  I think that is too big a change to 
something core, but we should try to keep the relationship in that shape, so we 
don't need to override methods just to throw NotSupportedExceptions.
On Sun, Aug 11, 2019 at 7:35 AM Ji Liu  wrote:

Thanks Jacques, to avoid complex call paths for getObject, should keep 
getObject for both classes. I'll also checked for other methods.

Thanks,
Ji Liu

--
From:Jacques Nadeau 
Send Time:2019年8月11日(星期日) 21:43
To:dev ; Ji Liu 
Cc:emkornfield 
Subject:Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

We tried to get away from this kind of back and forth with subclassing as
much as possible. (call getObject on base class which then calls getIndex
on child class which then calls something else on base class). I haven't
looked through the code but let's try to avoid having complex call paths
for the vectors.

On Sat, Aug 10, 2019 at 6:07 PM Ji Liu  wrote:

> Hi Micah, thanks for your suggestion.
> You are right, the mainly difference between FixSizedListVector and
> ListVector is the offsetBuffer, but I think this could be avoided through
> allocateNewSafe() overwrite which calls allocateOffsetBuffer() in
> BaseRepeatedValueVector, in this way, offsetBuffer in FixSizedListVector
> will remain allocator.getEmpty().
>
> Meanwhile, we could add getStartIndex(int index)/getEndIndex(int index)
> API to handle read data logic respectively which could be used in
> getObject(int index) or encoding parts. What’s more, no new interface need
> to be introduced.
>
> What do you think?
>
>
> Thanks,
> Ji Liu
>
>
> --
> From:Micah Kornfield 
> Send Time:2019年8月11日(星期日) 08:47
> To:dev ; Ji Liu 
> Subject:Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from
> ListVector
>
> Hi Ji Liu,
> I think have a common interface/base-class for the two makes sense (but
> don't have historical context) from a reading data perspective.
>
> I think the change would need to be something above
> BaseRepeatedValueVector, since the FixedSizeListVector doesn't contain an
> offset buffer, and that field is contained on BaseRepeatedValueVector.
>
> Thanks,
> Micah
> On Sat, Aug 10, 2019 at 5:25 PM Ji Liu  wrote:
> Hi, all
>
>  While working on the issue to implement dictionary-encoded subfields[1]
> [2], I found FixedSizeListVector not extends ListVector(Thanks Micah
> pointing this out and curious why implemented FixedSizeListVector this way
>  before). Since FixedSizeListVector is a specific case of ListVector,
> should we make former extends the latter to reduce the plenty duplicated
> logic in these two and writer/reader classes?
>
>
>  Thanks,
>  Ji Liu
>
>  [1]
> https://issues.apache.org/jira/browse/ARROW-1175[2]https://github.com/apache/arrow/pull/4972
>
>


Re: [ANNOUNCE] New Arrow PMC member: Sebastien Binet

2019-08-14 Thread Ji Liu
Congrats Sebastian!


--
From:Micah Kornfield 
Send Time:2019年8月15日(星期四) 10:46
To:dev@arrow.apache.org 
Subject:Re: [ANNOUNCE] New Arrow PMC member: Sebastien Binet

Congrats.  Well deserved.

On Wednesday, August 14, 2019, paddy horan  wrote:

> Congrats Sebastian!
>
> Get Outlook for iOS
> 
> From: Wes McKinney 
> Sent: Tuesday, August 13, 2019 4:54 PM
> To: dev@arrow.apache.org
> Subject: [ANNOUNCE] New Arrow PMC member: Sebastien Binet
>
> The Project Management Committee (PMC) for Apache Arrow has invited
> Sebastien Binet to become a PMC member and we are pleased to announce
> that Sebastien has accepted.
>
> Congratulations and welcome!
>


Re: [Java] CI builds failing on master

2019-08-14 Thread Ji Liu
Hi, Wes, as described in JIRA, this was introduced by our recent two patches, I 
have just submitted a PR[1] to fix this. Thanks for tracking this issue.


Thanks,
Ji Liu

[1] https://github.com/apache/arrow/pull/5090


--
From:Wes McKinney 
Send Time:2019年8月15日(星期四) 06:16
To:dev 
Subject:[Java] CI builds failing on master

We've got some Java-related build failures occurring on master

https://travis-ci.org/apache/arrow/jobs/571998256

Since we build the Java library in some of the C++/Python builds
sorting this out is fairly urgent so we can continue to merge patches.

I opened

https://issues.apache.org/jira/browse/ARROW-6241

to track

thanks
Wes

[jira] [Created] (ARROW-6234) [Java] ListVector hashCode() is not correct

2019-08-14 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6234:
-

 Summary: [Java] ListVector hashCode() is not correct
 Key: ARROW-6234
 URL: https://issues.apache.org/jira/browse/ARROW-6234
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Current implement is not correct:
{code:java}
for (int i = start; i < end; i++) {
  hash = 31 * vector.hashCode(i);
}
{code}
Should be something like:
{code:java}
hash = 31 * hash + vector.hashCode(i);{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6218) [Java] Add UINT type test in integration to avoid potential overflow

2019-08-12 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6218:
-

 Summary: [Java] Add UINT type test in integration to avoid 
potential overflow
 Key: ARROW-6218
 URL: https://issues.apache.org/jira/browse/ARROW-6218
 Project: Apache Arrow
  Issue Type: Test
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


As per discussion [https://github.com/apache/arrow/pull/5002]

For UINT type, when write/read json data in integration test, it extend data 
type(i.e. Long->BigInteger, Int->Long) to avoid potential overflow.

Like UINT8 the write side and read side code like this:

 
{code:java}
case UINT8:

  generator.writeNumber(UInt8Vector.getNoOverflow(buffer, index));

  break;{code}
 
{code:java}
BigInteger value = parser.getBigIntegerValue();

buf.writeLong(value.longValue());
{code}
Should add a test to avoid potential overflow in the data transfer process.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

2019-08-11 Thread Ji Liu
Thanks Jacques, to avoid complex call paths for getObject, should keep 
getObject for both classes. I'll also checked for other methods.

Thanks,
Ji Liu


--
From:Jacques Nadeau 
Send Time:2019年8月11日(星期日) 21:43
To:dev ; Ji Liu 
Cc:emkornfield 
Subject:Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

We tried to get away from this kind of back and forth with subclassing as
much as possible. (call getObject on base class which then calls getIndex
on child class which then calls something else on base class). I haven't
looked through the code but let's try to avoid having complex call paths
for the vectors.

On Sat, Aug 10, 2019 at 6:07 PM Ji Liu  wrote:

> Hi Micah, thanks for your suggestion.
> You are right, the mainly difference between FixSizedListVector and
> ListVector is the offsetBuffer, but I think this could be avoided through
> allocateNewSafe() overwrite which calls allocateOffsetBuffer() in
> BaseRepeatedValueVector, in this way, offsetBuffer in FixSizedListVector
> will remain allocator.getEmpty().
>
> Meanwhile, we could add getStartIndex(int index)/getEndIndex(int index)
> API to handle read data logic respectively which could be used in
> getObject(int index) or encoding parts. What’s more, no new interface need
> to be introduced.
>
> What do you think?
>
>
> Thanks,
> Ji Liu
>
>
> --
> From:Micah Kornfield 
> Send Time:2019年8月11日(星期日) 08:47
> To:dev ; Ji Liu 
> Subject:Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from
> ListVector
>
> Hi Ji Liu,
> I think have a common interface/base-class for the two makes sense (but
> don't have historical context) from a reading data perspective.
>
> I think the change would need to be something above
> BaseRepeatedValueVector, since the FixedSizeListVector doesn't contain an
> offset buffer, and that field is contained on BaseRepeatedValueVector.
>
> Thanks,
> Micah
> On Sat, Aug 10, 2019 at 5:25 PM Ji Liu  wrote:
> Hi, all
>
>  While working on the issue to implement dictionary-encoded subfields[1]
> [2], I found FixedSizeListVector not extends ListVector(Thanks Micah
> pointing this out and curious why implemented FixedSizeListVector this way
>  before). Since FixedSizeListVector is a specific case of ListVector,
> should we make former extends the latter to reduce the plenty duplicated
> logic in these two and writer/reader classes?
>
>
>  Thanks,
>  Ji Liu
>
>  [1]
> https://issues.apache.org/jira/browse/ARROW-1175[2]https://github.com/apache/arrow/pull/4972
>
>


[jira] [Created] (ARROW-6200) [Java] Method getBufferSizeFor in BaseRepeatedValueVector/ListVector not correct

2019-08-11 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6200:
-

 Summary: [Java] Method getBufferSizeFor in 
BaseRepeatedValueVector/ListVector not correct
 Key: ARROW-6200
 URL: https://issues.apache.org/jira/browse/ARROW-6200
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently, {{getBufferSizeFor}} in {{BaseRepeatedValueVector}} implemented as 
below:
{code:java}
if (valueCount == 0) {

  return 0;

}

return ((valueCount + 1) * OFFSET_WIDTH) + vector.getBufferSizeFor(valueCount);
{code}
Here vector.getBufferSizeFor(valueCount) seems not right which should be

 
{code:java}
int innerVectorValueCount = offsetBuffer.getInt(valueCount * OFFSET_WIDTH);

vector.getBufferSizeFor(innerVectorValueCount)
{code}
 ListVector has the same problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6199) [Java] Avro adapter avoid potential resource leak.

2019-08-11 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6199:
-

 Summary: [Java] Avro adapter avoid potential resource leak.
 Key: ARROW-6199
 URL: https://issues.apache.org/jira/browse/ARROW-6199
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently, avro consumer interface has no close API, which may cause resource 
leak like {{AvroBytesConsumer#cacheBuffer}}.

To resolve this, make consumer extends {{AutoCloseable}} and create 
{{CompositeAvroConsumer}} to encompasses consume and close logic. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

2019-08-10 Thread Ji Liu
Hi Micah, thanks for your suggestion. 
You are right, the mainly difference between FixSizedListVector and ListVector 
is the offsetBuffer, but I think this could be avoided through 
allocateNewSafe() overwrite which calls allocateOffsetBuffer() in 
BaseRepeatedValueVector, in this way, offsetBuffer in FixSizedListVector will 
remain allocator.getEmpty(). 

Meanwhile, we could add getStartIndex(int index)/getEndIndex(int index) API to 
handle read data logic respectively which could be used in getObject(int index) 
or encoding parts. What’s more, no new interface need to be introduced.

What do you think?


Thanks,
Ji Liu


--
From:Micah Kornfield 
Send Time:2019年8月11日(星期日) 08:47
To:dev ; Ji Liu 
Subject:Re: [DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

Hi Ji Liu,
I think have a common interface/base-class for the two makes sense (but don't 
have historical context) from a reading data perspective. 

I think the change would need to be something above BaseRepeatedValueVector, 
since the FixedSizeListVector doesn't contain an offset buffer, and that field 
is contained on BaseRepeatedValueVector.

Thanks,
Micah
On Sat, Aug 10, 2019 at 5:25 PM Ji Liu  wrote:
Hi, all

 While working on the issue to implement dictionary-encoded subfields[1] [2], I 
found FixedSizeListVector not extends ListVector(Thanks Micah pointing this out 
and curious why implemented FixedSizeListVector this way 
 before). Since FixedSizeListVector is a specific case of ListVector, should we 
make former extends the latter to reduce the plenty duplicated logic in these 
two and writer/reader classes? 


 Thanks,
 Ji Liu

 [1] 
https://issues.apache.org/jira/browse/ARROW-1175[2]https://github.com/apache/arrow/pull/4972



Re: [Java] Arrow PR queue build up?

2019-08-10 Thread Ji Liu
Hi, Jacques, thanks for your valuable feedback. 
Sorry for the lack of discuss. Some of these PRs are small change/bugfix which 
not deserving a discuss. You are right, some PRs are more complex than we 
thought before in the review process, making a discuss on ML/JIRA would 
actually help. This situation will be avoided as far as possible in the future 
work.


ThanksJi Liu


--
From:Jacques Nadeau 
Send Time:2019年8月11日(星期日) 03:36
To:dev ; Micah Kornfield 
Subject:Re: [Java] Arrow PR queue build up?

I think one of the issues here is that there is no upfront discussion about
most of the changes that are being proposed. In most cases, a pull request
just appears without. This makes the reviews much more intensive and time
consuming as frequently there are questions about the validity, nature or
rationale of the change. Having short design discussions before starting on
these changes would ensure that the nature of the reviews are less
involved, thus decreasing the effort associated with reviewing them.

On Thu, Aug 8, 2019 at 10:18 PM Micah Kornfield 
wrote:

> I did a pass through most of the open PRs (I might have missed one or
> two).  Most had at least a few minor comments so the backlog hasn't gone
> down that much, but I expect most will be mergeable very soon.
>
> On Thu, Aug 8, 2019 at 9:44 AM Micah Kornfield 
> wrote:
>
> > Not a full solution, but I've fallen behind a bit. I'm going to plan to
> > spend some time tonight at least reviewing PRs I've already done the
> first
> > pass on and I'll try to pickup some more.
> >
> > Having more engaged reviewers would be helpful though.
> >
> > Cheers,
> > Micah
> >
> > On Thursday, August 8, 2019, Wes McKinney  wrote:
> >
> >> hi folks,
> >>
> >> Liya Fan and Ji Liu have about 24 open Java PRs between them if I
> >> counted right -- it seems like the project is having a hard time
> >> keeping up with code reviews and merging on these. It looks to me like
> >> they are making a lot of material improvements to the Java library
> >> where previously there had not been a lot of development, so I would
> >> like to see PRs get merged faster -- any ideas how we might be able to
> >> achieve that?  I know that Micah has been spending a lot of time
> >> reviewing and giving feedback on these PRs so that is much appreciated
> >>
> >> Thanks,
> >> Wes
> >>
> >
>



[DISCUSS][JAVA] Make FixedSizedListVector inherit from ListVector

2019-08-10 Thread Ji Liu
Hi, all

While working on the issue to implement dictionary-encoded subfields[1] [2], I 
found FixedSizeListVector not extends ListVector(Thanks Micah pointing this out 
and curious why implemented FixedSizeListVector this way 
before). Since FixedSizeListVector is a specific case of ListVector, should we 
make former extends the latter to reduce the plenty duplicated logic in these 
two and writer/reader classes? 


Thanks,
Ji Liu

[1] 
https://issues.apache.org/jira/browse/ARROW-1175[2]https://github.com/apache/arrow/pull/4972



[jira] [Created] (ARROW-6194) [Java] Make DictionaryEncoder non-static making it easy to extend and reuse

2019-08-10 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6194:
-

 Summary: [Java] Make DictionaryEncoder non-static making it easy 
to extend and reuse
 Key: ARROW-6194
 URL: https://issues.apache.org/jira/browse/ARROW-6194
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


As discussed in [https://github.com/apache/arrow/pull/4994].

Current static DictionaryEncoder has some limitation for extension and reuse.

Slightly change the APIs and migrate static method to object based approach.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [ANNOUNCE] New Arrow PMC member: Micah Kornfield

2019-08-09 Thread Ji Liu
Congrats Micah! Well deserved.

Thanks,
Ji Liu


--
From:paddy horan 
Send Time:2019年8月9日(星期五) 23:31
To:dev@arrow.apache.org 
Subject:Re: [ANNOUNCE] New Arrow PMC member: Micah Kornfield

Congrats Micah!

Get Outlook for iOS<https://aka.ms/o0ukef>

From: Wes McKinney 
Sent: Friday, August 9, 2019 11:12 AM
To: dev@arrow.apache.org
Subject: [ANNOUNCE] New Arrow PMC member: Micah Kornfield

The Project Management Committee (PMC) for Apache Arrow has invited
Micah Kornfield to become a PMC member and we are pleased to announce
that Micah has accepted.

Congratulations and welcome!


[jira] [Created] (ARROW-6175) [Java] Fix MapVector#getMinorType and extend AbstractContainerVector addOrGet complex vector API

2019-08-08 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6175:
-

 Summary: [Java] Fix MapVector#getMinorType and extend 
AbstractContainerVector addOrGet complex vector API
 Key: ARROW-6175
 URL: https://issues.apache.org/jira/browse/ARROW-6175
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


i. Currently {{MapVector#getMinorType}} extends {{ListVector}} which returns 
the wrong {{MinorType}}.

ii. {{AbstractContainerVector}} now only has {{addOrGetList}}, 
{{addOrGetUnion}}, {{addOrGetStruct}} which not support all complex type like 
{{MapVector}} and {{FixedSizeListVector}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6160) [Java] AbstractStructVector#getPrimitiveVectors fails to work with complex child vectors

2019-08-07 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6160:
-

 Summary: [Java] AbstractStructVector#getPrimitiveVectors fails to 
work with complex child vectors
 Key: ARROW-6160
 URL: https://issues.apache.org/jira/browse/ARROW-6160
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently in {{AbstractStructVector#getPrimitiveVectors}}, only struct type 
child vectors will recursively get primitive vectors, other complex type like 
{{ListVector}}, {{UnionVector}} was treated as primitive type and return 
directly.

For example, Struct(List(Int), Struct(Int, Varchar)) {{getPrimitiveVectors}} 
should return {{[IntVector, IntVector, VarCharVector]}} instead of [ListVector, 
IntVector, VarCharVector]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6145) [Java] UnionVector created by MinorType#getNewVector could not keep field type info properly

2019-08-06 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6145:
-

 Summary: [Java] UnionVector created by MinorType#getNewVector 
could not keep field type info properly
 Key: ARROW-6145
 URL: https://issues.apache.org/jira/browse/ARROW-6145
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


When I worked for other items, I found {{UnionVector}} created by 
{{VectorSchemaRoot#create(Schema schema, BufferAllocator allocator)}} could not 
keep field type info properly. For example, if we set metadata in {{Field}} in 
schema, we could not get it back by {{UnionVector#getField}}.

This is mainly because {{MinorType.Union.getNewVector}} did not pass 
{{FieldType}} to vector and {{UnionVector#getField}} create a new {{Field}} 
which cause inconsistent.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6118) [Java] Replace google Preconditions with Arrow Preconditions

2019-08-02 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6118:
-

 Summary: [Java] Replace google Preconditions with Arrow 
Preconditions
 Key: ARROW-6118
 URL: https://issues.apache.org/jira/browse/ARROW-6118
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Now in java code, most places uses {{org.apache.arrow.util.Preconditions}}, but 
still some places uses {{com.google.common.base.Preconditions}}.

Remove google Preconditions meanwhile remove duplicated checks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6097) [Java] Avro adapter implement unions type

2019-08-01 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6097:
-

 Summary: [Java] Avro adapter implement unions type
 Key: ARROW-6097
 URL: https://issues.apache.org/jira/browse/ARROW-6097
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Ji Liu
Assignee: Ji Liu


Support convert unions type like ["string"], ["string", 'int"] and nullable 
["string", "int", "null"]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6083) [Java] Refactor Jdbc adapter consume logic

2019-07-31 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6083:
-

 Summary: [Java] Refactor Jdbc adapter consume logic
 Key: ARROW-6083
 URL: https://issues.apache.org/jira/browse/ARROW-6083
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Jdbc adapter read from {{ResultSet}} looks like:

while (rs.next()) {
 for (int i = 1; i <= columnCount; i++) {
 jdbcToFieldVector(
 rs,
 i,
 rs.getMetaData().getColumnType(i),
 rowCount,
 root.getVector(rsmd.getColumnName(i)),
 config);
 }
 rowCount++;
}

And in {{jdbcToFieldVector}} has lots of switch-case, that is to see, for every 
single value from ResultSet we have to do lots of analyzing conditions.

I think we could optimize this using consumer/delegate like avro adapter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6079) [Java] Implement/test UnionFixedSizeListWriter for FixedSizeListVector

2019-07-31 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6079:
-

 Summary: [Java] Implement/test UnionFixedSizeListWriter for 
FixedSizeListVector
 Key: ARROW-6079
 URL: https://issues.apache.org/jira/browse/ARROW-6079
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Now we have two list vectors: {{ListVector}} and {{FixedSizeListVector}}.

{{ListVector}} has already implemented UnionListWriter for writing data, 
however, {{FixedSizeListVector}} doesn't have this yet and seems the only way 
for users to write data is getting inner vector and set value manually.

Implement a writer for {{FixedSizeListVector}} is useful in some cases.

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6078) [Java] Implement dictionary-encoded subfields for List type

2019-07-30 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6078:
-

 Summary: [Java] Implement dictionary-encoded subfields for List 
type
 Key: ARROW-6078
 URL: https://issues.apache.org/jira/browse/ARROW-6078
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Ji Liu
Assignee: Ji Liu


For example, int type List (valueCount = 5) has data like below:

10, 20

10, 20

30, 40, 50

30, 40, 50

10, 20

could be encoded to:

0, 1

0, 1

2, 3, 4

2, 3, 4

0, 1

with list type dictionary

10, 20, 30, 40, 50

or

10,

20,

30,

40,

50

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6035) [Java] Avro adapter support convert nullable value

2019-07-25 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6035:
-

 Summary: [Java] Avro adapter support convert nullable value
 Key: ARROW-6035
 URL: https://issues.apache.org/jira/browse/ARROW-6035
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Ji Liu
Assignee: Ji Liu


A  specific Avro unions type(has two types and one is null type) could convert 
to a nullable ArrowVector.

For instance, ["null", "string"] could represented by a VarcharVector which 
could has null value.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6022) [Java] Support equals API in ValueVector to compare two vectors equal

2019-07-24 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6022:
-

 Summary: [Java] Support equals API in ValueVector to compare two 
vectors equal
 Key: ARROW-6022
 URL: https://issues.apache.org/jira/browse/ARROW-6022
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


In some case, this feature is useful.

In ARROW-1184, {{Dictionary#equals}} not work due to the lack of this API.

Moreover, we already implemented {{equals(int index, ValueVector target, int 
targetIndex)}}, so this new added API could reuse it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6020) [Java] Refactor ByteFunctionHelper#hash with new added ArrowBufHasher

2019-07-23 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6020:
-

 Summary: [Java] Refactor ByteFunctionHelper#hash with new added 
ArrowBufHasher
 Key: ARROW-6020
 URL: https://issues.apache.org/jira/browse/ARROW-6020
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Some logic in these two classes are similar, should replace 
ByteFunctionHelper#hash logic with ArrowBufHasher since it has murmur hash 
algorithm which could avoid hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6019) [Java] Port Jdbc and Avro adapter to new directory

2019-07-23 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6019:
-

 Summary: [Java] Port Jdbc and Avro adapter to new directory 
 Key: ARROW-6019
 URL: https://issues.apache.org/jira/browse/ARROW-6019
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


As discussed in mail list, adapters are different from native reader.

This issue is used to track these issues:

i. create new “contrib” directory and move Jdbc/Avro adapter to it.

ii. provide more description.

iii. change orc readers structure to “converter"

cc [~emkornfi...@gmail.com]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5997) [Java] Support dictionary encoding for Union type

2019-07-22 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5997:
-

 Summary: [Java] Support dictionary encoding for Union type
 Key: ARROW-5997
 URL: https://issues.apache.org/jira/browse/ARROW-5997
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Ji Liu
Assignee: Ji Liu


Now only Union type is not supported in dictionary encoding.

In the last several weeks, we did some refactor for encoding and now it's time 
to support Union type.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS][JAVA] Designs & goals for readers/writers

2019-07-21 Thread Ji Liu
Thanks for your proposal.
Agreed Arrow readers/writers should have high performance like Orc reader, and 
as mentioned above, I think the current Avro adapter should be positioned as 
adapter rather than native reader. Not sure whether Arrow requires adapter 
based on library, I update the current design in ARROW-5845[1] for your 
information anyway.


Thanks,
Ji Liu

[1] https://issues.apache.org/jira/browse/ARROW-5845


--
From:Jacques Nadeau 
Send Time:2019年7月22日(星期一) 09:16
To:dev ; Micah Kornfield 
Subject:Re: [DISCUSS][JAVA] Designs & goals for readers/writers

As I read through your responses, I think it might be useful to talk about
adapters versus native Arrow readers/writers. Adapters are something that
adapt an existing API to produce and/or consume Arrow data. A native
reader/writer is something that understand the format directly and does not
have intermediate representations or APIs the data moves through beyond
those that needs to be used to complete work.

If people want to write adapters for Arrow, I see that as useful but very
different than writing native implementations and we should try to create a
clear delineation between the two.

Further comments inline.


> Could you expand on what level of detail you would like to see a design
> document?
>

A couple paragraphs seems sufficient. This is the goals of the
implementation. We target existing functionality X. It is an adapter. Or it
is a native impl. This is the expected memory and processing
characteristics, etc.  I've never been one for huge amount of design but
I've seen a number of recent patches appear where this is no upfront
discussion. Making sure that multiple buy into a design is the best way to
ensure long-term maintenance and use.


> I think this should be optional (the same argument below about predicates
> apply so I won't repeat them).
>

Per my comments above, maybe adapter versus native reader clarifies things.
For example, I've been working on a native avro read implementation. It is
little more than chicken scratch at this point but its goals, vision and
design are very different than the adapter that is being produced atm.


> Can you clarify the intent of this objective.  Is it mainly to tie in with
> the existing Java arrow memory book keeping?  Performance?  Something else?
>

Arrow is designed to be off-heap. If you have large variable amounts of
on-heap memory in an application, it starts to make it very hard to make
decisions about off-heap versus on-heap memory since those divisions are by
and large static in nature. It's fine for short lived applications but for
long lived applications, if you're working with a large amount of data, you
want to keep most of your memory in one pool. In the context of Arrow, this
is going to naturally be off-heap memory.


> I'm afraid this might lead to a "perfect is the enemy of the good"
> situation.  Starting off with a known good implementation of conversion to
> Arrow can allow us to both to profile hot-spots and provide a comparison of
> implementations to verify correctness.
>

I'm not clear what message we're sending as a community if we produce low
performance components. The whole of Arrow is to increase performance, not
decrease it. I'm targeting good, not perfect. At the same time, from my
perspective, Arrow development should not be approached in the same way
that general Java app development should be. If we hold a high standard,
we'll have less total integrations initially but I think we'll solve more
real world problems.

There is also the question of how widely adoptable we want Arrow libraries
> to be.
> It isn't surprising to me that Impala's Avro reader is an order of
> magnitude faster then the stock Java one.  As far as I know Impala's is a
> C++ implementation that does JIT with LLVM.  We could try to use it as a
> basis for converting to Arrow but I think this might limit adoption in some
> circumstances.  Some organizations/people might be hesitant to adopt the
> technology due to:
> 1.  Use of JNI.
> 2.  Use LLVM to do JIT.
>
> It seems that as long as we have a reasonably general interface to
> data-sources we should be able to optimize/refactor aggressively when
> needed.
>

This is somewhat the crux of the problem. It goes a little bit to who our
consuming audience is and what we're trying to deliver. I'll also say that
trying to build a high-quality implementation on top of low-quality
implementation or library-based adapter is worse than starting from
scratch. I believe this is especially true in Java where developers are
trained to trust hotspot and that things will be good enough. That is great
in a web app but not in systems software where we (and I expect others)
will deploy Arrow.


> >3. Propose a generalized "reader" interface as opposed to making each
>

[jira] [Created] (ARROW-5988) [Java] Avro adapter implement simple Record type

2019-07-19 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5988:
-

 Summary: [Java] Avro adapter implement simple Record type 
 Key: ARROW-5988
 URL: https://issues.apache.org/jira/browse/ARROW-5988
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Ji Liu
Assignee: Ji Liu


1. implement simple Record type witch only contains primitive types

2. add ByteBuffer cache in String/Bytes consumer to reduce creations. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5986) [Java] Code cleanup for dictionary encoding

2019-07-18 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5986:
-

 Summary: [Java] Code cleanup for dictionary encoding
 Key: ARROW-5986
 URL: https://issues.apache.org/jira/browse/ARROW-5986
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


In last few weeks, we did some refactor in dictionary encoding.

Since the new designed hash table for {{DictionaryEncoder}} and {{hashCode}} & 
{{equals}} API in {{ValueVector}} already checked in, some classed are no use 
anymore like {{DictionaryEncodingHashTable}}, {{BaseBinaryVector}} and related 
benchmarks & UT.

Fortunately, these changes are not made into version 0.14, which makes possible 
to remove them.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS][JAVA] Implement a CSV to Arrow adapter

2019-07-18 Thread Ji Liu
Thanks a lot for Wes and Liya's feedbacks.

Agreed that parsing performance of CSV files is important, and I just found a 
benchmark test for Java CSV library[1][2] which shows FastCSV has obvious 
advantages. Anyway, I will test it myself.


Thanks,
Ji Liu

[1] https://raw.githubusercontent.com/osiegmar/FastCSV/master/benchmark.png
[2] https://github.com/osiegmar/FastCSV


--
From:Fan Liya 
Send Time:2019年7月19日(星期五) 10:14
To:dev 
Cc:Ji Liu ; Micah Kornfield 
Subject:Re: [DISCUSS][JAVA] Implement a CSV to Arrow adapter

Hi Ji,

Thanks for proposing this. CSV adapter sounds like a useful feature.

Best,
Liya Fan
On Fri, Jul 19, 2019 at 12:31 AM Wes McKinney  wrote:
We wrote a custom reader in C++ since performance of parsing CSV files
 matters a lot -- we wanted to do multi-threaded execution of
 conversion steps, also. I don't know what the performance of
 commons-csv is but it might be worth doing some benchmarks to see.

 On Thu, Jul 18, 2019 at 4:35 AM Ji Liu  wrote:
 >
 > Hi all,
 >
 > Seems there is no adapter to convert CSV data to Arrow data in Java side 
 > which C++ has.  Now we already have JDBC adapter, Orc adapter and Avro 
 > adapter (In progress),  I think an adapter for CSV would probably also be 
 > nice.
 > After a brief discuss with @Micah Kornfield, Apache commons-csv [1] seems an 
 > efficient CSV parser that we could potentially leverage but I don't know if 
 > there are other better options. Any inputs and comments would be appreciated.
 >
 > Thanks,
 > Ji Liu[1]https://commons.apache.org/proper/commons-csv/


  1   2   >