[DISCUSS] How to solve the problem of OutOfMemoryException when there is sufficient memory?

2019-05-28 Thread Fan Liya
Hi all, We frequently encounter this problem in our code: Arrow throws OutOfMemoryException even though there is sufficient memory. Let me illustrate with the following example, which is frequently used in our code: int requestSize = ...; if (requestSize <= allocator.getLimit() -

Re: [DISCUSS] PR Backlog reduction

2019-05-31 Thread Fan Liya
; On the call today we discussed possibly repurposing the Spark PR > >>> dashboard application for our use > >>> > >>> * https://github.com/databricks/spark-pr-dashboard > >>> * https://spark-prs.appspot.com/ > >>> > >>> Thi

Re: [DISCUSS] PR Backlog reduction

2019-05-29 Thread Fan Liya
Sounds like a great idea. I am interested in Java PRs. Best, Liya Fan On Wed, May 29, 2019 at 1:28 PM Micah Kornfield wrote: > Sorry for the delay. I created > > https://docs.google.com/spreadsheets/d/146lDg11c5ohgVkrOglrb42a1JB0Gm1qBRbnoDlvB8QY/edit#gid=0 > as > simple way to distribute old

Re: [DISCUSS][Java] How to solve the problem of OutOfMemoryException when there is sufficient memory?

2019-06-03 Thread Fan Liya
2 allocation. @Wes > McKinney or @Jacques Nadeau may know more about this. > >> > >> However, there is likely to be more code that assumes or optimises for > this case. eg. in the vector allocation code, we round-up the request to > make full use of the power-of-2 bu

Re: [DISCUSS][Java] How to solve the problem of OutOfMemoryException when there is sufficient memory?

2019-06-03 Thread Fan Liya
ggregation of the concatenation of two string columns > and you'll see what I mean... :) > > Arrow always does part of this by sizing record batches to avoid this > exact fragmentation... > > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > > On Mon, Jun 3, 2019 at 7:36 P

Re: [ANNOUNCE] New Arrow committer: Francois Saint-Jacques

2019-06-13 Thread Fan Liya
Congrats! On Thu, Jun 13, 2019 at 7:28 AM Abraham Elmahrek wrote: > Congrats :) . > > On Wed, Jun 12, 2019 at 4:21 PM Robert Nishihara < > robertnishih...@gmail.com> > wrote: > > > Congratulations! > > > > On Wed, Jun 12, 2019 at 4:16 PM Philipp Moritz > wrote: > > > > > Congrats François :) >

Re: [Discuss][Java][Typical use cases for dictionary encoding string vectors]

2019-06-12 Thread Fan Liya
t makes sense to them? > > Thanks, > Micah > > On Mon, Jun 10, 2019 at 2:23 AM Fan Liya wrote: > > > Hi all, > > > > This is concerning issue ARROW-3396. > > > > I have summarized the problem (please see if my understanding is > correct), > > and prop

Re: [Disscuss][Java] Add more check style rule for Java code

2019-06-13 Thread Fan Liya
+1 for removing unused imports On Fri, Jun 14, 2019 at 11:25 AM niki.lj wrote: > Hi all, > We recently discovered that style check rules for Java code could be > enhanced, for example, "unused imports" and "redundant modifier" checks are > not included before. > I created ARROW-5587 and

[Discuss][Java][Typical use cases for dictionary encoding string vectors]

2019-06-10 Thread Fan Liya
Hi all, This is concerning issue ARROW-3396. I have summarized the problem (please see if my understanding is correct), and proposed some solutions to it. Please give your valuable feedback. For details, please see:

Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

2019-05-09 Thread Fan Liya
. Special thanks to @Jacques Nadeau for all the suggestions and helpful comments. Best, Liya Fan On Wed, May 8, 2019 at 1:05 PM Fan Liya wrote: > Hi Jacques, > > Thanks a lot for your comments. > > I have evaluated the assembly code of original Arrow API, as well as the > unsafe API in

Re: [ANNOUNCE] New Arrow committer: Neville Dipale

2019-05-13 Thread Fan Liya
Congrats!!! On Sun, May 12, 2019 at 10:10 AM Philipp Moritz wrote: > Congrats Neville! > > On Sat, May 11, 2019 at 6:09 PM Renjie Liu > wrote: > > > Congrats! > > > > Chao Sun 于 2019年5月12日周日 上午12:38写道: > > > > > Congrats Neville! > > > > > > On Sat, May 11, 2019 at 9:36 AM Micah Kornfield >

Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

2019-05-09 Thread Fan Liya
but that is orthogonal. > > Thanks, > Micah > > [1] https://github.com/apache/arrow/pull/4258 > [2] > > https://github.com/apache/arrow/blob/master/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java > > On Thu, May 9, 2019 at 7:22 PM Fan Liya wrote:

Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

2019-05-09 Thread Fan Liya
> > > > +1 on this proposal. > > > > > > ------ > > 发件人:Fan Liya > > 发送时间:2019年5月9日(星期四) 16:33 > > 收件人:dev > > 主 题:Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow > > > > Hi all, > > > > Our previous results

Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

2019-05-07 Thread Fan Liya
oblem today, this seems just as > feasible. > > The other question: in a real algorithm, how much does that 30% matter? > Your benchmarks are entirely about this one call whereas real workloads are > impacted by many things and the time in writing/reading vectors is > miniscule versus

Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

2019-05-06 Thread Fan Liya
Hi Jacques, Thank you so much for your kind reminder. To come up with some performance data, I have set up an environment and run some micro-benchmarks. The server runs Linux, has 64 cores and has 256 GB memory. The benchmarks are simple iterations over some double vectors (the source file is

[DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

2019-04-28 Thread Fan Liya
Hi all, We are proposing a new set of APIs in Arrow - unsafe vector APIs. The general ideas is attached below, and also accessible from our online document . Please give your valuable comments by

Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

2019-05-05 Thread Fan Liya
being aware of the implementation details. Best, Liya Fan On Sun, May 5, 2019 at 5:28 PM Fan Liya wrote: > Hi all, > > Thank you so much for your attention and valuable feedback. > > Please let me try to address some common questions, before answering > individual ones. >

Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

2019-05-05 Thread Fan Liya
e uses for turning bounds > > > checking > > > > > on/off [1]. > > > > > > > > > > Also, I think there was a comment on the JIRA, but are there any > > > > benchmarks > > > > > to show the expected improvements? My lim

Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

2019-05-05 Thread Fan Liya
Hi Jacques, Thanks a lot for your kind reply. Please see my comments in line. Best, Liya Fan > > > 1. How much slower is the current Arrow API, compared to directly accessing > off-heap memory? > > According to my (intuitive) experience in vectorizing Flink, the current > API is much slower, at

[Discuss][Java] Make the semantics of lastSet consistent

2019-07-04 Thread Fan Liya
There are two lastSet member variables in the code. One is in BaseVariableWidthVector and the other is in ListVector. In BaseVariableWidthVector, the lastSet refers to the last index that is actually set, while in ListVector, the lastSet refers to the next index that will be set. So there is an

Re: [Discuss][Java][Typical use cases for dictionary encoding string vectors]

2019-06-27 Thread Fan Liya
h I wrote in this > release cycle which addressed this design flaw. > > Thanks > Wes > > On Wed, Jun 12, 2019 at 5:33 AM Fan Liya wrote: > > > > @Micah Kornfield Thanks a lot for your comments. > > > > In the doc, we identify 3 problems for the current d

Support an alternative memory layout for varchar/varbinary vectors

2019-07-10 Thread Fan Liya
Hi all, We are thinking of providing varchar/varbinary vectors with a different memory layout which exists in a wide range of systems. The memory layout is different from that of VarCharVector in the following ways: 1. Instead of storing (start offset, end offset), the new layout stores

[Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-10 Thread Fan Liya
Hi all, We are thinking of providing varchar/varbinary vectors with a different memory layout which exists in a wide range of systems. The memory layout is different from that of VarCharVector in the following ways: 1. Instead of storing (start offset, end offset), the new layout stores

Re: Adding a new encoding for FP data

2019-07-11 Thread Fan Liya
lead to less attempts to build > a huffman tree. It's hard to pin-point the exact reason. > > I did not try other lossless text compressors but I expect similar results. > > For code, I can polish my patches, create a Jira task and submit the > patches for review. > > &g

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Fan Liya
compromises. I think in this case it is a necessary compromise to not allow > all kind of string representations. > > Uwe > > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote: > > Hi all, > > > > > > We are thinking of providing varchar/varbinary vectors wi

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Fan Liya
before even > when stored consecutively. > > > > Uwe > > > > On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote: > > > Hi Korn, > > > > > > Thanks a lot for your comments. > > > > > > In my opinion, your comments make sense

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Fan Liya
Fan On Fri, Jul 12, 2019 at 11:07 AM Fan Liya wrote: > @Uwe L. Korn > > Thanks a lot for the suggestion. I think this is exactly what we are doing > right now. > > Best, > Liya Fan > > On Thu, Jul 11, 2019 at 9:44 PM Wes McKinney wrote: > >> hi Liya --

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2019-07-12 Thread Fan Liya
@Antoine Pitrou, Good question. I think the answer depends on the concrete encoding scheme. For some encoding schemes, it is not a good idea to use them for in-memory data compression. For others, it is beneficial to operator directly on the compressed data. For example, it is beneficial to

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-08 Thread Fan Liya
Hi Micah, Thanks for opening this discussion. For me, most of the features are super useful, especially RLE and integer encoding. IMO, to support these new features, we need some basic algorithms first (e.g. sort and search). For example, RLE and sort are often used in combination. These new

Re: [Discuss][Java] Make the semantics of lastSet consistent

2019-07-08 Thread Fan Liya
java/org/apache/arrow/vector/complex/ListVectorHelper.java#L65 > > I'm fine with treating this as bug and fixing the external semantics too. > > > > On Thu, Jul 4, 2019 at 7:32 PM Fan Liya wrote: > > > > > There are two lastSet member variables in the code. O

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-14 Thread Fan Liya
ain this issue. Let me know if it is not clear > > I'm interested to experiment with the same thing in C++. We would have > an ExtensionArray in C++ whose values are string_view referencing > external memory, for example. > > - Wes > > On Thu, Jul 11, 2019 at 10:16 PM Fan Liya

Re: Permission Request

2019-04-22 Thread Fan Liya
the Contributor to the role after they've already > submitted a pull request > > Thanks > > On Mon, Apr 22, 2019 at 11:25 PM Fan Liya wrote: > > > > Hi Guys, > > > > I want to contribute to Apache Arrow. > > Would you please give me the permission a

Permission Request

2019-04-22 Thread Fan Liya
Hi Guys, I want to contribute to Apache Arrow. Would you please give me the permission as a contributor? My JIRA ID is fan_li_ya. Thanks a lot. Best, Liya Fan

Re: Proper way to retrigger Travis CI builds

2019-04-25 Thread Fan Liya
I am experiencing the same problem. I find a post here: https://stackoverflow.com/questions/17606874/trigger-a-travis-ci-rebuild-without-pushing-a-commit However, most methods do not work for me. Maybe I do not have enough permission. So repoen the PR can be a good choice for me. Best, Liya Fan

[DISCUSS][Java] Provide an interface for numeric vectors

2019-08-14 Thread Fan Liya
Dear all, We want to provide an interface for all vectors with numeric types (small int, float4, float8, etc). This interface will make it convenient for many operations on a vector, like average, sum, variance, etc. With this interface, the client code will be greatly simplified, with many

Re: [ANNOUNCE] New Arrow PMC member: Sebastien Binet

2019-08-14 Thread Fan Liya
Congratulations, Sebastien! Best, Liya Fan On Thu, Aug 15, 2019 at 10:47 AM Ji Liu wrote: > Congrats Sebastian! > > > -- > From:Micah Kornfield > Send Time:2019年8月15日(星期四) 10:46 > To:dev@arrow.apache.org > Subject:Re:

Re: [DISCUSS][Java] Provide an interface for numeric vectors

2019-08-14 Thread Fan Liya
on > this. > > Thanks, > Micah > > On Wed, Aug 14, 2019 at 7:26 PM Fan Liya wrote: > > > Dear all, > > > > > > > > We want to provide an interface for all vectors with numeric types (small > > int, float4, float8, etc). This interface will mak

Re: [Java] CI builds failing on master

2019-08-14 Thread Fan Liya
Hi Ji, Thanks for fixing this. Best, Liya Fan On Thu, Aug 15, 2019 at 12:50 PM Micah Kornfield wrote: > I just merged this. Thank you Ji Liu. > > On Wed, Aug 14, 2019 at 4:50 PM Ji Liu wrote: > > > Hi, Wes, as described in JIRA, this was introduced by our recent two > > patches, I have just

Re: [Discussion][Java] Redesign the dictionary encoder

2019-08-13 Thread Fan Liya
sts at > least, and the new encoders are in the algorithm package. How do you plan > on resolving the dependencies? > > [1] https://github.com/apache/arrow/pull/5055/files > > On Sun, Aug 11, 2019 at 1:18 AM Fan Liya wrote: > > > Dear all, > > > > Di

[Discussion][Java] Redesign the dictionary encoder

2019-08-11 Thread Fan Liya
Dear all, Dictionary encoding is an important feature, so it should be implemented with good performance. The current Java dictionary encoder implementation is based on static utility methods in org.apache.arrow.vector.dictionary.DictionaryEncoder, which has heavy performance overhead, preventing

Re: [DISCUSS][Java] Provide an interface for numeric vectors

2019-08-20 Thread Fan Liya
lues. The purpose is similar, to reduce unnecessary branch/swtich statements in the code. Please take a look https://issues.apache.org/jira/browse/ARROW-6247 Best, Liya Fan On Thu, Aug 15, 2019 at 1:41 PM Fan Liya wrote: > Hi Micah, > > Thanks for the good points. > I agree with yo

Re: [DISCUSS][Java] Design of RLE vector

2019-08-21 Thread Fan Liya
21, 2019 at 9:34 PM Wes McKinney wrote: > hi Liya, > > Do you intend to be able to send RLE vectors using the IPC protocol? > If so, we need to spend some time on Micah's discussion about > sparseness and encodings/compression. > > - Wes > > On Wed, Aug 21, 2019

Re: [DISCUSS][Java] Design of RLE vector

2019-08-21 Thread Fan Liya
for the lengths. > > > Thanks, > Micah > > On Wed, Aug 21, 2019 at 6:50 PM Fan Liya wrote: > > > Hi Wes, > > > > Thanks for the good suggestion. > > It is intended to be sent through IPC. So it should implement > FieldVector, > > not just ValueVector

Re: [Discuss][Java] Refactor code for time related vectors

2019-08-26 Thread Fan Liya
ch duplicated code you're actually > removing. > > On Mon, Aug 26, 2019, 6:13 PM Fan Liya wrote: > > > Dear all, > > > > Currently, we have two classes of time related vectors. One class are > named > > TimeXXVector, while the other class are named Time

[DISCUSS][Java] Should null values in VariableWidthVector/ListVector always takes 0 space?

2019-08-28 Thread Fan Liya
Dear all, In the discussion of this PR (https://github.com/apache/arrow/pull/5073), we are faced with a problem: Normally, in a VariableWidthVector (e.g. VarCharVector), a null value is supposed to take no space in the data buffer. In particular, for a null value, we have start index == end

Re: [DISCUSS][Java] Should null values in VariableWidthVector/ListVector always takes 0 space?

2019-09-03 Thread Fan Liya
Hi Wes, Thanks for the effort. I will add clarifications. Best, Liya Fan On Wed, Sep 4, 2019 at 11:06 AM Wes McKinney wrote: > I opened https://issues.apache.org/jira/browse/ARROW-6451 > > On Sun, Sep 1, 2019 at 9:59 PM Fan Liya wrote: > > > > Hi Wes, > > >

[Discuss][Java] Refactor code for time related vectors

2019-08-26 Thread Fan Liya
Dear all, Currently, we have two classes of time related vectors. One class are named TimeXXVector, while the other class are named TimeStampXXVector. We found there are some redundant code for these classes, so we want to do some refactorings for them. 1. For TimeXXVector, all classes extend

[Discuss][Java] Support conversions between delta vector and partial sum vector

2019-09-01 Thread Fan Liya
Dear all, We want to support a feature for conversions between delta vector and partial sum vector. Please give your valuable feedback. Best, Liya Fan What is a delta vector/partial sum vector? Given an integer vector a with length n, its partial sum vector is another integer vector b with

Re: [DISCUSS][Java] Should null values in VariableWidthVector/ListVector always takes 0 space?

2019-09-01 Thread Fan Liya
so, including > primitive. > > On Thu, Aug 29, 2019 at 12:50 AM Fan Liya wrote: > > > > Hi Jacques and Ravindra, > > > > Thanks for your valuable feedback. > > > > Please let me talk more about contiguous memory: > > For some operations (like memory segment

Re: [ANNOUNCE] New Arrow committer: David M Li

2019-09-01 Thread Fan Liya
Congratulations! David. Best, Liya Fan On Mon, Sep 2, 2019 at 7:51 AM David Li wrote: > Thanks all! Looking forward to continuing to work with everyone. > > Best, > David > > On 8/31/19, paddy horan wrote: > > Congrats David > > > > Get Outlook for iOS > >

[DISCUSS][Java] Design of RLE vector

2019-08-21 Thread Fan Liya
Dear all, RLE (run length encoding) is a widely used encoding/decoding technique. Compared with other encoding/decoding techniques, it is easier to work with the encoded data. We want to provide an RLE vector implementation in Arrow. The design details include: 1. RleVector implements

Re: [Discuss][Java] Refactor code for time related vectors

2019-08-28 Thread Fan Liya
> Micah > > > > [1] https://github.com/apache/arrow/pull/5213 > > > > > > On Wed, Aug 28, 2019 at 12:11 PM Jacques Nadeau > > wrote: > > > > > What problem are you trying to solve? It seems like you're proposing > > > refactoring for

Re: [DISCUSS][Java] Should null values in VariableWidthVector/ListVector always takes 0 space?

2019-08-28 Thread Fan Liya
values are null or non-null. What do you think? Best, Liya Fan On Thu, Aug 29, 2019 at 1:26 PM Ravindra Pindikura wrote: > On Wed, Aug 28, 2019 at 12:32 PM Fan Liya wrote: > > > Dear all, > > > > In the discussion of this PR (https://github.com/apache/arrow/pull/50

Re: [DISCUSS][Java] Design of RLE vector

2019-08-22 Thread Fan Liya
we are gather feedback on the > proposal, so we should hold off on coding these up, until we have consensus > on the approach. > > Thanks, > Micah > > On Wed, Aug 21, 2019 at 9:22 PM Fan Liya wrote: > >> Hi Micah, >> >> Thanks for the comments. >>

Re: [Discuss][Java] Support conversions between delta vector and partial sum vector

2019-09-05 Thread Fan Liya
it is difficult to give a precise plan. However, I am going to prepare a document about the requirement/design/implementation of the algorithms. Hope that will make discussions/code review more efficient. Best, Liya Fan On Thu, Sep 5, 2019 at 11:46 AM Fan Liya wrote: > Hi Wes, > > Tha

Re: [ANNOUNCE] New committers: Ben Kietzman, Kenta Murata, and Neal Richardson

2019-09-05 Thread Fan Liya
Big congratulations to Ben, Kenta and Neal! Best, Liya Fan On Fri, Sep 6, 2019 at 5:33 AM Wes McKinney wrote: > hi all, > > on behalf of the Arrow PMC, I'm pleased to announce that Ben, Kenta, > and Neal have accepted invitations to become Arrow committers. Welcome > and thank you for all your

Re: [Discuss][Java] Support conversions between delta vector and partial sum vector

2019-09-04 Thread Fan Liya
form them to delta vectors > before IPC". It sounds like you are proposing a data compression > technique. Should this be a part of the > sparseness/encoding/compression discussion? > > - Wes > > On Sun, Sep 1, 2019 at 10:14 PM Fan Liya wrote: > > > > Dear all, &g

Re: [Java] CI test failures

2019-09-12 Thread Fan Liya
It seems the problem occurred when the sure-fire plugin forks: [ERROR] Caused by: org.apache.maven.surefire.booter.SurefireBooterForkException: Error occurred in starting fork, check output in log So we need to adjust the forkCount parameter in the pom.xml file? Best, Liya Fan On Thu, Sep 12,

Re: [Java] CI test failures

2019-09-12 Thread Fan Liya
I have tried many times locally, with different values of forkCount, but failed to reproduce the error. Best, Liya Fan On Thu, Sep 12, 2019 at 4:21 PM Fan Liya wrote: > It seems the problem occurred when the sure-fire plugin forks: > > [ERRO

Re: [Discuss] [Java] DateMilliVector.getObject() return type (LocalDateTime vs LocalDate)

2019-09-17 Thread Fan Liya
I think there are similar problems with other time related vectors. Best, Liya Fan On Tue, Sep 17, 2019 at 1:02 PM Micah Kornfield wrote: > Anyone have an opinion on this? Personally, I'm leaning on keeping the > existing API compatibility, but I don't feel too strongly about it. > > On Mon,

Re: [DISCUSS][Java] Design of the algorithm module

2019-09-17 Thread Fan Liya
Thanks, > Micah > > On Sat, Sep 14, 2019 at 6:57 AM Fan Liya wrote: > > > Dear all, > > > > We have prepared a document for discussing the requirements, design and > > implementation issues for the algorithm module of Java: > > > > > > > https:/

Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-07 Thread Fan Liya
Mode Cnt Score Error Units > Float8Benchmarks.copyFromBenchmark avgt5 16.248 ± 1.409 us/op > Float8Benchmarks.readWriteBenchmark avgt5 14.150 ± 0.084 us/op > > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya wrote: > >> Hi Micah, >> >> Thanks a lot for doing th

Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-07 Thread Fan Liya
the JVM (specifically, C2) does not > > apply when dealing with int instead of longs. One of the > > most commons is the loop unrolling and vectorization. > > > > * It doesn't seem the best way to reference an old email on the list, but > > it is the only result shown by Google

Re: BigQuery Storage API now supports Arow

2019-07-27 Thread Fan Liya
@Micah Kornfield Awesome work! Big congratulations! Best, Liya Fan On Sat, Jul 27, 2019 at 9:17 PM David Li wrote: > This is super awesome, thanks for sharing! > > I see the original thread mentioned Flight support, do you think it'd > be possible to support Flight natively? Or conversely,

Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-06 Thread Fan Liya
Hi Micah, Thanks a lot for doing this. I am a little concerned about if there is any negative performance impact on the current 32-bit-length based applications. Can we do some performance comparison on our existing benchmarks? Best, Liya Fan On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield

Re: [ANNOUNCE] New Arrow PMC member: Micah Kornfield

2019-08-09 Thread Fan Liya
Big congratulations! Micah Thank you so much for all the help! Best, Liya Fan On Saturday, August 10, 2019, Brian Hulette wrote: > Congratulations Micah! Well deserved :) > > On Fri, Aug 9, 2019 at 9:02 AM Francois Saint-Jacques < > fsaintjacq...@gmail.com> wrote: > >> Congrats! >> >> well

Re: [Memo] API Behavior changes

2019-07-22 Thread Fan Liya
@Wes Mckineey, Thanks for the good suggestion. Best, Liya Fan On Mon, Jul 22, 2019 at 8:23 PM Wes McKinney wrote: > You could also use labels in JIRA to mark issues that introduce API changes > > On Mon, Jul 22, 2019 at 4:42 AM Fan Liya wrote: > > > > @Uwe L. Korn

Re: [DISCUSS][JAVA] Implement a CSV to Arrow adapter

2019-07-18 Thread Fan Liya
Hi Ji, Thanks for proposing this. CSV adapter sounds like a useful feature. Best, Liya Fan On Fri, Jul 19, 2019 at 12:31 AM Wes McKinney wrote: > We wrote a custom reader in C++ since performance of parsing CSV files > matters a lot -- we wanted to do multi-threaded execution of > conversion

[Memo] API Behavior changes

2019-07-21 Thread Fan Liya
Hi all, Let's track the API behavior changes in this email thread, so as not forget about them for the next release. ARROW-5842 : the semantics of lastSet in ListVector changes. In the past, it refers to the next index that will be set; now it

Re: [DISCUSS][Java] Design of the algorithm module

2019-09-24 Thread Fan Liya
eave more comments over the next few days. > > Thanks again for the write-up I think it will help frame a productive > conversation. > > -Micah > > On Tue, Sep 17, 2019 at 1:47 AM Fan Liya wrote: > >> Hi Micah, >> >> Thanks for your kind reminder. Comments ar

[DISCUSS][Java] Reduce the range of synchronized block when releasing an ArrowBuf

2019-09-27 Thread Fan Liya
Dear all, When releasing an ArrowBuf, we will run the following piece of code: private int decrement(int decrement) { allocator.assertOpen(); final int outcome; synchronized (allocationManager) { outcome = bufRefCnt.addAndGet(-decrement); if (outcome == 0) {

Re: Arrow for low latency IPC

2019-11-01 Thread Fan Liya
Hi Samrat, Arrow has flexible support for IPC through grpc. The cpp benchmark can be found in: https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/flight_benchmark.cc The java benchmark can be found in:

Re: [Java] Append multiple record batches together?

2019-11-14 Thread Fan Liya
ked array with multiple vector buffers would be > >>> ideal, similar to C++. It might take a fair amount of work to add this > but > >>> would open up a lot more functionality. As for the API, > >>> VectorSchemaRoot.concat(Collection) seems good to me. > >

Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-11-15 Thread Fan Liya
tical processing--purely subjective of course). >>>>>>>> > >> > >>>>>>>> > >> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield < >>>>>>>> emkornfi...@gmail.com> >>>

Re: [Java] Append multiple record batches together?

2019-11-07 Thread Fan Liya
Hi Micah, Thanks for bringing this up. > 1. An efficient solution already exists? It seems like TransferPair implementations could possibly be improved upon or have they already been optimized? Fundamnentally, memory copy is unavoidable, IMO, because the source and targe memory regions are

Re: [Java] Question About Vector Allocation

2019-11-08 Thread Fan Liya
Hi Azim, I think we should be aware of two distinct concepts: 1. vector capacity: the max number of values that can be stored in the vector, without reallocation 2. vector length: the number of values actually filled in the vector For any valid vector, we always have vector length <= vector

Re: [VOTE] Adopt Arrow in-process C Data Interface specification

2019-12-08 Thread Fan Liya
+1, as this is useful IMO. Best, Liya Fan On Sat, Dec 7, 2019 at 12:21 PM Jacques Nadeau wrote: > -1 (binding) > > I'm voting -1 on this. I posted the thinking why on the PR. The high-level > is that I think it needs to better address the pipelined use case as right > now it fails to support

Re: Java - Spark dataframe to Arrow format

2019-12-06 Thread Fan Liya
ney > *Sent:* Thursday, December 5, 2019 6:53 AM > *To:* dev > *Cc:* Fan Liya ; > jeetendra.jais...@impetus.co.in.invalid > > *Subject:* Re: Java - Spark dataframe to Arrow format > > hi folks, > > I understand the question to be about serialization. > > see >

Re: [DISCUSS][Java] Design of the algorithm module

2019-10-14 Thread Fan Liya
ode generation is applicable to almost all scenarios to achieve >> good performance, if we are willing to pay the price of code readability. >> I will try to detail the principles for choosing these tools for >> performance improvements later. >> >> Best, >

Re: [DISCUSS][Java] Builders for java classes

2019-10-24 Thread Fan Liya
Hi Micah, IMO, we need an adapter from on-heap array to off-heap array. This is useful because many third-party Java libraries populate data to an on-heap array. And I see this API in your design: IntVectorBuilder addAll(int[] values); So I am +1 for this. Best, Liya Fan On Thu, Oct 24, 2019

Re: [ANNOUNCE] New Arrow committer: Eric Erhardt

2019-10-17 Thread Fan Liya
Congrats Eric! Best, Liya Fan On Fri, Oct 18, 2019 at 3:06 AM paddy horan wrote: > Congrats Eric! > > > From: Micah Kornfield > Sent: Thursday, October 17, 2019 12:45:15 PM > To: dev > Subject: Re: [ANNOUNCE] New Arrow committer: Eric Erhardt > > Congrats

Re: [DISCUSS][Java] Design of the algorithm module

2019-10-22 Thread Fan Liya
nd peoples' time. If > there is only going to be one user of the code it might not belong in Arrow > "proper" due to these hurdles. > > Thanks, > Micah > > [1] https://issues.apache.org/jira/browse/FLINK-10929 > > On Mon, Oct 14, 2019 at 10:38 PM Fan Liya wrote: >

Re: [Discuss][Java] Provide default for io.netty.tryReflectionSetAccessible to prevent errors

2019-11-20 Thread Fan Liya
Hi Bryan, Thanks for bringing this up. +1 for the change. I am not clear what is the right place to override the jvm property. It is possible that when we try to override it (possibly in a static block), the old property value has already been read by netty library. To avoid this problem, do we

Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding (second iteration)

2019-11-25 Thread Fan Liya
I am sorry I did not follow the thread closely (will follow up later). However, the proposal above looks good to me. So I am +0.5 for this. Best, Liya Fan On Tue, Nov 26, 2019 at 1:12 PM Micah Kornfield wrote: > Could other members of the community chime in on this? In particular > getting

Re: Dense unions: monotonic or strictly monotonic offsets?

2019-11-24 Thread Fan Liya
no conflict with repeated or non-monotonic offset values. > > On Fri, Nov 22, 2019 at 1:49 AM Fan Liya wrote: > > > > This is an interesting question. > > IMO, to support repeated values, we also need to design a "coherency > > protocol", to avoid the sc

Re: Dense unions: monotonic or strictly monotonic offsets?

2019-11-24 Thread Fan Liya
Hi Wes, Thanks for your clarification. I agree with you that the problem should be considered in the implementation level. Best, Liya Fan On Mon, Nov 25, 2019 at 10:34 AM Wes McKinney wrote: > On Sun, Nov 24, 2019 at 8:07 PM Fan Liya wrote: > > > > Hi Wes, > >

Re: Unions: storing type_ids or type_codes?

2019-11-26 Thread Fan Liya
Hi Antoine, For Java, the physical child id is the same as the logical type code, as the index of each child vector is the code (ordinal) of the vector's minor type. This leads to a problem, that only a single vector for each type can exist in a union vector, so strictly speaking, the Java

Re: Dense unions: monotonic or strictly monotonic offsets?

2019-11-21 Thread Fan Liya
This is an interesting question. IMO, to support repeated values, we also need to design a "coherency protocol", to avoid the scenario where once a value is witten, the change is propagated to another slot unexpectedly. Best, Liya Fan On Fri, Nov 22, 2019 at 1:34 PM Micah Kornfield wrote: >

Re: Java API for Arrow Compute

2019-11-25 Thread Fan Liya
Hi Yuan, Currently, we have some APIs in the algorithm module of the Java project. If you have more requirements, maybe you can describe your requirements/scenarios, and start a discussion in the mailing list. Best, Liya Fan On Mon, Nov 25, 2019 at 11:17 PM Wes McKinney wrote: > There is a

[Discuss][Java] Appropriate semantics for comparing values in UnionVector

2019-11-14 Thread Fan Liya
Dear all, The problem arises from the discussion in a PR: https://github.com/apache/arrow/pull/5544#discussion_r338394941. We are trying to come up with a proper semantics to compare values in UnionVectors. According to the current logic in the code base, two values from two UnionVectors are

Re: [Java] Question About Vector Allocation

2019-11-14 Thread Fan Liya
;} > > > > Thanks, > > > Azim Afroozeh > > On Fri, Nov 8, 2019 at 10:57 AM Fan Liya wrote: > > > Hi Azim, > > > > I think we should be aware of two distinct concepts: > > > > 1. vector capacity: the max number of values that can be sto

Re: [DISCUSS][Java] Design of the algorithm module

2019-10-10 Thread Fan Liya
ll scenarios to achieve good > performance, if we are willing to pay the price of code readability. > I will try to detail the principles for choosing these tools for > performance improvements later. > > Best, > Liya Fan > > ----

Re: Possible Arrow 0.15.1 release

2019-10-12 Thread Fan Liya
I have update ARROW-6738 to add 0.15.1. I hope this bug can be fixed in the next release. Best, Liya Fan On Sat, Oct 12, 2019 at 12:38 PM Micah Kornfield wrote: > I updated the JIRA to add 0.15.1 but ARROW-6806 seems like it should be >

[DISCUSS][Java] Enhance code style checking for Java code

2019-12-18 Thread Fan Liya
Dear all, We want to enhance the Java code style checking. This is due to a discussion in [1]. In the discussion, we found the current style checking for Java code is not sufficient. So we want to enhace it in a series of "small" steps, in order to avoid having to change too many files at once.

Re: [DISCUSS][C++] Pointer name aliasing

2019-12-22 Thread Fan Liya
IMO, this question relates to something general and fundamental. Generally, name alias leads to two results: 1) It makes writing code easier 2) It makes reading code more difficult Personally, I prefer readability to writability. However, I am wrondering if we have some general principles

Re: Java - Spark dataframe to Arrow format

2019-12-05 Thread Fan Liya
Hi Jeetendra, I am not sure if I understand your question correctly. Arrow is an in-memory columnar data format, and Spark has its own in-memory data format for DataFrame, which is invisible to end users. So the Spark user has no control over the underlying in-memory layout. If you really want

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-05 Thread Fan Liya
more concrete) > > So in the USER_DEFINED case, how will the library know how to obtain > the uncompressed buffer? Is some additional metadata structure > required to provide instructions? > > On Wed, Mar 4, 2020 at 8:05 AM Fan Liya wrote: > > > > Hi Wes, > > &g

Re: [ANNOUNCE] New Arrow PMC member: Neal Richardson

2020-03-05 Thread Fan Liya
Congratulations, Neal Richardson! Best, Liya Fan On Thu, Mar 5, 2020 at 12:51 AM Wes McKinney wrote: > The Project Management Committee (PMC) for Apache Arrow has invited > Neal Richardson to become a PMC member and we are pleased to announce > that Neal has accepted. > > Congratulations and

Re: [ANNOUNCE] New Arrow PMC member: Francois Saint-Jacques

2020-03-05 Thread Fan Liya
Congratulations, Francois Saint-Jacques! Best, Liya Fan On Thu, Mar 5, 2020 at 12:52 AM Wes McKinney wrote: > The Project Management Committee (PMC) for Apache Arrow has invited > Francois Saint-Jacques to become a PMC member and we are pleased to > announce > that Francois has accepted. > >

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

2020-03-04 Thread Fan Liya
Hi Wes, I am thinking of adding an option named "USER_DEFINED" (or something similar) to enum CompressionType in your proposal. IMO, this option should be used primarily in Flight. Best, Liya Fan On Wed, Mar 4, 2020 at 11:12 AM Wes McKinney wrote: > On Tue, Mar 3, 2020, 8:1

  1   2   >