[jira] [Created] (ARROW-15562) [Java] FuzzIpcStream: Uncaught exception in java.base/java.nio.Bits.reserveMemory

2022-02-03 Thread Liya Fan (Jira)
Liya Fan created ARROW-15562:


 Summary: [Java] FuzzIpcStream: Uncaught exception in 
java.base/java.nio.Bits.reserveMemory
 Key: ARROW-15562
 URL: https://issues.apache.org/jira/browse/ARROW-15562
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan


Detailed Report: [https://oss-fuzz.com/testcase?key=4599882936090624]

Project: arrow-java
Fuzzing Engine: libFuzzer
Fuzz Target: FuzzIpcStream
Job Type: libfuzzer_asan_arrow-java
Platform Id: linux

Crash Type: Uncaught exception
Crash Address: 
Crash State:
java.base/java.nio.Bits.reserveMemory
java.base/java.nio.DirectByteBuffer.
java.base/java.nio.ByteBuffer.allocateDirect

Sanitizer: address (ASAN)

Recommended Security Severity: Low

Crash Revision: 
[https://oss-fuzz.com/revisions?job=libfuzzer_asan_arrow-java=202202010604]

Reproducer Testcase: 
[https://oss-fuzz.com/download?testcase_id=4599882936090624]

Issue filed automatically.

See [https://google.github.io/oss-fuzz/advanced-topics/reproducing] for 
instructions to reproduce this bug locally.
When you fix this bug, please
* mention the fix revision(s).
* state whether the bug was a short-lived regression or an old bug in any 
stable releases.
* add any other useful information.
This information can help downstream consumers.

If you need to contact the OSS-Fuzz team with a question, concern, or any other 
feedback, please file an issue at [https://github.com/google/oss-fuzz/issues]. 
Comments on individual Monorail issues are not monitored.

This bug is subject to a 90 day disclosure deadline. If 90 days elapse
without an upstream patch, then the bug report will automatically
become visible to the public.{color:#88}
{color}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15561) [Java] FuzzIpcStream: Uncaught exception in org.apache.arrow.memory.BaseAllocator.buffer

2022-02-03 Thread Liya Fan (Jira)
Liya Fan created ARROW-15561:


 Summary: [Java] FuzzIpcStream: Uncaught exception in 
org.apache.arrow.memory.BaseAllocator.buffer
 Key: ARROW-15561
 URL: https://issues.apache.org/jira/browse/ARROW-15561
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan


Detailed Report: https://oss-fuzz.com/testcase?key=6427573486223360

Project: arrow-java
Fuzzing Engine: libFuzzer
Fuzz Target: FuzzIpcStream
Job Type: libfuzzer_asan_arrow-java
Platform Id: linux

Crash Type: Uncaught exception
Crash Address: 
Crash State:
  org.apache.arrow.memory.BaseAllocator.buffer
  org.apache.arrow.memory.RootAllocator.buffer
  org.apache.arrow.memory.BaseAllocator.buffer
  
Sanitizer: address (ASAN)

Crash Revision: 
https://oss-fuzz.com/revisions?job=libfuzzer_asan_arrow-java=202201310606

Reproducer Testcase: https://oss-fuzz.com/download?testcase_id=6427573486223360

Issue filed automatically.

See https://google.github.io/oss-fuzz/advanced-topics/reproducing for 
instructions to reproduce this bug locally.
When you fix this bug, please
  * mention the fix revision(s).
  * state whether the bug was a short-lived regression or an old bug in any 
stable releases.
  * add any other useful information.
This information can help downstream consumers.

If you need to contact the OSS-Fuzz team with a question, concern, or any other 
feedback, please file an issue at https://github.com/google/oss-fuzz/issues. 
Comments on individual Monorail issues are not monitored.

This bug is subject to a 90 day disclosure deadline. If 90 days elapse
without an upstream patch, then the bug report will automatically
become visible to the public.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15560) [Java] FuzzIpcStream: Uncaught exception in java.base/java.nio.Buffer.createCapacityException

2022-02-03 Thread Liya Fan (Jira)
Liya Fan created ARROW-15560:


 Summary: [Java] FuzzIpcStream: Uncaught exception in 
java.base/java.nio.Buffer.createCapacityException
 Key: ARROW-15560
 URL: https://issues.apache.org/jira/browse/ARROW-15560
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan


Detailed Report: https://oss-fuzz.com/testcase?key=5095153130405888

Project: arrow-java
Fuzzing Engine: libFuzzer
Fuzz Target: FuzzIpcStream
Job Type: libfuzzer_asan_arrow-java
Platform Id: linux

Crash Type: Uncaught exception
Crash Address: 
Crash State:
  java.base/java.nio.Buffer.createCapacityException
  java.base/java.nio.ByteBuffer.allocate
  org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage
  
Sanitizer: address (ASAN)

Crash Revision: 
https://oss-fuzz.com/revisions?job=libfuzzer_asan_arrow-java=202201300605

Reproducer Testcase: https://oss-fuzz.com/download?testcase_id=5095153130405888

Issue filed automatically.

See https://google.github.io/oss-fuzz/advanced-topics/reproducing for 
instructions to reproduce this bug locally.
When you fix this bug, please
  * mention the fix revision(s).
  * state whether the bug was a short-lived regression or an old bug in any 
stable releases.
  * add any other useful information.
This information can help downstream consumers.

If you need to contact the OSS-Fuzz team with a question, concern, or any other 
feedback, please file an issue at https://github.com/google/oss-fuzz/issues. 
Comments on individual Monorail issues are not monitored.

This bug is subject to a 90 day disclosure deadline. If 90 days elapse
without an upstream patch, then the bug report will automatically
become visible to the public.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15559) [Java] FuzzIpcFile: Uncaught exception in org.apache.arrow.vector.types.pojo.Schema.convertSchema

2022-02-03 Thread Liya Fan (Jira)
Liya Fan created ARROW-15559:


 Summary: [Java] FuzzIpcFile: Uncaught exception in 
org.apache.arrow.vector.types.pojo.Schema.convertSchema
 Key: ARROW-15559
 URL: https://issues.apache.org/jira/browse/ARROW-15559
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan


Detailed Report: https://oss-fuzz.com/testcase?key=5965184743636992

Project: arrow-java
Fuzzing Engine: libFuzzer
Fuzz Target: FuzzIpcFile
Job Type: libfuzzer_asan_arrow-java
Platform Id: linux

Crash Type: Uncaught exception
Crash Address: 
Crash State:
  org.apache.arrow.vector.types.pojo.Schema.convertSchema
  org.apache.arrow.vector.ipc.message.ArrowFooter.
  org.apache.arrow.vector.ipc.ArrowFileReader.readSchema
  
Sanitizer: address (ASAN)

Crash Revision: 
https://oss-fuzz.com/revisions?job=libfuzzer_asan_arrow-java=202201300605

Reproducer Testcase: https://oss-fuzz.com/download?testcase_id=5965184743636992

Issue filed automatically.

See https://google.github.io/oss-fuzz/advanced-topics/reproducing for 
instructions to reproduce this bug locally.
When you fix this bug, please
  * mention the fix revision(s).
  * state whether the bug was a short-lived regression or an old bug in any 
stable releases.
  * add any other useful information.
This information can help downstream consumers.

If you need to contact the OSS-Fuzz team with a question, concern, or any other 
feedback, please file an issue at https://github.com/google/oss-fuzz/issues. 
Comments on individual Monorail issues are not monitored.

This bug is subject to a 90 day disclosure deadline. If 90 days elapse
without an upstream patch, then the bug report will automatically
become visible to the public.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15558) [Java] FuzzIpcFile: Uncaught exception in java.base/java.nio.Buffer.checkIndex

2022-02-03 Thread Liya Fan (Jira)
Liya Fan created ARROW-15558:


 Summary: [Java] FuzzIpcFile: Uncaught exception in 
java.base/java.nio.Buffer.checkIndex
 Key: ARROW-15558
 URL: https://issues.apache.org/jira/browse/ARROW-15558
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan


Detailed Report: https://oss-fuzz.com/testcase?key=5518211972464640

Project: arrow-java
Fuzzing Engine: libFuzzer
Fuzz Target: FuzzIpcFile
Job Type: libfuzzer_asan_arrow-java
Platform Id: linux

Crash Type: Uncaught exception
Crash Address: 
Crash State:
  java.base/java.nio.Buffer.checkIndex
  java.base/java.nio.HeapByteBuffer.getInt
  com.google.flatbuffers.Table.__reset
  
Sanitizer: address (ASAN)

Crash Revision: 
https://oss-fuzz.com/revisions?job=libfuzzer_asan_arrow-java=202201300605

Reproducer Testcase: https://oss-fuzz.com/download?testcase_id=5518211972464640

Issue filed automatically.

See https://google.github.io/oss-fuzz/advanced-topics/reproducing for 
instructions to reproduce this bug locally.
When you fix this bug, please
  * mention the fix revision(s).
  * state whether the bug was a short-lived regression or an old bug in any 
stable releases.
  * add any other useful information.
This information can help downstream consumers.

If you need to contact the OSS-Fuzz team with a question, concern, or any other 
feedback, please file an issue at https://github.com/google/oss-fuzz/issues. 
Comments on individual Monorail issues are not monitored.

This bug is subject to a 90 day disclosure deadline. If 90 days elapse
without an upstream patch, then the bug report will automatically
become visible to the public.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15557) [Java] FuzzIpcFile: Uncaught exception in java.base/java.nio.HeapByteBuffer.

2022-02-03 Thread Liya Fan (Jira)
Liya Fan created ARROW-15557:


 Summary: [Java] FuzzIpcFile: Uncaught exception in 
java.base/java.nio.HeapByteBuffer.
 Key: ARROW-15557
 URL: https://issues.apache.org/jira/browse/ARROW-15557
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan


Detailed Report: https://oss-fuzz.com/testcase?key=5015797066498048

Project: arrow-java
Fuzzing Engine: libFuzzer
Fuzz Target: FuzzIpcFile
Job Type: libfuzzer_asan_arrow-java
Platform Id: linux

Crash Type: Uncaught exception
Crash Address: 
Crash State:
  java.base/java.nio.HeapByteBuffer.
  java.base/java.nio.ByteBuffer.allocate
  org.apache.arrow.vector.ipc.ArrowFileReader.readSchema
  
Sanitizer: address (ASAN)

Recommended Security Severity: Low

Crash Revision: 
https://oss-fuzz.com/revisions?job=libfuzzer_asan_arrow-java=202201300605

Reproducer Testcase: https://oss-fuzz.com/download?testcase_id=5015797066498048

Issue filed automatically.

See https://google.github.io/oss-fuzz/advanced-topics/reproducing for 
instructions to reproduce this bug locally.
When you fix this bug, please
  * mention the fix revision(s).
  * state whether the bug was a short-lived regression or an old bug in any 
stable releases.
  * add any other useful information.
This information can help downstream consumers.

If you need to contact the OSS-Fuzz team with a question, concern, or any other 
feedback, please file an issue at https://github.com/google/oss-fuzz/issues. 
Comments on individual Monorail issues are not monitored.

This bug is subject to a 90 day disclosure deadline. If 90 days elapse
without an upstream patch, then the bug report will automatically
become visible to the public.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15502) [Java] Detect exceptional footer size in Arrow file reader

2022-01-29 Thread Liya Fan (Jira)
Liya Fan created ARROW-15502:


 Summary: [Java] Detect exceptional footer size in Arrow file reader
 Key: ARROW-15502
 URL: https://issues.apache.org/jira/browse/ARROW-15502
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


When a malformed Arrow file containing an extremely large footer size (much 
larger than the file size) is fed to the ArrowFileReader, our implementation 
fails detect the problem, due to integer arithmetic overflow.

This will lead to extremely large memory allocation and eventually causing an 
OutOfMemoryError.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15501) [Java] Support validating decimal vectors

2022-01-29 Thread Liya Fan (Jira)
Liya Fan created ARROW-15501:


 Summary: [Java] Support validating decimal vectors
 Key: ARROW-15501
 URL: https://issues.apache.org/jira/browse/ARROW-15501
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Support validating decimal vectors and check precisions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14696) [Java] Reset vectors before populating JDBC data when reusing vector schema root

2021-11-12 Thread Liya Fan (Jira)
Liya Fan created ARROW-14696:


 Summary: [Java] Reset vectors before populating JDBC data when 
reusing vector schema root
 Key: ARROW-14696
 URL: https://issues.apache.org/jira/browse/ARROW-14696
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


When the option of reusing vector schema root, the vectors should be reset 
before populating data from JDBC.
This may cause problems when there are null values in the previous batch. 

For details, please see https://github.com/apache/arrow/issues/11589.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-13811) [Java] Provide a general out-of-place sorter

2021-08-31 Thread Liya Fan (Jira)
Liya Fan created ARROW-13811:


 Summary: [Java] Provide a general out-of-place sorter
 Key: ARROW-13811
 URL: https://issues.apache.org/jira/browse/ARROW-13811
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


The sorter should work for any type of vectors, with a time complexity of 
O(n*log(n)). 

Since it does not make any assumptions about the memory layout of the vector, 
its performance can be sub-optimal. So if another sorter is applicable 
(e.g.{{FixedWidthInPlaceVectorSorter}}), it should be used in preference. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13792) [Java] The toString representation is incorrect for unsigned integer vectors

2021-08-30 Thread Liya Fan (Jira)
Liya Fan created ARROW-13792:


 Summary: [Java] The toString representation is incorrect for 
unsigned integer vectors
 Key: ARROW-13792
 URL: https://issues.apache.org/jira/browse/ARROW-13792
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


When adding a byte 0xff to a UInt1Vector, the {{toString}} method produces 
{{[-1]}}. Since the vector contains unsinged integers, the correct result 
should be {{[255]}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13733) [Java] Allow JDBC adapters to reuse vector schema roots

2021-08-24 Thread Liya Fan (Jira)
Liya Fan created ARROW-13733:


 Summary: [Java] Allow JDBC adapters to reuse vector schema roots
 Key: ARROW-13733
 URL: https://issues.apache.org/jira/browse/ARROW-13733
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


According to the current design of the JDBC adapter, it is not possible to 
reuse the vector schema roots. That is, a new vector schema root is created and 
released for each batch.

This can cause performance problems, because in many scenarios, the client code 
only reads data in vector schema root. So the vector schema roots can be reused 
in the following cycle: populate data -> client use data -> populate data -> ...

The current design has another problem. For most times, it has two alternating 
vector schema roots in memory, causing a large waste of memory, especially for 
large batches.

We solve both problems by providing a flag in the config, which allows the user 
to reuse the vector shema roots. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13645) [Java] Allow NullVectors to have distinct field names

2021-08-17 Thread Liya Fan (Jira)
Liya Fan created ARROW-13645:


 Summary: [Java] Allow NullVectors to have distinct field names
 Key: ARROW-13645
 URL: https://issues.apache.org/jira/browse/ARROW-13645
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan
 Fix For: 6.0.0


As discussed in the ML 
(https://lists.apache.org/thread.html/r19aad8a34f63334d3fcd627106f69be13f689740b83930a4eecc17b3%40%3Cdev.arrow.apache.org%3E),
 the current implementation use hard-coded field names. This may cause some 
problems, for example, when reconstruct a usable schema from a list of 
FieldVectors.

So we resolve this problem by allowing NullVectors to have distinct field 
names. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13604) [Java] Remove deprecation annotations for APIs representing unsupported operations

2021-08-11 Thread Liya Fan (Jira)
Liya Fan created ARROW-13604:


 Summary: [Java] Remove deprecation annotations for APIs 
representing unsupported operations
 Key: ARROW-13604
 URL: https://issues.apache.org/jira/browse/ARROW-13604
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Some APIs representing unsupported operations should not be annotated 
deprecated, unless the overriden APIs are deprecated in the super 
classes/interfaces.

According to the discussion in 
https://github.com/apache/arrow/pull/10864#issuecomment-895707729, we open a 
separate issue for this. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13544) [Java] Remove APIs that have been deprecated for long

2021-08-04 Thread Liya Fan (Jira)
Liya Fan created ARROW-13544:


 Summary: [Java] Remove APIs that have been deprecated for long
 Key: ARROW-13544
 URL: https://issues.apache.org/jira/browse/ARROW-13544
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


For some APIs, it has been a long time since they were annotated deprecated. 
During this time, a number of releases have been published. So it is time to 
get rid of them.

Please also note that some APIs representing unsupported operations should not 
be annotated deprecated, unless the overriden APIs are deprecated in the super 
classes/interfaces. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13443) [C++] Fix the incorrect mapping from flatbuf::MetadataVersion to arrow::ipc::MetadataVersion

2021-07-26 Thread Liya Fan (Jira)
Liya Fan created ARROW-13443:


 Summary: [C++] Fix the incorrect mapping from 
flatbuf::MetadataVersion to arrow::ipc::MetadataVersion
 Key: ARROW-13443
 URL: https://issues.apache.org/jira/browse/ARROW-13443
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Liya Fan
Assignee: Liya Fan


The mapping for V3 is incorrect. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13194) [Java]]

2021-06-28 Thread Liya Fan (Jira)
Liya Fan created ARROW-13194:


 Summary: [Java]]
 Key: ARROW-13194
 URL: https://issues.apache.org/jira/browse/ARROW-13194
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Liya Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13147) [Java] Respect the rounding policy when allocating vector buffers

2021-06-23 Thread Liya Fan (Jira)
Liya Fan created ARROW-13147:


 Summary: [Java] Respect the rounding policy when allocating vector 
buffers
 Key: ARROW-13147
 URL: https://issues.apache.org/jira/browse/ARROW-13147
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


According to the current implementation, the default "next power of two" 
rounding policy is assumed when allocating buffers for a vector. 

In particular, for fixed width vectors, this policy is applied for the validity 
and data buffers, and for variable width vectors, this policy is applied for 
the validity and offset buffers. 

However, this default policy is not always used for the allocator. When an 
alternative policy is in use, the buffers allocated assuming the default policy 
will have inappropriate capacities, which may lead to waste of memory spaces. 




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12310) [Java] ValueVector#getObject should support covariance for complex types

2021-04-09 Thread Liya Fan (Jira)
Liya Fan created ARROW-12310:


 Summary: [Java] ValueVector#getObject should support covariance 
for complex types
 Key: ARROW-12310
 URL: https://issues.apache.org/jira/browse/ARROW-12310
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Currently, the {{ValueVector#getObject}} API supports covariance for primitive 
types.
For example, {{IntVector#getObject}} returns {{Integer}} while 
{{BitVector#getObject}} returns {{Boolean}}.

For complex types, we should also support covariance. For example, 
{{ListVector#getObject}} should return a {{List}}

This will help reduce unnecessary casts, and enforce type safety. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11999) [Java] Support parallel vector element search with user-specified comparator

2021-03-17 Thread Liya Fan (Jira)
Liya Fan created ARROW-11999:


 Summary: [Java] Support parallel vector element search with 
user-specified comparator
 Key: ARROW-11999
 URL: https://issues.apache.org/jira/browse/ARROW-11999
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This is in response to the discussion in 
https://github.com/apache/arrow/pull/5631#discussion_r339110228

Currently, we only support parallel search with {{RangeEqualsVisitor}}, which 
does not support user-specified comparators.
We want to provide the functionality in this issue to support wider range of 
use cases. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11901) [Java] Investigate potential performance improvement of compression codec

2021-03-07 Thread Liya Fan (Jira)
Liya Fan created ARROW-11901:


 Summary: [Java] Investigate potential performance improvement of 
compression codec
 Key: ARROW-11901
 URL: https://issues.apache.org/jira/browse/ARROW-11901
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


In response to the discussion in 
https://github.com/apache/arrow/pull/8949/files#r588046787

There are some performance penalties in the implementation of the compression 
codecs (e.g. data copying between heap/off-heap data). We need to revise the 
code to improve the performance. 

We should also provide some benchmarks to validate that the performance 
actually improves. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11899) [Java] Refactor the compression codec implementation into core/Arrow specific parts

2021-03-07 Thread Liya Fan (Jira)
Liya Fan created ARROW-11899:


 Summary: [Java] Refactor the compression codec implementation into 
core/Arrow specific parts
 Key: ARROW-11899
 URL: https://issues.apache.org/jira/browse/ARROW-11899
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This issue is in response to the discussion in 
https://github.com/apache/arrow/pull/8949/files#r588049088

We want to refactor the compression codec related code into two parts: one for 
the core compression logic, and the other for Arrow specific logic.

This will make it easier to support other compression types. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11081) [Java] Make IPC option immutable

2020-12-30 Thread Liya Fan (Jira)
Liya Fan created ARROW-11081:


 Summary: [Java] Make IPC option immutable
 Key: ARROW-11081
 URL: https://issues.apache.org/jira/browse/ARROW-11081
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


By making it immutable, the following benefits can be obtained:
1. It makes the code easier to reason about.
2. It allows JIT to make more optimizations.
3. Immutable objects can be shared, so many object allocations can be avoided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10880) [Java] Support compressing RecordBatch IPC buffers by LZ4

2020-12-11 Thread Liya Fan (Jira)
Liya Fan created ARROW-10880:


 Summary: [Java] Support compressing RecordBatch IPC buffers by LZ4
 Key: ARROW-10880
 URL: https://issues.apache.org/jira/browse/ARROW-10880
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Support compressing/decompressing RecordBatch IPC buffers by LZ4.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10749) [C++] Incorrect string format for Datum with the collection type

2020-11-27 Thread Liya Fan (Jira)
Liya Fan created ARROW-10749:


 Summary: [C++] Incorrect string format for Datum with the 
collection type
 Key: ARROW-10749
 URL: https://issues.apache.org/jira/browse/ARROW-10749
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Liya Fan
Assignee: Liya Fan


The current format looks like this: {{Collection(...}}, the right embrace is 
omitted. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10662) [Java] Avoid integer overflow for Json file reader

2020-11-19 Thread Liya Fan (Jira)
Liya Fan created ARROW-10662:


 Summary: [Java] Avoid integer overflow for Json file reader
 Key: ARROW-10662
 URL: https://issues.apache.org/jira/browse/ARROW-10662
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


For the current implementation, it uses {{int}} to represent the buffer size. 
However, the buffer can be larger than Integer.MAX_VALUE, which will lead to 
integer overflow and unexpected behaviors. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10492) [Java][JDBC] Allow users to config the mapping between SQL types and Arrow types

2020-11-03 Thread Liya Fan (Jira)
Liya Fan created ARROW-10492:


 Summary: [Java][JDBC] Allow users to config the mapping between 
SQL types and Arrow types
 Key: ARROW-10492
 URL: https://issues.apache.org/jira/browse/ARROW-10492
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


According to the current implementation of JDBC adapter, the conversion between 
SQL types and Arrow types is hard-coded. This will cause some problems in 
practice:
 # The appropriate conversion may vary for different databases. For example, 
for SQL Server, type {{real}} corresponds to 4 byte floating point values 
([https://docs.microsoft.com/en-us/sql/t-sql/data-types/float-and-real-transact-sql?view=sql-server-ver15),]
 whereas for SQLite, \{{real}} corresponds to 8 byte floating point values 
([https://www.sqlitetutorial.net/sqlite-data-types/).] If the maping is not 
right, some extra conversion would happen, which can impact performance 
severely. 
 # Our current implementation determines the type conversion solely based on 
the type ID. However, the appropriate conversion may also depend some other 
information, like precision and scale. For example, for {{FLOAT(n)}}, it should 
correspond to 4 byte floating point values, if n <= 24, otherwise, it should 
correspond to 8 byte floating point values.  

To address the problems, we should allow users to customize the conversion 
between SQL and Arrow types. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10388) [Java] Fix spark integration builds

2020-10-26 Thread Liya Fan (Jira)
Liya Fan created ARROW-10388:


 Summary: [Java] Fix spark integration builds
 Key: ARROW-10388
 URL: https://issues.apache.org/jira/browse/ARROW-10388
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


As discussed in 
[https://github.com/apache/arrow/pull/8475#issuecomment-716377181,] the 
integration build is failing because we have changed the constructor API. 

We need to restore the original constructor and make it deprecated. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10294) [Java] Resolve problems of DecimalVector APIs on ArrowBufs

2020-10-12 Thread Liya Fan (Jira)
Liya Fan created ARROW-10294:


 Summary: [Java] Resolve problems of DecimalVector APIs on ArrowBufs
 Key: ARROW-10294
 URL: https://issues.apache.org/jira/browse/ARROW-10294
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Unlike other fixed width vectors, DecimalVectors have some APIs that directly 
manipulate an ArrowBuf (e.g. \{{void set(int index, int isSet, int start, 
ArrowBuf buffer)}}).

After supporting 64-bit ArrowBufs, we need to adjust such APIs so that they 
work properly. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10277) [C++] Support comparing scalars approximately

2020-10-11 Thread Liya Fan (Jira)
Liya Fan created ARROW-10277:


 Summary: [C++] Support comparing scalars approximately
 Key: ARROW-10277
 URL: https://issues.apache.org/jira/browse/ARROW-10277
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Liya Fan
Assignee: Liya Fan


As discussed in 
[https://github.com/apache/arrow/pull/7748#discussion_r469997286,] we need to 
compare scalars approximately in some scenarios. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10107) [Doc] Provide Java development guide

2020-09-27 Thread Liya Fan (Jira)
Liya Fan created ARROW-10107:


 Summary: [Doc] Provide Java development guide
 Key: ARROW-10107
 URL: https://issues.apache.org/jira/browse/ARROW-10107
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Documentation
Reporter: Liya Fan


We need a document to help developers with development related issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10069) [Java] Support running Java benchmarks from command line

2020-09-22 Thread Liya Fan (Jira)
Liya Fan created ARROW-10069:


 Summary: [Java] Support running Java benchmarks from command line
 Key: ARROW-10069
 URL: https://issues.apache.org/jira/browse/ARROW-10069
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Liya Fan
Assignee: Liya Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9797) [Rust] AMD64 Conda Integration Tests is failing for the Master branch

2020-08-19 Thread Liya Fan (Jira)
Liya Fan created ARROW-9797:
---

 Summary: [Rust] AMD64 Conda Integration Tests is failing for the 
Master branch
 Key: ARROW-9797
 URL: https://issues.apache.org/jira/browse/ARROW-9797
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Liya Fan


The integration tests is failing:

 

{noformat}

error[E0061]: this function takes 2 arguments but 3 arguments were supplied
 --> datafusion/src/optimizer/filter_push_down.rs:373:22
 |
373 | vec![aggregate_expr("SUM", col("b"), DataType::Int32)
 | ^^ -  --- supplied 3 arguments
 | |
 | expected 2 arguments
 | 
 ::: datafusion/src/logicalplan.rs:667:1
 |
667 | pub fn aggregate_expr(name: , expr: Expr) -> Expr {
 | - defined here

{noformat}

 

Rust folks, please take a look. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9680) [Java] Support non-nullable vectors

2020-08-10 Thread Liya Fan (Jira)
Liya Fan created ARROW-9680:
---

 Summary: [Java] Support non-nullable vectors
 Key: ARROW-9680
 URL: https://issues.apache.org/jira/browse/ARROW-9680
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This issue was first discussed in the ML 
([https://lists.apache.org/thread.html/r480387ec9ec822f3ed30e9131109e43874a1c4d18af74ede1a7e41c5%40%3Cdev.arrow.apache.org%3E]),
 from which we have received some feedback.

We briefly resate it here as below:

 
1. Non-nullable vectors are widely used in practice. For example, in a database 
engine, a column can be declared as not null, so it cannot contain null values.
2.Non-nullable vectors has significant performance advantages compared with 
their nullable conterparts, such as:
  1) the memory space of the validity buffer can be saved.
  2) manipulation of the validity buffer can be bypassed
  3) some if-else branches can be replaced by sequential instructions (by the 
JIT compiler), leading to high throughput for the CPU pipeline. 
 
We open this Jira to facilitate further discussions, and we may provide a 
sample PR, which will help us make a clearer decision. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9540) [WebSite] Java documents are hidden

2020-07-22 Thread Liya Fan (Jira)
Liya Fan created ARROW-9540:
---

 Summary: [WebSite] Java documents are hidden
 Key: ARROW-9540
 URL: https://issues.apache.org/jira/browse/ARROW-9540
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Liya Fan
 Attachments: image-2020-07-22-15-57-03-928.png

In our website (https://arrow.apache.org/docs/), when we choose "Java" from the 
left panel, it goes to the JavaDoc page, so other documents cannot be accessed. 

For other languages (e.g. Python), when we select the language, a menu appears 
so users can select other document to read.

 !image-2020-07-22-15-57-03-928.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9477) [C++] Fix test case TestSchemaMetadata.MetadataVersionForwardCompatibility

2020-07-15 Thread Liya Fan (Jira)
Liya Fan created ARROW-9477:
---

 Summary: [C++] Fix test case 
TestSchemaMetadata.MetadataVersionForwardCompatibility
 Key: ARROW-9477
 URL: https://issues.apache.org/jira/browse/ARROW-9477
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Liya Fan


Test case TestSchemaMetadata.MetadataVersionForwardCompatibility is failing in 
master branch. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9315) [Java] Fix the failure of testAllocationManagerType

2020-07-02 Thread Liya Fan (Jira)
Liya Fan created ARROW-9315:
---

 Summary: [Java] Fix the failure of testAllocationManagerType
 Key: ARROW-9315
 URL: https://issues.apache.org/jira/browse/ARROW-9315
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


It appears sometimes in the CI build. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9273) [C++] Add crossbow job to capture build setup

2020-06-30 Thread Liya Fan (Jira)
Liya Fan created ARROW-9273:
---

 Summary: [C++] Add crossbow job to capture build setup
 Key: ARROW-9273
 URL: https://issues.apache.org/jira/browse/ARROW-9273
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Liya Fan


As discussed in 
https://github.com/apache/arrow/pull/7287#issuecomment-645432605, the CI jobs 
cannot capture some build problems. So we want a crossbow job to capture such 
problems. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9169) [C++] Undefined reference when building flight code

2020-06-18 Thread Liya Fan (Jira)
Liya Fan created ARROW-9169:
---

 Summary: [C++] Undefined reference when building flight code
 Key: ARROW-9169
 URL: https://issues.apache.org/jira/browse/ARROW-9169
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Liya Fan


When linking flight-test-server with 
CMakeFiles/flight-test-server.dir/link.txt, I get the following error:

../../../debug/libarrow_flight.so.100.0.0: undefined reference to 
`absl::StrCat[abi:cxx11](absl::AlphaNum const&, absl::AlphaNum const&, 
absl::AlphaNum const&)'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to 
`absl::strings_internal::CatPieces[abi:cxx11](std::initializer_list)'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to 
`absl::numbers_internal::FastIntToBuffer(long, char*)'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to 
`absl::numbers_internal::FastIntToBuffer(unsigned long, char*)'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to 
`absl::str_format_internal::FormatPack[abi:cxx11](absl::str_format_internal::UntypedFormatSpecImpl,
 absl::Span)'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to 
`absl::numbers_internal::SixDigitsToBuffer(double, char*)'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to 
`absl::base_internal::ThrowStdOutOfRange(char const*)'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to 
`absl::string_view::find(char, unsigned long) const'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to 
`absl::StrCat[abi:cxx11](absl::AlphaNum const&, absl::AlphaNum const&, 
absl::AlphaNum const&, absl::AlphaNum const&)'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to 
`absl::FormatTime[abi:cxx11](absl::Time)'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to `bool 
absl::str_format_internal::FormatArgImpl::Dispatch(absl::str_format_internal::FormatArgImpl::Data, 
absl::str_format_internal::FormatConversionSpecImpl, void*)'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to `bool 
absl::str_format_internal::FormatArgImpl::Dispatch, std::allocator > 
>(absl::str_format_internal::FormatArgImpl::Data, 
absl::str_format_internal::FormatConversionSpecImpl, void*)'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to 
`absl::numbers_internal::FastIntToBuffer(unsigned int, char*)'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to 
`absl::EqualsIgnoreCase(absl::string_view, absl::string_view)'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to 
`absl::StrCat[abi:cxx11](absl::AlphaNum const&, absl::AlphaNum const&)'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to 
`absl::numbers_internal::FastIntToBuffer(int, char*)'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to 
`absl::optional_internal::throw_bad_optional_access()'
../../../debug/libarrow_flight.so.100.0.0: undefined reference to 
`absl::ByChar::Find(absl::string_view, unsigned long) const'

it seems absl has been installed successfully by cmake. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8909) [Java] Out of order writes using setSafe

2020-06-04 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126371#comment-17126371
 ] 

Liya Fan commented on ARROW-8909:
-

[~saurabhm] According to your suggestion, we have made it clear in the document 
that variable width vectors do not support out-of-order writing. Could you 
please check it (https://github.com/apache/arrow/pull/7354) ?

> [Java] Out of order writes using setSafe
> 
>
> Key: ARROW-8909
> URL: https://issues.apache.org/jira/browse/ARROW-8909
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Saurabh
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I noticed that calling setSafe on a VarCharVector with indices not in 
> increasing order causes the lastIndex to be set to the index in the last call 
> to setSafe.
> Is this a documented and expected behavior ?
> Sample code:
> {code:java}
> import java.util.Collections;
> import lombok.extern.slf4j.Slf4j;
> import org.apache.arrow.memory.RootAllocator;
> import org.apache.arrow.vector.VarCharVector;
> import org.apache.arrow.vector.VectorSchemaRoot;
> import org.apache.arrow.vector.types.pojo.ArrowType;
> import org.apache.arrow.vector.types.pojo.Field;
> import org.apache.arrow.vector.types.pojo.Schema;
> import org.apache.arrow.vector.util.Text;
> @Slf4j
> public class ATest {
>   public static void main() {
> Schema schema = new 
> Schema(Collections.singletonList(Field.nullable("Data", new 
> ArrowType.Utf8(;
> try (VectorSchemaRoot vroot = VectorSchemaRoot.create(schema, new 
> RootAllocator())) {
>   VarCharVector vec = (VarCharVector) vroot.getVector("Data");
>   for (int i = 0; i < 10; i++) {
> vec.setSafe(i, new Text(Integer.toString(i) + "_mtest"));
>   }
>   vec.setSafe(7, new Text(Integer.toString(7) + "_new"));
>   log.info("Data at index 8 Before {}", vec.getObject(8));
>   vroot.setRowCount(10);
>   log.info("Data at index 8 After {}", vec.getObject(8));
>   log.info(vroot.contentToTSVString());
> }
>   }
> }
> {code}
>  
> If I don't set the index 7 after the loop, I get all the 0_mtest, 1_mtest, 
> ..., 9_mtest entries.
> If I set index 7 after the loop, I see 0_mtest, ..., 5_mtest, 6_mtext, 7_new,
>     Before the setRowCount, the data at index 8 is -> *st8_mtest*  ; index 9 
> is *9_mtest*
>    After the setRowCount, the data at index 8 is -> "" ; index  9 is ""
> With a text with more chars instead of 4 with _new, it keeps eating into the 
> data at the following indices.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9010) [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-06-02 Thread Liya Fan (Jira)
Liya Fan created ARROW-9010:
---

 Summary: [Java] Framework and interface changes for RecordBatch 
IPC buffer compression
 Key: ARROW-9010
 URL: https://issues.apache.org/jira/browse/ARROW-9010
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This is the first sub-work item of ARROW-8672 (
[Java] Implement RecordBatch IPC buffer compression from ARROW-300). However, 
it does not involve any concrete compression algorithms. The purpose of this PR 
is to establish basic interfaces for data compression, and make changes to the 
IPC framework so that different compression algorithms can be plug-in smoothly. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8973) [Java] Support batch value appending for large varchar/varbinary vectors

2020-05-27 Thread Liya Fan (Jira)
Liya Fan created ARROW-8973:
---

 Summary: [Java] Support batch value appending for large 
varchar/varbinary vectors
 Key: ARROW-8973
 URL: https://issues.apache.org/jira/browse/ARROW-8973
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan


Support appending values in batch for LargeVarCharVector/LargeVarBinaryVector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8972) [Java] Support range value comparison for large varchar/varbinary vectors

2020-05-27 Thread Liya Fan (Jira)
Liya Fan created ARROW-8972:
---

 Summary: [Java] Support range value comparison for large 
varchar/varbinary vectors
 Key: ARROW-8972
 URL: https://issues.apache.org/jira/browse/ARROW-8972
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Support comparing a range of values for LargeVarCharVector and 
LargeVarBinaryVector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8940) [Java] Fix the performance degradation of integration tests

2020-05-26 Thread Liya Fan (Jira)
Liya Fan created ARROW-8940:
---

 Summary: [Java] Fix the performance degradation of integration 
tests
 Key: ARROW-8940
 URL: https://issues.apache.org/jira/browse/ARROW-8940
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


In the past, we run integration tests from main methods, and recently, we have 
changed this to run them by the failsafe plugin. 

This is a good change, but it also leads to significant performance 
degradation. In the past, it took about 10s to run 
{{ITTestLargeVector#testLargeDecimalVector}}, now it takes more than half an 
hour. 

Our investigation shows that the problem was caused by calling 
{{HistoricalLog#recordEvent}} repeatedly. This method is called only when 
{{BaseAllocator#DEBUG}} is enabled. In a unit/integration test, the flag is 
enabled by default. 

We solve the problem with the following steps:
1. We set system property to disable the {{BaseAllocator#DEBUG}} flag.
2. We change the logic so that the system property takes precedence over the 
{{AssertionUtil#isAssertionsEnabled}} method. 

This makes the integration tests as fast as before. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8909) [Java] Out of order writes using setSafe

2020-05-25 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116350#comment-17116350
 ] 

Liya Fan commented on ARROW-8909:
-

[~saurabhm] Thank you for reporting the problem.
I think the behavior is by design. For variable width vectors, we do not 
support setting values in random order, as this might cause severe performance 
penalty. 

> [Java] Out of order writes using setSafe
> 
>
> Key: ARROW-8909
> URL: https://issues.apache.org/jira/browse/ARROW-8909
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Saurabh
>Priority: Major
>
> I noticed that calling setSafe on a VarCharVector with indices not in 
> increasing order causes the lastIndex to be set to the index in the last call 
> to setSafe.
> Is this a documented and expected behavior ?
> Sample code:
> {code:java}
> import java.util.Collections;
> import lombok.extern.slf4j.Slf4j;
> import org.apache.arrow.memory.RootAllocator;
> import org.apache.arrow.vector.VarCharVector;
> import org.apache.arrow.vector.VectorSchemaRoot;
> import org.apache.arrow.vector.types.pojo.ArrowType;
> import org.apache.arrow.vector.types.pojo.Field;
> import org.apache.arrow.vector.types.pojo.Schema;
> import org.apache.arrow.vector.util.Text;
> @Slf4j
> public class ATest {
>   public static void main() {
> Schema schema = new 
> Schema(Collections.singletonList(Field.nullable("Data", new 
> ArrowType.Utf8(;
> try (VectorSchemaRoot vroot = VectorSchemaRoot.create(schema, new 
> RootAllocator())) {
>   VarCharVector vec = (VarCharVector) vroot.getVector("Data");
>   for (int i = 0; i < 10; i++) {
> vec.setSafe(i, new Text(Integer.toString(i) + "_mtest"));
>   }
>   vec.setSafe(7, new Text(Integer.toString(7) + "_new"));
>   log.info("Data at index 8 Before {}", vec.getObject(8));
>   vroot.setRowCount(10);
>   log.info("Data at index 8 After {}", vec.getObject(8));
>   log.info(vroot.contentToTSVString());
> }
>   }
> }
> {code}
>  
> If I don't set the index 7 after the loop, I get all the 0_mtest, 1_mtest, 
> ..., 9_mtest entries.
> If I set index 7 after the loop, I see 0_mtest, ..., 5_mtest, 6_mtext, 7_new,
>     Before the setRowCount, the data at index 8 is -> *st8_mtest*  ; index 9 
> is *9_mtest*
>    After the setRowCount, the data at index 8 is -> "" ; index  9 is ""
> With a text with more chars instead of 4 with _new, it keeps eating into the 
> data at the following indices.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8402) [Java] Support ValidateFull methods in Java

2020-05-18 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110259#comment-17110259
 ] 

Liya Fan commented on ARROW-8402:
-

Oh sorry, I have already started working on this over the weekend. 

> [Java] Support ValidateFull methods in Java
> ---
>
> Key: ARROW-8402
> URL: https://issues.apache.org/jira/browse/ARROW-8402
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>
> We need to support ValidateFull methods in Java, just like we do in C++. 
> This is required by ARROW-5926.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8803) [Java] Row count should be set before loading buffers in VectorLoader

2020-05-15 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108163#comment-17108163
 ] 

Liya Fan commented on ARROW-8803:
-

As you have indicated, {{root.setRowCount}} calls {{setValueCount}} methods for 
the underlying vectors, and the {{setValueCount}} methods may involve 
allocation for the underlying vectors. 

If we place the {{root.setRowCount}} call to the front, it will lead to 
unnecessary vector allocations, as the underlying buffers will be populated 
shortly.

In fact, we are working on the support of data compression in IPC scenarios 
(ARROW-8672). Hope it will solve your problem. 

> [Java] Row count should be set before loading buffers in VectorLoader
> -
>
> Key: ARROW-8803
> URL: https://issues.apache.org/jira/browse/ARROW-8803
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Rong Ma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Hi guys! I'm new to the community, and I've been using Arrow for some time. 
> In my use case, I need to read RecordBatch with *compressed* underlying 
> buffers using Java's IPC API, and I'm finally blocked by the VectorLoader's 
> "load" method. In this method,
> {quote}{{root.setRowCount(recordBatch.getLength());}}
> {quote}
> It not only set the rowCount for the root, but also set the valueCount for 
> the vectors the root holds, *which have already been set once when load 
> buffers.*
> It's not a bug... I know. But if I try to load some compressed buffers, I 
> will get the following exceptions:
> {quote}java.lang.IndexOutOfBoundsException: index: 0, length: 512 (expected: 
> range(0, 504))
>  at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:718)
>  at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:965)
>  at 
> org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:439)
>  at 
> org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:708)
>  at 
> org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:226)
>  at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
>  at 
> org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:205)
>  at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:122)
> {quote}
> And I start to think that if it would be more make sense to call 
> root.setRowCount before loadbuffers?
> In root.setRowCount it also calls each vector's setValueCount, which I think 
> is unnecessary here since the vectors after calling loadbuffers are already 
> formed.
> Another existing piece of code upstream is similar to this change. 
> [link|https://github.com/apache/arrow/blob/ed1f771dccdde623ce85e212eccb2b573185c461/java/vector/src/main/java/org/apache/arrow/vector/ipc/JsonFileReader.java#L170-L178]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8803) [Java] Row count should be set before loading buffers In VectorLoader

2020-05-14 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17107826#comment-17107826
 ] 

Liya Fan commented on ARROW-8803:
-

[~rongma] Thanks for reporting this problem.
I am curious how are you compressing buffers, as our framework does not support 
compression yet.

> [Java] Row count should be set before loading buffers In VectorLoader
> -
>
> Key: ARROW-8803
> URL: https://issues.apache.org/jira/browse/ARROW-8803
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Rong Ma
>Priority: Major
> Fix For: 1.0.0
>
>
> Hi guys! I'm new to the community, and I've been using Arrow for some time. 
> In my use case, I need to read RecordBatch with *compressed* underlying 
> buffers using Java's IPC API, and I'm finally blocked by the VectorLoader's 
> "load" method. In this method,
> {quote}{{root.setRowCount(recordBatch.getLength());}}
> {quote}
> It not only set the rowCount for the root, but also set the valueCount for 
> the vectors the root holds, *which have already been set once when load 
> buffers.*
> It's not a bug... I know. But if I try to load some compressed buffers, I 
> will get the following exceptions:
> {quote}java.lang.IndexOutOfBoundsException: index: 0, length: 512 (expected: 
> range(0, 504))
>  at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:718)
>  at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:965)
>  at 
> org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:439)
>  at 
> org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:708)
>  at 
> org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:226)
>  at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
>  at 
> org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:205)
>  at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:122)
> {quote}
> And I start to think that if it would be more make sense to call 
> root.setRowCount before loadbuffers?
> In root.setRowCount it also calls each vector's setValueCount, which I think 
> is unnecessary here since the vectors after calling loadbuffers are already 
> formed.
> Another existing piece of code upstream is similar to this change. 
> [link|https://github.com/apache/arrow/blob/ed1f771dccdde623ce85e212eccb2b573185c461/java/vector/src/main/java/org/apache/arrow/vector/ipc/JsonFileReader.java#L170-L178]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8771) [C++] Add boost/process library to build support

2020-05-11 Thread Liya Fan (Jira)
Liya Fan created ARROW-8771:
---

 Summary: [C++] Add boost/process library to build support
 Key: ARROW-8771
 URL: https://issues.apache.org/jira/browse/ARROW-8771
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Liya Fan


Some of our test source code requires the process.hpp file (and its dependent 
libraries). Our current build support does not include these files, causing 
build failures like:

fatal error: boost/process.hpp: No such file or directory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8761) [C++] Improve the performance of minmax kernel

2020-05-11 Thread Liya Fan (Jira)
Liya Fan created ARROW-8761:
---

 Summary: [C++] Improve the performance of minmax kernel
 Key: ARROW-8761
 URL: https://issues.apache.org/jira/browse/ARROW-8761
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Liya Fan
Assignee: Liya Fan


We improve the performance of the max-min kernel with the simple idea: if the 
current value is smaller than the current min value; then there is no need to 
compare it against the current max value, because it must be smaller than the 
current max value. 

This simple trick reduces the expected number of comparisons from 2n to 1.5n, 
which can be notable for large arrays. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8672) [Java] Implement RecordBatch IPC buffer compression from ARROW-300

2020-05-04 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan reassigned ARROW-8672:
---

Assignee: Liya Fan

> [Java] Implement RecordBatch IPC buffer compression from ARROW-300
> --
>
> Key: ARROW-8672
> URL: https://issues.apache.org/jira/browse/ARROW-8672
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Wes McKinney
>Assignee: Liya Fan
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8665) [Java] DenseUnionWriter#setPosition fails with NullPointerException

2020-05-02 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan reassigned ARROW-8665:
---

Assignee: Liya Fan

> [Java] DenseUnionWriter#setPosition fails with NullPointerException
> ---
>
> Key: ARROW-8665
> URL: https://issues.apache.org/jira/browse/ARROW-8665
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 0.17.0
>Reporter: David Li
>Assignee: Liya Fan
>Priority: Major
>
> The writer always iterates through all BaseWriters, and an array of 128 
> BaseWriters is allocated. So if you do not have 128 typeIds and do not touch 
> all of them, setPosition will give you an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8665) [Java] DenseUnionWriter#setPosition fails with NullPointerException

2020-05-02 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098187#comment-17098187
 ] 

Liya Fan commented on ARROW-8665:
-

[~lidavidm] Sorry for the problem. I will provide a patch for it. 

> [Java] DenseUnionWriter#setPosition fails with NullPointerException
> ---
>
> Key: ARROW-8665
> URL: https://issues.apache.org/jira/browse/ARROW-8665
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 0.17.0
>Reporter: David Li
>Priority: Major
>
> The writer always iterates through all BaseWriters, and an array of 128 
> BaseWriters is allocated. So if you do not have 128 typeIds and do not touch 
> all of them, setPosition will give you an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8666) [Java] DenseUnionVector has no way to set offset/validity directly

2020-05-02 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098185#comment-17098185
 ] 

Liya Fan commented on ARROW-8666:
-

[~lidavidm] Thanks for opening this issue.
Could you please give more details about your scenario?
Currently, we have {{loadFieldBuffers}}, does it meet your needs?

> [Java] DenseUnionVector has no way to set offset/validity directly
> --
>
> Key: ARROW-8666
> URL: https://issues.apache.org/jira/browse/ARROW-8666
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.17.0
>Reporter: David Li
>Priority: Major
>
> You can set the type ID manually, but you cannot set the offset or validity 
> directly. Ideally, we'd have an API like Python that lets us build it 
> directly from constituent vectors and the offsets/type IDs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8672) [Java] Implement RecordBatch IPC buffer compression from ARROW-300

2020-05-02 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098183#comment-17098183
 ] 

Liya Fan commented on ARROW-8672:
-

[~wesm] This issue looks interesting. May I try to provide a patch for it?

> [Java] Implement RecordBatch IPC buffer compression from ARROW-300
> --
>
> Key: ARROW-8672
> URL: https://issues.apache.org/jira/browse/ARROW-8672
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8481) [Java] Provide an allocation manager based on Unsafe API

2020-04-16 Thread Liya Fan (Jira)
Liya Fan created ARROW-8481:
---

 Summary: [Java] Provide an allocation manager based on Unsafe API
 Key: ARROW-8481
 URL: https://issues.apache.org/jira/browse/ARROW-8481
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This is in response to the discussion in 
https://github.com/apache/arrow/pull/6323#issuecomment-614195070

In this issue, we provide an allocation manager that is capable of allocation 
large (> 2GB) buffers. In addition, it does not depend on the netty library, 
which is aligning with the general trend of removing netty dependencies. In the 
future, we are going to make it the default allocation manager. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8468) [Document] Fix the incorrect null bits description

2020-04-15 Thread Liya Fan (Jira)
Liya Fan created ARROW-8468:
---

 Summary: [Document] Fix the incorrect null bits description
 Key: ARROW-8468
 URL: https://issues.apache.org/jira/browse/ARROW-8468
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation
Reporter: Liya Fan
Assignee: Liya Fan


The desription about the null bits in arrays.rst is incorrect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7285) [C++] ensure C++ implementation meets clarified dictionary spec

2020-04-15 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083899#comment-17083899
 ] 

Liya Fan commented on ARROW-7285:
-

[~emkornfi...@gmail.com] If nobody else is working on this issue, may I try to 
provide a patch for it?

> [C++] ensure C++ implementation meets clarified dictionary spec
> ---
>
> Key: ARROW-7285
> URL: https://issues.apache.org/jira/browse/ARROW-7285
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
> Fix For: 1.0.0
>
>
> see parent issue.
>  
> CC [~tianchen92]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5926) [Java] Test fuzzer inputs

2020-04-12 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082021#comment-17082021
 ] 

Liya Fan commented on ARROW-5926:
-

[~tianchen92] I see. Thank you. If the change is not too big, maybe there is no 
need to separate them. 

> [Java] Test fuzzer inputs
> -
>
> Key: ARROW-5926
> URL: https://issues.apache.org/jira/browse/ARROW-5926
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> We are developing a fuzzer-based corpus of malformed IPC inputs
> https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc
> The Java implementation should also test against these to verify that the 
> correct kind of exception is raised



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5926) [Java] Test fuzzer inputs

2020-04-11 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081300#comment-17081300
 ] 

Liya Fan commented on ARROW-5926:
-

[~emkornfi...@gmail.com] Sounds good. I have opened ARROW-8402 to track it. 

> [Java] Test fuzzer inputs
> -
>
> Key: ARROW-5926
> URL: https://issues.apache.org/jira/browse/ARROW-5926
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> We are developing a fuzzer-based corpus of malformed IPC inputs
> https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc
> The Java implementation should also test against these to verify that the 
> correct kind of exception is raised



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8402) [Java] Support ValidateFull methods in Java

2020-04-11 Thread Liya Fan (Jira)
Liya Fan created ARROW-8402:
---

 Summary: [Java] Support ValidateFull methods in Java
 Key: ARROW-8402
 URL: https://issues.apache.org/jira/browse/ARROW-8402
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


We need to support ValidateFull methods in Java, just like we do in C++. 
This is required by ARROW-5926.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8392) [Java] Fix overflow related corner cases for vector value comparison

2020-04-09 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan reassigned ARROW-8392:
---

Assignee: Liya Fan

> [Java] Fix overflow related corner cases for vector value comparison
> 
>
> Key: ARROW-8392
> URL: https://issues.apache.org/jira/browse/ARROW-8392
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>
> 1. Fix corner cases related to overflow.
> 2. Provide test cases for the corner cases. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8392) [Java] Fix overflow related corner cases for vector value comparison

2020-04-09 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080218#comment-17080218
 ] 

Liya Fan commented on ARROW-8392:
-

Hi Martin Janda, do you want to provide a patch for this issue?

> [Java] Fix overflow related corner cases for vector value comparison
> 
>
> Key: ARROW-8392
> URL: https://issues.apache.org/jira/browse/ARROW-8392
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Liya Fan
>Priority: Major
>
> 1. Fix corner cases related to overflow.
> 2. Provide test cases for the corner cases. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8392) [Java] Fix overflow related corner cases for vector value comparison

2020-04-09 Thread Liya Fan (Jira)
Liya Fan created ARROW-8392:
---

 Summary: [Java] Fix overflow related corner cases for vector value 
comparison
 Key: ARROW-8392
 URL: https://issues.apache.org/jira/browse/ARROW-8392
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan


1. Fix corner cases related to overflow.
2. Provide test cases for the corner cases. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8299) [C++] Reusable "optional ParallelFor" function for optional use of multithreading

2020-04-07 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077752#comment-17077752
 ] 

Liya Fan commented on ARROW-8299:
-

[~wesm] I notice the assignee is left empty. May I try to provide a patch for 
this issue?

> [C++] Reusable "optional ParallelFor" function for optional use of 
> multithreading
> -
>
> Key: ARROW-8299
> URL: https://issues.apache.org/jira/browse/ARROW-8299
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> We often see code like
> {code}
> if (use_threads) {
>   return ::arrow::internal::ParallelFor(n, Func);
> } else {
>   for (size_t i = 0; i < n; ++i) {
> RETURN_NOT_OK(Func(i));
>   }
>   return Status::OK();
> {code}
> It might be nice to have a helper function to do this. It doesn't even need 
> to be an inline template, it could be a precompiled function accepting 
> {{std::function}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8239) [Java] fix param checks in splitAndTransfer method

2020-03-31 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan updated ARROW-8239:

Fix Version/s: 0.17.0

> [Java] fix param checks in splitAndTransfer method
> --
>
> Key: ARROW-8239
> URL: https://issues.apache.org/jira/browse/ARROW-8239
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8230) [Java] Move Netty memory manager into a separate module

2020-03-30 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071382#comment-17071382
 ] 

Liya Fan commented on ARROW-8230:
-

[~rymurr] Sounds good to me. Please go ahead with your PR. 

> [Java] Move Netty memory manager into a separate module
> ---
>
> Key: ARROW-8230
> URL: https://issues.apache.org/jira/browse/ARROW-8230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Ryan Murray
>Priority: Major
>
> Move Netty memory manager into a separate module such that the basic 
> allocator does not depend on it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8230) [Java] Move Netty memory manager into a separate module

2020-03-30 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071018#comment-17071018
 ] 

Liya Fan commented on ARROW-8230:
-

[~rymurr] 
I think my understanding is close to yours. 

However, it may not be feasible to remove the netty dependency completely in a 
single PR, as some refactoring work is required before we are ready to remove 
it completely.

So I think in this issue, we need to create the new module, move some classes 
there, mark some as deprecated, and leave the rest as future work. 

> [Java] Move Netty memory manager into a separate module
> ---
>
> Key: ARROW-8230
> URL: https://issues.apache.org/jira/browse/ARROW-8230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Ryan Murray
>Priority: Major
>
> Move Netty memory manager into a separate module such that the basic 
> allocator does not depend on it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-4526) [Java] Remove Netty references from ArrowBuf and move Allocator out of vector package

2020-03-26 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067575#comment-17067575
 ] 

Liya Fan edited comment on ARROW-4526 at 3/26/20, 11:38 AM:


[~rymurr] Thanks for your time.
Of the 3 work items listed above, we have finished item 1 in ARROW-7935 and 
ARROW-7935. 
We are now working on item 3 (ARROW-8229, a PR will be submitted soon).

So if you are interested, maybe you can take item 2. 
I have created an issue (ARROW-8230), and please feel free to assign it to 
yourself, if you are interested :)


was (Author: fan_li_ya):
[~rymurr] Thanks for your time.
Of the 3 work items listed above, we have finished item 1 in ARROW-7905 and 
ARROW-7935. 
We are now working on item 3 (a PR will be submitted soon).

So if you are interested, maybe you can take item 2. 
I have created an issue (ARROW-8230), and please feel free to assign it to 
yourself, if you are interested :)

> [Java] Remove Netty references from ArrowBuf and move Allocator out of vector 
> package
> -
>
> Key: ARROW-4526
> URL: https://issues.apache.org/jira/browse/ARROW-4526
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Jacques Nadeau
>Assignee: Liya Fan
>Priority: Critical
> Fix For: 0.17.0
>
>
> Arrow currently has a hard dependency on Netty and exposes this in public 
> APIs. This shouldn't be the case. There could be many allocator 
> implementations with Netty as one possible option. We should remove hard 
> dependency between arrow-vector and Netty, instead creating a trivial 
> allocator. ArrowBuf should probably expose an  T unwrap(Class clazz) 
> method instead to allow inner providers availability without a hard 
> reference. This should also include drastically reducing the number of 
> methods on ArrowBuf as right now it includes every method from ByteBuf but 
> many of those are not very useful, appropriate.
> This work should come after we do the simpler ARROW-3191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4526) [Java] Remove Netty references from ArrowBuf and move Allocator out of vector package

2020-03-26 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067575#comment-17067575
 ] 

Liya Fan commented on ARROW-4526:
-

[~rymurr] Thanks for your time.
Of the 3 work items listed above, we have finished item 1 in ARROW-7905 and 
ARROW-7935. 
We are now working on item 3 (a PR will be submitted soon).

So if you are interested, maybe you can take item 2. 
I have created an issue (ARROW-8230), and please feel free to assign it to 
yourself, if you are interested :)

> [Java] Remove Netty references from ArrowBuf and move Allocator out of vector 
> package
> -
>
> Key: ARROW-4526
> URL: https://issues.apache.org/jira/browse/ARROW-4526
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Jacques Nadeau
>Assignee: Liya Fan
>Priority: Critical
> Fix For: 0.17.0
>
>
> Arrow currently has a hard dependency on Netty and exposes this in public 
> APIs. This shouldn't be the case. There could be many allocator 
> implementations with Netty as one possible option. We should remove hard 
> dependency between arrow-vector and Netty, instead creating a trivial 
> allocator. ArrowBuf should probably expose an  T unwrap(Class clazz) 
> method instead to allow inner providers availability without a hard 
> reference. This should also include drastically reducing the number of 
> methods on ArrowBuf as right now it includes every method from ByteBuf but 
> many of those are not very useful, appropriate.
> This work should come after we do the simpler ARROW-3191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8230) [Java] Move Netty memory manager into a separate module

2020-03-26 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067574#comment-17067574
 ] 

Liya Fan commented on ARROW-8230:
-

[~rymurr] Please feel free to assign it to yourself, if you are interested

> [Java] Move Netty memory manager into a separate module
> ---
>
> Key: ARROW-8230
> URL: https://issues.apache.org/jira/browse/ARROW-8230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Priority: Major
>
> Move Netty memory manager into a separate module such that the basic 
> allocator does not depend on it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8230) [Java] Move Netty memory manager into a separate module

2020-03-26 Thread Liya Fan (Jira)
Liya Fan created ARROW-8230:
---

 Summary: [Java] Move Netty memory manager into a separate module
 Key: ARROW-8230
 URL: https://issues.apache.org/jira/browse/ARROW-8230
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan


Move Netty memory manager into a separate module such that the basic allocator 
does not depend on it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8229) [Java] Move ArrowBuf into the Arrow package

2020-03-26 Thread Liya Fan (Jira)
Liya Fan created ARROW-8229:
---

 Summary: [Java] Move ArrowBuf into the Arrow package
 Key: ARROW-8229
 URL: https://issues.apache.org/jira/browse/ARROW-8229
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


After ARROW-7505 and ARROW-7935 are done, we are ready to move ArrowBuf into 
Arrow's package, and make it independent of Netty library.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8169) [Java] Improve the performance of JDBC adapter by allocating memory proactively

2020-03-19 Thread Liya Fan (Jira)
Liya Fan created ARROW-8169:
---

 Summary: [Java] Improve the performance of JDBC adapter by 
allocating memory proactively
 Key: ARROW-8169
 URL: https://issues.apache.org/jira/browse/ARROW-8169
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


The current implementation use {{setSafe}} methods to dynamically allocate 
memory if necessary. For fixed width vectors (which are frequently used in 
JDBC), however, we can allocate memory proactively, since the vector size is 
known as a configuration parameter. So for fixed width vectors, we can use 
{{set}} methods instead.

This change leads to two benefits:
1. When processing each value, we no longer have to check vector capacity and 
reallocate memroy if needed. This leads to better performance.
2. If we allow the memory to expand automatically (each time by 2x), the amount 
of memory usually ends up being more than necessary. By allocating memory by 
the configuration parameter, we allocate no more, or no less. 

Benchmark results show notable performance improvements:

Before:

Benchmark   Mode  CntScore   Error  Units
JdbcAdapterBenchmarks.consumeBenchmark  avgt5  521.700 ± 4.837  us/op

After:

Benchmark   Mode  CntScore   Error  Units
JdbcAdapterBenchmarks.consumeBenchmark  avgt5  430.523 ± 9.932  us/op



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8121) [Java] Enhance code style checking for Java code (add space after commas, semi-colons and type casts)

2020-03-14 Thread Liya Fan (Jira)
Liya Fan created ARROW-8121:
---

 Summary: [Java] Enhance code style checking for Java code (add 
space after commas, semi-colons and type casts)
 Key: ARROW-8121
 URL: https://issues.apache.org/jira/browse/ARROW-8121
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This is in response to a discussion in 
https://github.com/apache/arrow/pull/6039#discussion_r375161992

We found the current style checking for Java code is not sufficient. So we want 
to enhace it in a series of "small" steps, in order to avoid having to change 
too many files at once.

In this issue, we add spaces after commas, semi-colons and type casts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8108) [Java] Extract a common interface for dictionary encoders

2020-03-12 Thread Liya Fan (Jira)
Liya Fan created ARROW-8108:
---

 Summary: [Java] Extract a common interface for dictionary encoders
 Key: ARROW-8108
 URL: https://issues.apache.org/jira/browse/ARROW-8108
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


In this issue, we extract a common interfaces from existing dictionary 
encoders. This can be useful for scenarios when the client does not care about 
the encoder implementations. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8009) [Java] Fix the hash code methods for BitVector

2020-03-05 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan updated ARROW-8009:

Summary: [Java] Fix the hash code methods for BitVector  (was: [Java] Fix 
the hash code mehods for BitVector)

> [Java] Fix the hash code methods for BitVector
> --
>
> Key: ARROW-8009
> URL: https://issues.apache.org/jira/browse/ARROW-8009
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>
> The current hash code methods of BitVector are based on implementations in 
> BaseFixedWidthVector, which rely on the type width of the vector. 
> For BitVector, the type width is 0, so the underlying data is not actually 
> used when computing the hash code. That means, the hash code will always be 
> 0, no matter if the underlying data is null or not, and no matter if the 
> underlying bit is 0 or 1. 
> We fix this by overriding the methods in BitVector. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8009) [Java] Fix the hash code mehods for BitVector

2020-03-05 Thread Liya Fan (Jira)
Liya Fan created ARROW-8009:
---

 Summary: [Java] Fix the hash code mehods for BitVector
 Key: ARROW-8009
 URL: https://issues.apache.org/jira/browse/ARROW-8009
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


The current hash code methods of BitVector are based on implementations in 
BaseFixedWidthVector, which rely on the type width of the vector. 
For BitVector, the type width is 0, so the underlying data is not actually used 
when computing the hash code. That means, the hash code will always be 0, no 
matter if the underlying data is null or not, and no matter if the underlying 
bit is 0 or 1. 

We fix this by overriding the methods in BitVector. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7048) [Java] Support for combining multiple vectors under VectorSchemaRoot

2020-02-27 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046623#comment-17046623
 ] 

Liya Fan commented on ARROW-7048:
-

[~yogeshtewari] Sorry for the long wait. We have provided a PR for this issue. 
Would you please take a look, and check if it is what you want?

> [Java] Support for combining multiple vectors under VectorSchemaRoot
> 
>
> Key: ARROW-7048
> URL: https://issues.apache.org/jira/browse/ARROW-7048
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Yogesh Tewari
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hi,
>  
> pyarrow.Table.combine_chunks provides a nice functionality of combining 
> multiple batch records under a single pyarrow.Table.
>  
> I am currently working on a downstream application which reads data from 
> BigQuery. BigQuery storage api supports data output in Arrow format but 
> streams data in many batches of size 1024 or less number of rows.
> It would be really nice to have Arrow Java api provide this functionality 
> under an abstraction like VectorSchemaRoot.
> After getting guidance from [~emkornfi...@gmail.com], I tried to write my own 
> implementation by copying data vector by vector using TransferPair's 
> copyValueSafe
> But, unless I am missing some thing obvious, turns out it only copies one 
> value at a time. That means a lot of looping trying copyValueSafe millions of 
> rows from source vector index to target vector index. Ideally I would want to 
> concatenate/link the underlying buffers rather than copying one cell at a 
> time.
>  
> Eg, if I have :
> {code:java}
> List batchList = new ArrayList<>();
> try (ArrowStreamReader reader = new ArrowStreamReader(new 
> ByteArrayInputStream(out.toByteArray()), allocator)) {
> Schema schema = reader.getVectorSchemaRoot().getSchema();
> for (int i = 0; i < 5; i++) {
> // This will be loaded with new values on every call to loadNextBatch
> VectorSchemaRoot readBatch = reader.getVectorSchemaRoot();
> reader.loadNextBatch();
> batchList.add(readBatch);
> }
> }
> //VectorSchemaRoot.combineChunks(batchList, newVectorSchemaRoot);{code}
>  
> A method like VectorSchemaRoot.combineChunks(List)?
> I did read the VectorSchemaRoot discussion on 
> https://issues.apache.org/jira/browse/ARROW-6896 and am not sure if its the 
> right thing to use here.
>  
>  
> PS. Feel free to update the title of this feature request with more 
> appropriate wordings.
>  
> Cheers,
> Yogesh
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7746) [Java] Support large buffer for Flight

2020-02-27 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046431#comment-17046431
 ] 

Liya Fan commented on ARROW-7746:
-

It seems the PR for ARROW-7610 is already big enough. To make code reviewing 
easier, I have opened ARROW-7955 to track the support for file/stream IPC. 

> [Java] Support large buffer for Flight
> --
>
> Key: ARROW-7746
> URL: https://issues.apache.org/jira/browse/ARROW-7746
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Liya Fan
>Priority: Major
>
> The motivation is described in 
> https://github.com/apache/arrow/pull/6323#issuecomment-580137629.
> When the size of the ArrowBuf exceeds 2GB, our flighing library does not work 
> due to integer overflow. 
> This is because internally, we have used some data structures which are based 
> on 32-bit integers. To resolve the problem, we must revise/replace the data 
> structures to make them support 64-bit integers. 
> As a concrete example, we can see that when the server sends data through 
> IPC, an org.apache.arrow.flight.ArrowMessage object is created, and is 
> wrapped as an InputStream through the `asInputStream` method. In this method, 
> we use data stuctures like java.io.ByteArrayOutputStream and 
> io.netty.buffer.ByteBuf, which are based on 32-bit integers (we can observe 
> that NettyArrowBuf#length and ByteArrayOutputStream#count are both 32-bit 
> integers). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7955) [Java] Support large buffer for file/stream IPC

2020-02-27 Thread Liya Fan (Jira)
Liya Fan created ARROW-7955:
---

 Summary: [Java] Support large buffer for file/stream IPC
 Key: ARROW-7955
 URL: https://issues.apache.org/jira/browse/ARROW-7955
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


After supporting 64-bit ArrowBuf, we need to make file/stream IPC work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7746) [Java] Support large buffer for Flight

2020-02-27 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan updated ARROW-7746:

Summary: [Java] Support large buffer for Flight  (was: [Java] Support large 
buffer for IPC)

> [Java] Support large buffer for Flight
> --
>
> Key: ARROW-7746
> URL: https://issues.apache.org/jira/browse/ARROW-7746
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Liya Fan
>Priority: Major
>
> The motivation is described in 
> https://github.com/apache/arrow/pull/6323#issuecomment-580137629.
> When the size of the ArrowBuf exceeds 2GB, our flighing library does not work 
> due to integer overflow. 
> This is because internally, we have used some data structures which are based 
> on 32-bit integers. To resolve the problem, we must revise/replace the data 
> structures to make them support 64-bit integers. 
> As a concrete example, we can see that when the server sends data through 
> IPC, an org.apache.arrow.flight.ArrowMessage object is created, and is 
> wrapped as an InputStream through the `asInputStream` method. In this method, 
> we use data stuctures like java.io.ByteArrayOutputStream and 
> io.netty.buffer.ByteBuf, which are based on 32-bit integers (we can observe 
> that NettyArrowBuf#length and ByteArrayOutputStream#count are both 32-bit 
> integers). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7746) [Java] Support large buffer for IPC

2020-02-27 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046316#comment-17046316
 ] 

Liya Fan commented on ARROW-7746:
-

[~emkornfi...@gmail.com] Sorry. There must be some misunderstanding here. I 
thought flight was a necessary part for IPC. 

So I will change the title for this issue, and provide support for the rest 
issues of IPC (e.g. ArrowStreamWriter/ArrowStreamReader) in ARROW-7610.

> [Java] Support large buffer for IPC
> ---
>
> Key: ARROW-7746
> URL: https://issues.apache.org/jira/browse/ARROW-7746
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Liya Fan
>Priority: Major
>
> The motivation is described in 
> https://github.com/apache/arrow/pull/6323#issuecomment-580137629.
> When the size of the ArrowBuf exceeds 2GB, our flighing library does not work 
> due to integer overflow. 
> This is because internally, we have used some data structures which are based 
> on 32-bit integers. To resolve the problem, we must revise/replace the data 
> structures to make them support 64-bit integers. 
> As a concrete example, we can see that when the server sends data through 
> IPC, an org.apache.arrow.flight.ArrowMessage object is created, and is 
> wrapped as an InputStream through the `asInputStream` method. In this method, 
> we use data stuctures like java.io.ByteArrayOutputStream and 
> io.netty.buffer.ByteBuf, which are based on 32-bit integers (we can observe 
> that NettyArrowBuf#length and ByteArrayOutputStream#count are both 32-bit 
> integers). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7935) [Java] Remove Netty dependency for BufferAllocator and ReferenceManager

2020-02-25 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan updated ARROW-7935:

Description: With previous work (ARROW-7329 and ARROW-7505), Netty based 
allocation is only one of the possible implementations. So we need to revise 
BufferAllocator and ReferenceManager, to make them general, and independent of 
Netty libraries.  (was: With previous work (ARROW-7329 and ARROW-7505), Netty 
based allocation is only one of the possible implementations. So we need to 
revise BufferAllocator and ReferenceManager, to make them general, and 
independent Netty libraries.)

> [Java] Remove Netty dependency for BufferAllocator and ReferenceManager
> ---
>
> Key: ARROW-7935
> URL: https://issues.apache.org/jira/browse/ARROW-7935
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>
> With previous work (ARROW-7329 and ARROW-7505), Netty based allocation is 
> only one of the possible implementations. So we need to revise 
> BufferAllocator and ReferenceManager, to make them general, and independent 
> of Netty libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7505) [Java] Remove Netty dependency for ArrowBuf

2020-02-25 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan resolved ARROW-7505.
-
Resolution: Fixed

Fixed by https://github.com/apache/arrow/pull/6131

> [Java] Remove Netty dependency for ArrowBuf
> ---
>
> Key: ARROW-7505
> URL: https://issues.apache.org/jira/browse/ARROW-7505
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> This is part of the first step of issue ARROW-4526. 
> In this step, we remove netty dependency for ArrowBuf, BufferAllocator and 
> ReferenceManager. 
> In this issue, we remove the dependency for ArrowBuf. 
> The task for BufferAllocator and ReferenceManager will not start until 
> ARROW-7329 is finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7935) [Java] Remove Netty dependency for BufferAllocator and ReferenceManager

2020-02-25 Thread Liya Fan (Jira)
Liya Fan created ARROW-7935:
---

 Summary: [Java] Remove Netty dependency for BufferAllocator and 
ReferenceManager
 Key: ARROW-7935
 URL: https://issues.apache.org/jira/browse/ARROW-7935
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


With previous work (ARROW-7329 and ARROW-7505), Netty based allocation is only 
one of the possible implementations. So we need to revise BufferAllocator and 
ReferenceManager, to make them general, and independent Netty libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7837) [Java] bug in BaseVariableWidthVector.copyFromSafe results with an index out of bounds exception

2020-02-17 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038355#comment-17038355
 ] 

Liya Fan commented on ARROW-7837:
-

> btw, what do you mean by non-consecutive slots? in out use-case the code 
> populating the vector simply skipped null slots instead of explicitly setting 
> them to nulls (probably relying on fillHoles) - is this what you mean?

Yes, this is what I mean.

> the handleSafe method is used by several other 'safe' methods so I think it 
> should be fixed (not copySafe which is just one of the call sites).

Maybe we should prepare a unit test for each case, and discuss it case by case.

> will you accept a pull request with a fix?

Sure. A PR with a fix would be great. 



> [Java] bug in BaseVariableWidthVector.copyFromSafe results with an index out 
> of bounds exception
> 
>
> Key: ARROW-7837
> URL: https://issues.apache.org/jira/browse/ARROW-7837
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 0.15.0
>Reporter: Eyal Farago
>Priority: Major
>
> There's a subtle bug in the copySafe method of BaseVariableWidthVector that 
> results with an index out of bounds exception.
> The issue is somewhere between the safeCopy and handleSafe methods,
> copySafe calls handleSafe in order to assure underlying buffers capacity 
> before appending a value to the vector, however the handleSafe method falsely 
> assumes all 'holes' have been field when checking the next write offset. as a 
> result it reads a stale offset (I believe it's 0 for freshly allocated 
> buffers but may be un-guaranteed when reusing a buffer) and fails to identify 
> the need to resize the values buffer.
>  
> the following (scala) test demonstrates the issue (by artificially shrinking 
> the values buffer). it was written after we've hit this in production:
> {code:java}
> test("try to reproduce Arrow issue"){
> val charVector = new VarCharVector("stam", Allocator.get)
> val srcCharVector = new VarCharVector("src", Allocator.get)
> srcCharVector.setSafe(0, Array.tabulate(20)(_.toByte))
> srcCharVector.setValueCount(2)
> for( i <- 0 until 4){
>   charVector.copyFromSafe(0, i, srcCharVector)
>   charVector.setValueCount(i + 1)
> }
> val valBuff = charVector.getBuffers(false)(2)
> valBuff.capacity(90)
> charVector.copyFromSafe(0, 14, srcCharVector)
> srcCharVector.close()
> charVector.close()
>   }
> {code}
>  this test fails with the following exception:
>  
> {code:java}
> index: 80, length: 20 (expected: range(0, 90))
> java.lang.IndexOutOfBoundsException: index: 80, length: 20 (expected: 
> range(0, 90))
>   at io.netty.buffer.ArrowBuf.getBytes(ArrowBuf.java:929)
>   at 
> org.apache.arrow.vector.BaseVariableWidthVector.copyFromSafe(BaseVariableWidthVector.java:1345)
>   at 
> com.datorama.pluto.arrow.ArroStreamSerializationTest.$anonfun$new$33(ArroStreamSerializationTest.scala:454)
>   at 
> com.datorama.pluto.arrow.ArroStreamSerializationTest$$Lambda$129.F78CFE20.apply$mcV$sp(Unknown
>  Source)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
>   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$Lambda$367.001B9220.apply(Unknown Source)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> com.datorama.pluto.arrow.ArroStreamSerializationTest.org$scalatest$BeforeAndAfterEachTestData$$super$runTest(ArroStreamSerializationTest.scala:32)
>   at 
> org.scalatest.BeforeAndAfterEachTestData.runTest(BeforeAndAfterEachTestData.scala:194)
>   at 
> org.scalatest.BeforeAndAfterEachTestData.runTest$(BeforeAndAfterEachTestData.scala:187)
>   at 
> com.datorama.pluto.arrow.ArroStreamSerializationTest.runTest(ArroStreamSerializationTest.scala:32)
>   at 
> 

[jira] [Commented] (ARROW-7837) [Java] bug in BaseVariableWidthVector.copyFromSafe results with an index out of bounds exception

2020-02-17 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038334#comment-17038334
 ] 

Liya Fan commented on ARROW-7837:
-

[~eyalfa] Thanks a lot for reporting this issue. I think it is a real bug, 
which can be triggered when the target vector is not written in consecutive 
slots. 

It should be resolved by calling the {{fillHoles}} method in the {{copySafe}} 
method. 

> [Java] bug in BaseVariableWidthVector.copyFromSafe results with an index out 
> of bounds exception
> 
>
> Key: ARROW-7837
> URL: https://issues.apache.org/jira/browse/ARROW-7837
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 0.15.0
>Reporter: Eyal Farago
>Priority: Major
>
> There's a subtle bug in the copySafe method of BaseVariableWidthVector that 
> results with an index out of bounds exception.
> The issue is somewhere between the safeCopy and handleSafe methods,
> copySafe calls handleSafe in order to assure underlying buffers capacity 
> before appending a value to the vector, however the handleSafe method falsely 
> assumes all 'holes' have been field when checking the next write offset. as a 
> result it reads a stale offset (I believe it's 0 for freshly allocated 
> buffers but may be un-guaranteed when reusing a buffer) and fails to identify 
> the need to resize the values buffer.
>  
> the following (scala) test demonstrates the issue (by artificially shrinking 
> the values buffer). it was written after we've hit this in production:
> {code:java}
> test("try to reproduce Arrow issue"){
> val charVector = new VarCharVector("stam", Allocator.get)
> val srcCharVector = new VarCharVector("src", Allocator.get)
> srcCharVector.setSafe(0, Array.tabulate(20)(_.toByte))
> srcCharVector.setValueCount(2)
> for( i <- 0 until 4){
>   charVector.copyFromSafe(0, i, srcCharVector)
>   charVector.setValueCount(i + 1)
> }
> val valBuff = charVector.getBuffers(false)(2)
> valBuff.capacity(90)
> charVector.copyFromSafe(0, 14, srcCharVector)
> srcCharVector.close()
> charVector.close()
>   }
> {code}
>  this test fails with the following exception:
>  
> {code:java}
> index: 80, length: 20 (expected: range(0, 90))
> java.lang.IndexOutOfBoundsException: index: 80, length: 20 (expected: 
> range(0, 90))
>   at io.netty.buffer.ArrowBuf.getBytes(ArrowBuf.java:929)
>   at 
> org.apache.arrow.vector.BaseVariableWidthVector.copyFromSafe(BaseVariableWidthVector.java:1345)
>   at 
> com.datorama.pluto.arrow.ArroStreamSerializationTest.$anonfun$new$33(ArroStreamSerializationTest.scala:454)
>   at 
> com.datorama.pluto.arrow.ArroStreamSerializationTest$$Lambda$129.F78CFE20.apply$mcV$sp(Unknown
>  Source)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
>   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$Lambda$367.001B9220.apply(Unknown Source)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> com.datorama.pluto.arrow.ArroStreamSerializationTest.org$scalatest$BeforeAndAfterEachTestData$$super$runTest(ArroStreamSerializationTest.scala:32)
>   at 
> org.scalatest.BeforeAndAfterEachTestData.runTest(BeforeAndAfterEachTestData.scala:194)
>   at 
> org.scalatest.BeforeAndAfterEachTestData.runTest$(BeforeAndAfterEachTestData.scala:187)
>   at 
> com.datorama.pluto.arrow.ArroStreamSerializationTest.runTest(ArroStreamSerializationTest.scala:32)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$Lambda$358.1AAC0020.apply(Unknown Source)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
>   at org.scalatest.SuperEngine$$Lambda$359.1AAC0820.apply(Unknown 
> Source)
>  

[jira] [Commented] (ARROW-4890) [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1

2020-02-16 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038112#comment-17038112
 ] 

Liya Fan commented on ARROW-4890:
-

Sure. [~emkornfi...@gmail.com] is right.
After [~emkornfi...@gmail.com] has finished the implementation of the 64 bit 
buffer, we have a few follow-up work items to do, before we can claim that the 
2GB restrict is removed:

1. In ARROW-7610, we apply the 64 bit buffer to vector implementations, and add 
integration tests. This work item is on-going.
2. In ARROW-6111, we provide new vectors to support 64 bit buffers, as the 
current ones have a offset width of 4 bytes. This work item is on-going. (we 
would appreciate if anyone could provide some feedback/review comments for the 
PRs for 1 and 2)
3. And we need another work item to make everything goes well in IPC scenarios. 
This is tracked by ARROW-7746, and has not started yet. (we would appreciate if 
anyone would provide a solution to this issue. Otherwise, I will try to provide 
a solution some days later)



> [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1
> -
>
> Key: ARROW-4890
> URL: https://issues.apache.org/jira/browse/ARROW-4890
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Cloudera cdh5.13.3
> Cloudera Spark 2.3.0.cloudera3
>Reporter: Abdeali Kothari
>Priority: Major
> Attachments: Task retry fails.png, image-2019-07-04-12-03-57-002.png
>
>
> Creating this in Arrow project as the traceback seems to suggest this is an 
> issue in Arrow.
>  Continuation from the conversation on the 
> https://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3CCAK7Z5T_mChuqhFDAF2U68dO=p_1nst5ajjcrg0mexo5kby9...@mail.gmail.com%3E
> When I run a GROUPED_MAP UDF in Spark using PySpark, I run into the error:
> {noformat}
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 279, in load_stream
> for batch in reader:
>   File "pyarrow/ipc.pxi", line 265, in __iter__
>   File "pyarrow/ipc.pxi", line 281, in 
> pyarrow.lib._RecordBatchReader.read_next_batch
>   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: read length must be positive or -1
> {noformat}
> as my dataset size starts increasing that I want to group on. Here is a 
> reproducible code snippet where I can reproduce this.
>  Note: My actual dataset is much larger and has many more unique IDs and is a 
> valid usecase where I cannot simplify this groupby in any way. I have 
> stripped out all the logic to make this example as simple as I could.
> {code:java}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--executor-memory 9G pyspark-shell'
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql import functions as F, types as T
> import pandas as pd
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> pdf1 = pd.DataFrame(
>   [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
>   columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
> )
> df1 = spark.createDataFrame(pd.concat([pdf1 for i in 
> range(429)]).reset_index()).drop('index')
> pdf2 = pd.DataFrame(
>   [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", 
> "abcdefghijklmno"]],
>   columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
> )
> df2 = spark.createDataFrame(pd.concat([pdf2 for i in 
> range(48993)]).reset_index()).drop('index')
> df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')
> def myudf(df):
> return df
> df4 = df3
> udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)
> df5 = df4.groupBy('df1_c1').apply(udf)
> print('df5.count()', df5.count())
> # df5.write.parquet('/tmp/temp.parquet', mode='overwrite')
> {code}
> I have tried running this on Amazon EMR with Spark 2.3.1 and 20GB RAM per 
> executor too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5929) [Java] Define API for ExtensionVector whose data must be serialized prior to being sent via IPC

2020-02-12 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan closed ARROW-5929.
---
Resolution: Later

> [Java] Define API for ExtensionVector whose data must be serialized prior to 
> being sent via IPC
> ---
>
> Key: ARROW-5929
> URL: https://issues.apache.org/jira/browse/ARROW-5929
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Wes McKinney
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> As being discussed on the mailing list, a possible use case for 
> ExtensionVector involves having the Arrow buffers contain pointer-type values 
> referring to memory outside of the Arrow memory heap. In IPC, such vectors 
> would need to be serialized to a wholly Arrow-resident form, such as a 
> VarBinaryVector. We do not have an API to allow for this, so this JIRA 
> proposes to add new functions that can indicate to the IPC layer that an 
> ExtensionVector requires additional serialization to a native Arrow type (in 
> such case, the extension type metadata would be discarded)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6420) [Java] Improve the performance of UnionVector when getting underlying vectors

2020-02-12 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan closed ARROW-6420.
---
Resolution: Later

> [Java] Improve the performance of UnionVector when getting underlying vectors
> -
>
> Key: ARROW-6420
> URL: https://issues.apache.org/jira/browse/ARROW-6420
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Getting the underlying vector is a frequent opertation for UnionVector. It 
> relies on this operation to get/set data at each index.
> The current implementation is inefficient. In particular, it first gets the 
> minor type at the given index, and then compares it against all possible 
> minor types in a switch statment, until a match is found.
> We improve the performance by storing the internal vectors in an array, whose 
> index is the ordinal of the minor type. So given a minor type, its 
> corresponding underlying vector can be obtained in O(1) time.
> It should be noted that this technique is also applicable to UnionReader and 
> UnionWriter, and support for UnionReader is already implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6374) [Java] Refactor the code for TimeXXVectors

2020-02-12 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan closed ARROW-6374.
---
Resolution: Won't Fix

> [Java] Refactor the code for TimeXXVectors
> --
>
> Key: ARROW-6374
> URL: https://issues.apache.org/jira/browse/ARROW-6374
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This is based on the discussion in 
> [https://lists.apache.org/thread.html/836d3b87ccb6e65e9edf0f220829a29edfa394fc2cd1e0866007d86e@%3Cdev.arrow.apache.org%3E.|https://lists.apache.org/thread.html/836d3b87ccb6e65e9edf0f220829a29edfa394fc2cd1e0866007d86e@%3Cdev.arrow.apache.org%3E,]
>  
> The internals of TimeXXVectors are simply IntVector or BigIntVector. There 
> are duplicated code for setting/getting int/long.
>  
> We want to refactor the code by:
>  # push get/set methods into the base class BaseFixedWidthVector, and make 
> them protected.
>  # The APIs in TimeXXVectors references the methods in the base class.
>  
> Note that this issue not just reduce redundant code, it also centralizes the 
> logics for getting/setting int/long, making them easy to maintain and change.
>  
> If it looks good, later we will make other integer based vectors rely on the 
> base class implementations. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6307) [Java] Provide RLE vector

2020-02-12 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan closed ARROW-6307.
---
Resolution: Later

> [Java] Provide RLE vector
> -
>
> Key: ARROW-6307
> URL: https://issues.apache.org/jira/browse/ARROW-6307
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> RLE (run length encoding) is a widely used encoding/decoding technique. 
> Compared with other encoding/decoding techniques, it is easier to work with 
> the encoded data. 
>   
>  We want to provide an RLE vector implementation in Arrow. The design details 
> include:
>   
>  1. RleVector implements ValueVector.
> 2. the data structure of RleVector includes an inner vector, plus a buffer 
> storing the end indices for runs. 
> 3. we provide random access, with time complexity O(log(n)), so it should not 
> be used frequently.
>  4. In the future, we will provide iterators to access the vector in sequence.
>  5. RleVector does not support update, but supports appending.
>  6. In the future, we will provide encoder/decoder to efficiently transform 
> encoded/decoded vectors.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5209) [Java] Add performance benchmarks from SQL workloads

2020-02-12 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan closed ARROW-5209.
---
Resolution: Won't Fix

> [Java] Add performance benchmarks from SQL workloads
> 
>
> Key: ARROW-5209
> URL: https://issues.apache.org/jira/browse/ARROW-5209
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> To improve the performance of Arrow implementations. Some performance 
> benchmarks must be setup first. 
> In this issue, we want to provide some performance benchmarks extracted from 
> our SQL engine, which is going to be made open source soon. The workloads are 
> obtained by running an open SQL benchmarks TPC-H. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6245) [Java] Provide an interface for numeric vectors

2020-02-12 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan closed ARROW-6245.
---
Resolution: Won't Fix

> [Java] Provide an interface for numeric vectors
> ---
>
> Key: ARROW-6245
> URL: https://issues.apache.org/jira/browse/ARROW-6245
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We want to provide an interface for all vectors with numeric types (small 
> int, float4, float8, etc). This interface will make it convenient for many 
> operations on a vector, like average, sum, variance, etc. With this 
> interface, the client code will be greatly simplified, with many 
> branches/switch removed.
>  
> The design is similar to BaseIntVector (the interface for all integer 
> vectors). We provide 3 methods for setting & getting numeric values:
>  setWithPossibleRounding
>  setSafeWithPossibleRounding
>  getValueAsDouble



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7808) [Java][Dataset] Implement Datasets Java API

2020-02-09 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033341#comment-17033341
 ] 

Liya Fan commented on ARROW-7808:
-

Thanks for opening this issue. 
Personally, I would like to see these features added to Java code base.
But I am not sure if the community are willing to support them. 

> [Java][Dataset] Implement Datasets Java API 
> 
>
> Key: ARROW-7808
> URL: https://issues.apache.org/jira/browse/ARROW-7808
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset, Java
>Reporter: Hongze Zhang
>Priority: Major
>  Labels: dataset
>
> Porting following C++ Datasets APIs to Java: 
> * DataSource 
> * DataSourceDiscovery 
> * DataFragment 
> * Dataset
> * Scanner 
> * ScanTask 
> * ScanOptions 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7746) [Java] Support large buffer for IPC

2020-02-02 Thread Liya Fan (Jira)
Liya Fan created ARROW-7746:
---

 Summary: [Java] Support large buffer for IPC
 Key: ARROW-7746
 URL: https://issues.apache.org/jira/browse/ARROW-7746
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Liya Fan


The motivation is described in 
https://github.com/apache/arrow/pull/6323#issuecomment-580137629.

When the size of the ArrowBuf exceeds 2GB, our flighing library does not work 
due to integer overflow. 

This is because internally, we have used some data structures which are based 
on 32-bit integers. To resolve the problem, we must revise/replace the data 
structures to make them support 64-bit integers. 

As a concrete example, we can see that when the server sends data through IPC, 
an org.apache.arrow.flight.ArrowMessage object is created, and is wrapped as an 
InputStream through the `asInputStream` method. In this method, we use data 
stuctures like java.io.ByteArrayOutputStream and io.netty.buffer.ByteBuf, which 
are based on 32-bit integers (we can observe that NettyArrowBuf#length and 
ByteArrayOutputStream#count are both 32-bit integers). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2020-02-02 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028700#comment-17028700
 ] 

Liya Fan commented on ARROW-6111:
-

As a follow up for ARROW-7610, we need these vectors for cases when the buffer 
size exceeds 2GB. 

> [Java] Support LargeVarChar and LargeBinary types and add integration test 
> with C++
> ---
>
> Key: ARROW-6111
> URL: https://issues.apache.org/jira/browse/ARROW-6111
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Liya Fan
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2020-02-02 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan reassigned ARROW-6111:
---

Assignee: Liya Fan

> [Java] Support LargeVarChar and LargeBinary types and add integration test 
> with C++
> ---
>
> Key: ARROW-6111
> URL: https://issues.apache.org/jira/browse/ARROW-6111
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Liya Fan
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6566) Implement VarChar in Scala

2020-01-30 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan closed ARROW-6566.
---
Resolution: Cannot Reproduce

> Implement VarChar in Scala
> --
>
> Key: ARROW-6566
> URL: https://issues.apache.org/jira/browse/ARROW-6566
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Java
>Affects Versions: 0.14.1
>Reporter: Boris V.Kuznetsov
>Priority: Major
>
> Hello
> I'm trying to write and read a zio.Chunk of strings, with is essentially an 
> array of strings.
> My implementation fails the test, how should I fix that ?
> [Writer|https://github.com/Neurodyne/zio-serdes/blob/9e2128ff64ffa0e7c64167a5ee46584c3fcab9e4/src/main/scala/zio/serdes/arrow/ArrowUtils.scala#L48]
>  code
> [Reader|https://github.com/Neurodyne/zio-serdes/blob/9e2128ff64ffa0e7c64167a5ee46584c3fcab9e4/src/main/scala/zio/serdes/arrow/ArrowUtils.scala#L108]
>  code
> [Test|https://github.com/Neurodyne/zio-serdes/blob/9e2128ff64ffa0e7c64167a5ee46584c3fcab9e4/src/test/scala/arrow/Base.scala#L115]
>  code
> Any help, links and advice are highly appreciated
> Thank you!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   >