[jira] [Created] (ARROW-8870) [Java] Make sure Netty Allocator has correct behavior with empty ArrowBuf

2020-05-20 Thread Ji Liu (Jira)
Ji Liu created ARROW-8870:
-

 Summary: [Java] Make sure Netty Allocator has correct behavior 
with empty ArrowBuf
 Key: ARROW-8870
 URL: https://issues.apache.org/jira/browse/ARROW-8870
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Include a test which ensures that the Netty Allocator returns an empty-behaving 
byte buffer when users allocate a zero byte buffer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8305) [Java] Add ExtensionType support for visitor API

2020-04-01 Thread Ji Liu (Jira)
Ji Liu created ARROW-8305:
-

 Summary: [Java] Add ExtensionType support for visitor API
 Key: ARROW-8305
 URL: https://issues.apache.org/jira/browse/ARROW-8305
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


We have introduced visitor API for comparing vector/range/type in ARROW-6211, 
but it dose not support {{ExtensionType}} yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8171) Consider pre-allocating memory for fix-width vector in Avro adapter iterator

2020-03-20 Thread Ji Liu (Jira)
Ji Liu created ARROW-8171:
-

 Summary: Consider pre-allocating memory for fix-width vector in 
Avro adapter iterator
 Key: ARROW-8171
 URL: https://issues.apache.org/jira/browse/ARROW-8171
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8020) [Java] Implement vector validate functionality

2020-03-06 Thread Ji Liu (Jira)
Ji Liu created ARROW-8020:
-

 Summary: [Java] Implement vector validate functionality 
 Key: ARROW-8020
 URL: https://issues.apache.org/jira/browse/ARROW-8020
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


In C++ side, we already have array validate functionality but no similar 
functionality in Java side.

This issue is about to implement this functionality.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8019) [Java] Implement vector diff functionality

2020-03-05 Thread Ji Liu (Jira)
Ji Liu created ARROW-8019:
-

 Summary: [Java] Implement vector diff functionality 
 Key: ARROW-8019
 URL: https://issues.apache.org/jira/browse/ARROW-8019
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


In C++ side, we already have array diff functionality for vector equals and 
testing to make it easy to see differences between Arrays and reduce debugging 
time.  And it’s better to do something similar in Java side for better testing 
facilities.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7713) [Java] TastLeak was put at the wrong location

2020-01-28 Thread Ji Liu (Jira)
Ji Liu created ARROW-7713:
-

 Summary: [Java] TastLeak was put at the wrong location
 Key: ARROW-7713
 URL: https://issues.apache.org/jira/browse/ARROW-7713
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Seems {{TestLeak.java}} was put at the wrong place, we should move it into 
{{flight-core}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7546) [Java] Use new implementation to concat vectors values in batch

2020-01-10 Thread Ji Liu (Jira)
Ji Liu created ARROW-7546:
-

 Summary: [Java] Use new implementation to concat vectors values in 
batch
 Key: ARROW-7546
 URL: https://issues.apache.org/jira/browse/ARROW-7546
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Per discussion https://github.com/apache/arrow/pull/5945#discussion_r365108806.

In ARROW-7284, we write a simple method to concat vectors. However, ARROW-7073 
is about to concat vector values efficiently, after this PR merged, we should 
use this new implementation in {{ArrowReader}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7539) [Java] FieldVector getFieldBuffers API should not set reader/writer indices

2020-01-09 Thread Ji Liu (Jira)
Ji Liu created ARROW-7539:
-

 Summary: [Java] FieldVector getFieldBuffers API should not set 
reader/writer indices
 Key: ARROW-7539
 URL: https://issues.apache.org/jira/browse/ARROW-7539
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Per discussion 
[https://github.com/apache/arrow/pull/6133#discussion_r364906302].

The fact that we have reader/writer settings in {{getFieldBuffers}} is wrong. 
To clarify, {{getFieldBuffers}} is distinct from {{getBuffers}}. The former 
should be for getting access to underlying data for higher-performance 
algorithms. The latter is for sending the data over the wire. Seems we've mixed 
up use of both.

 

Currently in {{VectorUnloader}}, we used {{getFieldBuffers}} to create 
{{ArrowRecordBatch}} that’s why we keep writer/reader indices in 
{{getFieldBuffers}}, we should use {{getBuffers}} instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7490) [Java] Avro converter should convert attributes and props to FieldType metadata

2020-01-01 Thread Ji Liu (Jira)
Ji Liu created ARROW-7490:
-

 Summary: [Java] Avro converter should convert attributes and props 
to FieldType metadata
 Key: ARROW-7490
 URL: https://issues.apache.org/jira/browse/ARROW-7490
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently in Avro converter, some attributes are used when creating vectors 
such as “name”, “size” etc, others are discarded.

For named type like Record, Enum and Fixed, they may have attributes like “doc” 
“aliased” which should keep in metadata for potential further use.

Besides, properties are also not converted properly in some cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7472) [Java] Fix some incorrect behavior in UnionListWriter

2019-12-24 Thread Ji Liu (Jira)
Ji Liu created ARROW-7472:
-

 Summary: [Java] Fix some incorrect behavior in UnionListWriter
 Key: ARROW-7472
 URL: https://issues.apache.org/jira/browse/ARROW-7472
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently the {{UnionListWriter/UnionFixedSizeListWriter}} {{getField/close}} 
APIs seems incorrect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7467) [Java] ComplexCopier does incorrect copy for Map nullable info

2019-12-23 Thread Ji Liu (Jira)
Ji Liu created ARROW-7467:
-

 Summary: [Java] ComplexCopier does incorrect copy for Map nullable 
info
 Key: ARROW-7467
 URL: https://issues.apache.org/jira/browse/ARROW-7467
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


The {{MapVector}} and its 'value' vector are nullable, and its {{structVector}} 
and 'key' vector are non-nullable.

However, the {{MapVector}} generated by ComplexCopier has all nullable fields 
which is not correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7425) [Java] PromotableWriter support writing FixedSizeList type data

2019-12-18 Thread Ji Liu (Jira)
Ji Liu created ARROW-7425:
-

 Summary: [Java] PromotableWriter support writing FixedSizeList 
type data
 Key: ARROW-7425
 URL: https://issues.apache.org/jira/browse/ARROW-7425
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


We have introduced writer API for {{FixedSizeListVector}} via ARROW-6079, but 
{{PromotableWriter}}’s support for it is incomplete.

For example, using {{UnionListWriter}} we could simply write {{List}} 
type data, but for {{List}} or {{FixedSizeList}} 
it doesn’t work.

This issue is about to enhance the {{PromotableWriter}} support for 
{{FixedSizeList}} type and add tests to verify the cases mentioned above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7406) [Java] NonNullableStructVector#hashCode should pass hasher to child vectors

2019-12-16 Thread Ji Liu (Jira)
Ji Liu created ARROW-7406:
-

 Summary: [Java] NonNullableStructVector#hashCode should pass 
hasher to child vectors
 Key: ARROW-7406
 URL: https://issues.apache.org/jira/browse/ARROW-7406
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


This was introduced by ARROW-6866 making parameter hasher useless in 
hashCode(int index, {{ArrowBufHasher}} hasher), and the child vectors would 
calculate hashCode using default hasher which is not correct. 

This issue should be fixed by passing hasher to child vector when calculating 
hashCode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7405) [Java] ListVector isEmpty API is incorrect

2019-12-16 Thread Ji Liu (Jira)
Ji Liu created ARROW-7405:
-

 Summary: [Java] ListVector isEmpty API is incorrect
 Key: ARROW-7405
 URL: https://issues.apache.org/jira/browse/ARROW-7405
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


 Currently {{isEmpty}} API is always return false in 
{{BaseRepeatedValueVector}}, and its subclass {{ListVector}} did not overwrite 
this method.

This will lead to incorrect result, for example, a {{ListVector}} with data 
[1,2], null, [], [5,6] should get [false, false, true, false] with this API, 
but now it would return [false, false, false, false].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7264) [Java] RangeEqualsVisitor type check is not correct

2019-11-26 Thread Ji Liu (Jira)
Ji Liu created ARROW-7264:
-

 Summary: [Java] RangeEqualsVisitor type check is not correct
 Key: ARROW-7264
 URL: https://issues.apache.org/jira/browse/ARROW-7264
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 0.15.1
Reporter: Ji Liu
Assignee: Ji Liu


Currently {{RangeEqualsVisitor}} generally only checks type once and keep the 
result to avoid repeated type checking, see
{code:java}
typeCompareResult = 
left.getField().getType().equals(right.getField().getType());
{code}
This only compares {{ArrowType}} and for complex type, this may cause 
unexpected behavior, for example {{List}} and {{List}} would be 
type equals which not consider their child field.

We should compare Field here instead and to make it more extendable, we use 
{{TypeEqualsVisitor}} to compare Field, in this way, one could choose whether 
checks names or metadata either.

 

Also provide a test for ListVector to validate this change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7259) [Java] Support subfield encoder use different hasher

2019-11-25 Thread Ji Liu (Jira)
Ji Liu created ARROW-7259:
-

 Summary: [Java] Support subfield encoder use different hasher
 Key: ARROW-7259
 URL: https://issues.apache.org/jira/browse/ARROW-7259
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently {{ListSubFieldEncoder/StructSubFieldEncoder}} use default hasher for 
calculating hashCode.

This issue enables them to use different hasher or even user-defined hasher for 
their own use cases just like {{DictionaryEncoder}} does.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7026) [Java] Remove assertions in MessageSerializer/vector/writer/reader

2019-10-30 Thread Ji Liu (Jira)
Ji Liu created ARROW-7026:
-

 Summary: [Java] Remove assertions in 
MessageSerializer/vector/writer/reader
 Key: ARROW-7026
 URL: https://issues.apache.org/jira/browse/ARROW-7026
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently assertions exists in many classes like 
{{MessagaSerializer/JsonReader/JsonWriter/ListVector}} etc.

i. If jvm arguments are not specified, these checks will skipped and lead to 
potential problems.

ii. Java errors produced by failed assertions are not caught by traditional 
catch clauses.

To fix this, use {{Preconditions}} instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7021) [Java] UnionFixedSizeListWriter decimal type should check writer index

2019-10-29 Thread Ji Liu (Jira)
Ji Liu created ARROW-7021:
-

 Summary: [Java] UnionFixedSizeListWriter decimal type should check 
writer index
 Key: ARROW-7021
 URL: https://issues.apache.org/jira/browse/ARROW-7021
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


{{UnionFixedSizeListWriter}} should check writer index for decimal type (just 
as other types) to ensure the values written not exceed listSize.

Otherwise, the writer may continue to write data into it’s underlying vector 
quietly even the the writer.idx() > listSize * index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6912) [Java] Extract a common base class for avro converter consumers

2019-10-17 Thread Ji Liu (Jira)
Ji Liu created ARROW-6912:
-

 Summary: [Java] Extract a common base class for avro converter 
consumers
 Key: ARROW-6912
 URL: https://issues.apache.org/jira/browse/ARROW-6912
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently Avro converter consumers have some common variables and methods which 
could be eliminated by extracting a common class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6898) [Java] Fix potential memory leak in ArrowWriter and several test classes

2019-10-16 Thread Ji Liu (Jira)
Ji Liu created ARROW-6898:
-

 Summary: [Java] Fix potential memory leak in ArrowWriter and 
several test classes
 Key: ARROW-6898
 URL: https://issues.apache.org/jira/browse/ARROW-6898
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


ARROW-6040 fixed the problem that dictionary entries are required in IPC 
streams even when empty, which only writes dictionaries when there are at least 
one batch. In this way, if we write empty stream and invoke ArrowWriter#close, 
the dictionaries are not closed leading to memory leak (they are closed after 
the write operation), and it’s really hard to debug, this problem was found by 
{{TestArrowReaderWriter#testEmptyStreamInStreamingIPC}} when I tried to close 
allocator after the test. 

 

Besides, there are several test classes have potential memory leak without 
closing allocator/vector/buf etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6889) [Java] ComplexCopier enable FixedSizeList type & fix RangeEualsVisitor StackOverFlow

2019-10-15 Thread Ji Liu (Jira)
Ji Liu created ARROW-6889:
-

 Summary: [Java] ComplexCopier enable FixedSizeList type & fix 
RangeEualsVisitor StackOverFlow
 Key: ARROW-6889
 URL: https://issues.apache.org/jira/browse/ARROW-6889
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


i. Enable {{ComplexCopier}} copy {{FixedSizeListVector}} value, add related 
tests

ii. Fix {{RangeEqualsVisitor#compareFixedSizeListVectors}} StackOverFlow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6871) [Java] Enhance TransferPair related parameters check and tests

2019-10-13 Thread Ji Liu (Jira)
Ji Liu created ARROW-6871:
-

 Summary: [Java] Enhance TransferPair related parameters check and 
tests
 Key: ARROW-6871
 URL: https://issues.apache.org/jira/browse/ARROW-6871
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


{{TransferPair}} related param checks in different classes have potential 
problems:

i. {{copyValueSafe}} do not check from index, if from > valueCount, no error is 
shown.

ii. {{splitAndTansferPair}} has no indices check in classes like 
{{VarcharVector}}

iii. {{splitAndTranserPair}} indices check in classes like UnionVector is not 
correct (Preconditions.checkArgument(startIndex + length <= valueCount)), 
should check params separately.

iv. some assert usages should be replaced with {{Preconditions}}.

v. should add more UT to cover corner cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6853) [Java] Support vector and dictionary encoder use different hasher for calculating hashCode

2019-10-11 Thread Ji Liu (Jira)
Ji Liu created ARROW-6853:
-

 Summary: [Java] Support vector and dictionary encoder use 
different hasher for calculating hashCode
 Key: ARROW-6853
 URL: https://issues.apache.org/jira/browse/ARROW-6853
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Hasher interface was introduce in ARROW-5898 and now have two different 
implementations ({{MurmurHasher and }}{{SimpleHasher}}) and it could be more in 
the future.

And currently {{ValueVector#hashCode}} and {{DictionaryHashTable}} only use 
{{SimpleHasher}} for calculating hashCode. This issue enables them to use 
different hasher or even user-defined hasher for their own use cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6850) [Java] Jdbc converter support Null type

2019-10-11 Thread Ji Liu (Jira)
Ji Liu created ARROW-6850:
-

 Summary: [Java] Jdbc converter support Null type
 Key: ARROW-6850
 URL: https://issues.apache.org/jira/browse/ARROW-6850
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


java.sql.Types.Null is not supported yet since we have no NullVector in Java 
code before.

This could be implemented after ARROW-1638 merged (IPC roundtrip for null type).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6721) [JAVA] Avro adapter benchmark only runs once in JMH

2019-09-27 Thread Ji Liu (Jira)
Ji Liu created ARROW-6721:
-

 Summary: [JAVA] Avro adapter benchmark only runs once in JMH
 Key: ARROW-6721
 URL: https://issues.apache.org/jira/browse/ARROW-6721
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


The current {{AvroAdapterBenchmark}} actually only run once during JMH 
evaluation, since the decoder was consumed for the first time and the follow-up 
invokes will directly return.

To solve this, we use {{BinaryDecoder}} explicitly in benchmark and reset its 
inner stream first when the test method is invoked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6710) [Java] Add JDBC adapter test to cover cases which contains some null values

2019-09-26 Thread Ji Liu (Jira)
Ji Liu created ARROW-6710:
-

 Summary: [Java] Add JDBC adapter test to cover cases which 
contains some null values
 Key: ARROW-6710
 URL: https://issues.apache.org/jira/browse/ARROW-6710
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


The current JDBC adapter tests only cover the cases that values are all 
non-null or all null.

However, the cases that ResultSet has some null values are not covered 
(ARROW-6709).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6662) [Java] Implement equals/approxEquals API for VectorSchemaRoot

2019-09-22 Thread Ji Liu (Jira)
Ji Liu created ARROW-6662:
-

 Summary: [Java] Implement equals/approxEquals API for 
VectorSchemaRoot
 Key: ARROW-6662
 URL: https://issues.apache.org/jira/browse/ARROW-6662
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently with the new added visitor APIs(ARROW-6211), we could implement 
equals/approxEquals for VectorSchemaRoot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6661) [Java] Implement APIs like slice to enhance VectorSchemaRoot

2019-09-22 Thread Ji Liu (Jira)
Ji Liu created ARROW-6661:
-

 Summary: [Java] Implement APIs like slice to enhance 
VectorSchemaRoot
 Key: ARROW-6661
 URL: https://issues.apache.org/jira/browse/ARROW-6661
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently in Java Implementation there is no APIs like slice for record batch 
like C++/Python.

This issue is about to implement slice/getVector/addVector/removeVector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6600) [Java] Implement dictionary-encoded subfields for Union type

2019-09-18 Thread Ji Liu (Jira)
Ji Liu created ARROW-6600:
-

 Summary: [Java] Implement dictionary-encoded subfields for Union 
type
 Key: ARROW-6600
 URL: https://issues.apache.org/jira/browse/ARROW-6600
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Implement dictionary-encoded subfields for {{Union}} type. Each child vector 
could be encodable or not.

 

Meanwhile extra common logic into {{DictionaryEncoder}} as well as refactor 
List subfield encoding to keep consistent with {{Struct/Union}} type.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6472) [Java] ValueVector#accept may has potential cast exception

2019-09-05 Thread Ji Liu (Jira)
Ji Liu created ARROW-6472:
-

 Summary: [Java] ValueVector#accept may has potential cast exception
 Key: ARROW-6472
 URL: https://issues.apache.org/jira/browse/ARROW-6472
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Per discussion 
[https://github.com/apache/arrow/pull/5195#issuecomment-528425302]

We may use API this way:
{code:java}
RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector1, vector2);
vector3.accept(visitor, range){code}
if vector1/vector2 are say, {{StructVector}}s and vector3 is an {{IntVector}} - 
things can go bad. we'll use the {{compareBaseFixedWidthVectors()}} and do 
wrong type-casts for vector1/vector2.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6464) [Java] Refactor FixedSizeListVector#splitAndTransfer with slice API

2019-09-05 Thread Ji Liu (Jira)
Ji Liu created ARROW-6464:
-

 Summary: [Java] Refactor FixedSizeListVector#splitAndTransfer with 
slice API
 Key: ARROW-6464
 URL: https://issues.apache.org/jira/browse/ARROW-6464
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently {{FixedSizeListVector#splitAndTransfer}} actually use 
{{copyValueSafe}} which has memory copy, we should use slice API instead.

Meanwhile, {{splitAndTransfer}} in all classes should position index check at 
beginning.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6460) [Java] Add unit test for large avro data

2019-09-04 Thread Ji Liu (Jira)
Ji Liu created ARROW-6460:
-

 Summary: [Java] Add unit test for large avro data
 Key: ARROW-6460
 URL: https://issues.apache.org/jira/browse/ARROW-6460
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


To avoid OOM, we have implement iterator API in ARROW-6220.

This issue is about to add tests with a large fake data (say 6MM rows in JDBC 
adapter test) set and ensures no OOMs occur.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6452) [Java] Overrite ValueVector toString() method

2019-09-03 Thread Ji Liu (Jira)
Ji Liu created ARROW-6452:
-

 Summary: [Java] Overrite ValueVector toString() method
 Key: ARROW-6452
 URL: https://issues.apache.org/jira/browse/ARROW-6452
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently cpp code {{Array#ToString}} returns the human readable format string 
like:

[

  1,

  2,

  3

]

But Java {{ValueVector}} did not implement like this way now.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6401) [Java] Implement dictionary-encoded subfields for Struct type

2019-08-30 Thread Ji Liu (Jira)
Ji Liu created ARROW-6401:
-

 Summary: [Java] Implement dictionary-encoded subfields for Struct 
type
 Key: ARROW-6401
 URL: https://issues.apache.org/jira/browse/ARROW-6401
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Implement dictionary-encoded subfields for Struct type.

Each child vector will have a dictionary, the dictionary vector is struct type 
and holds all dictionaries.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6356) [Java] Avro adapter implement Enum type and nested Record type

2019-08-26 Thread Ji Liu (Jira)
Ji Liu created ARROW-6356:
-

 Summary: [Java] Avro adapter implement Enum type and nested Record 
type
 Key: ARROW-6356
 URL: https://issues.apache.org/jira/browse/ARROW-6356
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Ji Liu
Assignee: Ji Liu


Implement for converting avro {{Enum}} type.

Convert nested avro {{Record}} type to Arrow {{StructVector}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6311) [Java] Make ApproxEqualsVisitor accept DiffFunction to make it more flexible

2019-08-21 Thread Ji Liu (Jira)
Ji Liu created ARROW-6311:
-

 Summary: [Java] Make ApproxEqualsVisitor accept DiffFunction to 
make it more flexible
 Key: ARROW-6311
 URL: https://issues.apache.org/jira/browse/ARROW-6311
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently {{ApproxEqualsVisitor}} will accept a epsilon for both float and 
double compare, and the difference calculation is always {{Math.abs}}(f1-f2)

For some cases like {{Validator}} it is not very suitable as:

i. it has different epsilon values for float/double

ii. it difference function is not Math.abs(f1-f2)

 

To resolve these, make this visitor accept both float/double epsilons and diff 
functions.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6308) [Java] Support write interleaved dictionaries and batches in IPC stream

2019-08-21 Thread Ji Liu (Jira)
Ji Liu created ARROW-6308:
-

 Summary: [Java] Support write interleaved dictionaries and batches 
in IPC stream
 Key: ARROW-6308
 URL: https://issues.apache.org/jira/browse/ARROW-6308
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Per discussions in the following threads, as 
spec([http://arrow.apache.org/docs/format/IPC.html#streaming-format]) 
described, as long as a record batch doesn't reference a dictionary they can be 
interleaved.

[https://github.com/apache/arrow/pull/4960]

[https://github.com/apache/arrow/pull/5146]

Currently it’s able to parse dictionaries and batches which are interleaved via 
ARROW-6040,  But it’s impossible to write data in this format.

 

 

This issue is used to record this problem, and should be done after a ML 
discuss.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6289) [Java] Add empty() in UnionVector to create instance

2019-08-18 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6289:
-

 Summary: [Java] Add empty() in UnionVector to create instance
 Key: ARROW-6289
 URL: https://issues.apache.org/jira/browse/ARROW-6289
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently complex type vectors all have {{empty}}() API to create instance 
except {{UnionVector}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6288) [Java] Implement TypeEqualsVisitor comparing vector type equals considering names and metadata

2019-08-18 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6288:
-

 Summary: [Java] Implement TypeEqualsVisitor comparing vector type 
equals considering names and metadata
 Key: ARROW-6288
 URL: https://issues.apache.org/jira/browse/ARROW-6288
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently when we compare range/vector equals, we first compare vector 
{{Field}} by its equals method, in this case, it’s hard to specify whether 
compare names or metadata.

Implement a {{TypeEqualsVisitor}} will make type comparisons more flexible like 
cpp implementation dose 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.cc#L712]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6265) [Java] Avro adapter implement Array/Map/Fixed type

2019-08-15 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6265:
-

 Summary: [Java] Avro adapter implement Array/Map/Fixed type
 Key: ARROW-6265
 URL: https://issues.apache.org/jira/browse/ARROW-6265
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Support Array/Map/Fixed type in avro adapter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6250) [Java] Implement ApproxEqualsVisitor comparing approx for floating point

2019-08-15 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6250:
-

 Summary: [Java] Implement ApproxEqualsVisitor comparing approx for 
floating point
 Key: ARROW-6250
 URL: https://issues.apache.org/jira/browse/ARROW-6250
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently we already implemented {{RangeEqualsVisitor/VectorEqualsVisitor}} for 
comparing range/vector.

And ARROW-6211 is created to make {{ValueVector}} work with generic visitor.

We should also implement {{ApproxEqualsVisitor}} to compare floating point just 
like cpp does

[https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.cc]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6249) [Java] Remove useless class ByteArrayWrapper

2019-08-15 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6249:
-

 Summary: [Java] Remove useless class ByteArrayWrapper
 Key: ARROW-6249
 URL: https://issues.apache.org/jira/browse/ARROW-6249
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


This class was introduced into encoding part to compare byte[] values equals.

Since now we compare value/vector equals by new added visitor API by ARROW-6022 
instead of  comparing {{getObject}}, this class is no use anymore.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6234) [Java] ListVector hashCode() is not correct

2019-08-14 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6234:
-

 Summary: [Java] ListVector hashCode() is not correct
 Key: ARROW-6234
 URL: https://issues.apache.org/jira/browse/ARROW-6234
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Current implement is not correct:
{code:java}
for (int i = start; i < end; i++) {
  hash = 31 * vector.hashCode(i);
}
{code}
Should be something like:
{code:java}
hash = 31 * hash + vector.hashCode(i);{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6218) [Java] Add UINT type test in integration to avoid potential overflow

2019-08-12 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6218:
-

 Summary: [Java] Add UINT type test in integration to avoid 
potential overflow
 Key: ARROW-6218
 URL: https://issues.apache.org/jira/browse/ARROW-6218
 Project: Apache Arrow
  Issue Type: Test
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


As per discussion [https://github.com/apache/arrow/pull/5002]

For UINT type, when write/read json data in integration test, it extend data 
type(i.e. Long->BigInteger, Int->Long) to avoid potential overflow.

Like UINT8 the write side and read side code like this:

 
{code:java}
case UINT8:

  generator.writeNumber(UInt8Vector.getNoOverflow(buffer, index));

  break;{code}
 
{code:java}
BigInteger value = parser.getBigIntegerValue();

buf.writeLong(value.longValue());
{code}
Should add a test to avoid potential overflow in the data transfer process.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6200) [Java] Method getBufferSizeFor in BaseRepeatedValueVector/ListVector not correct

2019-08-11 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6200:
-

 Summary: [Java] Method getBufferSizeFor in 
BaseRepeatedValueVector/ListVector not correct
 Key: ARROW-6200
 URL: https://issues.apache.org/jira/browse/ARROW-6200
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently, {{getBufferSizeFor}} in {{BaseRepeatedValueVector}} implemented as 
below:
{code:java}
if (valueCount == 0) {

  return 0;

}

return ((valueCount + 1) * OFFSET_WIDTH) + vector.getBufferSizeFor(valueCount);
{code}
Here vector.getBufferSizeFor(valueCount) seems not right which should be

 
{code:java}
int innerVectorValueCount = offsetBuffer.getInt(valueCount * OFFSET_WIDTH);

vector.getBufferSizeFor(innerVectorValueCount)
{code}
 ListVector has the same problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6199) [Java] Avro adapter avoid potential resource leak.

2019-08-11 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6199:
-

 Summary: [Java] Avro adapter avoid potential resource leak.
 Key: ARROW-6199
 URL: https://issues.apache.org/jira/browse/ARROW-6199
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently, avro consumer interface has no close API, which may cause resource 
leak like {{AvroBytesConsumer#cacheBuffer}}.

To resolve this, make consumer extends {{AutoCloseable}} and create 
{{CompositeAvroConsumer}} to encompasses consume and close logic. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6194) [Java] Make DictionaryEncoder non-static making it easy to extend and reuse

2019-08-10 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6194:
-

 Summary: [Java] Make DictionaryEncoder non-static making it easy 
to extend and reuse
 Key: ARROW-6194
 URL: https://issues.apache.org/jira/browse/ARROW-6194
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


As discussed in [https://github.com/apache/arrow/pull/4994].

Current static DictionaryEncoder has some limitation for extension and reuse.

Slightly change the APIs and migrate static method to object based approach.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6175) [Java] Fix MapVector#getMinorType and extend AbstractContainerVector addOrGet complex vector API

2019-08-08 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6175:
-

 Summary: [Java] Fix MapVector#getMinorType and extend 
AbstractContainerVector addOrGet complex vector API
 Key: ARROW-6175
 URL: https://issues.apache.org/jira/browse/ARROW-6175
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


i. Currently {{MapVector#getMinorType}} extends {{ListVector}} which returns 
the wrong {{MinorType}}.

ii. {{AbstractContainerVector}} now only has {{addOrGetList}}, 
{{addOrGetUnion}}, {{addOrGetStruct}} which not support all complex type like 
{{MapVector}} and {{FixedSizeListVector}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6160) [Java] AbstractStructVector#getPrimitiveVectors fails to work with complex child vectors

2019-08-07 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6160:
-

 Summary: [Java] AbstractStructVector#getPrimitiveVectors fails to 
work with complex child vectors
 Key: ARROW-6160
 URL: https://issues.apache.org/jira/browse/ARROW-6160
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently in {{AbstractStructVector#getPrimitiveVectors}}, only struct type 
child vectors will recursively get primitive vectors, other complex type like 
{{ListVector}}, {{UnionVector}} was treated as primitive type and return 
directly.

For example, Struct(List(Int), Struct(Int, Varchar)) {{getPrimitiveVectors}} 
should return {{[IntVector, IntVector, VarCharVector]}} instead of [ListVector, 
IntVector, VarCharVector]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6145) [Java] UnionVector created by MinorType#getNewVector could not keep field type info properly

2019-08-06 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6145:
-

 Summary: [Java] UnionVector created by MinorType#getNewVector 
could not keep field type info properly
 Key: ARROW-6145
 URL: https://issues.apache.org/jira/browse/ARROW-6145
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


When I worked for other items, I found {{UnionVector}} created by 
{{VectorSchemaRoot#create(Schema schema, BufferAllocator allocator)}} could not 
keep field type info properly. For example, if we set metadata in {{Field}} in 
schema, we could not get it back by {{UnionVector#getField}}.

This is mainly because {{MinorType.Union.getNewVector}} did not pass 
{{FieldType}} to vector and {{UnionVector#getField}} create a new {{Field}} 
which cause inconsistent.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6118) [Java] Replace google Preconditions with Arrow Preconditions

2019-08-02 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6118:
-

 Summary: [Java] Replace google Preconditions with Arrow 
Preconditions
 Key: ARROW-6118
 URL: https://issues.apache.org/jira/browse/ARROW-6118
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Now in java code, most places uses {{org.apache.arrow.util.Preconditions}}, but 
still some places uses {{com.google.common.base.Preconditions}}.

Remove google Preconditions meanwhile remove duplicated checks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6097) [Java] Avro adapter implement unions type

2019-08-01 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6097:
-

 Summary: [Java] Avro adapter implement unions type
 Key: ARROW-6097
 URL: https://issues.apache.org/jira/browse/ARROW-6097
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Ji Liu
Assignee: Ji Liu


Support convert unions type like ["string"], ["string", 'int"] and nullable 
["string", "int", "null"]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6083) [Java] Refactor Jdbc adapter consume logic

2019-07-31 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6083:
-

 Summary: [Java] Refactor Jdbc adapter consume logic
 Key: ARROW-6083
 URL: https://issues.apache.org/jira/browse/ARROW-6083
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Jdbc adapter read from {{ResultSet}} looks like:

while (rs.next()) {
 for (int i = 1; i <= columnCount; i++) {
 jdbcToFieldVector(
 rs,
 i,
 rs.getMetaData().getColumnType(i),
 rowCount,
 root.getVector(rsmd.getColumnName(i)),
 config);
 }
 rowCount++;
}

And in {{jdbcToFieldVector}} has lots of switch-case, that is to see, for every 
single value from ResultSet we have to do lots of analyzing conditions.

I think we could optimize this using consumer/delegate like avro adapter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6079) [Java] Implement/test UnionFixedSizeListWriter for FixedSizeListVector

2019-07-31 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6079:
-

 Summary: [Java] Implement/test UnionFixedSizeListWriter for 
FixedSizeListVector
 Key: ARROW-6079
 URL: https://issues.apache.org/jira/browse/ARROW-6079
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Now we have two list vectors: {{ListVector}} and {{FixedSizeListVector}}.

{{ListVector}} has already implemented UnionListWriter for writing data, 
however, {{FixedSizeListVector}} doesn't have this yet and seems the only way 
for users to write data is getting inner vector and set value manually.

Implement a writer for {{FixedSizeListVector}} is useful in some cases.

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6078) [Java] Implement dictionary-encoded subfields for List type

2019-07-30 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6078:
-

 Summary: [Java] Implement dictionary-encoded subfields for List 
type
 Key: ARROW-6078
 URL: https://issues.apache.org/jira/browse/ARROW-6078
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Ji Liu
Assignee: Ji Liu


For example, int type List (valueCount = 5) has data like below:

10, 20

10, 20

30, 40, 50

30, 40, 50

10, 20

could be encoded to:

0, 1

0, 1

2, 3, 4

2, 3, 4

0, 1

with list type dictionary

10, 20, 30, 40, 50

or

10,

20,

30,

40,

50

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6035) [Java] Avro adapter support convert nullable value

2019-07-25 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6035:
-

 Summary: [Java] Avro adapter support convert nullable value
 Key: ARROW-6035
 URL: https://issues.apache.org/jira/browse/ARROW-6035
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Ji Liu
Assignee: Ji Liu


A  specific Avro unions type(has two types and one is null type) could convert 
to a nullable ArrowVector.

For instance, ["null", "string"] could represented by a VarcharVector which 
could has null value.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6022) [Java] Support equals API in ValueVector to compare two vectors equal

2019-07-24 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6022:
-

 Summary: [Java] Support equals API in ValueVector to compare two 
vectors equal
 Key: ARROW-6022
 URL: https://issues.apache.org/jira/browse/ARROW-6022
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


In some case, this feature is useful.

In ARROW-1184, {{Dictionary#equals}} not work due to the lack of this API.

Moreover, we already implemented {{equals(int index, ValueVector target, int 
targetIndex)}}, so this new added API could reuse it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6020) [Java] Refactor ByteFunctionHelper#hash with new added ArrowBufHasher

2019-07-23 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6020:
-

 Summary: [Java] Refactor ByteFunctionHelper#hash with new added 
ArrowBufHasher
 Key: ARROW-6020
 URL: https://issues.apache.org/jira/browse/ARROW-6020
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Some logic in these two classes are similar, should replace 
ByteFunctionHelper#hash logic with ArrowBufHasher since it has murmur hash 
algorithm which could avoid hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6019) [Java] Port Jdbc and Avro adapter to new directory

2019-07-23 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6019:
-

 Summary: [Java] Port Jdbc and Avro adapter to new directory 
 Key: ARROW-6019
 URL: https://issues.apache.org/jira/browse/ARROW-6019
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


As discussed in mail list, adapters are different from native reader.

This issue is used to track these issues:

i. create new “contrib” directory and move Jdbc/Avro adapter to it.

ii. provide more description.

iii. change orc readers structure to “converter"

cc [~emkornfi...@gmail.com]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5997) [Java] Support dictionary encoding for Union type

2019-07-22 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5997:
-

 Summary: [Java] Support dictionary encoding for Union type
 Key: ARROW-5997
 URL: https://issues.apache.org/jira/browse/ARROW-5997
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Ji Liu
Assignee: Ji Liu


Now only Union type is not supported in dictionary encoding.

In the last several weeks, we did some refactor for encoding and now it's time 
to support Union type.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5988) [Java] Avro adapter implement simple Record type

2019-07-19 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5988:
-

 Summary: [Java] Avro adapter implement simple Record type 
 Key: ARROW-5988
 URL: https://issues.apache.org/jira/browse/ARROW-5988
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Ji Liu
Assignee: Ji Liu


1. implement simple Record type witch only contains primitive types

2. add ByteBuffer cache in String/Bytes consumer to reduce creations. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5986) [Java] Code cleanup for dictionary encoding

2019-07-18 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5986:
-

 Summary: [Java] Code cleanup for dictionary encoding
 Key: ARROW-5986
 URL: https://issues.apache.org/jira/browse/ARROW-5986
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


In last few weeks, we did some refactor in dictionary encoding.

Since the new designed hash table for {{DictionaryEncoder}} and {{hashCode}} & 
{{equals}} API in {{ValueVector}} already checked in, some classed are no use 
anymore like {{DictionaryEncodingHashTable}}, {{BaseBinaryVector}} and related 
benchmarks & UT.

Fortunately, these changes are not made into version 0.14, which makes possible 
to remove them.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5968) [Java] Remove duplicate Preconditions check in JDBC adapter

2019-07-17 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5968:
-

 Summary: [Java] Remove duplicate Preconditions check in JDBC 
adapter
 Key: ARROW-5968
 URL: https://issues.apache.org/jira/browse/ARROW-5968
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Some Preconditions check are duplicate in {{JdbcToArrow#sqlToArrow}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5967) [Java] DateUtility#timeZoneList is not correct

2019-07-17 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5967:
-

 Summary: [Java] DateUtility#timeZoneList is not correct
 Key: ARROW-5967
 URL: https://issues.apache.org/jira/browse/ARROW-5967
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Now {{timeZoneList}} in {{DateUtility}} belongs to Joda time.

Since we have replace Joda time with Java time in ARROW-2015, this should also 
be changed.

{{TimeStampXXTZVectors}} have a timezone member which seems not used now and 
its {{getObject}} returns Long(different with that in {{TimeStampXXVectors}} 
which returns {{LocalDateTime}}), should it return {{LocalDateTime}} with its 
timezone?

Is it reasonable if we do as follows:
 # replace Joda {{timezoneList}} with Java {{timezoneList}} in {{DateUtility}}
 # add method like {{getLocalDateTimeFromEpochMilli(long epochMillis, String 
timezone)}} in DateUtility
 # Not sure make {{TimeStampXXTZVectors}} return {{LocalDateTime}}?

cc [~emkornfi...@gmail.com]  [~bryanc]  [~siddteotia]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5909) [Java] Optimize ByteFunctionHelpers equals & compare logic

2019-07-10 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5909:
-

 Summary: [Java] Optimize ByteFunctionHelpers equals & compare logic
 Key: ARROW-5909
 URL: https://issues.apache.org/jira/browse/ARROW-5909
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Now it first compare Long values and then if length < 8 then it compares Byte 
values.

Add the logic to compare Int values when 4 < length < 8.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5902) [Java] Implement HashTable for dictionary encoding

2019-07-10 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5902:
-

 Summary: [Java] Implement HashTable for dictionary encoding
 Key: ARROW-5902
 URL: https://issues.apache.org/jira/browse/ARROW-5902
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Ji Liu
Assignee: Ji Liu


As discussed in [https://github.com/apache/arrow/pull/4792]

Implement a hash table to only store hash & index, meanwhile add check equal 
function in ValueVector API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5883) [Java] Support Dictionary Encoding for List type

2019-07-09 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5883:
-

 Summary: [Java] Support Dictionary Encoding for List type
 Key: ARROW-5883
 URL: https://issues.apache.org/jira/browse/ARROW-5883
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


As described in 
[http://arrow.apache.org/docs/format/Layout.html#dictionary-encoding], List 
type encoding should be supported.

Now ListVector getObject returns a ArrayList implementation, and its equals and 
hashCode are already overwritten, so it could be directly supported to be 
hashMap key in DictionaryEncoder. Since we won't change Dictionary data, use 
mutable key seems dose't matter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5861) [Java] Initial implement to convert Avro record with primitive types

2019-07-05 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5861:
-

 Summary: [Java] Initial implement to convert Avro record with 
primitive types
 Key: ARROW-5861
 URL: https://issues.apache.org/jira/browse/ARROW-5861
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5846) [Java] Create Avro adapter module and add dependencies

2019-07-04 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5846:
-

 Summary: [Java] Create Avro adapter module and add dependencies
 Key: ARROW-5846
 URL: https://issues.apache.org/jira/browse/ARROW-5846
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Ji Liu
Assignee: Ji Liu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5845) [Java] Implement converter between Arrow record batches and Avro records

2019-07-04 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5845:
-

 Summary: [Java] Implement converter between Arrow record batches 
and Avro records
 Key: ARROW-5845
 URL: https://issues.apache.org/jira/browse/ARROW-5845
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5835) [Java] Support Dictionary Encoding for binary type

2019-07-03 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5835:
-

 Summary: [Java] Support Dictionary Encoding for binary type
 Key: ARROW-5835
 URL: https://issues.apache.org/jira/browse/ARROW-5835
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Now is not implemented because byte array is not supported to be HashMap key.

One possible way is that wrap them with something to implement equals and 
hashcode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5834) [Java] Apply new hash map in DictionaryEncoder

2019-07-03 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5834:
-

 Summary: [Java] Apply new hash map in DictionaryEncoder
 Key: ARROW-5834
 URL: https://issues.apache.org/jira/browse/ARROW-5834
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Ji Liu
Assignee: Ji Liu


Follow-up of [ARROW-5814|https://issues.apache.org/jira/browse/ARROW-5814].

Apply new hash map in DictionaryEncoder to make it work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5821) [Java] Support compact fixed-width vectors

2019-07-02 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5821:
-

 Summary: [Java] Support compact fixed-width vectors
 Key: ARROW-5821
 URL: https://issues.apache.org/jira/browse/ARROW-5821
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Ji Liu
Assignee: Ji Liu


In shuffle stage of some applications, FixedWitdhVectors may have very little 
non-null data.
In this case, directly serialize vectors is not a good choice, generally we can 
compact the vector make it only holding non-null value and create a BitVector 
to trace the indices for non-null values so that it could be deserialized 
properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5814) [Java] Implement a HashMap for DictionaryEncoder

2019-07-01 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5814:
-

 Summary: [Java] Implement a  HashMap for 
DictionaryEncoder
 Key: ARROW-5814
 URL: https://issues.apache.org/jira/browse/ARROW-5814
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


As a follow-up of 
[ARROW-5726|https://issues.apache.org/jira/browse/ARROW-5726]. Implement a 
Map for DictionaryEncoder to reduce boxing/unboxing operations.

Benchmark:
DictionaryEncodeHashMapBenchmarks.testHashMap: avgt  5  31151.345 ± 1661.878 
ns/op
DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt  5  
15549.902 ± 771.647 ns/op



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5812) [Java] Refactor method name and param type in BaseIntVector

2019-06-30 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5812:
-

 Summary: [Java] Refactor method name and param type in 
BaseIntVector
 Key: ARROW-5812
 URL: https://issues.apache.org/jira/browse/ARROW-5812
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Change to void _setWithPossibleTruncate(int index, long value);_ for better 
generality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5726) [Java] Implement a common interface for int vectors

2019-06-25 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5726:
-

 Summary: [Java] Implement a common interface for int vectors
 Key: ARROW-5726
 URL: https://issues.apache.org/jira/browse/ARROW-5726
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Now in _DictionaryEncoder#encode_ it use reflection to pull out the set method 
and then set values. 

Set values by reflection is not efficient and code structure is not elegant 
such as

_Method setter = null;_
_for (Class c : Arrays.asList(int.class, long.class)) {_
 _try {_
 _setter = indices.getClass().getMethod("setSafe", int.class, c);_
 _break;_
 _} catch (NoSuchMethodException e) {_
 _// ignore_
 _}_
_}_

Implement a common interface for int vectors to directly get set method and set 
values seems a good choice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5706) [Java] Remove type conversion in getValidityBufferValueCapacity

2019-06-24 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5706:
-

 Summary: [Java] Remove type conversion in 
getValidityBufferValueCapacity
 Key: ARROW-5706
 URL: https://issues.apache.org/jira/browse/ARROW-5706
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Now implementation of getValidityBufferValueCapacity is:

(int) (validityBuffer.capacity() * 8L)

Seems no need to convert it to Long then convert it back to Int, just replace 
with:

validityBuffer.capacity() * 8

VariableWidthVectorBenchmarks#getValueCapacity shows the performance:

Before:

avgt 5 5.731 ± 0.160 ns/op

After:

avgt 5 5.124 ± 0.125 ns/op



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5705) [Java] Optimize BaseValueVector#computeCombinedBufferSize logic

2019-06-24 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5705:
-

 Summary: [Java] Optimize BaseValueVector#computeCombinedBufferSize 
logic
 Key: ARROW-5705
 URL: https://issues.apache.org/jira/browse/ARROW-5705
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Now in BaseValueVector#computeCombinedBufferSize, it computes validity buffer 
size as follow:

_roundUp8(getValidityBufferSizeFromCount(valueCount))_

which can be be expanded to 

_(((valueCount + 7) >> 3 + 7) / 8) * 8_

Seems there's no need to compute bufferSize first and expression above could be 
replaced with:

_(valueCount + 63) / 64 * 8_

In this way, performance of _computeCombinedBufferSize_ would be improved. 
Performance test:

Before:
BaseValueVectorBenchmarks.testC_omputeCombinedBufferSize_ avgt 5 4083.180 ± 
180.363 ns/op

After:

BaseValueVectorBenchmarks.testC_omputeCombinedBufferSize_ avgt 5 3808.635 ± 
162.347 ns/op

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5672) [Java] Refactor redundant method modifier

2019-06-21 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5672:
-

 Summary: [Java] Refactor redundant method modifier
 Key: ARROW-5672
 URL: https://issues.apache.org/jira/browse/ARROW-5672
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Ji Liu
Assignee: Ji Liu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5587) Add more maven style check for Java code

2019-06-13 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5587:
-

 Summary: Add more maven style check for Java code
 Key: ARROW-5587
 URL: https://issues.apache.org/jira/browse/ARROW-5587
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Add more maven style check for java code, such as unused imports, redundant 
modifier, etc. In this way, the quality of code will be improved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5584) Add import for link reference in FieldReader javadoc

2019-06-13 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5584:
-

 Summary: Add import for link reference in FieldReader javadoc
 Key: ARROW-5584
 URL: https://issues.apache.org/jira/browse/ARROW-5584
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Ji Liu
Assignee: Ji Liu


Link reference(ValueVector) in FieldReader javadoc has no import.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5435) IntervalYearVector#getObject should return Period with both year and month

2019-05-29 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5435:
-

 Summary: IntervalYearVector#getObject should return Period with 
both year and month
 Key: ARROW-5435
 URL: https://issues.apache.org/jira/browse/ARROW-5435
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Ji Liu
Assignee: Ji Liu


IntervalYearVector#getObject today return Period with specific month. However, 
this vector stores interval (years and months, e.g. 2 years and 3 months is 
stored as 27(total months)), it should return Period with both years and 
months(now only months is assigned). 

As shown in the example above, now it return Period(27 months), I think it 
should return Period(2 years, 3 months).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5420) Implement or remove getCurrentSizeInBytes in VariableWidthVector

2019-05-27 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5420:
-

 Summary: Implement or remove getCurrentSizeInBytes in 
VariableWidthVector
 Key: ARROW-5420
 URL: https://issues.apache.org/jira/browse/ARROW-5420
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Now VariableWidthVector#getCurrentSizeInBytes doesn't seem to have been 
implemented. We should implement it or just remove it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5259) Add option for ValueVector to allocate buffers with actual size

2019-05-04 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5259:
-

 Summary: Add option for ValueVector to allocate buffers with 
actual size
 Key: ARROW-5259
 URL: https://issues.apache.org/jira/browse/ARROW-5259
 Project: Apache Arrow
  Issue Type: Wish
Reporter: Ji Liu
Assignee: Ji Liu


Currently in _BaseValueVector#computeCombinedBufferSize_, it calculates the 
buffer size with _valueCount_ and _typeWidth_ as inputs and then allocates 
memory for dataBuffer and validityBuffer. However, it always allocate memory 
greater than the actual size, because of the invoke of 
_BaseAllocator.nextPowerOfTwo(bufferSize)_.

For example, IntVector will allocate buffers with size 8192 with valueCount = 
1025, memory usage is almost double what it actually is. So in some cases, 
there have enough memory for actual use but throws OOM when the allocated 
memory is increased to next power of 2 and I think this problem is absolutely 
avoidable.

Is it feasible to add option for ValueVector to allocate actual buffer size 
rather than make it next power of 2 to reduce memory allocation?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5225) [Java] Improve performance of BaseValueVector#getValidityBufferSizeFromCount

2019-04-27 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5225:
-

 Summary: [Java] Improve performance of 
BaseValueVector#getValidityBufferSizeFromCount
 Key: ARROW-5225
 URL: https://issues.apache.org/jira/browse/ARROW-5225
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Now in _BaseValueVector#getValidityBufferSizeFromCount_  and 
_BitVectorHelper#getValidityBufferSize_, it uses _Math.ceil_ to calculate size 
which is not efficient (lots of unnecessary logic in _StrictMath#floorOrCeil_) 
. Since the valueCount is always not less than 0, we could simply replace 
_Math.ceil_ with the following code:

_return valueCount % 8 > 0 ? valueCount / 8 + 1 : valueCount / 8_;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5224) [Java] Add APIs for supporting directly serialize/deserialize ValueVector

2019-04-27 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5224:
-

 Summary: [Java] Add APIs for supporting directly 
serialize/deserialize ValueVector
 Key: ARROW-5224
 URL: https://issues.apache.org/jira/browse/ARROW-5224
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


There is no API in MessageSerializer to directly serilize/deserialize 
ValueVector. This feature is useful for user who only use ValueVectors rather 
than ArrowRecordBatch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5207) [Java] add APIs to support vector

2019-04-23 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5207:
-

 Summary: [Java] add APIs to support vector 
 Key: ARROW-5207
 URL: https://issues.apache.org/jira/browse/ARROW-5207
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu


In some scenarios we hope that ValueVector could be reused to reduce creation 
overhead. This is very common in shuffle stage, it's no need to create 
ValueVector or realloc buffers every time, suppose that the recordCount of 
ValueVector and capacity of its buffers is written in stream, when we 
deserialize it, we can simply judge whether realloc is needed through 
dataLength.

My proposal is that add APIs in ValueVector to process this logic, otherwise 
users have to implement by themselves if they want to reuse which is not 
user-friendly. 

If you agree with this, I would like to take this ticket. Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5206) [JAVA]Add APIs in MessageSerializer to directly serialize/deserialize ArrowBuf

2019-04-23 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5206:
-

 Summary: [JAVA]Add APIs in MessageSerializer to directly 
serialize/deserialize ArrowBuf
 Key: ARROW-5206
 URL: https://issues.apache.org/jira/browse/ARROW-5206
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu


It seems there no APIs to directly write ArrowBuf to OutputStream or read 
ArrowBuf from InputStream. These APIs may be helpful when users use Vectors 
directly instead of RecordBatch, in this case, provide APIs to 
serialize/deserialize dataBuffer/validityBuffer/offsetBuffer is necessary.

I would like to work on this and make it my first contribution to Arrow. What 
do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)