[jira] [Created] (ARROW-8870) [Java] Make sure Netty Allocator has correct behavior with empty ArrowBuf
Ji Liu created ARROW-8870: - Summary: [Java] Make sure Netty Allocator has correct behavior with empty ArrowBuf Key: ARROW-8870 URL: https://issues.apache.org/jira/browse/ARROW-8870 Project: Apache Arrow Issue Type: Task Components: Java Reporter: Ji Liu Assignee: Ji Liu Include a test which ensures that the Netty Allocator returns an empty-behaving byte buffer when users allocate a zero byte buffer -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8305) [Java] Add ExtensionType support for visitor API
Ji Liu created ARROW-8305: - Summary: [Java] Add ExtensionType support for visitor API Key: ARROW-8305 URL: https://issues.apache.org/jira/browse/ARROW-8305 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu We have introduced visitor API for comparing vector/range/type in ARROW-6211, but it dose not support {{ExtensionType}} yet. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8171) Consider pre-allocating memory for fix-width vector in Avro adapter iterator
Ji Liu created ARROW-8171: - Summary: Consider pre-allocating memory for fix-width vector in Avro adapter iterator Key: ARROW-8171 URL: https://issues.apache.org/jira/browse/ARROW-8171 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8020) [Java] Implement vector validate functionality
Ji Liu created ARROW-8020: - Summary: [Java] Implement vector validate functionality Key: ARROW-8020 URL: https://issues.apache.org/jira/browse/ARROW-8020 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu In C++ side, we already have array validate functionality but no similar functionality in Java side. This issue is about to implement this functionality. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8019) [Java] Implement vector diff functionality
Ji Liu created ARROW-8019: - Summary: [Java] Implement vector diff functionality Key: ARROW-8019 URL: https://issues.apache.org/jira/browse/ARROW-8019 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu In C++ side, we already have array diff functionality for vector equals and testing to make it easy to see differences between Arrays and reduce debugging time. And it’s better to do something similar in Java side for better testing facilities. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7713) [Java] TastLeak was put at the wrong location
Ji Liu created ARROW-7713: - Summary: [Java] TastLeak was put at the wrong location Key: ARROW-7713 URL: https://issues.apache.org/jira/browse/ARROW-7713 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Seems {{TestLeak.java}} was put at the wrong place, we should move it into {{flight-core}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7546) [Java] Use new implementation to concat vectors values in batch
Ji Liu created ARROW-7546: - Summary: [Java] Use new implementation to concat vectors values in batch Key: ARROW-7546 URL: https://issues.apache.org/jira/browse/ARROW-7546 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Per discussion https://github.com/apache/arrow/pull/5945#discussion_r365108806. In ARROW-7284, we write a simple method to concat vectors. However, ARROW-7073 is about to concat vector values efficiently, after this PR merged, we should use this new implementation in {{ArrowReader}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7539) [Java] FieldVector getFieldBuffers API should not set reader/writer indices
Ji Liu created ARROW-7539: - Summary: [Java] FieldVector getFieldBuffers API should not set reader/writer indices Key: ARROW-7539 URL: https://issues.apache.org/jira/browse/ARROW-7539 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Per discussion [https://github.com/apache/arrow/pull/6133#discussion_r364906302]. The fact that we have reader/writer settings in {{getFieldBuffers}} is wrong. To clarify, {{getFieldBuffers}} is distinct from {{getBuffers}}. The former should be for getting access to underlying data for higher-performance algorithms. The latter is for sending the data over the wire. Seems we've mixed up use of both. Currently in {{VectorUnloader}}, we used {{getFieldBuffers}} to create {{ArrowRecordBatch}} that’s why we keep writer/reader indices in {{getFieldBuffers}}, we should use {{getBuffers}} instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7490) [Java] Avro converter should convert attributes and props to FieldType metadata
Ji Liu created ARROW-7490: - Summary: [Java] Avro converter should convert attributes and props to FieldType metadata Key: ARROW-7490 URL: https://issues.apache.org/jira/browse/ARROW-7490 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently in Avro converter, some attributes are used when creating vectors such as “name”, “size” etc, others are discarded. For named type like Record, Enum and Fixed, they may have attributes like “doc” “aliased” which should keep in metadata for potential further use. Besides, properties are also not converted properly in some cases. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7472) [Java] Fix some incorrect behavior in UnionListWriter
Ji Liu created ARROW-7472: - Summary: [Java] Fix some incorrect behavior in UnionListWriter Key: ARROW-7472 URL: https://issues.apache.org/jira/browse/ARROW-7472 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently the {{UnionListWriter/UnionFixedSizeListWriter}} {{getField/close}} APIs seems incorrect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7467) [Java] ComplexCopier does incorrect copy for Map nullable info
Ji Liu created ARROW-7467: - Summary: [Java] ComplexCopier does incorrect copy for Map nullable info Key: ARROW-7467 URL: https://issues.apache.org/jira/browse/ARROW-7467 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu The {{MapVector}} and its 'value' vector are nullable, and its {{structVector}} and 'key' vector are non-nullable. However, the {{MapVector}} generated by ComplexCopier has all nullable fields which is not correct. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7425) [Java] PromotableWriter support writing FixedSizeList type data
Ji Liu created ARROW-7425: - Summary: [Java] PromotableWriter support writing FixedSizeList type data Key: ARROW-7425 URL: https://issues.apache.org/jira/browse/ARROW-7425 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu We have introduced writer API for {{FixedSizeListVector}} via ARROW-6079, but {{PromotableWriter}}’s support for it is incomplete. For example, using {{UnionListWriter}} we could simply write {{List}} type data, but for {{List}} or {{FixedSizeList}} it doesn’t work. This issue is about to enhance the {{PromotableWriter}} support for {{FixedSizeList}} type and add tests to verify the cases mentioned above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7406) [Java] NonNullableStructVector#hashCode should pass hasher to child vectors
Ji Liu created ARROW-7406: - Summary: [Java] NonNullableStructVector#hashCode should pass hasher to child vectors Key: ARROW-7406 URL: https://issues.apache.org/jira/browse/ARROW-7406 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu This was introduced by ARROW-6866 making parameter hasher useless in hashCode(int index, {{ArrowBufHasher}} hasher), and the child vectors would calculate hashCode using default hasher which is not correct. This issue should be fixed by passing hasher to child vector when calculating hashCode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7405) [Java] ListVector isEmpty API is incorrect
Ji Liu created ARROW-7405: - Summary: [Java] ListVector isEmpty API is incorrect Key: ARROW-7405 URL: https://issues.apache.org/jira/browse/ARROW-7405 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently {{isEmpty}} API is always return false in {{BaseRepeatedValueVector}}, and its subclass {{ListVector}} did not overwrite this method. This will lead to incorrect result, for example, a {{ListVector}} with data [1,2], null, [], [5,6] should get [false, false, true, false] with this API, but now it would return [false, false, false, false]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7264) [Java] RangeEqualsVisitor type check is not correct
Ji Liu created ARROW-7264: - Summary: [Java] RangeEqualsVisitor type check is not correct Key: ARROW-7264 URL: https://issues.apache.org/jira/browse/ARROW-7264 Project: Apache Arrow Issue Type: Bug Components: Java Affects Versions: 0.15.1 Reporter: Ji Liu Assignee: Ji Liu Currently {{RangeEqualsVisitor}} generally only checks type once and keep the result to avoid repeated type checking, see {code:java} typeCompareResult = left.getField().getType().equals(right.getField().getType()); {code} This only compares {{ArrowType}} and for complex type, this may cause unexpected behavior, for example {{List}} and {{List}} would be type equals which not consider their child field. We should compare Field here instead and to make it more extendable, we use {{TypeEqualsVisitor}} to compare Field, in this way, one could choose whether checks names or metadata either. Also provide a test for ListVector to validate this change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7259) [Java] Support subfield encoder use different hasher
Ji Liu created ARROW-7259: - Summary: [Java] Support subfield encoder use different hasher Key: ARROW-7259 URL: https://issues.apache.org/jira/browse/ARROW-7259 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently {{ListSubFieldEncoder/StructSubFieldEncoder}} use default hasher for calculating hashCode. This issue enables them to use different hasher or even user-defined hasher for their own use cases just like {{DictionaryEncoder}} does. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7026) [Java] Remove assertions in MessageSerializer/vector/writer/reader
Ji Liu created ARROW-7026: - Summary: [Java] Remove assertions in MessageSerializer/vector/writer/reader Key: ARROW-7026 URL: https://issues.apache.org/jira/browse/ARROW-7026 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently assertions exists in many classes like {{MessagaSerializer/JsonReader/JsonWriter/ListVector}} etc. i. If jvm arguments are not specified, these checks will skipped and lead to potential problems. ii. Java errors produced by failed assertions are not caught by traditional catch clauses. To fix this, use {{Preconditions}} instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7021) [Java] UnionFixedSizeListWriter decimal type should check writer index
Ji Liu created ARROW-7021: - Summary: [Java] UnionFixedSizeListWriter decimal type should check writer index Key: ARROW-7021 URL: https://issues.apache.org/jira/browse/ARROW-7021 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu {{UnionFixedSizeListWriter}} should check writer index for decimal type (just as other types) to ensure the values written not exceed listSize. Otherwise, the writer may continue to write data into it’s underlying vector quietly even the the writer.idx() > listSize * index. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6912) [Java] Extract a common base class for avro converter consumers
Ji Liu created ARROW-6912: - Summary: [Java] Extract a common base class for avro converter consumers Key: ARROW-6912 URL: https://issues.apache.org/jira/browse/ARROW-6912 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently Avro converter consumers have some common variables and methods which could be eliminated by extracting a common class. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6898) [Java] Fix potential memory leak in ArrowWriter and several test classes
Ji Liu created ARROW-6898: - Summary: [Java] Fix potential memory leak in ArrowWriter and several test classes Key: ARROW-6898 URL: https://issues.apache.org/jira/browse/ARROW-6898 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu ARROW-6040 fixed the problem that dictionary entries are required in IPC streams even when empty, which only writes dictionaries when there are at least one batch. In this way, if we write empty stream and invoke ArrowWriter#close, the dictionaries are not closed leading to memory leak (they are closed after the write operation), and it’s really hard to debug, this problem was found by {{TestArrowReaderWriter#testEmptyStreamInStreamingIPC}} when I tried to close allocator after the test. Besides, there are several test classes have potential memory leak without closing allocator/vector/buf etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6889) [Java] ComplexCopier enable FixedSizeList type & fix RangeEualsVisitor StackOverFlow
Ji Liu created ARROW-6889: - Summary: [Java] ComplexCopier enable FixedSizeList type & fix RangeEualsVisitor StackOverFlow Key: ARROW-6889 URL: https://issues.apache.org/jira/browse/ARROW-6889 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu i. Enable {{ComplexCopier}} copy {{FixedSizeListVector}} value, add related tests ii. Fix {{RangeEqualsVisitor#compareFixedSizeListVectors}} StackOverFlow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6871) [Java] Enhance TransferPair related parameters check and tests
Ji Liu created ARROW-6871: - Summary: [Java] Enhance TransferPair related parameters check and tests Key: ARROW-6871 URL: https://issues.apache.org/jira/browse/ARROW-6871 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu {{TransferPair}} related param checks in different classes have potential problems: i. {{copyValueSafe}} do not check from index, if from > valueCount, no error is shown. ii. {{splitAndTansferPair}} has no indices check in classes like {{VarcharVector}} iii. {{splitAndTranserPair}} indices check in classes like UnionVector is not correct (Preconditions.checkArgument(startIndex + length <= valueCount)), should check params separately. iv. some assert usages should be replaced with {{Preconditions}}. v. should add more UT to cover corner cases. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6853) [Java] Support vector and dictionary encoder use different hasher for calculating hashCode
Ji Liu created ARROW-6853: - Summary: [Java] Support vector and dictionary encoder use different hasher for calculating hashCode Key: ARROW-6853 URL: https://issues.apache.org/jira/browse/ARROW-6853 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu Hasher interface was introduce in ARROW-5898 and now have two different implementations ({{MurmurHasher and }}{{SimpleHasher}}) and it could be more in the future. And currently {{ValueVector#hashCode}} and {{DictionaryHashTable}} only use {{SimpleHasher}} for calculating hashCode. This issue enables them to use different hasher or even user-defined hasher for their own use cases. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6850) [Java] Jdbc converter support Null type
Ji Liu created ARROW-6850: - Summary: [Java] Jdbc converter support Null type Key: ARROW-6850 URL: https://issues.apache.org/jira/browse/ARROW-6850 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu java.sql.Types.Null is not supported yet since we have no NullVector in Java code before. This could be implemented after ARROW-1638 merged (IPC roundtrip for null type). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6721) [JAVA] Avro adapter benchmark only runs once in JMH
Ji Liu created ARROW-6721: - Summary: [JAVA] Avro adapter benchmark only runs once in JMH Key: ARROW-6721 URL: https://issues.apache.org/jira/browse/ARROW-6721 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu The current {{AvroAdapterBenchmark}} actually only run once during JMH evaluation, since the decoder was consumed for the first time and the follow-up invokes will directly return. To solve this, we use {{BinaryDecoder}} explicitly in benchmark and reset its inner stream first when the test method is invoked. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6710) [Java] Add JDBC adapter test to cover cases which contains some null values
Ji Liu created ARROW-6710: - Summary: [Java] Add JDBC adapter test to cover cases which contains some null values Key: ARROW-6710 URL: https://issues.apache.org/jira/browse/ARROW-6710 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu The current JDBC adapter tests only cover the cases that values are all non-null or all null. However, the cases that ResultSet has some null values are not covered (ARROW-6709). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6662) [Java] Implement equals/approxEquals API for VectorSchemaRoot
Ji Liu created ARROW-6662: - Summary: [Java] Implement equals/approxEquals API for VectorSchemaRoot Key: ARROW-6662 URL: https://issues.apache.org/jira/browse/ARROW-6662 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently with the new added visitor APIs(ARROW-6211), we could implement equals/approxEquals for VectorSchemaRoot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6661) [Java] Implement APIs like slice to enhance VectorSchemaRoot
Ji Liu created ARROW-6661: - Summary: [Java] Implement APIs like slice to enhance VectorSchemaRoot Key: ARROW-6661 URL: https://issues.apache.org/jira/browse/ARROW-6661 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently in Java Implementation there is no APIs like slice for record batch like C++/Python. This issue is about to implement slice/getVector/addVector/removeVector. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6600) [Java] Implement dictionary-encoded subfields for Union type
Ji Liu created ARROW-6600: - Summary: [Java] Implement dictionary-encoded subfields for Union type Key: ARROW-6600 URL: https://issues.apache.org/jira/browse/ARROW-6600 Project: Apache Arrow Issue Type: Sub-task Components: Java Reporter: Ji Liu Assignee: Ji Liu Implement dictionary-encoded subfields for {{Union}} type. Each child vector could be encodable or not. Meanwhile extra common logic into {{DictionaryEncoder}} as well as refactor List subfield encoding to keep consistent with {{Struct/Union}} type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6472) [Java] ValueVector#accept may has potential cast exception
Ji Liu created ARROW-6472: - Summary: [Java] ValueVector#accept may has potential cast exception Key: ARROW-6472 URL: https://issues.apache.org/jira/browse/ARROW-6472 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Per discussion [https://github.com/apache/arrow/pull/5195#issuecomment-528425302] We may use API this way: {code:java} RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector1, vector2); vector3.accept(visitor, range){code} if vector1/vector2 are say, {{StructVector}}s and vector3 is an {{IntVector}} - things can go bad. we'll use the {{compareBaseFixedWidthVectors()}} and do wrong type-casts for vector1/vector2. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6464) [Java] Refactor FixedSizeListVector#splitAndTransfer with slice API
Ji Liu created ARROW-6464: - Summary: [Java] Refactor FixedSizeListVector#splitAndTransfer with slice API Key: ARROW-6464 URL: https://issues.apache.org/jira/browse/ARROW-6464 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently {{FixedSizeListVector#splitAndTransfer}} actually use {{copyValueSafe}} which has memory copy, we should use slice API instead. Meanwhile, {{splitAndTransfer}} in all classes should position index check at beginning. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6460) [Java] Add unit test for large avro data
Ji Liu created ARROW-6460: - Summary: [Java] Add unit test for large avro data Key: ARROW-6460 URL: https://issues.apache.org/jira/browse/ARROW-6460 Project: Apache Arrow Issue Type: Sub-task Components: Java Reporter: Ji Liu Assignee: Ji Liu To avoid OOM, we have implement iterator API in ARROW-6220. This issue is about to add tests with a large fake data (say 6MM rows in JDBC adapter test) set and ensures no OOMs occur. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6452) [Java] Overrite ValueVector toString() method
Ji Liu created ARROW-6452: - Summary: [Java] Overrite ValueVector toString() method Key: ARROW-6452 URL: https://issues.apache.org/jira/browse/ARROW-6452 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently cpp code {{Array#ToString}} returns the human readable format string like: [ 1, 2, 3 ] But Java {{ValueVector}} did not implement like this way now. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6401) [Java] Implement dictionary-encoded subfields for Struct type
Ji Liu created ARROW-6401: - Summary: [Java] Implement dictionary-encoded subfields for Struct type Key: ARROW-6401 URL: https://issues.apache.org/jira/browse/ARROW-6401 Project: Apache Arrow Issue Type: Sub-task Components: Java Reporter: Ji Liu Assignee: Ji Liu Implement dictionary-encoded subfields for Struct type. Each child vector will have a dictionary, the dictionary vector is struct type and holds all dictionaries. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6356) [Java] Avro adapter implement Enum type and nested Record type
Ji Liu created ARROW-6356: - Summary: [Java] Avro adapter implement Enum type and nested Record type Key: ARROW-6356 URL: https://issues.apache.org/jira/browse/ARROW-6356 Project: Apache Arrow Issue Type: Sub-task Reporter: Ji Liu Assignee: Ji Liu Implement for converting avro {{Enum}} type. Convert nested avro {{Record}} type to Arrow {{StructVector}}. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6311) [Java] Make ApproxEqualsVisitor accept DiffFunction to make it more flexible
Ji Liu created ARROW-6311: - Summary: [Java] Make ApproxEqualsVisitor accept DiffFunction to make it more flexible Key: ARROW-6311 URL: https://issues.apache.org/jira/browse/ARROW-6311 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently {{ApproxEqualsVisitor}} will accept a epsilon for both float and double compare, and the difference calculation is always {{Math.abs}}(f1-f2) For some cases like {{Validator}} it is not very suitable as: i. it has different epsilon values for float/double ii. it difference function is not Math.abs(f1-f2) To resolve these, make this visitor accept both float/double epsilons and diff functions. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6308) [Java] Support write interleaved dictionaries and batches in IPC stream
Ji Liu created ARROW-6308: - Summary: [Java] Support write interleaved dictionaries and batches in IPC stream Key: ARROW-6308 URL: https://issues.apache.org/jira/browse/ARROW-6308 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Per discussions in the following threads, as spec([http://arrow.apache.org/docs/format/IPC.html#streaming-format]) described, as long as a record batch doesn't reference a dictionary they can be interleaved. [https://github.com/apache/arrow/pull/4960] [https://github.com/apache/arrow/pull/5146] Currently it’s able to parse dictionaries and batches which are interleaved via ARROW-6040, But it’s impossible to write data in this format. This issue is used to record this problem, and should be done after a ML discuss. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6289) [Java] Add empty() in UnionVector to create instance
Ji Liu created ARROW-6289: - Summary: [Java] Add empty() in UnionVector to create instance Key: ARROW-6289 URL: https://issues.apache.org/jira/browse/ARROW-6289 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently complex type vectors all have {{empty}}() API to create instance except {{UnionVector}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6288) [Java] Implement TypeEqualsVisitor comparing vector type equals considering names and metadata
Ji Liu created ARROW-6288: - Summary: [Java] Implement TypeEqualsVisitor comparing vector type equals considering names and metadata Key: ARROW-6288 URL: https://issues.apache.org/jira/browse/ARROW-6288 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently when we compare range/vector equals, we first compare vector {{Field}} by its equals method, in this case, it’s hard to specify whether compare names or metadata. Implement a {{TypeEqualsVisitor}} will make type comparisons more flexible like cpp implementation dose [https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.cc#L712] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6265) [Java] Avro adapter implement Array/Map/Fixed type
Ji Liu created ARROW-6265: - Summary: [Java] Avro adapter implement Array/Map/Fixed type Key: ARROW-6265 URL: https://issues.apache.org/jira/browse/ARROW-6265 Project: Apache Arrow Issue Type: Sub-task Components: Java Reporter: Ji Liu Assignee: Ji Liu Support Array/Map/Fixed type in avro adapter. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6250) [Java] Implement ApproxEqualsVisitor comparing approx for floating point
Ji Liu created ARROW-6250: - Summary: [Java] Implement ApproxEqualsVisitor comparing approx for floating point Key: ARROW-6250 URL: https://issues.apache.org/jira/browse/ARROW-6250 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently we already implemented {{RangeEqualsVisitor/VectorEqualsVisitor}} for comparing range/vector. And ARROW-6211 is created to make {{ValueVector}} work with generic visitor. We should also implement {{ApproxEqualsVisitor}} to compare floating point just like cpp does [https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.cc] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6249) [Java] Remove useless class ByteArrayWrapper
Ji Liu created ARROW-6249: - Summary: [Java] Remove useless class ByteArrayWrapper Key: ARROW-6249 URL: https://issues.apache.org/jira/browse/ARROW-6249 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu This class was introduced into encoding part to compare byte[] values equals. Since now we compare value/vector equals by new added visitor API by ARROW-6022 instead of comparing {{getObject}}, this class is no use anymore. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6234) [Java] ListVector hashCode() is not correct
Ji Liu created ARROW-6234: - Summary: [Java] ListVector hashCode() is not correct Key: ARROW-6234 URL: https://issues.apache.org/jira/browse/ARROW-6234 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Current implement is not correct: {code:java} for (int i = start; i < end; i++) { hash = 31 * vector.hashCode(i); } {code} Should be something like: {code:java} hash = 31 * hash + vector.hashCode(i);{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6218) [Java] Add UINT type test in integration to avoid potential overflow
Ji Liu created ARROW-6218: - Summary: [Java] Add UINT type test in integration to avoid potential overflow Key: ARROW-6218 URL: https://issues.apache.org/jira/browse/ARROW-6218 Project: Apache Arrow Issue Type: Test Components: Java Reporter: Ji Liu Assignee: Ji Liu As per discussion [https://github.com/apache/arrow/pull/5002] For UINT type, when write/read json data in integration test, it extend data type(i.e. Long->BigInteger, Int->Long) to avoid potential overflow. Like UINT8 the write side and read side code like this: {code:java} case UINT8: generator.writeNumber(UInt8Vector.getNoOverflow(buffer, index)); break;{code} {code:java} BigInteger value = parser.getBigIntegerValue(); buf.writeLong(value.longValue()); {code} Should add a test to avoid potential overflow in the data transfer process. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6200) [Java] Method getBufferSizeFor in BaseRepeatedValueVector/ListVector not correct
Ji Liu created ARROW-6200: - Summary: [Java] Method getBufferSizeFor in BaseRepeatedValueVector/ListVector not correct Key: ARROW-6200 URL: https://issues.apache.org/jira/browse/ARROW-6200 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently, {{getBufferSizeFor}} in {{BaseRepeatedValueVector}} implemented as below: {code:java} if (valueCount == 0) { return 0; } return ((valueCount + 1) * OFFSET_WIDTH) + vector.getBufferSizeFor(valueCount); {code} Here vector.getBufferSizeFor(valueCount) seems not right which should be {code:java} int innerVectorValueCount = offsetBuffer.getInt(valueCount * OFFSET_WIDTH); vector.getBufferSizeFor(innerVectorValueCount) {code} ListVector has the same problem. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6199) [Java] Avro adapter avoid potential resource leak.
Ji Liu created ARROW-6199: - Summary: [Java] Avro adapter avoid potential resource leak. Key: ARROW-6199 URL: https://issues.apache.org/jira/browse/ARROW-6199 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently, avro consumer interface has no close API, which may cause resource leak like {{AvroBytesConsumer#cacheBuffer}}. To resolve this, make consumer extends {{AutoCloseable}} and create {{CompositeAvroConsumer}} to encompasses consume and close logic. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6194) [Java] Make DictionaryEncoder non-static making it easy to extend and reuse
Ji Liu created ARROW-6194: - Summary: [Java] Make DictionaryEncoder non-static making it easy to extend and reuse Key: ARROW-6194 URL: https://issues.apache.org/jira/browse/ARROW-6194 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu As discussed in [https://github.com/apache/arrow/pull/4994]. Current static DictionaryEncoder has some limitation for extension and reuse. Slightly change the APIs and migrate static method to object based approach. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6175) [Java] Fix MapVector#getMinorType and extend AbstractContainerVector addOrGet complex vector API
Ji Liu created ARROW-6175: - Summary: [Java] Fix MapVector#getMinorType and extend AbstractContainerVector addOrGet complex vector API Key: ARROW-6175 URL: https://issues.apache.org/jira/browse/ARROW-6175 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu i. Currently {{MapVector#getMinorType}} extends {{ListVector}} which returns the wrong {{MinorType}}. ii. {{AbstractContainerVector}} now only has {{addOrGetList}}, {{addOrGetUnion}}, {{addOrGetStruct}} which not support all complex type like {{MapVector}} and {{FixedSizeListVector}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6160) [Java] AbstractStructVector#getPrimitiveVectors fails to work with complex child vectors
Ji Liu created ARROW-6160: - Summary: [Java] AbstractStructVector#getPrimitiveVectors fails to work with complex child vectors Key: ARROW-6160 URL: https://issues.apache.org/jira/browse/ARROW-6160 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently in {{AbstractStructVector#getPrimitiveVectors}}, only struct type child vectors will recursively get primitive vectors, other complex type like {{ListVector}}, {{UnionVector}} was treated as primitive type and return directly. For example, Struct(List(Int), Struct(Int, Varchar)) {{getPrimitiveVectors}} should return {{[IntVector, IntVector, VarCharVector]}} instead of [ListVector, IntVector, VarCharVector] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6145) [Java] UnionVector created by MinorType#getNewVector could not keep field type info properly
Ji Liu created ARROW-6145: - Summary: [Java] UnionVector created by MinorType#getNewVector could not keep field type info properly Key: ARROW-6145 URL: https://issues.apache.org/jira/browse/ARROW-6145 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu When I worked for other items, I found {{UnionVector}} created by {{VectorSchemaRoot#create(Schema schema, BufferAllocator allocator)}} could not keep field type info properly. For example, if we set metadata in {{Field}} in schema, we could not get it back by {{UnionVector#getField}}. This is mainly because {{MinorType.Union.getNewVector}} did not pass {{FieldType}} to vector and {{UnionVector#getField}} create a new {{Field}} which cause inconsistent. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6118) [Java] Replace google Preconditions with Arrow Preconditions
Ji Liu created ARROW-6118: - Summary: [Java] Replace google Preconditions with Arrow Preconditions Key: ARROW-6118 URL: https://issues.apache.org/jira/browse/ARROW-6118 Project: Apache Arrow Issue Type: Improvement Reporter: Ji Liu Assignee: Ji Liu Now in java code, most places uses {{org.apache.arrow.util.Preconditions}}, but still some places uses {{com.google.common.base.Preconditions}}. Remove google Preconditions meanwhile remove duplicated checks. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6097) [Java] Avro adapter implement unions type
Ji Liu created ARROW-6097: - Summary: [Java] Avro adapter implement unions type Key: ARROW-6097 URL: https://issues.apache.org/jira/browse/ARROW-6097 Project: Apache Arrow Issue Type: Sub-task Reporter: Ji Liu Assignee: Ji Liu Support convert unions type like ["string"], ["string", 'int"] and nullable ["string", "int", "null"] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6083) [Java] Refactor Jdbc adapter consume logic
Ji Liu created ARROW-6083: - Summary: [Java] Refactor Jdbc adapter consume logic Key: ARROW-6083 URL: https://issues.apache.org/jira/browse/ARROW-6083 Project: Apache Arrow Issue Type: Improvement Reporter: Ji Liu Assignee: Ji Liu Jdbc adapter read from {{ResultSet}} looks like: while (rs.next()) { for (int i = 1; i <= columnCount; i++) { jdbcToFieldVector( rs, i, rs.getMetaData().getColumnType(i), rowCount, root.getVector(rsmd.getColumnName(i)), config); } rowCount++; } And in {{jdbcToFieldVector}} has lots of switch-case, that is to see, for every single value from ResultSet we have to do lots of analyzing conditions. I think we could optimize this using consumer/delegate like avro adapter. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6079) [Java] Implement/test UnionFixedSizeListWriter for FixedSizeListVector
Ji Liu created ARROW-6079: - Summary: [Java] Implement/test UnionFixedSizeListWriter for FixedSizeListVector Key: ARROW-6079 URL: https://issues.apache.org/jira/browse/ARROW-6079 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu Now we have two list vectors: {{ListVector}} and {{FixedSizeListVector}}. {{ListVector}} has already implemented UnionListWriter for writing data, however, {{FixedSizeListVector}} doesn't have this yet and seems the only way for users to write data is getting inner vector and set value manually. Implement a writer for {{FixedSizeListVector}} is useful in some cases. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6078) [Java] Implement dictionary-encoded subfields for List type
Ji Liu created ARROW-6078: - Summary: [Java] Implement dictionary-encoded subfields for List type Key: ARROW-6078 URL: https://issues.apache.org/jira/browse/ARROW-6078 Project: Apache Arrow Issue Type: Sub-task Reporter: Ji Liu Assignee: Ji Liu For example, int type List (valueCount = 5) has data like below: 10, 20 10, 20 30, 40, 50 30, 40, 50 10, 20 could be encoded to: 0, 1 0, 1 2, 3, 4 2, 3, 4 0, 1 with list type dictionary 10, 20, 30, 40, 50 or 10, 20, 30, 40, 50 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6035) [Java] Avro adapter support convert nullable value
Ji Liu created ARROW-6035: - Summary: [Java] Avro adapter support convert nullable value Key: ARROW-6035 URL: https://issues.apache.org/jira/browse/ARROW-6035 Project: Apache Arrow Issue Type: Sub-task Reporter: Ji Liu Assignee: Ji Liu A specific Avro unions type(has two types and one is null type) could convert to a nullable ArrowVector. For instance, ["null", "string"] could represented by a VarcharVector which could has null value. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6022) [Java] Support equals API in ValueVector to compare two vectors equal
Ji Liu created ARROW-6022: - Summary: [Java] Support equals API in ValueVector to compare two vectors equal Key: ARROW-6022 URL: https://issues.apache.org/jira/browse/ARROW-6022 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu In some case, this feature is useful. In ARROW-1184, {{Dictionary#equals}} not work due to the lack of this API. Moreover, we already implemented {{equals(int index, ValueVector target, int targetIndex)}}, so this new added API could reuse it. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6020) [Java] Refactor ByteFunctionHelper#hash with new added ArrowBufHasher
Ji Liu created ARROW-6020: - Summary: [Java] Refactor ByteFunctionHelper#hash with new added ArrowBufHasher Key: ARROW-6020 URL: https://issues.apache.org/jira/browse/ARROW-6020 Project: Apache Arrow Issue Type: Improvement Reporter: Ji Liu Assignee: Ji Liu Some logic in these two classes are similar, should replace ByteFunctionHelper#hash logic with ArrowBufHasher since it has murmur hash algorithm which could avoid hash collision. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6019) [Java] Port Jdbc and Avro adapter to new directory
Ji Liu created ARROW-6019: - Summary: [Java] Port Jdbc and Avro adapter to new directory Key: ARROW-6019 URL: https://issues.apache.org/jira/browse/ARROW-6019 Project: Apache Arrow Issue Type: Task Components: Java Reporter: Ji Liu Assignee: Ji Liu As discussed in mail list, adapters are different from native reader. This issue is used to track these issues: i. create new “contrib” directory and move Jdbc/Avro adapter to it. ii. provide more description. iii. change orc readers structure to “converter" cc [~emkornfi...@gmail.com] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5997) [Java] Support dictionary encoding for Union type
Ji Liu created ARROW-5997: - Summary: [Java] Support dictionary encoding for Union type Key: ARROW-5997 URL: https://issues.apache.org/jira/browse/ARROW-5997 Project: Apache Arrow Issue Type: New Feature Reporter: Ji Liu Assignee: Ji Liu Now only Union type is not supported in dictionary encoding. In the last several weeks, we did some refactor for encoding and now it's time to support Union type. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5988) [Java] Avro adapter implement simple Record type
Ji Liu created ARROW-5988: - Summary: [Java] Avro adapter implement simple Record type Key: ARROW-5988 URL: https://issues.apache.org/jira/browse/ARROW-5988 Project: Apache Arrow Issue Type: Sub-task Reporter: Ji Liu Assignee: Ji Liu 1. implement simple Record type witch only contains primitive types 2. add ByteBuffer cache in String/Bytes consumer to reduce creations. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5986) [Java] Code cleanup for dictionary encoding
Ji Liu created ARROW-5986: - Summary: [Java] Code cleanup for dictionary encoding Key: ARROW-5986 URL: https://issues.apache.org/jira/browse/ARROW-5986 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu In last few weeks, we did some refactor in dictionary encoding. Since the new designed hash table for {{DictionaryEncoder}} and {{hashCode}} & {{equals}} API in {{ValueVector}} already checked in, some classed are no use anymore like {{DictionaryEncodingHashTable}}, {{BaseBinaryVector}} and related benchmarks & UT. Fortunately, these changes are not made into version 0.14, which makes possible to remove them. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5968) [Java] Remove duplicate Preconditions check in JDBC adapter
Ji Liu created ARROW-5968: - Summary: [Java] Remove duplicate Preconditions check in JDBC adapter Key: ARROW-5968 URL: https://issues.apache.org/jira/browse/ARROW-5968 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu Some Preconditions check are duplicate in {{JdbcToArrow#sqlToArrow}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5967) [Java] DateUtility#timeZoneList is not correct
Ji Liu created ARROW-5967: - Summary: [Java] DateUtility#timeZoneList is not correct Key: ARROW-5967 URL: https://issues.apache.org/jira/browse/ARROW-5967 Project: Apache Arrow Issue Type: Improvement Reporter: Ji Liu Assignee: Ji Liu Now {{timeZoneList}} in {{DateUtility}} belongs to Joda time. Since we have replace Joda time with Java time in ARROW-2015, this should also be changed. {{TimeStampXXTZVectors}} have a timezone member which seems not used now and its {{getObject}} returns Long(different with that in {{TimeStampXXVectors}} which returns {{LocalDateTime}}), should it return {{LocalDateTime}} with its timezone? Is it reasonable if we do as follows: # replace Joda {{timezoneList}} with Java {{timezoneList}} in {{DateUtility}} # add method like {{getLocalDateTimeFromEpochMilli(long epochMillis, String timezone)}} in DateUtility # Not sure make {{TimeStampXXTZVectors}} return {{LocalDateTime}}? cc [~emkornfi...@gmail.com] [~bryanc] [~siddteotia] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5909) [Java] Optimize ByteFunctionHelpers equals & compare logic
Ji Liu created ARROW-5909: - Summary: [Java] Optimize ByteFunctionHelpers equals & compare logic Key: ARROW-5909 URL: https://issues.apache.org/jira/browse/ARROW-5909 Project: Apache Arrow Issue Type: Improvement Reporter: Ji Liu Assignee: Ji Liu Now it first compare Long values and then if length < 8 then it compares Byte values. Add the logic to compare Int values when 4 < length < 8. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5902) [Java] Implement HashTable for dictionary encoding
Ji Liu created ARROW-5902: - Summary: [Java] Implement HashTable for dictionary encoding Key: ARROW-5902 URL: https://issues.apache.org/jira/browse/ARROW-5902 Project: Apache Arrow Issue Type: New Feature Reporter: Ji Liu Assignee: Ji Liu As discussed in [https://github.com/apache/arrow/pull/4792] Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5883) [Java] Support Dictionary Encoding for List type
Ji Liu created ARROW-5883: - Summary: [Java] Support Dictionary Encoding for List type Key: ARROW-5883 URL: https://issues.apache.org/jira/browse/ARROW-5883 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu As described in [http://arrow.apache.org/docs/format/Layout.html#dictionary-encoding], List type encoding should be supported. Now ListVector getObject returns a ArrayList implementation, and its equals and hashCode are already overwritten, so it could be directly supported to be hashMap key in DictionaryEncoder. Since we won't change Dictionary data, use mutable key seems dose't matter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5861) [Java] Initial implement to convert Avro record with primitive types
Ji Liu created ARROW-5861: - Summary: [Java] Initial implement to convert Avro record with primitive types Key: ARROW-5861 URL: https://issues.apache.org/jira/browse/ARROW-5861 Project: Apache Arrow Issue Type: Sub-task Components: Java Reporter: Ji Liu Assignee: Ji Liu -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5846) [Java] Create Avro adapter module and add dependencies
Ji Liu created ARROW-5846: - Summary: [Java] Create Avro adapter module and add dependencies Key: ARROW-5846 URL: https://issues.apache.org/jira/browse/ARROW-5846 Project: Apache Arrow Issue Type: Sub-task Reporter: Ji Liu Assignee: Ji Liu -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5845) [Java] Implement converter between Arrow record batches and Avro records
Ji Liu created ARROW-5845: - Summary: [Java] Implement converter between Arrow record batches and Avro records Key: ARROW-5845 URL: https://issues.apache.org/jira/browse/ARROW-5845 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5835) [Java] Support Dictionary Encoding for binary type
Ji Liu created ARROW-5835: - Summary: [Java] Support Dictionary Encoding for binary type Key: ARROW-5835 URL: https://issues.apache.org/jira/browse/ARROW-5835 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu Now is not implemented because byte array is not supported to be HashMap key. One possible way is that wrap them with something to implement equals and hashcode. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5834) [Java] Apply new hash map in DictionaryEncoder
Ji Liu created ARROW-5834: - Summary: [Java] Apply new hash map in DictionaryEncoder Key: ARROW-5834 URL: https://issues.apache.org/jira/browse/ARROW-5834 Project: Apache Arrow Issue Type: New Feature Reporter: Ji Liu Assignee: Ji Liu Follow-up of [ARROW-5814|https://issues.apache.org/jira/browse/ARROW-5814]. Apply new hash map in DictionaryEncoder to make it work. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5821) [Java] Support compact fixed-width vectors
Ji Liu created ARROW-5821: - Summary: [Java] Support compact fixed-width vectors Key: ARROW-5821 URL: https://issues.apache.org/jira/browse/ARROW-5821 Project: Apache Arrow Issue Type: New Feature Reporter: Ji Liu Assignee: Ji Liu In shuffle stage of some applications, FixedWitdhVectors may have very little non-null data. In this case, directly serialize vectors is not a good choice, generally we can compact the vector make it only holding non-null value and create a BitVector to trace the indices for non-null values so that it could be deserialized properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5814) [Java] Implement a HashMap for DictionaryEncoder
Ji Liu created ARROW-5814: - Summary: [Java] Implement a HashMap for DictionaryEncoder Key: ARROW-5814 URL: https://issues.apache.org/jira/browse/ARROW-5814 Project: Apache Arrow Issue Type: Improvement Reporter: Ji Liu Assignee: Ji Liu As a follow-up of [ARROW-5726|https://issues.apache.org/jira/browse/ARROW-5726]. Implement a Map for DictionaryEncoder to reduce boxing/unboxing operations. Benchmark: DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5812) [Java] Refactor method name and param type in BaseIntVector
Ji Liu created ARROW-5812: - Summary: [Java] Refactor method name and param type in BaseIntVector Key: ARROW-5812 URL: https://issues.apache.org/jira/browse/ARROW-5812 Project: Apache Arrow Issue Type: Improvement Reporter: Ji Liu Assignee: Ji Liu Change to void _setWithPossibleTruncate(int index, long value);_ for better generality. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5726) [Java] Implement a common interface for int vectors
Ji Liu created ARROW-5726: - Summary: [Java] Implement a common interface for int vectors Key: ARROW-5726 URL: https://issues.apache.org/jira/browse/ARROW-5726 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu Now in _DictionaryEncoder#encode_ it use reflection to pull out the set method and then set values. Set values by reflection is not efficient and code structure is not elegant such as _Method setter = null;_ _for (Class c : Arrays.asList(int.class, long.class)) {_ _try {_ _setter = indices.getClass().getMethod("setSafe", int.class, c);_ _break;_ _} catch (NoSuchMethodException e) {_ _// ignore_ _}_ _}_ Implement a common interface for int vectors to directly get set method and set values seems a good choice. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5706) [Java] Remove type conversion in getValidityBufferValueCapacity
Ji Liu created ARROW-5706: - Summary: [Java] Remove type conversion in getValidityBufferValueCapacity Key: ARROW-5706 URL: https://issues.apache.org/jira/browse/ARROW-5706 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu Now implementation of getValidityBufferValueCapacity is: (int) (validityBuffer.capacity() * 8L) Seems no need to convert it to Long then convert it back to Int, just replace with: validityBuffer.capacity() * 8 VariableWidthVectorBenchmarks#getValueCapacity shows the performance: Before: avgt 5 5.731 ± 0.160 ns/op After: avgt 5 5.124 ± 0.125 ns/op -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5705) [Java] Optimize BaseValueVector#computeCombinedBufferSize logic
Ji Liu created ARROW-5705: - Summary: [Java] Optimize BaseValueVector#computeCombinedBufferSize logic Key: ARROW-5705 URL: https://issues.apache.org/jira/browse/ARROW-5705 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu Now in BaseValueVector#computeCombinedBufferSize, it computes validity buffer size as follow: _roundUp8(getValidityBufferSizeFromCount(valueCount))_ which can be be expanded to _(((valueCount + 7) >> 3 + 7) / 8) * 8_ Seems there's no need to compute bufferSize first and expression above could be replaced with: _(valueCount + 63) / 64 * 8_ In this way, performance of _computeCombinedBufferSize_ would be improved. Performance test: Before: BaseValueVectorBenchmarks.testC_omputeCombinedBufferSize_ avgt 5 4083.180 ± 180.363 ns/op After: BaseValueVectorBenchmarks.testC_omputeCombinedBufferSize_ avgt 5 3808.635 ± 162.347 ns/op -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5672) [Java] Refactor redundant method modifier
Ji Liu created ARROW-5672: - Summary: [Java] Refactor redundant method modifier Key: ARROW-5672 URL: https://issues.apache.org/jira/browse/ARROW-5672 Project: Apache Arrow Issue Type: Sub-task Reporter: Ji Liu Assignee: Ji Liu -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5587) Add more maven style check for Java code
Ji Liu created ARROW-5587: - Summary: Add more maven style check for Java code Key: ARROW-5587 URL: https://issues.apache.org/jira/browse/ARROW-5587 Project: Apache Arrow Issue Type: Improvement Reporter: Ji Liu Assignee: Ji Liu Add more maven style check for java code, such as unused imports, redundant modifier, etc. In this way, the quality of code will be improved. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5584) Add import for link reference in FieldReader javadoc
Ji Liu created ARROW-5584: - Summary: Add import for link reference in FieldReader javadoc Key: ARROW-5584 URL: https://issues.apache.org/jira/browse/ARROW-5584 Project: Apache Arrow Issue Type: Bug Reporter: Ji Liu Assignee: Ji Liu Link reference(ValueVector) in FieldReader javadoc has no import. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5435) IntervalYearVector#getObject should return Period with both year and month
Ji Liu created ARROW-5435: - Summary: IntervalYearVector#getObject should return Period with both year and month Key: ARROW-5435 URL: https://issues.apache.org/jira/browse/ARROW-5435 Project: Apache Arrow Issue Type: Bug Reporter: Ji Liu Assignee: Ji Liu IntervalYearVector#getObject today return Period with specific month. However, this vector stores interval (years and months, e.g. 2 years and 3 months is stored as 27(total months)), it should return Period with both years and months(now only months is assigned). As shown in the example above, now it return Period(27 months), I think it should return Period(2 years, 3 months). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5420) Implement or remove getCurrentSizeInBytes in VariableWidthVector
Ji Liu created ARROW-5420: - Summary: Implement or remove getCurrentSizeInBytes in VariableWidthVector Key: ARROW-5420 URL: https://issues.apache.org/jira/browse/ARROW-5420 Project: Apache Arrow Issue Type: Improvement Reporter: Ji Liu Assignee: Ji Liu Now VariableWidthVector#getCurrentSizeInBytes doesn't seem to have been implemented. We should implement it or just remove it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5259) Add option for ValueVector to allocate buffers with actual size
Ji Liu created ARROW-5259: - Summary: Add option for ValueVector to allocate buffers with actual size Key: ARROW-5259 URL: https://issues.apache.org/jira/browse/ARROW-5259 Project: Apache Arrow Issue Type: Wish Reporter: Ji Liu Assignee: Ji Liu Currently in _BaseValueVector#computeCombinedBufferSize_, it calculates the buffer size with _valueCount_ and _typeWidth_ as inputs and then allocates memory for dataBuffer and validityBuffer. However, it always allocate memory greater than the actual size, because of the invoke of _BaseAllocator.nextPowerOfTwo(bufferSize)_. For example, IntVector will allocate buffers with size 8192 with valueCount = 1025, memory usage is almost double what it actually is. So in some cases, there have enough memory for actual use but throws OOM when the allocated memory is increased to next power of 2 and I think this problem is absolutely avoidable. Is it feasible to add option for ValueVector to allocate actual buffer size rather than make it next power of 2 to reduce memory allocation? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5225) [Java] Improve performance of BaseValueVector#getValidityBufferSizeFromCount
Ji Liu created ARROW-5225: - Summary: [Java] Improve performance of BaseValueVector#getValidityBufferSizeFromCount Key: ARROW-5225 URL: https://issues.apache.org/jira/browse/ARROW-5225 Project: Apache Arrow Issue Type: Improvement Reporter: Ji Liu Assignee: Ji Liu Now in _BaseValueVector#getValidityBufferSizeFromCount_ and _BitVectorHelper#getValidityBufferSize_, it uses _Math.ceil_ to calculate size which is not efficient (lots of unnecessary logic in _StrictMath#floorOrCeil_) . Since the valueCount is always not less than 0, we could simply replace _Math.ceil_ with the following code: _return valueCount % 8 > 0 ? valueCount / 8 + 1 : valueCount / 8_; -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5224) [Java] Add APIs for supporting directly serialize/deserialize ValueVector
Ji Liu created ARROW-5224: - Summary: [Java] Add APIs for supporting directly serialize/deserialize ValueVector Key: ARROW-5224 URL: https://issues.apache.org/jira/browse/ARROW-5224 Project: Apache Arrow Issue Type: Improvement Reporter: Ji Liu Assignee: Ji Liu There is no API in MessageSerializer to directly serilize/deserialize ValueVector. This feature is useful for user who only use ValueVectors rather than ArrowRecordBatch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5207) [Java] add APIs to support vector
Ji Liu created ARROW-5207: - Summary: [Java] add APIs to support vector Key: ARROW-5207 URL: https://issues.apache.org/jira/browse/ARROW-5207 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu In some scenarios we hope that ValueVector could be reused to reduce creation overhead. This is very common in shuffle stage, it's no need to create ValueVector or realloc buffers every time, suppose that the recordCount of ValueVector and capacity of its buffers is written in stream, when we deserialize it, we can simply judge whether realloc is needed through dataLength. My proposal is that add APIs in ValueVector to process this logic, otherwise users have to implement by themselves if they want to reuse which is not user-friendly. If you agree with this, I would like to take this ticket. Thanks -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5206) [JAVA]Add APIs in MessageSerializer to directly serialize/deserialize ArrowBuf
Ji Liu created ARROW-5206: - Summary: [JAVA]Add APIs in MessageSerializer to directly serialize/deserialize ArrowBuf Key: ARROW-5206 URL: https://issues.apache.org/jira/browse/ARROW-5206 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu It seems there no APIs to directly write ArrowBuf to OutputStream or read ArrowBuf from InputStream. These APIs may be helpful when users use Vectors directly instead of RecordBatch, in this case, provide APIs to serialize/deserialize dataBuffer/validityBuffer/offsetBuffer is necessary. I would like to work on this and make it my first contribution to Arrow. What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)