[jira] [Resolved] (ARROW-4095) [C++] Implement optimizations for dictionary unification where dictionaries are prefixes of the unified dictionary

2019-08-30 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-4095.

Resolution: Fixed

Issue resolved by pull request 5230
[https://github.com/apache/arrow/pull/5230]

> [C++] Implement optimizations for dictionary unification where dictionaries 
> are prefixes of the unified dictionary
> --
>
> Key: ARROW-4095
> URL: https://issues.apache.org/jira/browse/ARROW-4095
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In the event that the unified dictionary contains other dictionaries as 
> prefixes (e.g. as the result of delta dictionaries), we can avoid memory 
> allocation and index transposition.
> See discussion at 
> https://github.com/apache/arrow/pull/3165#discussion_r243020982



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6031) [Java] Support iterating a vector by ArrowBufPointer

2019-08-30 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6031.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 4950
[https://github.com/apache/arrow/pull/4950]

> [Java] Support iterating a vector by ArrowBufPointer
> 
>
> Key: ARROW-6031
> URL: https://issues.apache.org/jira/browse/ARROW-6031
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Provide the functionality to traverse a vector (fixed-width vector & 
> variable-width vector) by an iterator. This is convenient for scenarios when 
> accessing vector elements in sequence.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6247) [Java] Provide a common interface for float4 and float8 vectors

2019-08-30 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6247.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5132
[https://github.com/apache/arrow/pull/5132]

> [Java] Provide a common interface for float4 and float8 vectors
> ---
>
> Key: ARROW-6247
> URL: https://issues.apache.org/jira/browse/ARROW-6247
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We want to provide an interface for floating point vectors (float4 & float8). 
> This interface will make it convenient for many operations on a vector. With 
> this interface, the client code will be greatly simplified, with many 
> branches/switch removed.
>  
> The design is similar to BaseIntVector (the interface for all integer 
> vectors). We provide 3 methods for setting & getting floating point values:
>  setWithPossibleTruncate
>  setSafeWithPossibleTruncate
>  getValueAsDouble



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6099) [JAVA] Has the ability to not using slf4j logging framework

2019-08-30 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6099.

Resolution: Won't Fix

Closing for now, more discussion on the mailing list might be warranted and we 
can reopen.

> [JAVA] Has the ability to not using slf4j logging framework
> ---
>
> Key: ARROW-6099
> URL: https://issues.apache.org/jira/browse/ARROW-6099
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.14.1
>Reporter: Haowei Yu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Currently, the java library directly calls slf4j api, and there is no 
> abstract layer. This leads to user need to install slf4j as a requirement 
> even if we don't use slf4j at all. 
>  
> It is best if you can change the slf4j dependency scope to provided and log 
> content only if slf4j jar file is provided at runtime.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-4668) [C++] Support GCP BigQuery Storage API

2019-08-30 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1691#comment-1691
 ] 

Micah Kornfield commented on ARROW-4668:


Wes is correct.  I'll also add that either this (or even a higher level wrapper 
around BQ) or flight would make a good test case for DataSet APIs to make sure 
they are generic enough.  I won't be getting to this anytime soon, so I'm going 
to unassign it from myself.  I have some sample code on my work computer that I 
will also try to share to show how the API can be accessed in a simple scenario.

> [C++] Support GCP BigQuery Storage API
> --
>
> Key: ARROW-4668
> URL: https://issues.apache.org/jira/browse/ARROW-4668
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: filesystem
> Fix For: 1.0.0
>
>
> Docs: [https://cloud.google.com/bigquery/docs/reference/storage/] 
> Need to investigate the best way to do this maybe just see if we can build 
> our client on GCP (once a protobuf definition is published to 
> [https://github.com/googleapis/googleapis/tree/master/google)?|https://github.com/googleapis/googleapis/tree/master/google)]
>  
> This will serve as a parent issue, and sub-issues will be added for subtasks 
> if necessary.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6099) [JAVA] Has the ability to not using slf4j logging framework

2019-08-30 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920034#comment-16920034
 ] 

Micah Kornfield commented on ARROW-6099:


See discussion on the PR [~jacq...@dremio.com] vetoed the patch.

> [JAVA] Has the ability to not using slf4j logging framework
> ---
>
> Key: ARROW-6099
> URL: https://issues.apache.org/jira/browse/ARROW-6099
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.14.1
>Reporter: Haowei Yu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Currently, the java library directly calls slf4j api, and there is no 
> abstract layer. This leads to user need to install slf4j as a requirement 
> even if we don't use slf4j at all. 
>  
> It is best if you can change the slf4j dependency scope to provided and log 
> content only if slf4j jar file is provided at runtime.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6352) [Java] Add implementation of DenseUnionVector.

2019-08-25 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-6352:
--

 Summary: [Java] Add implementation of DenseUnionVector.
 Key: ARROW-6352
 URL: https://issues.apache.org/jira/browse/ARROW-6352
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Micah Kornfield


Today only Sparse unions are supported.  We should have a dense union 
implementation vector that conforms to the IPC protocol (the current spare 
union vector doesn't do this and there are other JIRAs covering making these 
compatible).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6335) [Java] Improve the performance of DictionaryHashTable

2019-08-25 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6335.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5178
[https://github.com/apache/arrow/pull/5178]

> [Java] Improve the performance of DictionaryHashTable
> -
>
> Key: ARROW-6335
> URL: https://issues.apache.org/jira/browse/ARROW-6335
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> when comparing two entries in the dictionary hash table, it is more efficient 
> to compare the index directly, rather than using Objects.equals, because they 
> are both ints.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6356) [Java] Avro adapter implement Enum type and nested Record type

2019-08-29 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918339#comment-16918339
 ] 

Micah Kornfield commented on ARROW-6356:


We will need to add an an interface that can update a DictionaryProvider as 
well.

> [Java] Avro adapter implement Enum type and nested Record type
> --
>
> Key: ARROW-6356
> URL: https://issues.apache.org/jira/browse/ARROW-6356
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>
> Implement for converting avro {{Enum}} type.
> Convert nested avro {{Record}} type to Arrow {{StructVector}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-08-22 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913863#comment-16913863
 ] 

Micah Kornfield commented on ARROW-5691:


Given the current organization of the code base and based on [~xhochy] comment 
above.  I think we should put the core logic of reading files under the adaptor 
folders (where ORC is currently located), then consume that from datasets.  I 
don't have a good mental model of the current .so dependencies to offer a 
meaningful opinion on that. 

 

 

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (ARROW-6330) [C++] Include missing headers in api.h

2019-08-22 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-6330:
--

Assignee: Micah Kornfield

> [C++] Include missing headers in api.h
> --
>
> Key: ARROW-6330
> URL: https://issues.apache.org/jira/browse/ARROW-6330
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Minor
>
> I think result.h and array/concatenate.h should be included as they export 
> public symbols.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6330) [C++] Include missing headers in api.h

2019-08-22 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-6330:
--

 Summary: [C++] Include missing headers in api.h
 Key: ARROW-6330
 URL: https://issues.apache.org/jira/browse/ARROW-6330
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Micah Kornfield


I think result.h and array/concatenate.h should be included as they export 
public symbols.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6331) [Java] Incorporate ErrorProne into the java build

2019-08-22 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-6331:
--

 Summary: [Java] Incorporate ErrorProne into the java build
 Key: ARROW-6331
 URL: https://issues.apache.org/jira/browse/ARROW-6331
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Java
Reporter: Micah Kornfield


[Using 
https://github.com/google/error-prone|https://github.com/google/error-prone] 
seems like it would be a good idea to automatically catch more errors.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6296) [Java] Cleanup JDBC interfaces and eliminate one memcopy for binary/varchar fields

2019-09-04 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6296.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5152
[https://github.com/apache/arrow/pull/5152]

> [Java] Cleanup JDBC interfaces and eliminate one memcopy for binary/varchar 
> fields
> --
>
> Key: ARROW-6296
> URL: https://issues.apache.org/jira/browse/ARROW-6296
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> * If we use direct setting of fields, we can avoid the extra temporary buffer 
> and memcpy by setting bytes directly.
>  * We should overwrite existing vectors in consumers before returning 
> results, to avoid the possibility of closing vectors in use (or alternatively 
> make sure we retain the underlying buffers).
>  * Try to eliminate some of the state in load() by moving initialization to 
> the constructor.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6417) [C++][Parquet] Non-dictionary BinaryArray reads from Parquet format have slowed down since 0.11.x

2019-09-05 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923534#comment-16923534
 ] 

Micah Kornfield commented on ARROW-6417:


For SafeLoadAs, you could try changing the implementation to dereference 
instead of memcpy, which should be equivalent to the old code (assuming it is 
getting inlined correctly).  IIRC, we saw very comparable numbers for the 
existing parquet benchmarks when I made those changes. 

> [C++][Parquet] Non-dictionary BinaryArray reads from Parquet format have 
> slowed down since 0.11.x
> -
>
> Key: ARROW-6417
> URL: https://issues.apache.org/jira/browse/ARROW-6417
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Attachments: 20190903_parquet_benchmark.py, 
> 20190903_parquet_read_perf.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In doing some benchmarking, I have found that binary reads seem to be slower 
> from Arrow 0.11.1 to master branch. It would be a good idea to do some basic 
> profiling to see where we might improve our memory allocation strategy (or 
> whatever the bottleneck turns out to be)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2019-09-05 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6111:
---
Fix Version/s: 1.0.0

> [Java] Support LargeVarChar and LargeBinary types and add integration test 
> with C++
> ---
>
> Key: ARROW-6111
> URL: https://issues.apache.org/jira/browse/ARROW-6111
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-6110) [Java] Support LargeList Type and add integration test with C++

2019-09-05 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6110:
---
Fix Version/s: (was: 0.15.0)
   1.0.0

> [Java] Support LargeList Type and add integration test with C++
> ---
>
> Key: ARROW-6110
> URL: https://issues.apache.org/jira/browse/ARROW-6110
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2019-09-05 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6111:
---
Priority: Major  (was: Blocker)

> [Java] Support LargeVarChar and LargeBinary types and add integration test 
> with C++
> ---
>
> Key: ARROW-6111
> URL: https://issues.apache.org/jira/browse/ARROW-6111
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2019-09-05 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6111:
---
Priority: Blocker  (was: Major)

> [Java] Support LargeVarChar and LargeBinary types and add integration test 
> with C++
> ---
>
> Key: ARROW-6111
> URL: https://issues.apache.org/jira/browse/ARROW-6111
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2019-09-05 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6111:
---
Fix Version/s: (was: 0.15.0)

> [Java] Support LargeVarChar and LargeBinary types and add integration test 
> with C++
> ---
>
> Key: ARROW-6111
> URL: https://issues.apache.org/jira/browse/ARROW-6111
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-5125) [Python] Cannot roundtrip extreme dates through pyarrow

2019-09-06 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-5125:
---
Labels: pull-request-available windows  (was: parquet 
pull-request-available windows)

> [Python] Cannot roundtrip extreme dates through pyarrow
> ---
>
> Key: ARROW-5125
> URL: https://issues.apache.org/jira/browse/ARROW-5125
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Windows 10, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 
> 2019, 22:22:05)
>Reporter: Max Bolingbroke
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available, windows
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> You can roundtrip many dates through a pyarrow array:
>  
> {noformat}
> >>> pa.array([datetime.date(1980, 1, 1)], type=pa.date32())[0]
> datetime.date(1980, 1, 1){noformat}
>  
> But (on Windows at least), not extreme ones:
>  
> {noformat}
> >>> pa.array([datetime.date(1960, 1, 1)], type=pa.date32())[0]
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "pyarrow\scalar.pxi", line 74, in pyarrow.lib.ArrayValue.__repr__
>  File "pyarrow\scalar.pxi", line 226, in pyarrow.lib.Date32Value.as_py
> OSError: [Errno 22] Invalid argument
> >>> pa.array([datetime.date(3200, 1, 1)], type=pa.date32())[0]
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "pyarrow\scalar.pxi", line 74, in pyarrow.lib.ArrayValue.__repr__
>  File "pyarrow\scalar.pxi", line 226, in pyarrow.lib.Date32Value.as_py
> {noformat}
> This is because datetime.utcfromtimestamp and datetime.timestamp fail on 
> these dates, but it seems we should be able to totally avoid invoking this 
> function when deserializing dates. Ideally we would be able to roundtrip 
> these as datetimes too, of course, but it's less clear that this will be 
> easy. For some context on this see [https://bugs.python.org/issue29097].
> This may be related to ARROW-3176 and ARROW-4746



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6473) [Format] Clarify dictionary encoding edge cases

2019-09-05 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-6473:
--

 Summary: [Format] Clarify dictionary encoding edge cases
 Key: ARROW-6473
 URL: https://issues.apache.org/jira/browse/ARROW-6473
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Format
Reporter: Micah Kornfield
Assignee: Micah Kornfield


Several recent threads on the mailing list:

1.  Edge case for all null columns and interleaved dictionaries

2. Semantics non-delta dictionaries (and relation to the file format).

3.  Propose a forward compatible enum so dictionaries can represented as other 
types besides for a "flat" vector.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6474) Provide mechanism for python to write out old format

2019-09-05 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-6474:
--

 Summary: Provide mechanism for python to write out old format
 Key: ARROW-6474
 URL: https://issues.apache.org/jira/browse/ARROW-6474
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Micah Kornfield
 Fix For: 0.15.0


I think this needs to be an environment variable, so it can be made to work 
with old version of the Java library pyspark integration.

 

 [~bryanc] can you check if this captures the requirements?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (ARROW-4746) [C++/Python] PyDataTime_Date wrongly casted to PyDataTime_DateTime

2019-09-05 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-4746:
--

Assignee: Micah Kornfield

> [C++/Python] PyDataTime_Date wrongly casted to PyDataTime_DateTime
> --
>
> Key: ARROW-4746
> URL: https://issues.apache.org/jira/browse/ARROW-4746
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Uwe L. Korn
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pypy
> Fix For: 0.15.0
>
>
> As mentioned in 
> https://bitbucket.org/pypy/pypy/issues/2842/running-pyarrow-on-pypy-segfaults#comment-50670536,
>  we currently access a {{PyDataTime_Date}} object with a 
> {{PyDataTime_DateTime}} cast in {{PyDateTime_DATE_GET_SECOND}} in our code in 
> two instances. While CPython is able to deal with this wrong usage, PyPy is 
> not able to do so. We should separate the path here into one that deals with 
> dates and another that deals with datetimes.
> Reproducible code:
> {code:java}
> pa.array([datetime.date(2018, 5, 10)], type=pa.date64()){code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6547) [C++] valgrind errors in arrow-ipc-read-write-test

2019-09-16 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931092#comment-16931092
 ] 

Micah Kornfield commented on ARROW-6547:


It sounds like the diff-test is potentially worth fixing, I'll see what I can 
do for this next release.  I'm also interested to see how these would compare 
with what MSAN ([https://clang.llvm.org/docs/MemorySanitizer.html]) returns, 
unfortunately its not clear to me if "that can be recompiled from source, 
including all dependent libraries." is actually a requirement or if it can be 
limited to Arrow libs.

> [C++] valgrind errors in arrow-ipc-read-write-test
> --
>
> Key: ARROW-6547
> URL: https://issues.apache.org/jira/browse/ARROW-6547
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Not sure when these crept in but I encountered when looking into a segfault 
> in a build today
> https://gist.github.com/wesm/b388dda4f0e2e38a8aa77dfc9bd91914



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6401) [Java] Implement dictionary-encoded subfields for Struct type

2019-09-16 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6401.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5243
[https://github.com/apache/arrow/pull/5243]

> [Java] Implement dictionary-encoded subfields for Struct type
> -
>
> Key: ARROW-6401
> URL: https://issues.apache.org/jira/browse/ARROW-6401
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Implement dictionary-encoded subfields for Struct type.
> Each child vector will have a dictionary, the dictionary vector is struct 
> type and holds all dictionaries.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6458) [Java] Remove value boxing/unboxing for ApproxEqualsVisitor

2019-09-16 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6458.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5304
[https://github.com/apache/arrow/pull/5304]

> [Java] Remove value boxing/unboxing for ApproxEqualsVisitor
> ---
>
> Key: ARROW-6458
> URL: https://issues.apache.org/jira/browse/ARROW-6458
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> As discussed in 
> https://github.com/apache/arrow/pull/5195#issuecomment-526157961, there are 
> some problems with the current ways of comparing floating point vectors, we 
> solve them in this PR:
> 1. there are if statements/duplicated members in ApproxEqualsVisitor, making 
> the code redundant and less clear.
> 2. the comparion of float4 and float8 are based on wrapped objects Float and 
> Double, which may have performance penalty.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6315) [Java] Make change to ensure flatbuffer reads are aligned

2019-09-06 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6315.

Resolution: Fixed

Issue resolved by pull request 5229
[https://github.com/apache/arrow/pull/5229]

> [Java] Make change to ensure flatbuffer reads are aligned 
> --
>
> Key: ARROW-6315
> URL: https://issues.apache.org/jira/browse/ARROW-6315
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> See parent bug for details on requirements.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6356) [Java] Avro adapter implement Enum type and nested Record type

2019-09-06 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6356.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5305
[https://github.com/apache/arrow/pull/5305]

> [Java] Avro adapter implement Enum type and nested Record type
> --
>
> Key: ARROW-6356
> URL: https://issues.apache.org/jira/browse/ARROW-6356
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Implement for converting avro {{Enum}} type.
> Convert nested avro {{Record}} type to Arrow {{StructVector}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Closed] (ARROW-6220) [Java] Add API to avro adapter to limit number of rows returned at a time.

2019-09-06 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield closed ARROW-6220.
--
Resolution: Fixed

> [Java] Add API to avro adapter to limit number of rows returned at a time.
> --
>
> Key: ARROW-6220
> URL: https://issues.apache.org/jira/browse/ARROW-6220
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Major
>  Labels: avro, pull-request-available
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> We can either let clients iterate or ideally provide an iterator interface.  
> This is important for large avro data and was also discussed as something 
> readers/adapters should haven.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-6356) [Java] Avro adapter implement Enum type and nested Record type

2019-09-06 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6356:
---
Component/s: Java

> [Java] Avro adapter implement Enum type and nested Record type
> --
>
> Key: ARROW-6356
> URL: https://issues.apache.org/jira/browse/ARROW-6356
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Implement for converting avro {{Enum}} type.
> Convert nested avro {{Record}} type to Arrow {{StructVector}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6460) [Java] Add unit test for large avro data

2019-09-06 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924713#comment-16924713
 ] 

Micah Kornfield commented on ARROW-6460:


as part of this can we add in a performance test as well to get a baseline 
number.

> [Java] Add unit test for large avro data
> 
>
> Key: ARROW-6460
> URL: https://issues.apache.org/jira/browse/ARROW-6460
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Critical
>
> To avoid OOM, we have implement iterator API in ARROW-6220.
> This issue is about to add tests with a large fake data set (say 6MM rows in 
> JDBC adapter test) and ensures no OOMs occur.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6484) [Java] Enable create indexType for DictionaryEncoding according to dictionary value count

2019-09-09 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6484.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5321
[https://github.com/apache/arrow/pull/5321]

> [Java] Enable create indexType for DictionaryEncoding according to dictionary 
> value count
> -
>
> Key: ARROW-6484
> URL: https://issues.apache.org/jira/browse/ARROW-6484
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, when create {{DictionaryEncoding}}, we need to specify indexType, 
> and it use Int(32, true) as default if this value is null.
> Actually, when dictionary valueCount is small, we should use 
> Int(8,true)/Int(16,true) instead to reduce memory allocation.
> This issue is about to provide API for creating indexType according to 
> valueCount and apply it to avro adapter for enum type.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-1669) [C++] Consider adding Abseil (Google C++11 standard library extensions) to toolchain

2019-09-17 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-1669.

Resolution: Won't Fix

Closing for now, we can revisit individual functionality as necessary.

> [C++] Consider adding Abseil (Google C++11 standard library extensions) to 
> toolchain
> 
>
> Key: ARROW-1669
> URL: https://issues.apache.org/jira/browse/ARROW-1669
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Google has released a library of C++11-compliant extensions to the STL that 
> may help make a lot of Arrow code simpler:
> https://github.com/abseil/abseil-cpp/
> This code is not header-only and so would require some effort to add to the 
> toolchain at the moment since it only supports the Bazel build system



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1669) [C++] Consider adding Abseil (Google C++11 standard library extensions) to toolchain

2019-09-17 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932034#comment-16932034
 ] 

Micah Kornfield commented on ARROW-1669:


I think individual backports make sense for the time being.  Ideally, we're 
getting closer to the point where we can use a more modern C++ standard which 
makes absl less valuable.  My other concern which looking at the code isn't a 
concern yet, is if absl ever open sources Google's Status object, it would make 
our code more complicated.

> [C++] Consider adding Abseil (Google C++11 standard library extensions) to 
> toolchain
> 
>
> Key: ARROW-1669
> URL: https://issues.apache.org/jira/browse/ARROW-1669
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Google has released a library of C++11-compliant extensions to the STL that 
> may help make a lot of Arrow code simpler:
> https://github.com/abseil/abseil-cpp/
> This code is not header-only and so would require some effort to add to the 
> toolchain at the moment since it only supports the Bazel build system



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4018) [C++] RLE decoder may not big-endian compatible

2019-09-17 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932031#comment-16932031
 ] 

Micah Kornfield commented on ARROW-4018:


Hmm, i think we've gone back and forth on Endianness support.  I know when the 
project started I thought it was important because at the time it seemed like 
Spark was intending to support both (I don't know if it still does).  

 

Are we actually clean in terms of endianness in other places?  I would need to 
investigate further, but it sounds strange to be slicing a long like coverity 
describes have you looked to see if this is intended?

> [C++] RLE decoder may not big-endian compatible
> ---
>
> Key: ARROW-4018
> URL: https://issues.apache.org/jira/browse/ARROW-4018
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.11.1
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> This issue was found by Coverity. The {{RleDecoder::NextCounts}} method has 
> the following code to fetch the repeated literal in repeated runs:
> {code:c++}
> bool result =
> 
> bit_reader_.GetAligned(static_cast(BitUtil::CeilDiv(bit_width_, 8)),
>   reinterpret_cast(_value_));
> {code}
> Coverity says this:
> bq. Pointer ">current_value_" points to an object whose effective type 
> is "unsigned long long" (64 bits, unsigned) but is dereferenced as a narrower 
> "unsigned int" (32 bits, unsigned). This may lead to unexpected results 
> depending on machine endianness.
> bq. 
> In addition, it's not obvious whether {{current_value_}} also needs 
> byte-swapping (presumably, at least in the Parquet file format, it's supposed 
> to be stored in little-endian format in the RLE bitstream).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6366) [Java] Make field vectors final explicitly

2019-09-17 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6366.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5204
[https://github.com/apache/arrow/pull/5204]

> [Java] Make field vectors final explicitly
> --
>
> Key: ARROW-6366
> URL: https://issues.apache.org/jira/browse/ARROW-6366
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> According to the discussion in 
> [https://lists.apache.org/thread.html/836d3b87ccb6e65e9edf0f220829a29edfa394fc2cd1e0866007d86e@%3Cdev.arrow.apache.org%3E,]
>  field vectors should not be extended, so they should be made final 
> explicitly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6594) [Java] Support logical type encodings from Avro

2019-09-17 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6594:
---
Description: 
Avro supports some logical types that overlap with Arrow logical types 
([http://avro.apache.org/docs/current/spec.html#Logical+Types) 
|http://avro.apache.org/docs/current/spec.html#Logical+Types]

 

For the ones that overlap, we should use the appropriate Arrow Logical type 
array instead of the raw values.

 

  was:
It has been posited that the Decoder object (and on-heap work in general) is 
potentially slow for decoding.

 

The scope of this Jira is to add a new method that instead of consuming from 
Decoder, consumes directly from a ByteBuffer.  In order to this we there needs 
to be utility classes for zig-zag decoding (one might existing in avro) from a 
ByteBuffer.

 

This is esentially rewriting logic in the decoder to work directly against a 
bytebuffer and then measure if there is a meaningful performance impact.

 

 


> [Java] Support logical type encodings from Avro
> ---
>
> Key: ARROW-6594
> URL: https://issues.apache.org/jira/browse/ARROW-6594
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: avro
>
> Avro supports some logical types that overlap with Arrow logical types 
> ([http://avro.apache.org/docs/current/spec.html#Logical+Types) 
> |http://avro.apache.org/docs/current/spec.html#Logical+Types]
>  
> For the ones that overlap, we should use the appropriate Arrow Logical type 
> array instead of the raw values.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5845) [Java] Implement converter between Arrow record batches and Avro records

2019-09-17 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932063#comment-16932063
 ] 

Micah Kornfield commented on ARROW-5845:


Thanks [~tianchen92].  I think there is still probably room for improvement of 
functionality and performance.  If you are interested in still doing work in 
this area I can create a new set of JIRAs.

> [Java] Implement converter between Arrow record batches and Avro records
> 
>
> Key: ARROW-5845
> URL: https://issues.apache.org/jira/browse/ARROW-5845
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
> Fix For: 0.15.0
>
>
> It would be useful for applications which need convert Avro data to Arrow 
> data.
> This is an adapter which convert data with existing API (like JDBC adapter) 
> rather than a native reader (like orc).
> We implement this function through Avro java project, receiving param like 
> Decoder/Schema/DatumReader of Avro and return VectorSchemaRoot. For each data 
> type we have a consumer class as below to get Avro data and write it into 
> vector to avoid boxing/unboxing (e.g. GenericRecord#get returns Object)
> {code:java}
> public class AvroIntConsumer implements Consumer {
> private final IntWriter writer;
> public AvroIntConsumer(IntVector vector)
> { this.writer = new IntWriterImpl(vector); }
> @Override
> public void consume(Decoder decoder) throws IOException
> { writer.writeInt(decoder.readInt()); writer.setPosition(writer.getPosition() 
> + 1); }
> {code}
> We intended to support primitive and complex types (null value represented 
> via unions type with null type), size limit and field selection could be 
> optional for users. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6593) [Java] Experiment with performance difference of avoiding the use of Avro Decoder

2019-09-17 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-6593:
--

 Summary: [Java] Experiment with performance difference of avoiding 
the use of Avro Decoder
 Key: ARROW-6593
 URL: https://issues.apache.org/jira/browse/ARROW-6593
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Micah Kornfield


Users should be able to pass in a set of fields they wish to decode from Avro 
and the converter should avoid creating Vectors in the returned 
ArrowSchemaRoot.  This would ideally support nested columns so if there was:

 

Struct A {

    int B;

    int C;

} 

 

The use could choose to only read A.B or A.C or both.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6595) [Java] Avro - Experiment with "compiled" consumer delegates for performance.

2019-09-18 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6595:
---
Summary: [Java] Avro - Experiment with "compiled" consumer delegates for 
performance.  (was: [Java] Avro - Experiment consumer compilation.)

> [Java] Avro - Experiment with "compiled" consumer delegates for performance.
> 
>
> Key: ARROW-6595
> URL: https://issues.apache.org/jira/browse/ARROW-6595
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Major
>  Labels: avro
>
> All consumers that rely on delegates (e.g. struct, composite, list, union, 
> ...) require  megamorphic lookups which can't be inlined well by JIT.  
>  
> We should verify the performance different of a hand-coded consumer vs an 
> existing delegate consumer
>  
> i.e. something like:
>  
> void consume(Decoder d) {
>   ((IntConsumer)delegate).consume(d);
> }
>  
> compared to the existing implementation.  It is expected we will see a decent 
> amount of performance improvement from this approach.  If we do, we should 
> add an option to converter to generate new custom classes on the fly, that 
> mimic the hand-coded option.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6566) Implement VarChar in Scala

2019-09-18 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932144#comment-16932144
 ] 

Micah Kornfield commented on ARROW-6566:


It could be useful if you clarify how this is failing.  One thing that springs 
to mind is you potentially want to use 
[setSafe|[https://arrow.apache.org/docs/java/org/apache/arrow/vector/BaseVariableWidthVector.html#setSafe-int-byte:A-]]
 instead of set

> Implement VarChar in Scala
> --
>
> Key: ARROW-6566
> URL: https://issues.apache.org/jira/browse/ARROW-6566
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Java
>Affects Versions: 0.14.1
>Reporter: Boris V.Kuznetsov
>Priority: Major
>
> Hello
> I'm trying to write and read a zio.Chunk of strings, with is essentially an 
> array of strings.
> My implementation fails the test, how should I fix that ?
> [Writer|https://github.com/Neurodyne/zio-serdes/blob/9e2128ff64ffa0e7c64167a5ee46584c3fcab9e4/src/main/scala/zio/serdes/arrow/ArrowUtils.scala#L48]
>  code
> [Reader|https://github.com/Neurodyne/zio-serdes/blob/9e2128ff64ffa0e7c64167a5ee46584c3fcab9e4/src/main/scala/zio/serdes/arrow/ArrowUtils.scala#L108]
>  code
> [Test|https://github.com/Neurodyne/zio-serdes/blob/9e2128ff64ffa0e7c64167a5ee46584c3fcab9e4/src/test/scala/arrow/Base.scala#L115]
>  code
> Any help, links and advice are highly appreciated
> Thank you!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6595) [Java] Avro - Experiment consumer compilation.

2019-09-17 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-6595:
--

 Summary: [Java] Avro - Experiment consumer compilation.
 Key: ARROW-6595
 URL: https://issues.apache.org/jira/browse/ARROW-6595
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Micah Kornfield


Avro supports some logical types that overlap with Arrow logical types 
([http://avro.apache.org/docs/current/spec.html#Logical+Types) 
|http://avro.apache.org/docs/current/spec.html#Logical+Types]

 

For the ones that overlap, we should use the appropriate Arrow Logical type 
array instead of the raw values.

 

it potentially makes sense to break this down further into sub-tasks for each 
logical type.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6594) [Java] Support logical type encodings from Avro

2019-09-17 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6594:
---
Description: 
Avro supports some logical types that overlap with Arrow logical types 
([http://avro.apache.org/docs/current/spec.html#Logical+Types) 
|http://avro.apache.org/docs/current/spec.html#Logical+Types]

 

For the ones that overlap, we should use the appropriate Arrow Logical type 
array instead of the raw values.

 

it potentially makes sense to break this down further into sub-tasks for each 
logical type.

 

  was:
Avro supports some logical types that overlap with Arrow logical types 
([http://avro.apache.org/docs/current/spec.html#Logical+Types) 
|http://avro.apache.org/docs/current/spec.html#Logical+Types]

 

For the ones that overlap, we should use the appropriate Arrow Logical type 
array instead of the raw values.

 


> [Java] Support logical type encodings from Avro
> ---
>
> Key: ARROW-6594
> URL: https://issues.apache.org/jira/browse/ARROW-6594
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: avro
>
> Avro supports some logical types that overlap with Arrow logical types 
> ([http://avro.apache.org/docs/current/spec.html#Logical+Types) 
> |http://avro.apache.org/docs/current/spec.html#Logical+Types]
>  
> For the ones that overlap, we should use the appropriate Arrow Logical type 
> array instead of the raw values.
>  
> it potentially makes sense to break this down further into sub-tasks for each 
> logical type.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6460) [Java] Add benchmark and large fake data UT for avro adapter

2019-09-17 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6460.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5317
[https://github.com/apache/arrow/pull/5317]

> [Java] Add benchmark and large fake data UT for avro adapter
> 
>
> Key: ARROW-6460
> URL: https://issues.apache.org/jira/browse/ARROW-6460
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> To avoid OOM, we have implement iterator API in ARROW-6220.
> This issue is about to add tests with a large fake data set (say 6MM rows in 
> JDBC adapter test) and ensures no OOMs occur.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6595) [Java] Avro - Experiment consumer compilation.

2019-09-18 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6595:
---
Description: 
All consumers that rely on delegates (e.g. struct, composite, list, union, ...) 
require  megamorphic lookups which can't be inlined well by JIT.  

 

We should verify the performance different of a hand-coded consumer vs an 
existing delegate consumer

 

i.e. something like:

 

void consume(Decoder d) {

  ((IntConsumer)delegate).consume(d);

}

 

compared to the existing implementation.  It is expected we will see a decent 
amount of performance improvement from this approach.  If we do, we should add 
an option to converter to generate new custom classes on the fly, that mimic 
the hand-coded option.

 

  was:
Avro supports some logical types that overlap with Arrow logical types 
([http://avro.apache.org/docs/current/spec.html#Logical+Types) 
|http://avro.apache.org/docs/current/spec.html#Logical+Types]

 

For the ones that overlap, we should use the appropriate Arrow Logical type 
array instead of the raw values.

 

it potentially makes sense to break this down further into sub-tasks for each 
logical type.

 


> [Java] Avro - Experiment consumer compilation.
> --
>
> Key: ARROW-6595
> URL: https://issues.apache.org/jira/browse/ARROW-6595
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Major
>  Labels: avro
>
> All consumers that rely on delegates (e.g. struct, composite, list, union, 
> ...) require  megamorphic lookups which can't be inlined well by JIT.  
>  
> We should verify the performance different of a hand-coded consumer vs an 
> existing delegate consumer
>  
> i.e. something like:
>  
> void consume(Decoder d) {
>   ((IntConsumer)delegate).consume(d);
> }
>  
> compared to the existing implementation.  It is expected we will see a decent 
> amount of performance improvement from this approach.  If we do, we should 
> add an option to converter to generate new custom classes on the fly, that 
> mimic the hand-coded option.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-5917) [Java] Redesign the dictionary encoder

2019-09-17 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5917.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 4994
[https://github.com/apache/arrow/pull/4994]

> [Java] Redesign the dictionary encoder
> --
>
> Key: ARROW-5917
> URL: https://issues.apache.org/jira/browse/ARROW-5917
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 12h 20m
>  Remaining Estimate: 0h
>
> The current dictionary encoder implementation 
> (org.apache.arrow.vector.dictionary.DictionaryEncoder) has heavy performance 
> overhead, which prevents it from being useful in practice:
>  # There are repeated conversions between Java objects and bytes (e.g. 
> vector.getObject(i)).
>  # Unnecessary memory copy (the vector data must be copied to the hash table).
>  # The hash table cannot be reused for encoding multiple vectors (other data 
> structure & results cannot be reused either).
>  # The output vector should not be created/managed by the encoder (just like 
> in the out-of-place sorter)
>  # The hash table requires that the hashCode & equals methods be implemented 
> appropriately, but this is not guaranteed.
> We plan to implement a new one in the algorithm module, and gradually 
> deprecate the current one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6592) [Java] Add support for skipping decoding of columns/field in Avro converter

2019-09-17 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-6592:
--

 Summary: [Java] Add support for skipping decoding of columns/field 
in Avro converter
 Key: ARROW-6592
 URL: https://issues.apache.org/jira/browse/ARROW-6592
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Micah Kornfield


Users should be able to pass in a set of fields they wish to decode from Avro 
and the converter should avoid creating Vectors in the returned 
ArrowSchemaRoot.  This would ideally support nested columns so if there was:

 

Struct A {

    int B;

    int C;

} 

 

The use could choose to only read A.B or A.C or both.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6594) [Java] Support logical type encodings from Avro

2019-09-17 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-6594:
--

 Summary: [Java] Support logical type encodings from Avro
 Key: ARROW-6594
 URL: https://issues.apache.org/jira/browse/ARROW-6594
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Micah Kornfield


It has been posited that the Decoder object (and on-heap work in general) is 
potentially slow for decoding.

 

The scope of this Jira is to add a new method that instead of consuming from 
Decoder, consumes directly from a ByteBuffer.  In order to this we there needs 
to be utility classes for zig-zag decoding (one might existing in avro) from a 
ByteBuffer.

 

This is esentially rewriting logic in the decoder to work directly against a 
bytebuffer and then measure if there is a meaningful performance impact.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6593) [Java] Experiment with performance difference of avoiding the use of Avro Decoder

2019-09-17 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6593:
---
Description: 
It has been posited that the Decoder object (and on-heap work in general) is 
potentially slow for decoding.

 

The scope of this Jira is to add a new method that instead of consuming from 
Decoder, consumes directly from a ByteBuffer.  In order to this we there needs 
to be utility classes for zig-zag decoding (one might existing in avro) from a 
ByteBuffer.

 

This is esentially rewriting logic in the decoder to work directly against a 
bytebuffer and then measure if there is a meaningful performance impact.

 

 

  was:
Users should be able to pass in a set of fields they wish to decode from Avro 
and the converter should avoid creating Vectors in the returned 
ArrowSchemaRoot.  This would ideally support nested columns so if there was:

 

Struct A {

    int B;

    int C;

} 

 

The use could choose to only read A.B or A.C or both.


> [Java] Experiment with performance difference of avoiding the use of Avro 
> Decoder
> -
>
> Key: ARROW-6593
> URL: https://issues.apache.org/jira/browse/ARROW-6593
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: avro
>
> It has been posited that the Decoder object (and on-heap work in general) is 
> potentially slow for decoding.
>  
> The scope of this Jira is to add a new method that instead of consuming from 
> Decoder, consumes directly from a ByteBuffer.  In order to this we there 
> needs to be utility classes for zig-zag decoding (one might existing in avro) 
> from a ByteBuffer.
>  
> This is esentially rewriting logic in the decoder to work directly against a 
> bytebuffer and then measure if there is a meaningful performance impact.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1175) [Java] Implement/test dictionary-encoded subfields

2019-07-30 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896441#comment-16896441
 ] 

Micah Kornfield commented on ARROW-1175:


SGTM, in general, I think if issues are unassigned they are up for anyone to 
take on.

> [Java] Implement/test dictionary-encoded subfields
> --
>
> Key: ARROW-1175
> URL: https://issues.apache.org/jira/browse/ARROW-1175
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> We do not have any tests about types like:
> {code}
> List
> {code}
> cc [~julienledem] [~elahrvivaz]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5998) [Java] Open a document to track the API changes

2019-07-30 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5998.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4918
[https://github.com/apache/arrow/pull/4918]

> [Java] Open a document to track the API changes
> ---
>
> Key: ARROW-5998
> URL: https://issues.apache.org/jira/browse/ARROW-5998
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> We need a document to track the API behavior changes, so as not forget about 
> them for the next release.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6075) [FlightRPC] Handle uncaught exceptions in middleware

2019-07-30 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6075:
---
Description: For some discussion on the java side see 
[https://github.com/apache/arrow/pull/4916]

> [FlightRPC] Handle uncaught exceptions in middleware
> 
>
> Key: ARROW-6075
> URL: https://issues.apache.org/jira/browse/ARROW-6075
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Reporter: lidavidm
>Priority: Major
>
> For some discussion on the java side see 
> [https://github.com/apache/arrow/pull/4916]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5996) [Java] Avoid resource leak in flight service

2019-07-30 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5996.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4916
[https://github.com/apache/arrow/pull/4916]

> [Java] Avoid resource leak in flight service
> 
>
> Key: ARROW-5996
> URL: https://issues.apache.org/jira/browse/ARROW-5996
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> # In FlightService#doPutCustom, the flight stream must be closed, even if an 
> exception is thrown during the call of responseObserver.onError
>  # The exception occurred during the call to acceptPut should not be 
> swallowed.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6073) [C++] Decimal128Builder is not reset in Finish()

2019-07-30 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6073.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4970
[https://github.com/apache/arrow/pull/4970]

> [C++] Decimal128Builder is not reset in Finish()
> 
>
> Key: ARROW-6073
> URL: https://issues.apache.org/jira/browse/ARROW-6073
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Kenneth Jung
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Repro:
> {code:java|title=decimal128-builder_test.cc|borderStyle=solid}
> TEST(ArrowDecimal128BuilderTest, TestResetAfterFinish) {
>   auto type = std::make_shared<::arrow::Decimal128Type>(4, 4);
>   auto builder = std::make_shared<::arrow::Decimal128Builder>(type);
>   std::shared_ptr<::arrow::Array> out;
>   ASSERT_OK(builder->Append("1"));
>   ASSERT_OK(builder->Finish());
>   ASSERT_OK(builder->Append("2"));
>   ASSERT_OK(builder->Finish());
>   ASSERT_EQ(out->length(), 1);
> }
> {code}
> Output:
> {{  Expected equality of these values:}}
> {{    out->length()}}
> {{    Which is: 2}}
> {{    1}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6073) [C++] Decimal128Builder is not reset in Finish()

2019-07-30 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896658#comment-16896658
 ] 

Micah Kornfield commented on ARROW-6073:


[~pitrou] [~wesmckinn] could one of you add Ken and a contributor and assign 
the Jira to him?

> [C++] Decimal128Builder is not reset in Finish()
> 
>
> Key: ARROW-6073
> URL: https://issues.apache.org/jira/browse/ARROW-6073
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Kenneth Jung
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Repro:
> {code:java|title=decimal128-builder_test.cc|borderStyle=solid}
> TEST(ArrowDecimal128BuilderTest, TestResetAfterFinish) {
>   auto type = std::make_shared<::arrow::Decimal128Type>(4, 4);
>   auto builder = std::make_shared<::arrow::Decimal128Builder>(type);
>   std::shared_ptr<::arrow::Array> out;
>   ASSERT_OK(builder->Append("1"));
>   ASSERT_OK(builder->Finish());
>   ASSERT_OK(builder->Append("2"));
>   ASSERT_OK(builder->Finish());
>   ASSERT_EQ(out->length(), 1);
> }
> {code}
> Output:
> {{  Expected equality of these values:}}
> {{    out->length()}}
> {{    Which is: 2}}
> {{    1}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-1875) Write 64-bit ints as strings in integration test JSON files

2019-07-31 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897719#comment-16897719
 ] 

Micah Kornfield commented on ARROW-1875:


I created: [https://github.com/apache/arrow/tree/integration_ints_as_strings]

> Write 64-bit ints as strings in integration test JSON files
> ---
>
> Key: ARROW-1875
> URL: https://issues.apache.org/jira/browse/ARROW-1875
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Integration, JavaScript
>Reporter: Brian Hulette
>Priority: Minor
> Fix For: 1.0.0
>
>
> Javascript can't handle 64-bit integers natively, so writing them as strings 
> in the JSON would make implementing the integration tests a lot simpler.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-1875) Write 64-bit ints as strings in integration test JSON files

2019-07-31 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897539#comment-16897539
 ] 

Micah Kornfield commented on ARROW-1875:


We would also need to update the read path, and coordinate with contributors 
from the other implementations.

> Write 64-bit ints as strings in integration test JSON files
> ---
>
> Key: ARROW-1875
> URL: https://issues.apache.org/jira/browse/ARROW-1875
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Integration, JavaScript
>Reporter: Brian Hulette
>Priority: Minor
> Fix For: 1.0.0
>
>
> Javascript can't handle 64-bit integers natively, so writing them as strings 
> in the JSON would make implementing the integration tests a lot simpler.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-1875) Write 64-bit ints as strings in integration test JSON files

2019-07-31 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897538#comment-16897538
 ] 

Micah Kornfield commented on ARROW-1875:


I would need to double check the code, but yes I believe it is in 
JsonFileWriter.  The idea is that javascript can better parse the longs as JSON 
strings then as JSON numbers.

> Write 64-bit ints as strings in integration test JSON files
> ---
>
> Key: ARROW-1875
> URL: https://issues.apache.org/jira/browse/ARROW-1875
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Integration, JavaScript
>Reporter: Brian Hulette
>Priority: Minor
> Fix For: 1.0.0
>
>
> Javascript can't handle 64-bit integers natively, so writing them as strings 
> in the JSON would make implementing the integration tests a lot simpler.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5439) [Java] Utilize stream EOS in File format

2019-07-31 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5439.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4966
[https://github.com/apache/arrow/pull/4966]

> [Java] Utilize stream EOS in File format
> 
>
> Key: ARROW-5439
> URL: https://issues.apache.org/jira/browse/ARROW-5439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: John Muehlhausen
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> We currently do not write EOS at the end of a Message stream inside the File 
> format.  As a result, the file cannot be parsed sequentially.  This change 
> prepares for other implementations or future reference features that parse a 
> File sequentially... i.e. without access to seek().



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-1561) [C++] Kernel implementations for "isin" (set containment)

2019-07-31 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-1561.

   Resolution: Fixed
Fix Version/s: (was: 1.0.0)
   0.15.0

Issue resolved by pull request 4235
[https://github.com/apache/arrow/pull/4235]

> [C++] Kernel implementations for "isin" (set containment)
> -
>
> Key: ARROW-1561
> URL: https://issues.apache.org/jira/browse/ARROW-1561
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 21h 20m
>  Remaining Estimate: 0h
>
> isin determines whether each element in the left array is contained in the 
> values in the right array. This function must handle the case where the right 
> array has nulls (so that null in the left array will return true)
> {code}
> isin(['a', 'b', null], ['a', 'c'])
> returns [true, false, null]
> isin(['a', 'b', null], ['a', 'c', null])
> returns [true, false, true]
> {code}
> May need an option to return false for null instead of null



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-1561) [C++] Kernel implementations for "isin" (set containment)

2019-07-31 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-1561:
--

Assignee: Preeti Suman

> [C++] Kernel implementations for "isin" (set containment)
> -
>
> Key: ARROW-1561
> URL: https://issues.apache.org/jira/browse/ARROW-1561
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Preeti Suman
>Priority: Major
>  Labels: Analytics, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 21h 20m
>  Remaining Estimate: 0h
>
> isin determines whether each element in the left array is contained in the 
> values in the right array. This function must handle the case where the right 
> array has nulls (so that null in the left array will return true)
> {code}
> isin(['a', 'b', null], ['a', 'c'])
> returns [true, false, null]
> isin(['a', 'b', null], ['a', 'c', null])
> returns [true, false, true]
> {code}
> May need an option to return false for null instead of null



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6035) [Java] Avro adapter support convert nullable value

2019-07-31 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6035:
---
Component/s: Java

> [Java] Avro adapter support convert nullable value
> --
>
> Key: ARROW-6035
> URL: https://issues.apache.org/jira/browse/ARROW-6035
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> A  specific Avro unions type(has two types and one is null type) could 
> convert to a nullable ArrowVector.
> For instance, ["null", "string"] could represented by a VarcharVector which 
> could has null value.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6035) [Java] Avro adapter support convert nullable value

2019-07-31 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6035.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4943
[https://github.com/apache/arrow/pull/4943]

> [Java] Avro adapter support convert nullable value
> --
>
> Key: ARROW-6035
> URL: https://issues.apache.org/jira/browse/ARROW-6035
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> A  specific Avro unions type(has two types and one is null type) could 
> convert to a nullable ArrowVector.
> For instance, ["null", "string"] could represented by a VarcharVector which 
> could has null value.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-4810) [Format][C++] Add "LargeList" type with 64-bit offsets

2019-07-31 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-4810.

   Resolution: Fixed
Fix Version/s: (was: 1.0.0)
   0.15.0

Issue resolved by pull request 4969
[https://github.com/apache/arrow/pull/4969]

> [Format][C++] Add "LargeList" type with 64-bit offsets
> --
>
> Key: ARROW-4810
> URL: https://issues.apache.org/jira/browse/ARROW-4810
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 11h 40m
>  Remaining Estimate: 0h
>
> Mentioned in https://github.com/apache/arrow/issues/3845



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6020) [Java] Refactor ByteFunctionHelper#hash with new added ArrowBufHasher

2019-07-31 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6020.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4938
[https://github.com/apache/arrow/pull/4938]

> [Java] Refactor ByteFunctionHelper#hash with new added ArrowBufHasher
> -
>
> Key: ARROW-6020
> URL: https://issues.apache.org/jira/browse/ARROW-6020
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Some logic in these two classes are similar, should replace 
> ByteFunctionHelper#hash logic with ArrowBufHasher since it has murmur hash 
> algorithm which could avoid hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2019-08-01 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898533#comment-16898533
 ] 

Micah Kornfield commented on ARROW-6111:


[~tianchen92] Please hold off on this until ARROW-6112 is either approved or we 
decide we don't want to do it.  I was thinking of doing this myself, since I've 
already done most of the work for LargeList and the changes are similar, but we 
can figure out division of work after details of ARROW-6112 are worked out..  
ARROW-750 has the details on the new types.

> [Java] Support LargeVarChar and LargeBinary types and add integration test 
> with C++
> ---
>
> Key: ARROW-6111
> URL: https://issues.apache.org/jira/browse/ARROW-6111
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Blocker
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6110) [Java] Support LargeList Type and add integration test with C++

2019-08-01 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-6110:
--

 Summary: [Java] Support LargeList Type and add integration test 
with C++
 Key: ARROW-6110
 URL: https://issues.apache.org/jira/browse/ARROW-6110
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Micah Kornfield
 Fix For: 0.15.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2019-08-01 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-6111:
--

 Summary: [Java] Support LargeVarChar and LargeBinary types and add 
integration test with C++
 Key: ARROW-6111
 URL: https://issues.apache.org/jira/browse/ARROW-6111
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Micah Kornfield
 Fix For: 0.15.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6110) [Java] Support LargeList Type and add integration test with C++

2019-08-01 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-6110:
--

Assignee: Micah Kornfield

> [Java] Support LargeList Type and add integration test with C++
> ---
>
> Key: ARROW-6110
> URL: https://issues.apache.org/jira/browse/ARROW-6110
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Blocker
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6112) [Java] Update APIs to support 64-bit address space

2019-08-01 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-6112:
--

 Summary: [Java] Update APIs to support 64-bit address space
 Key: ARROW-6112
 URL: https://issues.apache.org/jira/browse/ARROW-6112
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Micah Kornfield
Assignee: Micah Kornfield


The arrow spec allows for 64 bit address range for buffers (and arrays) we 
should support this at the API level in Java even if the current Netty backing 
buffers don't support it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6154) [Rust] Too many open files (os error 24)

2019-08-06 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6154:
---
Summary: [Rust] Too many open files (os error 24)  (was: Too many open 
files (os error 24))

> [Rust] Too many open files (os error 24)
> 
>
> Key: ARROW-6154
> URL: https://issues.apache.org/jira/browse/ARROW-6154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Yesh
>Priority: Major
>
> Used [rust]*parquet-read binary to read a deeply nested parquet file and see 
> the below stack trace. Unfortunately won't be able to upload file.*
> {code:java}
> stack backtrace:
>    0: std::panicking::default_hook::{{closure}}
>    1: std::panicking::default_hook
>    2: std::panicking::rust_panic_with_hook
>    3: std::panicking::continue_panic_fmt
>    4: rust_begin_unwind
>    5: core::panicking::panic_fmt
>    6: core::result::unwrap_failed
>    7: parquet::util::io::FileSource::new
>    8:  as 
> parquet::file::reader::RowGroupReader>::get_column_page_reader
>    9:  as 
> parquet::file::reader::RowGroupReader>::get_column_reader
>   10: parquet::record::reader::TreeBuilder::reader_tree
>   11: parquet::record::reader::TreeBuilder::reader_tree
>   12: parquet::record::reader::TreeBuilder::reader_tree
>   13: parquet::record::reader::TreeBuilder::reader_tree
>   14: parquet::record::reader::TreeBuilder::reader_tree
>   15: parquet::record::reader::TreeBuilder::build
>   16:  core::iter::traits::iterator::Iterator>::next
>   17: parquet_read::main
>   18: std::rt::lang_start::{{closure}}
>   19: std::panicking::try::do_call
>   20: __rust_maybe_catch_panic
>   21: std::rt::lang_start_internal
>   22: main{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6179) [C++] ExtensionType subclass for "unknown" types?

2019-08-08 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903265#comment-16903265
 ] 

Micah Kornfield commented on ARROW-6179:


How would the two options be chosen?

> [C++] ExtensionType subclass for "unknown" types?
> -
>
> Key: ARROW-6179
> URL: https://issues.apache.org/jira/browse/ARROW-6179
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++, when receiving IPC with extension type metadata for a type that is 
> unknown (the name is not registered), we currently fall back to returning the 
> "raw" storage array. The custom metadata (extension name and metadata) is 
> still available in the Field metadata.
> Alternatively, we could also have a generic {{ExtensionType}} class that can 
> hold such "unknown" extension type (eg {{UnknowExtensionType}} or 
> {{GenericExtensionType}}), keeping the extension name and metadata in the 
> Array's type. 
> This could be a single class where several instances can be created given a 
> storage type, extension name and optionally extension metadata. It would be a 
> way to have an unregistered extension type.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6155) [Java] Extract a super interface for vectors whose elements reside in continuous memory segments

2019-08-08 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6155.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5028
[https://github.com/apache/arrow/pull/5028]

> [Java] Extract a super interface for vectors whose elements reside in 
> continuous memory segments
> 
>
> Key: ARROW-6155
> URL: https://issues.apache.org/jira/browse/ARROW-6155
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> For vectors whose data elements reside in continuous memory segments, they 
> should implement a common super interface. This will avoid unnecessary code 
> branches.
> For now, such vectors include fixed-width vectors and variable-width vectors. 
> In the future, there can be more vectors included.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6155) [Java] Extract a super interface for vectors whose elements reside in continuous memory segments

2019-08-08 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6155:
---
Component/s: Java

> [Java] Extract a super interface for vectors whose elements reside in 
> continuous memory segments
> 
>
> Key: ARROW-6155
> URL: https://issues.apache.org/jira/browse/ARROW-6155
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> For vectors whose data elements reside in continuous memory segments, they 
> should implement a common super interface. This will avoid unnecessary code 
> branches.
> For now, such vectors include fixed-width vectors and variable-width vectors. 
> In the future, there can be more vectors included.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6179) [C++] ExtensionType subclass for "unknown" types?

2019-08-09 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903949#comment-16903949
 ] 

Micah Kornfield commented on ARROW-6179:


Ok, personally I would like to leave the current  behavior as at least the 
default.  One example of the usage on non registration of  extension types is 
the BQ storage read API uses it to mark fields that don't have a one to one 
correspondence with built in arrow types (geography and datetime).  In the 
future someone could choose to write custom extension types but in the meantime 
they don't require special handling and flow through without any problem when 
converting to pandas.

> [C++] ExtensionType subclass for "unknown" types?
> -
>
> Key: ARROW-6179
> URL: https://issues.apache.org/jira/browse/ARROW-6179
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++, when receiving IPC with extension type metadata for a type that is 
> unknown (the name is not registered), we currently fall back to returning the 
> "raw" storage array. The custom metadata (extension name and metadata) is 
> still available in the Field metadata.
> Alternatively, we could also have a generic {{ExtensionType}} class that can 
> hold such "unknown" extension type (eg {{UnknowExtensionType}} or 
> {{GenericExtensionType}}), keeping the extension name and metadata in the 
> Array's type. 
> This could be a single class where several instances can be created given a 
> storage type, extension name and optionally extension metadata. It would be a 
> way to have an unregistered extension type.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6175) [Java] Fix MapVector#getMinorType and extend AbstractContainerVector addOrGet complex vector API

2019-08-09 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6175.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5043
[https://github.com/apache/arrow/pull/5043]

> [Java] Fix MapVector#getMinorType and extend AbstractContainerVector addOrGet 
> complex vector API
> 
>
> Key: ARROW-6175
> URL: https://issues.apache.org/jira/browse/ARROW-6175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> i. Currently {{MapVector#getMinorType}} extends {{ListVector}} which returns 
> the wrong {{MinorType}}.
> ii. {{AbstractContainerVector}} now only has {{addOrGetList}}, 
> {{addOrGetUnion}}, {{addOrGetStruct}} which not support all complex type like 
> {{MapVector}} and {{FixedSizeListVector}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6093) [Java] reduce branches in algo for first match in VectorRangeSearcher

2019-08-09 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6093.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5011
[https://github.com/apache/arrow/pull/5011]

> [Java] reduce branches in algo for first match in VectorRangeSearcher
> -
>
> Key: ARROW-6093
> URL: https://issues.apache.org/jira/browse/ARROW-6093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Pindikura Ravindra
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> This is a follow up Jira for the improvement suggested by [~fsaintjacques] in 
> the PR for 
> [https://github.com/apache/arrow/pull/4925]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

2019-07-17 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887610#comment-16887610
 ] 

Micah Kornfield commented on ARROW-3772:


"I'm looking at this. This is not a small project – the assumption that values 
are fully materialized is pretty deeply baked into the library. We also have to 
deal with the "fallback" case where a column chunk starts out dictionary 
encoded and switches mid-stream because the dictionary got too big"

I don't have context on how we decided originally to designate an entire column 
dictionary encoded vs a chunk/record batch column but it seems like this might 
be another use-case where the proposal on encoding/compression might make 
things easier to code (i.e. specify dictionary encoding only on 
SparseRecordBatches where it makes sense and leave the fallback to dense 
encoding where it no longer makes sense).

> [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow 
> DictionaryArray
> -
>
> Key: ARROW-3772
> URL: https://issues.apache.org/jira/browse/ARROW-3772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Stav Nir
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Dictionary data is very common in parquet, in the current implementation 
> parquet-cpp decodes dictionary encoded data always before creating a plain 
> arrow array. This process is wasteful since we could use arrow's 
> DictionaryArray directly and achieve several benefits:
>  # Smaller memory footprint - both in the decoding process and in the 
> resulting arrow table - especially when the dict values are large
>  # Better decoding performance - mostly as a result of the first bullet - 
> less memory fetches and less allocations.
> I think those benefits could achieve significant improvements in runtime.
> My direction for the implementation is to read the indices (through the 
> DictionaryDecoder, after the RLE decoding) and values separately into 2 
> arrays and create a DictionaryArray using them.
> There are some questions to discuss:
>  # Should this be the default behavior for dictionary encoded data
>  # Should it be controlled with a parameter in the API
>  # What should be the policy in case some of the chunks are dictionary 
> encoded and some are not.
> I started implementing this but would like to hear your opinions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5976) [C++] RETURN_IF_ERROR(ctx) should be namespaced

2019-07-18 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-5976:
--

 Summary: [C++] RETURN_IF_ERROR(ctx) should be namespaced
 Key: ARROW-5976
 URL: https://issues.apache.org/jira/browse/ARROW-5976
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Micah Kornfield
Assignee: Micah Kornfield
 Fix For: 1.0.0


RETURN_IF_ERROR is a common macro, it shouldn't be exposed in a header file 
without namespacing to Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5976) [C++] RETURN_IF_ERROR(ctx) should be namespaced

2019-07-18 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-5976:
---
Component/s: C++

> [C++] RETURN_IF_ERROR(ctx) should be namespaced
> ---
>
> Key: ARROW-5976
> URL: https://issues.apache.org/jira/browse/ARROW-5976
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Minor
> Fix For: 1.0.0
>
>
> RETURN_IF_ERROR is a common macro, it shouldn't be exposed in a header file 
> without namespacing to Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5968) [Java] Remove duplicate Preconditions check in JDBC adapter

2019-07-18 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5968.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4896
[https://github.com/apache/arrow/pull/4896]

> [Java] Remove duplicate Preconditions check in JDBC adapter
> ---
>
> Key: ARROW-5968
> URL: https://issues.apache.org/jira/browse/ARROW-5968
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Some Preconditions check are duplicate in {{JdbcToArrow#sqlToArrow}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5861) [Java] Initial implement to convert Avro record with primitive types

2019-07-18 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5861.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4812
[https://github.com/apache/arrow/pull/4812]

> [Java] Initial implement to convert Avro record with primitive types
> 
>
> Key: ARROW-5861
> URL: https://issues.apache.org/jira/browse/ARROW-5861
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5902) [Java] Implement hash table and equals & hashCode API for dictionary encoding

2019-07-18 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5902.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4846
[https://github.com/apache/arrow/pull/4846]

> [Java] Implement hash table and equals & hashCode API for dictionary encoding
> -
>
> Key: ARROW-5902
> URL: https://issues.apache.org/jira/browse/ARROW-5902
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> As discussed in [https://github.com/apache/arrow/pull/4792]
> Implement a hash table to only store hash & index, meanwhile add check equal 
> function in ValueVector API.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5902) [Java] Implement hash table and equals & hashCode API for dictionary encoding

2019-07-18 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-5902:
---
Component/s: Java

> [Java] Implement hash table and equals & hashCode API for dictionary encoding
> -
>
> Key: ARROW-5902
> URL: https://issues.apache.org/jira/browse/ARROW-5902
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> As discussed in [https://github.com/apache/arrow/pull/4792]
> Implement a hash table to only store hash & index, meanwhile add check equal 
> function in ValueVector API.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5990) [Python] RowGroupMetaData.column misses bounds check

2019-07-19 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5990.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4911
[https://github.com/apache/arrow/pull/4911]

> [Python] RowGroupMetaData.column misses bounds check
> 
>
> Key: ARROW-5990
> URL: https://issues.apache.org/jira/browse/ARROW-5990
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> {{RowGroupMetaData.column}} currently does not check for negative or too 
> large positive indices, leading to an potential interpreter crash.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5986) [Java] Code cleanup for dictionary encoding

2019-07-19 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5986.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4909
[https://github.com/apache/arrow/pull/4909]

> [Java] Code cleanup for dictionary encoding
> ---
>
> Key: ARROW-5986
> URL: https://issues.apache.org/jira/browse/ARROW-5986
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In last few weeks, we did some refactor in dictionary encoding.
> Since the new designed hash table for {{DictionaryEncoder}} and {{hashCode}} 
> & {{equals}} API in {{ValueVector}} already checked in, some classed are no 
> use anymore like {{DictionaryEncodingHashTable}}, {{BaseBinaryVector}} and 
> related benchmarks & UT.
> Fortunately, these changes are not made into version 0.14, which makes 
> possible to remove them.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5973) [Java] Variable width vectors' get methods should return null when the underlying data is null

2019-07-19 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5973.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4901
[https://github.com/apache/arrow/pull/4901]

> [Java] Variable width vectors' get methods should return null when the 
> underlying data is null
> --
>
> Key: ARROW-5973
> URL: https://issues.apache.org/jira/browse/ARROW-5973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> For variable-width vectors (VarCharVector and VarBinaryVector), when the 
> validity bit is not set, it means the underlying data is null, so the get 
> method should return null.
> However, the current implementation throws an IllegalStateException when 
> NULL_CHECKING_ENABLED is set, or returns an empty array when the flag is 
> clear.
> Maybe the purpose of this design is to be consistent with fixed-width 
> vectors. However, the scenario is different: fixed-width vectors (e.g. 
> IntVector) throw an IllegalStateException, simply because the primitive types 
> are non-nullable.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5920) [Java] Support sort & compare for all variable width vectors

2019-07-19 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5920.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4860
[https://github.com/apache/arrow/pull/4860]

> [Java] Support sort & compare for all variable width vectors
> 
>
> Key: ARROW-5920
> URL: https://issues.apache.org/jira/browse/ARROW-5920
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> All types of variable-width vector can reuse the same comparator for sorting 
> & searching.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5918) [Java] Add get to BaseIntVector interface

2019-07-19 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5918.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4859
[https://github.com/apache/arrow/pull/4859]

> [Java] Add get to BaseIntVector interface
> -
>
> Key: ARROW-5918
> URL: https://issues.apache.org/jira/browse/ARROW-5918
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> 1. In the set method should not use long as parameter. It is hardly the case 
> that there are more than 2^32 distinct values in a dictionary. If it really 
> happens, maybe it means we should not have used dictionary in the first 
> place. 
> 2. In addition to the get method, there should also be a set method. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5988) [Java] Avro adapter implement simple Record type

2019-07-23 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-5988:
---
Component/s: Java

> [Java] Avro adapter implement simple Record type 
> -
>
> Key: ARROW-5988
> URL: https://issues.apache.org/jira/browse/ARROW-5988
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> 1. implement simple Record type witch only contains primitive types
> 2. add ByteBuffer cache in String/Bytes consumer to reduce creations. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5988) [Java] Avro adapter implement simple Record type

2019-07-23 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5988.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4910
[https://github.com/apache/arrow/pull/4910]

> [Java] Avro adapter implement simple Record type 
> -
>
> Key: ARROW-5988
> URL: https://issues.apache.org/jira/browse/ARROW-5988
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> 1. implement simple Record type witch only contains primitive types
> 2. add ByteBuffer cache in String/Bytes consumer to reduce creations. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6019) [Java] Port Jdbc and Avro adapter to new directory

2019-07-23 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891593#comment-16891593
 ] 

Micah Kornfield commented on ARROW-6019:


Please hold off doing this until we reach consensus on the mailing list.

> [Java] Port Jdbc and Avro adapter to new directory 
> ---
>
> Key: ARROW-6019
> URL: https://issues.apache.org/jira/browse/ARROW-6019
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>
> As discussed in mail list, adapters are different from native reader.
> This issue is used to track these issues:
> i. create new “contrib” directory and move Jdbc/Avro adapter to it.
> ii. provide more description.
> iii. change orc readers structure to “converter"
> cc [~emkornfi...@gmail.com]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5898) [Java] Provide functionality to efficiently compute hash code for arbitrary memory segment

2019-07-23 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5898.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4844
[https://github.com/apache/arrow/pull/4844]

> [Java] Provide functionality to efficiently compute hash code for arbitrary 
> memory segment
> --
>
> Key: ARROW-5898
> URL: https://issues.apache.org/jira/browse/ARROW-5898
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> This issue adds a functionality to efficiently compute  the hash code for a 
> consecutive memory region. This functionality is important in practical 
> scenarios because it helps:
>  * Avoid unnecessary memory copy.
>  * Avoid repeated conversions between Java objects & Arrow buffers. 
> Since the algorithm for calculating hash code has  significant performance 
> implications, we need to design an interface so that different algorithms can 
> be easily introduces as plug-ins.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6008) [Release] Don't parallelize the bintray upload script

2019-07-23 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6008.

Resolution: Fixed

Issue resolved by pull request 4929
[https://github.com/apache/arrow/pull/4929]

> [Release] Don't parallelize the bintray upload script
> -
>
> Key: ARROW-6008
> URL: https://issues.apache.org/jira/browse/ARROW-6008
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging
>Reporter: Krisztian Szucs
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
> Attachments: binary-upload.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It was spawning a lot of docker containers, and resulted fragile uploads.
> Patch provided by [~kou] is attached.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5997) [Java] Support dictionary encoding for Union type

2019-07-23 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-5997:
---
Component/s: Java

> [Java] Support dictionary encoding for Union type
> -
>
> Key: ARROW-5997
> URL: https://issues.apache.org/jira/browse/ARROW-5997
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now only Union type is not supported in dictionary encoding.
> In the last several weeks, we did some refactor for encoding and now it's 
> time to support Union type.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5997) [Java] Support dictionary encoding for Union type

2019-07-23 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5997.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4917
[https://github.com/apache/arrow/pull/4917]

> [Java] Support dictionary encoding for Union type
> -
>
> Key: ARROW-5997
> URL: https://issues.apache.org/jira/browse/ARROW-5997
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now only Union type is not supported in dictionary encoding.
> In the last several weeks, we did some refactor for encoding and now it's 
> time to support Union type.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5999) [C++] Required header files missing when built with -DARROW_DATASET=OFF

2019-07-23 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5999.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4920
[https://github.com/apache/arrow/pull/4920]

> [C++] Required header files missing when built with -DARROW_DATASET=OFF
> ---
>
> Key: ARROW-5999
> URL: https://issues.apache.org/jira/browse/ARROW-5999
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0
>Reporter: Steven Fackler
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
>  
> {noformat}
> In file included from /opt/arrow/include/arrow/type_fwd.h:23:0,
>  from /opt/arrow/include/arrow/type.h:29,
>  from /opt/arrow/include/arrow/array.h:32,
>  from /opt/arrow/include/arrow/api.h:23,
>  from src/bindings.cc:1:
> /opt/arrow/include/arrow/util/iterator.h:20:10: fatal error: 
> arrow/dataset/visibility.h: No such file or directory
>  #include "arrow/dataset/visibility.h"
>   ^~~~{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5835) [Java] Support Dictionary Encoding for binary type

2019-07-15 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5835.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4792
[https://github.com/apache/arrow/pull/4792]

> [Java] Support Dictionary Encoding for binary type
> --
>
> Key: ARROW-5835
> URL: https://issues.apache.org/jira/browse/ARROW-5835
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Now is not implemented because byte array is not supported to be HashMap key.
> One possible way is that wrap them with something to implement equals and 
> hashcode.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5884) [Java] Fix the get method of StructVector

2019-07-15 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5884.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4831
[https://github.com/apache/arrow/pull/4831]

> [Java] Fix the get method of StructVector
> -
>
> Key: ARROW-5884
> URL: https://issues.apache.org/jira/browse/ARROW-5884
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When the data at the specified location is null, there is no need to call the 
> method from super to set the reader
>  holder.isSet = isSet(index);
>  super.get(index, holder);



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


<    1   2   3   4   5   6   7   8   9   10   >