[jira] [Updated] (ARROW-6024) [Java] Provide more hash algorithms
[ https://issues.apache.org/jira/browse/ARROW-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield updated ARROW-6024: --- Description: Provide more hash algorithms to choose for different scenarios. In particular, we provide the following hash algorithms: * Simple hasher: A hasher that calculates the hash code of integers as is, and do not perform any finalization. So the computation is extremely efficient, but the quality of the produced hash code may not be good. * Murmur finalizing hasher: Finalize the hash code by the Murmur hashing algorithm. Details of the algorithm can be found in [https://en.wikipedia.org/wiki/MurmurHash]. Murmur hashing is computational expensive, as it involves several integer multiplications. However, the produced hash codes have good quality in the sense that they are uniformly distributed in the universe. was: Provide more hash algorithms to choose for different scenarios. In particular, we provide the following hash algorithms: * Simple hasher: A hasher that calculates the hash code of integers as is, and do not perform any finalization. So the computation is extremely efficient, but the quality of the produced hash code may not be good. * Murmur finalizing hasher: Finalize the hash code by the Murmur hashing algorithm. Details of the algorithm can be found in https://en.wikipedia.org/wiki/MurmurHash. Murmur hashing is computational expensive, as it involves several integer multiplications. However, the produced hash codes have good quality in the sense that they are uniformly distributed in the universe. * Jenkins finalizing hasher: Finalize the hash code by Bob Jenkins' algorithm. Details of this algorithm can be found in http://www.burtleburtle.net/bob/hash/integer.html. Jenkins hashing is less computational expensive than Murmur hashing, as it involves no integer multiplication. However, the produced hash codes also have good quality in the sense that they are uniformly distributed in the universe. * Non-negative hasher: Wrapper for another hasher, to make the generated hash code non-negative. This can be useful for scenarios like hash table. > [Java] Provide more hash algorithms > > > Key: ARROW-6024 > URL: https://issues.apache.org/jira/browse/ARROW-6024 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 8h 20m > Remaining Estimate: 0h > > Provide more hash algorithms to choose for different scenarios. In > particular, we provide the following hash algorithms: > * Simple hasher: A hasher that calculates the hash code of integers as is, > and do not perform any finalization. So the computation is extremely > efficient, but the quality of the produced hash code may not be good. > * Murmur finalizing hasher: Finalize the hash code by the Murmur hashing > algorithm. Details of the algorithm can be found in > [https://en.wikipedia.org/wiki/MurmurHash]. Murmur hashing is computational > expensive, as it involves several integer multiplications. However, the > produced hash codes have good quality in the sense that they are uniformly > distributed in the universe. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6024) [Java] Provide more hash algorithms
[ https://issues.apache.org/jira/browse/ARROW-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6024. Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 4934 [https://github.com/apache/arrow/pull/4934] > [Java] Provide more hash algorithms > > > Key: ARROW-6024 > URL: https://issues.apache.org/jira/browse/ARROW-6024 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 8h 20m > Remaining Estimate: 0h > > Provide more hash algorithms to choose for different scenarios. In > particular, we provide the following hash algorithms: > * Simple hasher: A hasher that calculates the hash code of integers as is, > and do not perform any finalization. So the computation is extremely > efficient, but the quality of the produced hash code may not be good. > * Murmur finalizing hasher: Finalize the hash code by the Murmur hashing > algorithm. Details of the algorithm can be found in > https://en.wikipedia.org/wiki/MurmurHash. Murmur hashing is computational > expensive, as it involves several integer multiplications. However, the > produced hash codes have good quality in the sense that they are uniformly > distributed in the universe. > * Jenkins finalizing hasher: Finalize the hash code by Bob Jenkins' > algorithm. Details of this algorithm can be found in > http://www.burtleburtle.net/bob/hash/integer.html. Jenkins hashing is less > computational expensive than Murmur hashing, as it involves no integer > multiplication. However, the produced hash codes also have good quality in > the sense that they are uniformly distributed in the universe. > * Non-negative hasher: Wrapper for another hasher, to make the generated hash > code non-negative. This can be useful for scenarios like hash table. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6193) [GLib] Add missing require in test
[ https://issues.apache.org/jira/browse/ARROW-6193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6193: -- Labels: pull-request-available (was: ) > [GLib] Add missing require in test > -- > > Key: ARROW-6193 > URL: https://issues.apache.org/jira/browse/ARROW-6193 > Project: Apache Arrow > Issue Type: Test > Components: GLib >Reporter: Sutou Kouhei >Assignee: Sutou Kouhei >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6193) [GLib] Add missing require in test
Sutou Kouhei created ARROW-6193: --- Summary: [GLib] Add missing require in test Key: ARROW-6193 URL: https://issues.apache.org/jira/browse/ARROW-6193 Project: Apache Arrow Issue Type: Test Components: GLib Reporter: Sutou Kouhei Assignee: Sutou Kouhei -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6188) [GLib] Add garrow_array_is_in()
[ https://issues.apache.org/jira/browse/ARROW-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei resolved ARROW-6188. - Resolution: Fixed Issue resolved by pull request 5047 [https://github.com/apache/arrow/pull/5047] > [GLib] Add garrow_array_is_in() > --- > > Key: ARROW-6188 > URL: https://issues.apache.org/jira/browse/ARROW-6188 > Project: Apache Arrow > Issue Type: New Feature > Components: GLib >Reporter: Yosuke Shiro >Assignee: Yosuke Shiro >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6192) [GLib] Use the same SO version as C++
[ https://issues.apache.org/jira/browse/ARROW-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6192: -- Labels: pull-request-available (was: ) > [GLib] Use the same SO version as C++ > - > > Key: ARROW-6192 > URL: https://issues.apache.org/jira/browse/ARROW-6192 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib >Reporter: Sutou Kouhei >Assignee: Sutou Kouhei >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6192) [GLib] Use the same SO version as C++
Sutou Kouhei created ARROW-6192: --- Summary: [GLib] Use the same SO version as C++ Key: ARROW-6192 URL: https://issues.apache.org/jira/browse/ARROW-6192 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Sutou Kouhei Assignee: Sutou Kouhei -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6093) [Java] reduce branches in algo for first match in VectorRangeSearcher
[ https://issues.apache.org/jira/browse/ARROW-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6093. Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5011 [https://github.com/apache/arrow/pull/5011] > [Java] reduce branches in algo for first match in VectorRangeSearcher > - > > Key: ARROW-6093 > URL: https://issues.apache.org/jira/browse/ARROW-6093 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Pindikura Ravindra >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > This is a follow up Jira for the improvement suggested by [~fsaintjacques] in > the PR for > [https://github.com/apache/arrow/pull/4925] > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6175) [Java] Fix MapVector#getMinorType and extend AbstractContainerVector addOrGet complex vector API
[ https://issues.apache.org/jira/browse/ARROW-6175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6175. Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5043 [https://github.com/apache/arrow/pull/5043] > [Java] Fix MapVector#getMinorType and extend AbstractContainerVector addOrGet > complex vector API > > > Key: ARROW-6175 > URL: https://issues.apache.org/jira/browse/ARROW-6175 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h > Remaining Estimate: 0h > > i. Currently {{MapVector#getMinorType}} extends {{ListVector}} which returns > the wrong {{MinorType}}. > ii. {{AbstractContainerVector}} now only has {{addOrGetList}}, > {{addOrGetUnion}}, {{addOrGetStruct}} which not support all complex type like > {{MapVector}} and {{FixedSizeListVector}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6188) [GLib] Add garrow_array_is_in()
[ https://issues.apache.org/jira/browse/ARROW-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei updated ARROW-6188: Summary: [GLib] Add garrow_array_is_in() (was: [GLib] Add garrow_array_isin()) > [GLib] Add garrow_array_is_in() > --- > > Key: ARROW-6188 > URL: https://issues.apache.org/jira/browse/ARROW-6188 > Project: Apache Arrow > Issue Type: New Feature > Components: GLib >Reporter: Yosuke Shiro >Assignee: Yosuke Shiro >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6186) [Packaging][C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian package
[ https://issues.apache.org/jira/browse/ARROW-6186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei updated ARROW-6186: Component/s: Packaging > [Packaging][C++] Plasma headers not included for ubuntu-xenial libplasma-dev > debian package > --- > > Key: ARROW-6186 > URL: https://issues.apache.org/jira/browse/ARROW-6186 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Plasma, Packaging >Affects Versions: 0.14.1 >Reporter: Wannes G >Assignee: Sutou Kouhei >Priority: Major > Labels: debian, packaging > > See > [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install] > Issue is still present on latest master branch, the debian install script is > correct: > [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian/libplasma-dev.install] > The first line is missing from the ubuntu install script causing no headers > to be installed when apt-get is used to install libplasma-dev. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-6186) [C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian package
[ https://issues.apache.org/jira/browse/ARROW-6186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei reassigned ARROW-6186: --- Assignee: Sutou Kouhei > [C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian > package > > > Key: ARROW-6186 > URL: https://issues.apache.org/jira/browse/ARROW-6186 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Plasma >Affects Versions: 0.14.1 >Reporter: Wannes G >Assignee: Sutou Kouhei >Priority: Major > Labels: debian, packaging > > See > [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install] > Issue is still present on latest master branch, the debian install script is > correct: > [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian/libplasma-dev.install] > The first line is missing from the ubuntu install script causing no headers > to be installed when apt-get is used to install libplasma-dev. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6186) [Packaging][C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian package
[ https://issues.apache.org/jira/browse/ARROW-6186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei updated ARROW-6186: Summary: [Packaging][C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian package (was: [C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian package) > [Packaging][C++] Plasma headers not included for ubuntu-xenial libplasma-dev > debian package > --- > > Key: ARROW-6186 > URL: https://issues.apache.org/jira/browse/ARROW-6186 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Plasma >Affects Versions: 0.14.1 >Reporter: Wannes G >Assignee: Sutou Kouhei >Priority: Major > Labels: debian, packaging > > See > [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install] > Issue is still present on latest master branch, the debian install script is > correct: > [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian/libplasma-dev.install] > The first line is missing from the ubuntu install script causing no headers > to be installed when apt-get is used to install libplasma-dev. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3
[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904082#comment-16904082 ] Andrey Krivonogov edited comment on ARROW-6058 at 8/9/19 8:34 PM: -- Hi [~wesmckinn], I have experienced the same issue as [~sid88in] I also managed to reproduce it with synthetic data: {code:java} import numpy as np import pyarrow as pa import pyarrow.parquet as pq import s3fs table = pa.Table.from_arrays([pa.array(np.arange(3 * 10 ** 7), type=pa.int64())], ['col']) path = 's3://bucket/path/0.parquet' fs = s3fs.S3FileSystem() pq.write_table(table, path, filesystem=fs, row_group_size=10 ** 7) table_read = pq.read_table(path, filesystem=fs){code} this snippet raises similar {code:java} ArrowIOError: Unexpected end of stream: Page was smaller (241959) than expected (524605) {code} The problem seemed to be in s3fs version. Package versions I have {code:java} python 3.6.7 packages installed with conda (via conda-forge) boto3==1.9.204 botocore==1.12.204 numpy==1.16.2 pyarrow==0.14.1 {code} and it raised with {code:java} s3fs==0.3.3{code} but everything worked fine with {code:java} s3fs==0.2.2 {code} Thank you in advance for your help ! was (Author: krivonogov): Hi [~wesmckinn], I have experienced the same issue as [~sid88in] I also managed to reproduce it with synthetic data: {code:java} import numpy as np import pyarrow as pa import pyarrow.parquet as pq import s3fs table = pa.Table.from_arrays([pa.array(np.arange(3 * 10 ** 7), type=pa.int64())], ['col']) path = 's3://bucket/path/0.parquet' fs = s3fs.S3FileSystem() pq.write_table(table, path, filesystem=fs, row_group_size=10 ** 7) table_read = pq.read_table(path, filesystem=fs){code} this snippet raises similar {code:java} ArrowIOError: Unexpected end of stream: Page was smaller (241959) than expected (524605) {code} This problem seemed to be in s3fs version. Package versions I have {code:java} python 3.6.7 packages installed with conda (via conda-forge) boto3==1.9.204 botocore==1.12.204 numpy==1.16.2 pyarrow==0.14.1 {code} and it raised with {code:java} s3fs==0.3.3{code} but everything worked fine with {code:java} s3fs==0.2.2 {code} Thank you in advance for your help ! > [Python][Parquet] Failure when reading Parquet file from S3 > > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.1 >Reporter: Siddharth >Priority: Major > Labels: parquet > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6191) [C++] buffer size default value will throw an error
[ https://issues.apache.org/jira/browse/ARROW-6191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zherui Cao updated ARROW-6191: -- Summary: [C++] buffer size default value will throw an error (was: [c++] buffer size default value will throw an error) > [C++] buffer size default value will throw an error > --- > > Key: ARROW-6191 > URL: https://issues.apache.org/jira/browse/ARROW-6191 > Project: Apache Arrow > Issue Type: Bug >Reporter: Zherui Cao >Priority: Major > > [https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L40] > this set default size as 0, > but in > [https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/buffered.cc#L259] > it prevent the buffer size being 0. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6191) [c++] buffer size default value will throw an error
[ https://issues.apache.org/jira/browse/ARROW-6191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zherui Cao updated ARROW-6191: -- Summary: [c++] buffer size default value will throw an error (was: Arrow error: Invalid: Buffer size should be positive) > [c++] buffer size default value will throw an error > --- > > Key: ARROW-6191 > URL: https://issues.apache.org/jira/browse/ARROW-6191 > Project: Apache Arrow > Issue Type: Bug >Reporter: Zherui Cao >Priority: Major > > [https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L40] > this set default size as 0, > but in > [https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/buffered.cc#L259] > it prevent the buffer size being 0. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6191) Arrow error: Invalid: Buffer size should be positive
Zherui Cao created ARROW-6191: - Summary: Arrow error: Invalid: Buffer size should be positive Key: ARROW-6191 URL: https://issues.apache.org/jira/browse/ARROW-6191 Project: Apache Arrow Issue Type: Bug Reporter: Zherui Cao [https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L40] this set default size as 0, but in [https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/buffered.cc#L259] it prevent the buffer size being 0. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3
[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904082#comment-16904082 ] Andrey Krivonogov commented on ARROW-6058: -- Hi [~wesmckinn], I have experienced the same issue as [~sid88in] I also managed to reproduce it with synthetic data: {code:java} import numpy as np import pyarrow as pa import pyarrow.parquet as pq import s3fs table = pa.Table.from_arrays([pa.array(np.arange(3 * 10 ** 7), type=pa.int64())], ['col']) path = 's3://bucket/path/0.parquet' fs = s3fs.S3FileSystem() pq.write_table(table, path, filesystem=fs, row_group_size=10 ** 7) table_read = pq.read_table(path, filesystem=fs){code} this snippet raises similar {code:java} ArrowIOError: Unexpected end of stream: Page was smaller (241959) than expected (524605) {code} This problem seemed to be in s3fs version. Package versions I have {code:java} python 3.6.7 packages installed with conda (via conda-forge) boto3==1.9.204 botocore==1.12.204 numpy==1.16.2 pyarrow==0.14.1 {code} and it raised with {code:java} s3fs==0.3.3{code} but everything worked fine with {code:java} s3fs==0.2.2 {code} Thank you in advance for your help ! > [Python][Parquet] Failure when reading Parquet file from S3 > > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.1 >Reporter: Siddharth >Priority: Major > Labels: parquet > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6190) [C++] Define and declare functions regardless of NDEBUG
[ https://issues.apache.org/jira/browse/ARROW-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6190: -- Labels: pull-request-available (was: ) > [C++] Define and declare functions regardless of NDEBUG > --- > > Key: ARROW-6190 > URL: https://issues.apache.org/jira/browse/ARROW-6190 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Omer Ozarslan >Priority: Minor > Labels: pull-request-available > > NDEBUG is not shipped in linker flags, so I got a linker error with release > build on FixedSizeBinaryBuilder::UnsafeAppend(util::string_view value) call, > since it makes a call to CheckValueSize. > This is somewhat a follow-up of ARROW-2313. I took the same path by removing > NDEBUG ifdefs around CheckValueSize definition and declaration. > I applied the same fix to CheckUTF8Initialized as well after grepping the > source code for "#ifndef NDEBUG" and figured out it has the same issue. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6190) [C++] Define and declare functions regardless of NDEBUG
[ https://issues.apache.org/jira/browse/ARROW-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904070#comment-16904070 ] Omer Ozarslan commented on ARROW-6190: -- Submitted PR on https://github.com/apache/arrow/pull/5049. > [C++] Define and declare functions regardless of NDEBUG > --- > > Key: ARROW-6190 > URL: https://issues.apache.org/jira/browse/ARROW-6190 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Omer Ozarslan >Priority: Minor > > NDEBUG is not shipped in linker flags, so I got a linker error with release > build on FixedSizeBinaryBuilder::UnsafeAppend(util::string_view value) call, > since it makes a call to CheckValueSize. > This is somewhat a follow-up of ARROW-2313. I took the same path by removing > NDEBUG ifdefs around CheckValueSize definition and declaration. > I applied the same fix to CheckUTF8Initialized as well after grepping the > source code for "#ifndef NDEBUG" and figured out it has the same issue. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6190) [C++] Define and declare functions regardless of NDEBUG
[ https://issues.apache.org/jira/browse/ARROW-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omer Ozarslan updated ARROW-6190: - Summary: [C++] Define and declare functions regardless of NDEBUG (was: Define and declare functions regardless of NDEBUG) > [C++] Define and declare functions regardless of NDEBUG > --- > > Key: ARROW-6190 > URL: https://issues.apache.org/jira/browse/ARROW-6190 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Omer Ozarslan >Priority: Minor > > NDEBUG is not shipped in linker flags, so I got a linker error with release > build on FixedSizeBinaryBuilder::UnsafeAppend(util::string_view value) call, > since it makes a call to CheckValueSize. > This is somewhat a follow-up of ARROW-2313. I took the same path by removing > NDEBUG ifdefs around CheckValueSize definition and declaration. > I applied the same fix to CheckUTF8Initialized as well after grepping the > source code for "#ifndef NDEBUG" and figured out it has the same issue. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6190) Define and declare functions regardless of NDEBUG
Omer Ozarslan created ARROW-6190: Summary: Define and declare functions regardless of NDEBUG Key: ARROW-6190 URL: https://issues.apache.org/jira/browse/ARROW-6190 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Omer Ozarslan NDEBUG is not shipped in linker flags, so I got a linker error with release build on FixedSizeBinaryBuilder::UnsafeAppend(util::string_view value) call, since it makes a call to CheckValueSize. This is somewhat a follow-up of ARROW-2313. I took the same path by removing NDEBUG ifdefs around CheckValueSize definition and declaration. I applied the same fix to CheckUTF8Initialized as well after grepping the source code for "#ifndef NDEBUG" and figured out it has the same issue. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6189) [Rust] Plain encoded boolean column chunks limited to 2048 values
Simon Jones created ARROW-6189: -- Summary: [Rust] Plain encoded boolean column chunks limited to 2048 values Key: ARROW-6189 URL: https://issues.apache.org/jira/browse/ARROW-6189 Project: Apache Arrow Issue Type: Bug Components: Rust Affects Versions: 0.14.1 Reporter: Simon Jones encoding::PlainEncoder::new creates a BitWriter with 256 bytes of storage, which limits the data page size that can be used. I suggest that in {{impl Encoder for PlainEncoder}} the return value of put_value is tested and the BitWriter flushed+cleared whenever it runs out of space. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet
[ https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904020#comment-16904020 ] Wes McKinney commented on ARROW-3246: - Making some progress on this. It's a can of worms because of the interplay between the ColumnWriter, Encoder, and Statistics types. > [Python][Parquet] direct reading/writing of pandas categoricals in parquet > -- > > Key: ARROW-3246 > URL: https://issues.apache.org/jira/browse/ARROW-3246 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Martin Durant >Assignee: Wes McKinney >Priority: Minor > Labels: parquet > Fix For: 1.0.0 > > > Parquet supports "dictionary encoding" of column data in a manner very > similar to the concept of Categoricals in pandas. It is natural to use this > encoding for a column which originated as a categorical. Conversely, when > loading, if the file metadata says that a given column came from a pandas (or > arrow) categorical, then we can trust that the whole of the column is > dictionary-encoded and load the data directly into a categorical column, > rather than expanding the labels upon load and recategorising later. > If the data does not have the pandas metadata, then the guarantee cannot > hold, and we cannot assume either that the whole column is dictionary encoded > or that the labels are the same throughout. In this case, the current > behaviour is fine. > > (please forgive that some of this has already been mentioned elsewhere; this > is one of the entries in the list at > [https://github.com/dask/fastparquet/issues/374] as a feature that is useful > in fastparquet) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6186) [C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian package
[ https://issues.apache.org/jira/browse/ARROW-6186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6186: Summary: [C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian package (was: Plasma headers not included for ubuntu-xenial libplasma-dev debian package) > [C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian > package > > > Key: ARROW-6186 > URL: https://issues.apache.org/jira/browse/ARROW-6186 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Plasma >Affects Versions: 0.14.1 >Reporter: Wannes G >Priority: Major > > See > [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install] > Issue is still present on latest master branch, the debian install script is > correct: > [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian/libplasma-dev.install] > The first line is missing from the ubuntu install script causing no headers > to be installed when apt-get is used to install libplasma-dev. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6186) [C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian package
[ https://issues.apache.org/jira/browse/ARROW-6186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6186: Labels: debian packaging (was: ) > [C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian > package > > > Key: ARROW-6186 > URL: https://issues.apache.org/jira/browse/ARROW-6186 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Plasma >Affects Versions: 0.14.1 >Reporter: Wannes G >Priority: Major > Labels: debian, packaging > > See > [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install] > Issue is still present on latest master branch, the debian install script is > correct: > [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian/libplasma-dev.install] > The first line is missing from the ubuntu install script causing no headers > to be installed when apt-get is used to install libplasma-dev. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6179) [C++] ExtensionType subclass for "unknown" types?
[ https://issues.apache.org/jira/browse/ARROW-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903969#comment-16903969 ] Joris Van den Bossche commented on ARROW-6179: -- The bigquery usage of this, is that open source code? (to familiarize myself with an application of the extension types) You mean that you use the extension type key (ARROW:extension:name) in the metadata without having it an actual extension type? For sure if we would create such a generic extension array, I think it should work in more places in arrow than it currently is the case (eg I opened issues to fallback to the storage type when converting to pandas or to parquet). > [C++] ExtensionType subclass for "unknown" types? > - > > Key: ARROW-6179 > URL: https://issues.apache.org/jira/browse/ARROW-6179 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Joris Van den Bossche >Priority: Major > > In C++, when receiving IPC with extension type metadata for a type that is > unknown (the name is not registered), we currently fall back to returning the > "raw" storage array. The custom metadata (extension name and metadata) is > still available in the Field metadata. > Alternatively, we could also have a generic {{ExtensionType}} class that can > hold such "unknown" extension type (eg {{UnknowExtensionType}} or > {{GenericExtensionType}}), keeping the extension name and metadata in the > Array's type. > This could be a single class where several instances can be created given a > storage type, extension name and optionally extension metadata. It would be a > way to have an unregistered extension type. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6179) [C++] ExtensionType subclass for "unknown" types?
[ https://issues.apache.org/jira/browse/ARROW-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903949#comment-16903949 ] Micah Kornfield commented on ARROW-6179: Ok, personally I would like to leave the current behavior as at least the default. One example of the usage on non registration of extension types is the BQ storage read API uses it to mark fields that don't have a one to one correspondence with built in arrow types (geography and datetime). In the future someone could choose to write custom extension types but in the meantime they don't require special handling and flow through without any problem when converting to pandas. > [C++] ExtensionType subclass for "unknown" types? > - > > Key: ARROW-6179 > URL: https://issues.apache.org/jira/browse/ARROW-6179 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Joris Van den Bossche >Priority: Major > > In C++, when receiving IPC with extension type metadata for a type that is > unknown (the name is not registered), we currently fall back to returning the > "raw" storage array. The custom metadata (extension name and metadata) is > still available in the Field metadata. > Alternatively, we could also have a generic {{ExtensionType}} class that can > hold such "unknown" extension type (eg {{UnknowExtensionType}} or > {{GenericExtensionType}}), keeping the extension name and metadata in the > Array's type. > This could be a single class where several instances can be created given a > storage type, extension name and optionally extension metadata. It would be a > way to have an unregistered extension type. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6179) [C++] ExtensionType subclass for "unknown" types?
[ https://issues.apache.org/jira/browse/ARROW-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903935#comment-16903935 ] Joris Van den Bossche commented on ARROW-6179: -- I suppose, if we go for this, it would replace the automatic fallback. And then a user can still get the storage array as a fallback themselves? Although, I see that there is a PR adding {{IpcOptions}} for writing, so if needed, there might also be such options for reading. To be honest, I don't know have a good enough idea of potential use cases in C++ of the ExtensionType mechanism to really assess if it would be generally useful to keep the array in a generic extension array or rather directly fall back to the storage array. I was thinking that for Python usage, this might be useful to be able to send an extension type defined from Python without needing to register a specific subclass in C++. > [C++] ExtensionType subclass for "unknown" types? > - > > Key: ARROW-6179 > URL: https://issues.apache.org/jira/browse/ARROW-6179 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Joris Van den Bossche >Priority: Major > > In C++, when receiving IPC with extension type metadata for a type that is > unknown (the name is not registered), we currently fall back to returning the > "raw" storage array. The custom metadata (extension name and metadata) is > still available in the Field metadata. > Alternatively, we could also have a generic {{ExtensionType}} class that can > hold such "unknown" extension type (eg {{UnknowExtensionType}} or > {{GenericExtensionType}}), keeping the extension name and metadata in the > Array's type. > This could be a single class where several instances can be created given a > storage type, extension name and optionally extension metadata. It would be a > way to have an unregistered extension type. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6117) [Java] Fix the set method of FixedSizeBinaryVector
[ https://issues.apache.org/jira/browse/ARROW-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra resolved ARROW-6117. --- Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 4995 [https://github.com/apache/arrow/pull/4995] > [Java] Fix the set method of FixedSizeBinaryVector > -- > > Key: ARROW-6117 > URL: https://issues.apache.org/jira/browse/ARROW-6117 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > For the set method, if the parameter is null, it should clear the validity > bit. However, the current implementation throws a NullPointerException. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6188) [GLib] Add garrow_array_isin()
[ https://issues.apache.org/jira/browse/ARROW-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6188: -- Labels: pull-request-available (was: ) > [GLib] Add garrow_array_isin() > -- > > Key: ARROW-6188 > URL: https://issues.apache.org/jira/browse/ARROW-6188 > Project: Apache Arrow > Issue Type: New Feature > Components: GLib >Reporter: Yosuke Shiro >Assignee: Yosuke Shiro >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6188) [GLib] Add garrow_array_isin()
Yosuke Shiro created ARROW-6188: --- Summary: [GLib] Add garrow_array_isin() Key: ARROW-6188 URL: https://issues.apache.org/jira/browse/ARROW-6188 Project: Apache Arrow Issue Type: New Feature Components: GLib Reporter: Yosuke Shiro Assignee: Yosuke Shiro Fix For: 0.15.0 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6137) [C++][Gandiva] Change output format of castVARCHAR(timestamp) in Gandiva
[ https://issues.apache.org/jira/browse/ARROW-6137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra resolved ARROW-6137. --- Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5014 [https://github.com/apache/arrow/pull/5014] > [C++][Gandiva] Change output format of castVARCHAR(timestamp) in Gandiva > > > Key: ARROW-6137 > URL: https://issues.apache.org/jira/browse/ARROW-6137 > Project: Apache Arrow > Issue Type: Task > Components: C++ - Gandiva >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Format timestamp to -MM-dd hh:mm:ss.sss -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6187) [C++] fallback to storage type when writing ExtensionType to Parquet
Joris Van den Bossche created ARROW-6187: Summary: [C++] fallback to storage type when writing ExtensionType to Parquet Key: ARROW-6187 URL: https://issues.apache.org/jira/browse/ARROW-6187 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Writing a table that contains an ExtensionType array to a parquet file is not yet implemented. It currently raises "ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: extension" (for a PyExtensionType in this case). I think minimal support can consist of writing the storage type / array. We also might want to save the extension name and metadata in the parquet FileMetadata. Later on, this could be potentially be used to restore the extension type when reading. This is related to other issues that need to save the arrow schema (categorical: ARROW-5480, time zones: ARROW-5888). Only in this case, we probably want to store the serialised type in addition to the schema (which only has the extension type's name). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6162) [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len parameter is zero
[ https://issues.apache.org/jira/browse/ARROW-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Praveen Kumar Desabandu updated ARROW-6162: --- Component/s: C++ - Gandiva > [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len > parameter is zero > --- > > Key: ARROW-6162 > URL: https://issues.apache.org/jira/browse/ARROW-6162 > Project: Apache Arrow > Issue Type: Task > Components: C++ - Gandiva >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Do not truncate string if length parameter is 0 in castVARCHAR_utf8_int64 > function. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6162) [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len parameter is zero
[ https://issues.apache.org/jira/browse/ARROW-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6162: -- Labels: pull-request-available (was: ) > [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len > parameter is zero > --- > > Key: ARROW-6162 > URL: https://issues.apache.org/jira/browse/ARROW-6162 > Project: Apache Arrow > Issue Type: Task >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > > Do not truncate string if length parameter is 0 in castVARCHAR_utf8_int64 > function. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6162) [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len parameter is zero
[ https://issues.apache.org/jira/browse/ARROW-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Praveen Kumar Desabandu resolved ARROW-6162. Resolution: Fixed Fix Version/s: 1.0.0 Issue resolved by pull request 5040 [https://github.com/apache/arrow/pull/5040] > [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len > parameter is zero > --- > > Key: ARROW-6162 > URL: https://issues.apache.org/jira/browse/ARROW-6162 > Project: Apache Arrow > Issue Type: Task >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Minor > Fix For: 1.0.0 > > > Do not truncate string if length parameter is 0 in castVARCHAR_utf8_int64 > function. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6145) [Java] UnionVector created by MinorType#getNewVector could not keep field type info properly
[ https://issues.apache.org/jira/browse/ARROW-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Praveen Kumar Desabandu resolved ARROW-6145. Resolution: Fixed Fix Version/s: 1.0.0 Issue resolved by pull request 5023 [https://github.com/apache/arrow/pull/5023] > [Java] UnionVector created by MinorType#getNewVector could not keep field > type info properly > > > Key: ARROW-6145 > URL: https://issues.apache.org/jira/browse/ARROW-6145 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > When I worked for other items, I found {{UnionVector}} created by > {{VectorSchemaRoot#create(Schema schema, BufferAllocator allocator)}} could > not keep field type info properly. For example, if we set metadata in > {{Field}} in schema, we could not get it back by {{UnionVector#getField}}. > This is mainly because {{MinorType.Union.getNewVector}} did not pass > {{FieldType}} to vector and {{UnionVector#getField}} create a new {{Field}} > which cause inconsistent. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6186) Plasma headers not included for ubuntu-xenial libplasma-dev debian package
Wannes G created ARROW-6186: --- Summary: Plasma headers not included for ubuntu-xenial libplasma-dev debian package Key: ARROW-6186 URL: https://issues.apache.org/jira/browse/ARROW-6186 Project: Apache Arrow Issue Type: Bug Components: C++ - Plasma Affects Versions: 0.14.1 Reporter: Wannes G See [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install] Issue is still present on latest master branch, the debian install script is correct: [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian/libplasma-dev.install] The first line is missing from the ubuntu install script causing no headers to be installed when apt-get is used to install libplasma-dev. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6069) [Rust] [Parquet] Implement Converter to convert record reader to arrow primitive array.
[ https://issues.apache.org/jira/browse/ARROW-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-6069. --- Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 4997 [https://github.com/apache/arrow/pull/4997] > [Rust] [Parquet] Implement Converter to convert record reader to arrow > primitive array. > --- > > Key: ARROW-6069 > URL: https://issues.apache.org/jira/browse/ARROW-6069 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Renjie Liu >Assignee: Renjie Liu >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 4h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-5638) [C++] cmake fails to generate Xcode project when Gandiva JNI bindings are enabled
[ https://issues.apache.org/jira/browse/ARROW-5638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hatem Helal reassigned ARROW-5638: -- Assignee: Hatem Helal > [C++] cmake fails to generate Xcode project when Gandiva JNI bindings are > enabled > - > > Key: ARROW-5638 > URL: https://issues.apache.org/jira/browse/ARROW-5638 > Project: Apache Arrow > Issue Type: Bug >Reporter: Hatem Helal >Assignee: Hatem Helal >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > See comment with error here: > https://github.com/apache/arrow/pull/4596#issuecomment-502954709 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5638) [C++] cmake fails to generate Xcode project when Gandiva JNI bindings are enabled
[ https://issues.apache.org/jira/browse/ARROW-5638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5638: -- Labels: pull-request-available (was: ) > [C++] cmake fails to generate Xcode project when Gandiva JNI bindings are > enabled > - > > Key: ARROW-5638 > URL: https://issues.apache.org/jira/browse/ARROW-5638 > Project: Apache Arrow > Issue Type: Bug >Reporter: Hatem Helal >Priority: Minor > Labels: pull-request-available > > See comment with error here: > https://github.com/apache/arrow/pull/4596#issuecomment-502954709 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6185) [Java] Provide hash table based dictionary builder
Liya Fan created ARROW-6185: --- Summary: [Java] Provide hash table based dictionary builder Key: ARROW-6185 URL: https://issues.apache.org/jira/browse/ARROW-6185 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan This is related ARROW-5862. We provide another type of dictionary builder based on hash table. Compared with a search based dictionary encoder, a hash table based encoder process each new element in O(1) time, but require extra memory space. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6184) [Java] Provide hash table based dictionary encoder
Liya Fan created ARROW-6184: --- Summary: [Java] Provide hash table based dictionary encoder Key: ARROW-6184 URL: https://issues.apache.org/jira/browse/ARROW-6184 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan This is the second part of ARROW-5917. We provide a sort based encoder, as well as a hash table based encoder, to solve the problem with the current dictionary encoder. In particular, we solve the following problems with the current encoder: # There are repeated conversions between Java objects and bytes (e.g. vector.getObject(i)). # Unnecessary memory copy (the vector data must be copied to the hash table). # The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either). # The output vector should not be created/managed by the encoder (just like in the out-of-place sorter) # The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed. -- This message was sent by Atlassian JIRA (v7.6.14#76016)