[jira] [Updated] (ARROW-6024) [Java] Provide more hash algorithms

2019-08-09 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6024:
---
Description: 
Provide more hash algorithms to choose for different scenarios. In particular, 
we provide the following hash algorithms:
 * Simple hasher: A hasher that calculates the hash code of integers as is, and 
do not perform any finalization. So the computation is extremely efficient, but 
the quality of the produced hash code may not be good.

 * Murmur finalizing hasher: Finalize the hash code by the Murmur hashing 
algorithm. Details of the algorithm can be found in 
[https://en.wikipedia.org/wiki/MurmurHash]. Murmur hashing is computational 
expensive, as it involves several integer multiplications. However, the 
produced hash codes have good quality in the sense that they are uniformly 
distributed in the universe.

  was:
Provide more hash algorithms to choose for different scenarios. In particular, 
we provide the following hash algorithms:

* Simple hasher: A hasher that calculates the hash code of integers as is, and 
do not perform any finalization. So the computation is extremely efficient, but 
the quality of the produced hash code may not be good.

* Murmur finalizing hasher: Finalize the hash code by the Murmur hashing 
algorithm. Details of the algorithm can be found in 
https://en.wikipedia.org/wiki/MurmurHash. Murmur hashing is computational 
expensive, as it involves several integer multiplications. However, the 
produced hash codes have good quality in the sense that they are uniformly 
distributed in the universe.

* Jenkins finalizing hasher: Finalize the hash code by Bob Jenkins' algorithm. 
Details of this algorithm can be found in 
http://www.burtleburtle.net/bob/hash/integer.html. Jenkins hashing is less 
computational expensive than Murmur hashing, as it involves no integer 
multiplication. However, the produced hash codes also have good quality in the 
sense that they are uniformly distributed in the universe.

* Non-negative hasher: Wrapper for another hasher, to make the generated hash 
code non-negative. This can be useful for scenarios like hash table.


> [Java] Provide more hash algorithms 
> 
>
> Key: ARROW-6024
> URL: https://issues.apache.org/jira/browse/ARROW-6024
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> Provide more hash algorithms to choose for different scenarios. In 
> particular, we provide the following hash algorithms:
>  * Simple hasher: A hasher that calculates the hash code of integers as is, 
> and do not perform any finalization. So the computation is extremely 
> efficient, but the quality of the produced hash code may not be good.
>  * Murmur finalizing hasher: Finalize the hash code by the Murmur hashing 
> algorithm. Details of the algorithm can be found in 
> [https://en.wikipedia.org/wiki/MurmurHash]. Murmur hashing is computational 
> expensive, as it involves several integer multiplications. However, the 
> produced hash codes have good quality in the sense that they are uniformly 
> distributed in the universe.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6024) [Java] Provide more hash algorithms

2019-08-09 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6024.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4934
[https://github.com/apache/arrow/pull/4934]

> [Java] Provide more hash algorithms 
> 
>
> Key: ARROW-6024
> URL: https://issues.apache.org/jira/browse/ARROW-6024
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> Provide more hash algorithms to choose for different scenarios. In 
> particular, we provide the following hash algorithms:
> * Simple hasher: A hasher that calculates the hash code of integers as is, 
> and do not perform any finalization. So the computation is extremely 
> efficient, but the quality of the produced hash code may not be good.
> * Murmur finalizing hasher: Finalize the hash code by the Murmur hashing 
> algorithm. Details of the algorithm can be found in 
> https://en.wikipedia.org/wiki/MurmurHash. Murmur hashing is computational 
> expensive, as it involves several integer multiplications. However, the 
> produced hash codes have good quality in the sense that they are uniformly 
> distributed in the universe.
> * Jenkins finalizing hasher: Finalize the hash code by Bob Jenkins' 
> algorithm. Details of this algorithm can be found in 
> http://www.burtleburtle.net/bob/hash/integer.html. Jenkins hashing is less 
> computational expensive than Murmur hashing, as it involves no integer 
> multiplication. However, the produced hash codes also have good quality in 
> the sense that they are uniformly distributed in the universe.
> * Non-negative hasher: Wrapper for another hasher, to make the generated hash 
> code non-negative. This can be useful for scenarios like hash table.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6193) [GLib] Add missing require in test

2019-08-09 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6193:
--
Labels: pull-request-available  (was: )

> [GLib] Add missing require in test
> --
>
> Key: ARROW-6193
> URL: https://issues.apache.org/jira/browse/ARROW-6193
> Project: Apache Arrow
>  Issue Type: Test
>  Components: GLib
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6193) [GLib] Add missing require in test

2019-08-09 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-6193:
---

 Summary: [GLib] Add missing require in test
 Key: ARROW-6193
 URL: https://issues.apache.org/jira/browse/ARROW-6193
 Project: Apache Arrow
  Issue Type: Test
  Components: GLib
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6188) [GLib] Add garrow_array_is_in()

2019-08-09 Thread Sutou Kouhei (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei resolved ARROW-6188.
-
Resolution: Fixed

Issue resolved by pull request 5047
[https://github.com/apache/arrow/pull/5047]

> [GLib] Add garrow_array_is_in()
> ---
>
> Key: ARROW-6188
> URL: https://issues.apache.org/jira/browse/ARROW-6188
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Reporter: Yosuke Shiro
>Assignee: Yosuke Shiro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6192) [GLib] Use the same SO version as C++

2019-08-09 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6192:
--
Labels: pull-request-available  (was: )

> [GLib] Use the same SO version as C++
> -
>
> Key: ARROW-6192
> URL: https://issues.apache.org/jira/browse/ARROW-6192
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6192) [GLib] Use the same SO version as C++

2019-08-09 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-6192:
---

 Summary: [GLib] Use the same SO version as C++
 Key: ARROW-6192
 URL: https://issues.apache.org/jira/browse/ARROW-6192
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6093) [Java] reduce branches in algo for first match in VectorRangeSearcher

2019-08-09 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6093.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5011
[https://github.com/apache/arrow/pull/5011]

> [Java] reduce branches in algo for first match in VectorRangeSearcher
> -
>
> Key: ARROW-6093
> URL: https://issues.apache.org/jira/browse/ARROW-6093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Pindikura Ravindra
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> This is a follow up Jira for the improvement suggested by [~fsaintjacques] in 
> the PR for 
> [https://github.com/apache/arrow/pull/4925]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6175) [Java] Fix MapVector#getMinorType and extend AbstractContainerVector addOrGet complex vector API

2019-08-09 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6175.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5043
[https://github.com/apache/arrow/pull/5043]

> [Java] Fix MapVector#getMinorType and extend AbstractContainerVector addOrGet 
> complex vector API
> 
>
> Key: ARROW-6175
> URL: https://issues.apache.org/jira/browse/ARROW-6175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> i. Currently {{MapVector#getMinorType}} extends {{ListVector}} which returns 
> the wrong {{MinorType}}.
> ii. {{AbstractContainerVector}} now only has {{addOrGetList}}, 
> {{addOrGetUnion}}, {{addOrGetStruct}} which not support all complex type like 
> {{MapVector}} and {{FixedSizeListVector}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6188) [GLib] Add garrow_array_is_in()

2019-08-09 Thread Sutou Kouhei (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei updated ARROW-6188:

Summary: [GLib] Add garrow_array_is_in()  (was: [GLib] Add 
garrow_array_isin())

> [GLib] Add garrow_array_is_in()
> ---
>
> Key: ARROW-6188
> URL: https://issues.apache.org/jira/browse/ARROW-6188
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Reporter: Yosuke Shiro
>Assignee: Yosuke Shiro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6186) [Packaging][C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian package

2019-08-09 Thread Sutou Kouhei (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei updated ARROW-6186:

Component/s: Packaging

> [Packaging][C++] Plasma headers not included for ubuntu-xenial libplasma-dev 
> debian package
> ---
>
> Key: ARROW-6186
> URL: https://issues.apache.org/jira/browse/ARROW-6186
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Plasma, Packaging
>Affects Versions: 0.14.1
>Reporter: Wannes G
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: debian, packaging
>
> See 
> [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install]
> Issue is still present on latest master branch, the debian install script is 
> correct: 
> [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian/libplasma-dev.install]
> The first line is missing from the ubuntu install script causing no headers 
> to be installed when apt-get is used to install libplasma-dev.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6186) [C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian package

2019-08-09 Thread Sutou Kouhei (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei reassigned ARROW-6186:
---

Assignee: Sutou Kouhei

> [C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian 
> package
> 
>
> Key: ARROW-6186
> URL: https://issues.apache.org/jira/browse/ARROW-6186
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Plasma
>Affects Versions: 0.14.1
>Reporter: Wannes G
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: debian, packaging
>
> See 
> [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install]
> Issue is still present on latest master branch, the debian install script is 
> correct: 
> [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian/libplasma-dev.install]
> The first line is missing from the ubuntu install script causing no headers 
> to be installed when apt-get is used to install libplasma-dev.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6186) [Packaging][C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian package

2019-08-09 Thread Sutou Kouhei (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei updated ARROW-6186:

Summary: [Packaging][C++] Plasma headers not included for ubuntu-xenial 
libplasma-dev debian package  (was: [C++] Plasma headers not included for 
ubuntu-xenial libplasma-dev debian package)

> [Packaging][C++] Plasma headers not included for ubuntu-xenial libplasma-dev 
> debian package
> ---
>
> Key: ARROW-6186
> URL: https://issues.apache.org/jira/browse/ARROW-6186
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Plasma
>Affects Versions: 0.14.1
>Reporter: Wannes G
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: debian, packaging
>
> See 
> [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install]
> Issue is still present on latest master branch, the debian install script is 
> correct: 
> [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian/libplasma-dev.install]
> The first line is missing from the ubuntu install script causing no headers 
> to be installed when apt-get is used to install libplasma-dev.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3

2019-08-09 Thread Andrey Krivonogov (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904082#comment-16904082
 ] 

Andrey Krivonogov edited comment on ARROW-6058 at 8/9/19 8:34 PM:
--

Hi [~wesmckinn],

I have experienced the same issue as [~sid88in]

I also managed to reproduce it with synthetic data:
{code:java}
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import s3fs

table = pa.Table.from_arrays([pa.array(np.arange(3 * 10 ** 7), 
type=pa.int64())], ['col'])
path = 's3://bucket/path/0.parquet'
fs = s3fs.S3FileSystem()
pq.write_table(table, path, filesystem=fs, row_group_size=10 ** 7)
table_read = pq.read_table(path, filesystem=fs){code}
 this snippet raises similar 
{code:java}
ArrowIOError: Unexpected end of stream: Page was smaller (241959) than expected 
(524605)
{code}
The problem seemed to be in s3fs version. Package versions I have
{code:java}
python 3.6.7
packages installed with conda (via conda-forge)

boto3==1.9.204
botocore==1.12.204
numpy==1.16.2
pyarrow==0.14.1
{code}
and it raised with
{code:java}
s3fs==0.3.3{code}
but everything worked fine with
{code:java}
s3fs==0.2.2
{code}
 

Thank you in advance for your help !

 


was (Author: krivonogov):
Hi [~wesmckinn],

I have experienced the same issue as [~sid88in]

I also managed to reproduce it with synthetic data:
{code:java}
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import s3fs

table = pa.Table.from_arrays([pa.array(np.arange(3 * 10 ** 7), 
type=pa.int64())], ['col'])
path = 's3://bucket/path/0.parquet'
fs = s3fs.S3FileSystem()
pq.write_table(table, path, filesystem=fs, row_group_size=10 ** 7)
table_read = pq.read_table(path, filesystem=fs){code}
 this snippet raises similar 
{code:java}
ArrowIOError: Unexpected end of stream: Page was smaller (241959) than expected 
(524605)
{code}
This problem seemed to be in s3fs version. Package versions I have
{code:java}
python 3.6.7
packages installed with conda (via conda-forge)

boto3==1.9.204
botocore==1.12.204
numpy==1.16.2
pyarrow==0.14.1
{code}
and it raised with
{code:java}
s3fs==0.3.3{code}
but everything worked fine with

 
{code:java}
s3fs==0.2.2
{code}
 

Thank you in advance for your help !

 

> [Python][Parquet] Failure when reading Parquet file from S3 
> 
>
> Key: ARROW-6058
> URL: https://issues.apache.org/jira/browse/ARROW-6058
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.1
>Reporter: Siddharth
>Priority: Major
>  Labels: parquet
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, 
> in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, 
> in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in 
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) 
> than expected (263929)
> {code}
>  
> *Note: Same code works on relatively smaller dataset (approx < 50M records)* 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6191) [C++] buffer size default value will throw an error

2019-08-09 Thread Zherui Cao (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zherui Cao updated ARROW-6191:
--
Summary: [C++] buffer size default value will throw an error  (was: [c++] 
buffer size default value will throw an error)

> [C++] buffer size default value will throw an error
> ---
>
> Key: ARROW-6191
> URL: https://issues.apache.org/jira/browse/ARROW-6191
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Zherui Cao
>Priority: Major
>
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L40]
> this set default size as 0,
> but in
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/buffered.cc#L259]
> it prevent the buffer size being 0.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6191) [c++] buffer size default value will throw an error

2019-08-09 Thread Zherui Cao (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zherui Cao updated ARROW-6191:
--
Summary: [c++] buffer size default value will throw an error  (was:  Arrow 
error: Invalid: Buffer size should be positive)

> [c++] buffer size default value will throw an error
> ---
>
> Key: ARROW-6191
> URL: https://issues.apache.org/jira/browse/ARROW-6191
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Zherui Cao
>Priority: Major
>
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L40]
> this set default size as 0,
> but in
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/buffered.cc#L259]
> it prevent the buffer size being 0.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6191) Arrow error: Invalid: Buffer size should be positive

2019-08-09 Thread Zherui Cao (JIRA)
Zherui Cao created ARROW-6191:
-

 Summary:  Arrow error: Invalid: Buffer size should be positive
 Key: ARROW-6191
 URL: https://issues.apache.org/jira/browse/ARROW-6191
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Zherui Cao


[https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L40]

this set default size as 0,

but in

[https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/buffered.cc#L259]

it prevent the buffer size being 0.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3

2019-08-09 Thread Andrey Krivonogov (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904082#comment-16904082
 ] 

Andrey Krivonogov commented on ARROW-6058:
--

Hi [~wesmckinn],

I have experienced the same issue as [~sid88in]

I also managed to reproduce it with synthetic data:
{code:java}
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import s3fs

table = pa.Table.from_arrays([pa.array(np.arange(3 * 10 ** 7), 
type=pa.int64())], ['col'])
path = 's3://bucket/path/0.parquet'
fs = s3fs.S3FileSystem()
pq.write_table(table, path, filesystem=fs, row_group_size=10 ** 7)
table_read = pq.read_table(path, filesystem=fs){code}
 this snippet raises similar 
{code:java}
ArrowIOError: Unexpected end of stream: Page was smaller (241959) than expected 
(524605)
{code}
This problem seemed to be in s3fs version. Package versions I have
{code:java}
python 3.6.7
packages installed with conda (via conda-forge)

boto3==1.9.204
botocore==1.12.204
numpy==1.16.2
pyarrow==0.14.1
{code}
and it raised with
{code:java}
s3fs==0.3.3{code}
but everything worked fine with

 
{code:java}
s3fs==0.2.2
{code}
 

Thank you in advance for your help !

 

> [Python][Parquet] Failure when reading Parquet file from S3 
> 
>
> Key: ARROW-6058
> URL: https://issues.apache.org/jira/browse/ARROW-6058
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.1
>Reporter: Siddharth
>Priority: Major
>  Labels: parquet
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, 
> in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, 
> in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in 
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) 
> than expected (263929)
> {code}
>  
> *Note: Same code works on relatively smaller dataset (approx < 50M records)* 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6190) [C++] Define and declare functions regardless of NDEBUG

2019-08-09 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6190:
--
Labels: pull-request-available  (was: )

> [C++] Define and declare functions regardless of NDEBUG
> ---
>
> Key: ARROW-6190
> URL: https://issues.apache.org/jira/browse/ARROW-6190
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Omer Ozarslan
>Priority: Minor
>  Labels: pull-request-available
>
> NDEBUG is not shipped in linker flags, so I got a linker error with release 
> build on FixedSizeBinaryBuilder::UnsafeAppend(util::string_view value) call, 
> since it makes a call to CheckValueSize.
> This is somewhat a follow-up of ARROW-2313. I took the same path by removing 
> NDEBUG ifdefs around CheckValueSize definition and declaration.
> I applied the same fix to CheckUTF8Initialized as well after grepping the 
> source code for "#ifndef NDEBUG" and figured out it has the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6190) [C++] Define and declare functions regardless of NDEBUG

2019-08-09 Thread Omer Ozarslan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904070#comment-16904070
 ] 

Omer Ozarslan commented on ARROW-6190:
--

Submitted PR on https://github.com/apache/arrow/pull/5049.

> [C++] Define and declare functions regardless of NDEBUG
> ---
>
> Key: ARROW-6190
> URL: https://issues.apache.org/jira/browse/ARROW-6190
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Omer Ozarslan
>Priority: Minor
>
> NDEBUG is not shipped in linker flags, so I got a linker error with release 
> build on FixedSizeBinaryBuilder::UnsafeAppend(util::string_view value) call, 
> since it makes a call to CheckValueSize.
> This is somewhat a follow-up of ARROW-2313. I took the same path by removing 
> NDEBUG ifdefs around CheckValueSize definition and declaration.
> I applied the same fix to CheckUTF8Initialized as well after grepping the 
> source code for "#ifndef NDEBUG" and figured out it has the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6190) [C++] Define and declare functions regardless of NDEBUG

2019-08-09 Thread Omer Ozarslan (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omer Ozarslan updated ARROW-6190:
-
Summary: [C++] Define and declare functions regardless of NDEBUG  (was: 
Define and declare functions regardless of NDEBUG)

> [C++] Define and declare functions regardless of NDEBUG
> ---
>
> Key: ARROW-6190
> URL: https://issues.apache.org/jira/browse/ARROW-6190
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Omer Ozarslan
>Priority: Minor
>
> NDEBUG is not shipped in linker flags, so I got a linker error with release 
> build on FixedSizeBinaryBuilder::UnsafeAppend(util::string_view value) call, 
> since it makes a call to CheckValueSize.
> This is somewhat a follow-up of ARROW-2313. I took the same path by removing 
> NDEBUG ifdefs around CheckValueSize definition and declaration.
> I applied the same fix to CheckUTF8Initialized as well after grepping the 
> source code for "#ifndef NDEBUG" and figured out it has the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6190) Define and declare functions regardless of NDEBUG

2019-08-09 Thread Omer Ozarslan (JIRA)
Omer Ozarslan created ARROW-6190:


 Summary: Define and declare functions regardless of NDEBUG
 Key: ARROW-6190
 URL: https://issues.apache.org/jira/browse/ARROW-6190
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Omer Ozarslan


NDEBUG is not shipped in linker flags, so I got a linker error with release 
build on FixedSizeBinaryBuilder::UnsafeAppend(util::string_view value) call, 
since it makes a call to CheckValueSize.

This is somewhat a follow-up of ARROW-2313. I took the same path by removing 
NDEBUG ifdefs around CheckValueSize definition and declaration.

I applied the same fix to CheckUTF8Initialized as well after grepping the 
source code for "#ifndef NDEBUG" and figured out it has the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6189) [Rust] Plain encoded boolean column chunks limited to 2048 values

2019-08-09 Thread Simon Jones (JIRA)
Simon Jones created ARROW-6189:
--

 Summary: [Rust] Plain encoded boolean column chunks limited to 
2048 values
 Key: ARROW-6189
 URL: https://issues.apache.org/jira/browse/ARROW-6189
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 0.14.1
Reporter: Simon Jones


encoding::PlainEncoder::new creates a BitWriter with 256 bytes of storage, 
which limits the data page size that can be used. 

I suggest that in

{{impl Encoder for PlainEncoder}}

the return value of put_value is tested and the BitWriter flushed+cleared 
whenever it runs out of space.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet

2019-08-09 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904020#comment-16904020
 ] 

Wes McKinney commented on ARROW-3246:
-

Making some progress on this. It's a can of worms because of the interplay 
between the ColumnWriter, Encoder, and Statistics types. 

> [Python][Parquet] direct reading/writing of pandas categoricals in parquet
> --
>
> Key: ARROW-3246
> URL: https://issues.apache.org/jira/browse/ARROW-3246
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Parquet supports "dictionary encoding" of column data in a manner very 
> similar to the concept of Categoricals in pandas. It is natural to use this 
> encoding for a column which originated as a categorical. Conversely, when 
> loading, if the file metadata says that a given column came from a pandas (or 
> arrow) categorical, then we can trust that the whole of the column is 
> dictionary-encoded and load the data directly into a categorical column, 
> rather than expanding the labels upon load and recategorising later.
> If the data does not have the pandas metadata, then the guarantee cannot 
> hold, and we cannot assume either that the whole column is dictionary encoded 
> or that the labels are the same throughout. In this case, the current 
> behaviour is fine.
>  
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6186) [C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian package

2019-08-09 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6186:

Summary: [C++] Plasma headers not included for ubuntu-xenial libplasma-dev 
debian package  (was: Plasma headers not included for ubuntu-xenial 
libplasma-dev debian package)

> [C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian 
> package
> 
>
> Key: ARROW-6186
> URL: https://issues.apache.org/jira/browse/ARROW-6186
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Plasma
>Affects Versions: 0.14.1
>Reporter: Wannes G
>Priority: Major
>
> See 
> [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install]
> Issue is still present on latest master branch, the debian install script is 
> correct: 
> [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian/libplasma-dev.install]
> The first line is missing from the ubuntu install script causing no headers 
> to be installed when apt-get is used to install libplasma-dev.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6186) [C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian package

2019-08-09 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6186:

Labels: debian packaging  (was: )

> [C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian 
> package
> 
>
> Key: ARROW-6186
> URL: https://issues.apache.org/jira/browse/ARROW-6186
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Plasma
>Affects Versions: 0.14.1
>Reporter: Wannes G
>Priority: Major
>  Labels: debian, packaging
>
> See 
> [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install]
> Issue is still present on latest master branch, the debian install script is 
> correct: 
> [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian/libplasma-dev.install]
> The first line is missing from the ubuntu install script causing no headers 
> to be installed when apt-get is used to install libplasma-dev.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6179) [C++] ExtensionType subclass for "unknown" types?

2019-08-09 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903969#comment-16903969
 ] 

Joris Van den Bossche commented on ARROW-6179:
--

The bigquery usage of this, is that open source code? (to familiarize myself 
with an application of the extension types) 
You mean that you use the extension type key (ARROW:extension:name) in the 
metadata without having it an actual extension type?

For sure if we would create such a generic extension array, I think it should 
work in more places in arrow than it currently is the case (eg I opened issues 
to fallback to the storage type when converting to pandas or to parquet).

> [C++] ExtensionType subclass for "unknown" types?
> -
>
> Key: ARROW-6179
> URL: https://issues.apache.org/jira/browse/ARROW-6179
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++, when receiving IPC with extension type metadata for a type that is 
> unknown (the name is not registered), we currently fall back to returning the 
> "raw" storage array. The custom metadata (extension name and metadata) is 
> still available in the Field metadata.
> Alternatively, we could also have a generic {{ExtensionType}} class that can 
> hold such "unknown" extension type (eg {{UnknowExtensionType}} or 
> {{GenericExtensionType}}), keeping the extension name and metadata in the 
> Array's type. 
> This could be a single class where several instances can be created given a 
> storage type, extension name and optionally extension metadata. It would be a 
> way to have an unregistered extension type.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6179) [C++] ExtensionType subclass for "unknown" types?

2019-08-09 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903949#comment-16903949
 ] 

Micah Kornfield commented on ARROW-6179:


Ok, personally I would like to leave the current  behavior as at least the 
default.  One example of the usage on non registration of  extension types is 
the BQ storage read API uses it to mark fields that don't have a one to one 
correspondence with built in arrow types (geography and datetime).  In the 
future someone could choose to write custom extension types but in the meantime 
they don't require special handling and flow through without any problem when 
converting to pandas.

> [C++] ExtensionType subclass for "unknown" types?
> -
>
> Key: ARROW-6179
> URL: https://issues.apache.org/jira/browse/ARROW-6179
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++, when receiving IPC with extension type metadata for a type that is 
> unknown (the name is not registered), we currently fall back to returning the 
> "raw" storage array. The custom metadata (extension name and metadata) is 
> still available in the Field metadata.
> Alternatively, we could also have a generic {{ExtensionType}} class that can 
> hold such "unknown" extension type (eg {{UnknowExtensionType}} or 
> {{GenericExtensionType}}), keeping the extension name and metadata in the 
> Array's type. 
> This could be a single class where several instances can be created given a 
> storage type, extension name and optionally extension metadata. It would be a 
> way to have an unregistered extension type.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6179) [C++] ExtensionType subclass for "unknown" types?

2019-08-09 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903935#comment-16903935
 ] 

Joris Van den Bossche commented on ARROW-6179:
--

I suppose, if we go for this, it would replace the automatic fallback. And then 
a user can still get the storage array as a fallback themselves?

Although, I see that there is a PR adding {{IpcOptions}} for writing, so if 
needed, there might also be such options for reading.

To be honest, I don't know have a good enough idea of potential use cases in 
C++ of the ExtensionType mechanism to really assess if it would be generally 
useful to keep the array in a generic extension array or rather directly fall 
back to the storage array.  
I was thinking that for Python usage, this might be useful to be able to send 
an extension type defined from Python without needing to register a specific 
subclass in C++.


> [C++] ExtensionType subclass for "unknown" types?
> -
>
> Key: ARROW-6179
> URL: https://issues.apache.org/jira/browse/ARROW-6179
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++, when receiving IPC with extension type metadata for a type that is 
> unknown (the name is not registered), we currently fall back to returning the 
> "raw" storage array. The custom metadata (extension name and metadata) is 
> still available in the Field metadata.
> Alternatively, we could also have a generic {{ExtensionType}} class that can 
> hold such "unknown" extension type (eg {{UnknowExtensionType}} or 
> {{GenericExtensionType}}), keeping the extension name and metadata in the 
> Array's type. 
> This could be a single class where several instances can be created given a 
> storage type, extension name and optionally extension metadata. It would be a 
> way to have an unregistered extension type.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6117) [Java] Fix the set method of FixedSizeBinaryVector

2019-08-09 Thread Pindikura Ravindra (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra resolved ARROW-6117.
---
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4995
[https://github.com/apache/arrow/pull/4995]

> [Java] Fix the set method of FixedSizeBinaryVector
> --
>
> Key: ARROW-6117
> URL: https://issues.apache.org/jira/browse/ARROW-6117
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> For the set method, if the parameter is null, it should clear the validity 
> bit. However, the current implementation throws a NullPointerException.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6188) [GLib] Add garrow_array_isin()

2019-08-09 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6188:
--
Labels: pull-request-available  (was: )

> [GLib] Add garrow_array_isin()
> --
>
> Key: ARROW-6188
> URL: https://issues.apache.org/jira/browse/ARROW-6188
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Reporter: Yosuke Shiro
>Assignee: Yosuke Shiro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6188) [GLib] Add garrow_array_isin()

2019-08-09 Thread Yosuke Shiro (JIRA)
Yosuke Shiro created ARROW-6188:
---

 Summary: [GLib] Add garrow_array_isin()
 Key: ARROW-6188
 URL: https://issues.apache.org/jira/browse/ARROW-6188
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Yosuke Shiro
Assignee: Yosuke Shiro
 Fix For: 0.15.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6137) [C++][Gandiva] Change output format of castVARCHAR(timestamp) in Gandiva

2019-08-09 Thread Pindikura Ravindra (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra resolved ARROW-6137.
---
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5014
[https://github.com/apache/arrow/pull/5014]

> [C++][Gandiva] Change output format of castVARCHAR(timestamp) in Gandiva
> 
>
> Key: ARROW-6137
> URL: https://issues.apache.org/jira/browse/ARROW-6137
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Format timestamp to -MM-dd hh:mm:ss.sss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6187) [C++] fallback to storage type when writing ExtensionType to Parquet

2019-08-09 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6187:


 Summary: [C++] fallback to storage type when writing ExtensionType 
to Parquet
 Key: ARROW-6187
 URL: https://issues.apache.org/jira/browse/ARROW-6187
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Writing a table that contains an ExtensionType array to a parquet file is not 
yet implemented. It currently raises "ArrowNotImplementedError: Unhandled type 
for Arrow to Parquet schema conversion: extension" 
(for a PyExtensionType in this case).

I think minimal support can consist of writing the storage type / array. 

We also might want to save the extension name and metadata in the parquet 
FileMetadata. 

Later on, this could be potentially be used to restore the extension type when 
reading. This is related to other issues that need to save the arrow schema 
(categorical: ARROW-5480, time zones: ARROW-5888). Only in this case, we 
probably want to store the serialised type in addition to the schema (which 
only has the extension type's name). 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6162) [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len parameter is zero

2019-08-09 Thread Praveen Kumar Desabandu (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar Desabandu updated ARROW-6162:
---
Component/s: C++ - Gandiva

> [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len 
> parameter is zero
> ---
>
> Key: ARROW-6162
> URL: https://issues.apache.org/jira/browse/ARROW-6162
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Do not truncate string if length parameter is 0 in castVARCHAR_utf8_int64 
> function.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6162) [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len parameter is zero

2019-08-09 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6162:
--
Labels: pull-request-available  (was: )

> [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len 
> parameter is zero
> ---
>
> Key: ARROW-6162
> URL: https://issues.apache.org/jira/browse/ARROW-6162
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Do not truncate string if length parameter is 0 in castVARCHAR_utf8_int64 
> function.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6162) [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len parameter is zero

2019-08-09 Thread Praveen Kumar Desabandu (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar Desabandu resolved ARROW-6162.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 5040
[https://github.com/apache/arrow/pull/5040]

> [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len 
> parameter is zero
> ---
>
> Key: ARROW-6162
> URL: https://issues.apache.org/jira/browse/ARROW-6162
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
> Fix For: 1.0.0
>
>
> Do not truncate string if length parameter is 0 in castVARCHAR_utf8_int64 
> function.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6145) [Java] UnionVector created by MinorType#getNewVector could not keep field type info properly

2019-08-09 Thread Praveen Kumar Desabandu (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar Desabandu resolved ARROW-6145.

   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 5023
[https://github.com/apache/arrow/pull/5023]

> [Java] UnionVector created by MinorType#getNewVector could not keep field 
> type info properly
> 
>
> Key: ARROW-6145
> URL: https://issues.apache.org/jira/browse/ARROW-6145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When I worked for other items, I found {{UnionVector}} created by 
> {{VectorSchemaRoot#create(Schema schema, BufferAllocator allocator)}} could 
> not keep field type info properly. For example, if we set metadata in 
> {{Field}} in schema, we could not get it back by {{UnionVector#getField}}.
> This is mainly because {{MinorType.Union.getNewVector}} did not pass 
> {{FieldType}} to vector and {{UnionVector#getField}} create a new {{Field}} 
> which cause inconsistent.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6186) Plasma headers not included for ubuntu-xenial libplasma-dev debian package

2019-08-09 Thread Wannes G (JIRA)
Wannes G created ARROW-6186:
---

 Summary: Plasma headers not included for ubuntu-xenial 
libplasma-dev debian package
 Key: ARROW-6186
 URL: https://issues.apache.org/jira/browse/ARROW-6186
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Plasma
Affects Versions: 0.14.1
Reporter: Wannes G


See 
[https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install]

Issue is still present on latest master branch, the debian install script is 
correct: 
[https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian/libplasma-dev.install]

The first line is missing from the ubuntu install script causing no headers to 
be installed when apt-get is used to install libplasma-dev.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6069) [Rust] [Parquet] Implement Converter to convert record reader to arrow primitive array.

2019-08-09 Thread Neville Dipale (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-6069.
---
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4997
[https://github.com/apache/arrow/pull/4997]

> [Rust] [Parquet] Implement Converter to convert record reader to arrow 
> primitive array.
> ---
>
> Key: ARROW-6069
> URL: https://issues.apache.org/jira/browse/ARROW-6069
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Renjie Liu
>Assignee: Renjie Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-5638) [C++] cmake fails to generate Xcode project when Gandiva JNI bindings are enabled

2019-08-09 Thread Hatem Helal (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hatem Helal reassigned ARROW-5638:
--

Assignee: Hatem Helal

> [C++] cmake fails to generate Xcode project when Gandiva JNI bindings are 
> enabled
> -
>
> Key: ARROW-5638
> URL: https://issues.apache.org/jira/browse/ARROW-5638
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Hatem Helal
>Assignee: Hatem Helal
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See comment with error here:
> https://github.com/apache/arrow/pull/4596#issuecomment-502954709



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5638) [C++] cmake fails to generate Xcode project when Gandiva JNI bindings are enabled

2019-08-09 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5638:
--
Labels: pull-request-available  (was: )

> [C++] cmake fails to generate Xcode project when Gandiva JNI bindings are 
> enabled
> -
>
> Key: ARROW-5638
> URL: https://issues.apache.org/jira/browse/ARROW-5638
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Hatem Helal
>Priority: Minor
>  Labels: pull-request-available
>
> See comment with error here:
> https://github.com/apache/arrow/pull/4596#issuecomment-502954709



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6185) [Java] Provide hash table based dictionary builder

2019-08-09 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6185:
---

 Summary: [Java] Provide hash table based dictionary builder
 Key: ARROW-6185
 URL: https://issues.apache.org/jira/browse/ARROW-6185
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This is related ARROW-5862. We provide another type of dictionary builder based 
on hash table. Compared with a search based dictionary encoder, a hash table 
based encoder process each new element in O(1) time, but require extra memory 
space.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6184) [Java] Provide hash table based dictionary encoder

2019-08-09 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6184:
---

 Summary: [Java] Provide hash table based dictionary encoder
 Key: ARROW-6184
 URL: https://issues.apache.org/jira/browse/ARROW-6184
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This is the second part of ARROW-5917. We provide a sort based encoder, as well 
as a hash table based encoder, to solve the problem with the current dictionary 
encoder. 

In particular, we solve the following problems with the current encoder:
 # There are repeated conversions between Java objects and bytes (e.g. 
vector.getObject(i)).
 # Unnecessary memory copy (the vector data must be copied to the hash table).
 # The hash table cannot be reused for encoding multiple vectors (other data 
structure & results cannot be reused either).
 # The output vector should not be created/managed by the encoder (just like in 
the out-of-place sorter)
 # The hash table requires that the hashCode & equals methods be implemented 
appropriately, but this is not guaranteed.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)