[jira] [Resolved] (ARROW-6208) [Java] Correct byte order before comparing in ByteFunctionHelpers

2019-08-15 Thread Pindikura Ravindra (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra resolved ARROW-6208.
---
   Resolution: Fixed
Fix Version/s: (was: 1.0.0)
   0.15.0

Issue resolved by pull request 5063
[https://github.com/apache/arrow/pull/5063]

> [Java] Correct byte order before comparing in ByteFunctionHelpers
> -
>
> Key: ARROW-6208
> URL: https://issues.apache.org/jira/browse/ARROW-6208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 1.0.0
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6265) [Java] Avro adapter implement Array/Map/Fixed type

2019-08-15 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6265:
--
Labels: pull-request-available  (was: )

> [Java] Avro adapter implement Array/Map/Fixed type
> --
>
> Key: ARROW-6265
> URL: https://issues.apache.org/jira/browse/ARROW-6265
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Critical
>  Labels: pull-request-available
>
> Support Array/Map/Fixed type in avro adapter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6265) [Java] Avro adapter implement Array/Map/Fixed type

2019-08-15 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6265:
-

 Summary: [Java] Avro adapter implement Array/Map/Fixed type
 Key: ARROW-6265
 URL: https://issues.apache.org/jira/browse/ARROW-6265
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Support Array/Map/Fixed type in avro adapter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6255) [Rust] [Parquet] Cannot use any published parquet crate due to parquet-format breaking change

2019-08-15 Thread Neville Dipale (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908692#comment-16908692
 ] 

Neville Dipale commented on ARROW-6255:
---

Hi [~andygrove] [~csun], would yanking 2.6.0 help in the interim?

> [Rust] [Parquet] Cannot use any published parquet crate due to parquet-format 
> breaking change
> -
>
> Key: ARROW-6255
> URL: https://issues.apache.org/jira/browse/ARROW-6255
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.12.0, 0.12.1, 0.13.0, 0.14.0, 0.14.1
>Reporter: Andy Grove
>Priority: Major
> Fix For: 0.15.0
>
>
> As a user who wants to use the Rust version of Arrow, I am unable to use any 
> of the previously published versions due to the recent breaking change in 
> parquet-format 2.5.0.
> To reproduce, simply create an empty Rust project using "cargo init example 
> --bin", add a dependency on "parquet-0.14.1" and attempt to build the project.
> {code:java}
>    Compiling parquet v0.13.0
> error[E0599]: no variant or associated item named `BOOLEAN` found for type 
> `parquet_format::parquet_format::Type` in the current scope
>    --> 
> /Users/agrove/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-0.13.0/src/basic.rs:408:28
>     |
> 408 |             parquet::Type::BOOLEAN => Type::BOOLEAN,
>     |                            ^^^ variant or associated item not found 
> in `parquet_format::parquet_format::Type`{code}
> This bug has already been fixed in master, but there is no usable published 
> crate. We could consider publishing a 0.14.2 to resolve this or just wait 
> until the 0.15.0 release. We could also consider using this Jira to at least 
> document a workaround, if one exists (maybe Cargo provides a mechanism for 
> overriding transitive dependencies?).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6264) [Java] There is no need to consider byte order in ArrowBufHasher

2019-08-15 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6264:
--
Labels: pull-request-available  (was: )

> [Java] There is no need to consider byte order in ArrowBufHasher
> 
>
> Key: ARROW-6264
> URL: https://issues.apache.org/jira/browse/ARROW-6264
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
>
> According to the discussion in 
> [https://github.com/apache/arrow/pull/5063#issuecomment-521276547|https://github.com/apache/arrow/pull/5063#issuecomment-521276547.],
>  Arrow has a mechanism to make sure the data is stored in little-endian, so 
> there is no need to check byte order.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6264) [Java] There is no need to consider byte order in ArrowBufHasher

2019-08-15 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6264:
---

 Summary: [Java] There is no need to consider byte order in 
ArrowBufHasher
 Key: ARROW-6264
 URL: https://issues.apache.org/jira/browse/ARROW-6264
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


According to the discussion in 
[https://github.com/apache/arrow/pull/5063#issuecomment-521276547|https://github.com/apache/arrow/pull/5063#issuecomment-521276547.],
 Arrow has a mechanism to make sure the data is stored in little-endian, so 
there is no need to check byte order.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6185) [Java] Provide hash table based dictionary builder

2019-08-15 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6185.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5054
[https://github.com/apache/arrow/pull/5054]

> [Java] Provide hash table based dictionary builder
> --
>
> Key: ARROW-6185
> URL: https://issues.apache.org/jira/browse/ARROW-6185
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> This is related ARROW-5862. We provide another type of dictionary builder 
> based on hash table. Compared with a search based dictionary encoder, a hash 
> table based encoder process each new element in O(1) time, but require extra 
> memory space.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5862) [Java] Provide dictionary builder

2019-08-15 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5862.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4813
[https://github.com/apache/arrow/pull/4813]

> [Java] Provide dictionary builder
> -
>
> Key: ARROW-5862
> URL: https://issues.apache.org/jira/browse/ARROW-5862
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> The dictionary builder servers for the following scenario which is frequently 
> encountered in practice when dictionary encoding is involved: the dictionary 
> values are not known a priori, so they are determined dynamically, as new 
> data arrive continually.
> In particular, when a new value arrives, it is tested to check if it is 
> already in the dictionary. If so, it is simply neglected, otherwise, it is 
> added to the dictionary.
>  
> When all values have been evaluated, the dictionary can be considered 
> complete. So encoding can start afterward.
> The code snippet using a dictionary builder should be like this:
> {{DictonaryBuilder dictionaryBuilder = ...}}
> {{dictionaryBuilder.startBuild();}}
> {{...}}
> {{dictionaryBuild.addValue(newValue);}}
> {{...}}
> {{dictionaryBuilder.endBuild();}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6199) [Java] Avro adapter avoid potential resource leak.

2019-08-15 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6199.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5059
[https://github.com/apache/arrow/pull/5059]

> [Java] Avro adapter avoid potential resource leak.
> --
>
> Key: ARROW-6199
> URL: https://issues.apache.org/jira/browse/ARROW-6199
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Currently, avro consumer interface has no close API, which may cause resource 
> leak like {{AvroBytesConsumer#cacheBuffer}}.
> To resolve this, make consumer extends {{AutoCloseable}} and create 
> {{CompositeAvroConsumer}} to encompasses consume and close logic. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-5952) [Python] Segfault when reading empty table with category as pandas dataframe

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5952:
---

Assignee: Joris Van den Bossche

> [Python] Segfault when reading empty table with category as pandas dataframe
> 
>
> Key: ARROW-5952
> URL: https://issues.apache.org/jira/browse/ARROW-5952
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0, 0.14.1
> Environment: Linux 3.10.0-327.36.3.el7.x86_64
> Python 3.6.8
> Pandas 0.24.2
> Pyarrow 0.14.0
>Reporter: Daniel Nugent
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> I have two short sample programs which demonstrate the issue:
> {code:java}
> import pyarrow as pa
> import pandas as pd
> empty = pd.DataFrame({'foo':[]},dtype='category')
> table = pa.Table.from_pandas(empty)
> outfile = pa.output_stream('bar')
> writer = pa.RecordBatchFileWriter(outfile,table.schema)
> writer.write(table)
> writer.close()
> {code}
> {code:java}
> import pyarrow as pa
> pa.ipc.open_file('bar').read_pandas()
> Segmentation fault
> {code}
> My apologies if this was already reported elsewhere, I searched but could not 
> find an issue which seemed to refer to the same behavior.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5952) [Python] Segfault when reading empty table with category as pandas dataframe

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5952.
-
Resolution: Fixed

Issue resolved by pull request 5081
[https://github.com/apache/arrow/pull/5081]

> [Python] Segfault when reading empty table with category as pandas dataframe
> 
>
> Key: ARROW-5952
> URL: https://issues.apache.org/jira/browse/ARROW-5952
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0, 0.14.1
> Environment: Linux 3.10.0-327.36.3.el7.x86_64
> Python 3.6.8
> Pandas 0.24.2
> Pyarrow 0.14.0
>Reporter: Daniel Nugent
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> I have two short sample programs which demonstrate the issue:
> {code:java}
> import pyarrow as pa
> import pandas as pd
> empty = pd.DataFrame({'foo':[]},dtype='category')
> table = pa.Table.from_pandas(empty)
> outfile = pa.output_stream('bar')
> writer = pa.RecordBatchFileWriter(outfile,table.schema)
> writer.write(table)
> writer.close()
> {code}
> {code:java}
> import pyarrow as pa
> pa.ipc.open_file('bar').read_pandas()
> Segmentation fault
> {code}
> My apologies if this was already reported elsewhere, I searched but could not 
> find an issue which seemed to refer to the same behavior.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6262) [Developer] Show JIRA issue before merging

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6262.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5097
[https://github.com/apache/arrow/pull/5097]

> [Developer] Show JIRA issue before merging
> --
>
> Key: ARROW-6262
> URL: https://issues.apache.org/jira/browse/ARROW-6262
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It's useful to confirm whehter the associated JIRA issue is right or not.
> 
> We couldn't find wrong associated JIRA issue after we merge the pull request 
> https://github.com/apache/arrow/pull/5050 .



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6212) [Java] Support vector rank operation

2019-08-15 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6212.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5066
[https://github.com/apache/arrow/pull/5066]

> [Java] Support vector rank operation
> 
>
> Key: ARROW-6212
> URL: https://issues.apache.org/jira/browse/ARROW-6212
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Given an unsorted vector, we want to get the index of the ith smallest 
> element in the vector. This function is supported by the rank operation. 
> We provide an implementation that gets the index with the desired rank, 
> without sorting the vector (the vector is left intact), and the 
> implementation takes O( n ) time, where n is the vector length.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3

2019-08-15 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908648#comment-16908648
 ] 

Wes McKinney commented on ARROW-6058:
-

Thank you, that's great! I added to the 0.15.0 milestone. I've been working a 
lot on Parquet stuff lately so if no one looks at it first I'll try to look 
before the release horizon closes

> [Python][Parquet] Failure when reading Parquet file from S3 
> 
>
> Key: ARROW-6058
> URL: https://issues.apache.org/jira/browse/ARROW-6058
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.1
>Reporter: Siddharth
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, 
> in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, 
> in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in 
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) 
> than expected (263929)
> {code}
>  
> *Note: Same code works on relatively smaller dataset (approx < 50M records)* 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6058:

Fix Version/s: 0.15.0

> [Python][Parquet] Failure when reading Parquet file from S3 
> 
>
> Key: ARROW-6058
> URL: https://issues.apache.org/jira/browse/ARROW-6058
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.1
>Reporter: Siddharth
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, 
> in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, 
> in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in 
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) 
> than expected (263929)
> {code}
>  
> *Note: Same code works on relatively smaller dataset (approx < 50M records)* 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3

2019-08-15 Thread Wong Chung Hoi (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908622#comment-16908622
 ] 

Wong Chung Hoi edited comment on ARROW-6058 at 8/16/19 2:39 AM:


Hi all,

below is a simple piece of code to reproduce the issue using:

 
{code:java}
s3fs==0.3.3
pyarrow==0.14.1
pandas==0.24.0 
{code}
 

The file generated is roughly 170MB

 
{code:java}
import pandas as pd
import numpy as np
pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) for i 
in range(10)]).to_parquet('s3://path/to/file.snappy.parquet')
pd.read_parquet('s3://path/to/file.snappy.parquet')
{code}
{code:java}
Traceback (most recent call last):
 File "", line 1, in 
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 282, in read_parquet
 return impl.read(path, columns=columns, **kwargs)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 129, in read
 **kwargs).to_pandas()
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 1216, in read_table
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 216, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1086, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
 pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) 
than expected (979599){code}
 


was (Author: hoi):
Hi all,

below is a simple piece of code to reproduce the issue using:

 
{code:java}
s3fs==0.3.3
pyarrow==0.14.1
pandas==0.24.0 
{code}
 

The file generated is roughly 170MB

 
{code:java}
import pandas as pd
>>> import numpy as np
>>> pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) 
>>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet')
>>> pd.read_parquet('s3://path/to/file.snappy.parquet')
{code}
{code:java}
Traceback (most recent call last):
 File "", line 1, in 
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 282, in read_parquet
 return impl.read(path, columns=columns, **kwargs)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 129, in read
 **kwargs).to_pandas()
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 1216, in read_table
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 216, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1086, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
 pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) 
than expected (979599){code}
 

> [Python][Parquet] Failure when reading Parquet file from S3 
> 
>
> Key: ARROW-6058
> URL: https://issues.apache.org/jira/browse/ARROW-6058
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.1
>Reporter: Siddharth
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, 
> in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, 
> in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in 
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) 
> than expected (263929)
> {code}
>  
> *Note: Same code works on relatively smaller dataset (approx < 50M records)* 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty

2019-08-15 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908642#comment-16908642
 ] 

Wes McKinney commented on ARROW-6038:
-

I confirmed that the MWE is behaving properly now

{code}
$ python ~/Downloads/segfault_ex.py 
Creating table
Traceback (most recent call last):
  File "/home/wesm/Downloads/segfault_ex.py", line 11, in 
pa.RecordBatch.from_arrays([pa.array(["C", "C", "C"])], schema),
  File "pyarrow/table.pxi", line 1117, in pyarrow.lib.Table.from_batches
return pyarrow_wrap_table(c_table)
  File "pyarrow/public-api.pxi", line 316, in pyarrow.lib.pyarrow_wrap_table
check_status(ctable.get().Validate())
  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
raise ArrowInvalid(message)
pyarrow.lib.ArrowInvalid: Column 0: In chunk 1 expected type string but saw null
{code}

This is still weird and dangerous though:

{code}
In [4]: pa.RecordBatch.from_arrays([pa.array([])], schema)  

Out[4]: 

In [5]: rb = pa.RecordBatch.from_arrays([pa.array([])], schema) 


In [6]: rb  

Out[6]: 

In [7]: rb.schema   

Out[7]: col: string

In [8]: rb[0]   

Out[8]: 

0 nulls
{code}

I opened ARROW-6263

> [Python] pyarrow.Table.from_batches produces corrupted table if any of the 
> batches were empty
> -
>
> Key: ARROW-6038
> URL: https://issues.apache.org/jira/browse/ARROW-6038
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0, 0.14.0, 0.14.1
>Reporter: Piotr Bajger
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available, windows
> Fix For: 0.15.0
>
> Attachments: segfault_ex.py
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When creating a Table from a list/iterator of batches which contains an 
> "empty" RecordBatch a Table is produced but attempts to run any pyarrow 
> built-in functions (such as unique()) occasionally result in a Segfault.
> The MWE is attached: [^segfault_ex.py]
>  # The segfaults happen randomly, around 30% of the time.
>  # Commenting out line 10 in the MWE results in no segfaults.
>  # The segfault is triggered using the unique() function, but I doubt the 
> behaviour is specific to that function, from what I gather the problem lies 
> in Table creation.
> I'm on Windows 10, using Python 3.6 and pyarrow 0.14.0 installed through pip 
> (problem also occurs with 0.13.0 from conda-forge).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6219) [Java] Add API for JDBC adapter that can convert less then the full result set at a time.

2019-08-15 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6219.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5075
[https://github.com/apache/arrow/pull/5075]

> [Java] Add API for JDBC adapter that can convert less then the full result 
> set at a time.
> -
>
> Key: ARROW-6219
> URL: https://issues.apache.org/jira/browse/ARROW-6219
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> Somehow we should configure number of rows per batch and either let clients 
> iterate or provide an iterator API.  Otherwise for large result sets we might 
> run out of memory.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6263) [Python] RecordBatch.from_arrays does not check array types against a passed schema

2019-08-15 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6263:
---

 Summary: [Python] RecordBatch.from_arrays does not check array 
types against a passed schema
 Key: ARROW-6263
 URL: https://issues.apache.org/jira/browse/ARROW-6263
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.15.0


Example came from ARROW-6038

{code}
In [4]: pa.RecordBatch.from_arrays([pa.array([])], schema)  

Out[4]: 

In [5]: rb = pa.RecordBatch.from_arrays([pa.array([])], schema) 


In [6]: rb  

Out[6]: 

In [7]: rb.schema   

Out[7]: col: string

In [8]: rb[0]   

Out[8]: 

0 nulls

{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6038:

Component/s: Python
 C++

> [Python] pyarrow.Table.from_batches produces corrupted table if any of the 
> batches were empty
> -
>
> Key: ARROW-6038
> URL: https://issues.apache.org/jira/browse/ARROW-6038
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0, 0.14.0, 0.14.1
>Reporter: Piotr Bajger
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available, windows
> Fix For: 0.15.0
>
> Attachments: segfault_ex.py
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When creating a Table from a list/iterator of batches which contains an 
> "empty" RecordBatch a Table is produced but attempts to run any pyarrow 
> built-in functions (such as unique()) occasionally result in a Segfault.
> The MWE is attached: [^segfault_ex.py]
>  # The segfaults happen randomly, around 30% of the time.
>  # Commenting out line 10 in the MWE results in no segfaults.
>  # The segfault is triggered using the unique() function, but I doubt the 
> behaviour is specific to that function, from what I gather the problem lies 
> in Table creation.
> I'm on Windows 10, using Python 3.6 and pyarrow 0.14.0 installed through pip 
> (problem also occurs with 0.13.0 from conda-forge).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6038:
---

Assignee: Antoine Pitrou

> [Python] pyarrow.Table.from_batches produces corrupted table if any of the 
> batches were empty
> -
>
> Key: ARROW-6038
> URL: https://issues.apache.org/jira/browse/ARROW-6038
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.13.0, 0.14.0, 0.14.1
>Reporter: Piotr Bajger
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available, windows
> Fix For: 0.15.0
>
> Attachments: segfault_ex.py
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When creating a Table from a list/iterator of batches which contains an 
> "empty" RecordBatch a Table is produced but attempts to run any pyarrow 
> built-in functions (such as unique()) occasionally result in a Segfault.
> The MWE is attached: [^segfault_ex.py]
>  # The segfaults happen randomly, around 30% of the time.
>  # Commenting out line 10 in the MWE results in no segfaults.
>  # The segfault is triggered using the unique() function, but I doubt the 
> behaviour is specific to that function, from what I gather the problem lies 
> in Table creation.
> I'm on Windows 10, using Python 3.6 and pyarrow 0.14.0 installed through pip 
> (problem also occurs with 0.13.0 from conda-forge).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6038.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4983
[https://github.com/apache/arrow/pull/4983]

> [Python] pyarrow.Table.from_batches produces corrupted table if any of the 
> batches were empty
> -
>
> Key: ARROW-6038
> URL: https://issues.apache.org/jira/browse/ARROW-6038
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.13.0, 0.14.0, 0.14.1
>Reporter: Piotr Bajger
>Priority: Minor
>  Labels: pull-request-available, windows
> Fix For: 0.15.0
>
> Attachments: segfault_ex.py
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When creating a Table from a list/iterator of batches which contains an 
> "empty" RecordBatch a Table is produced but attempts to run any pyarrow 
> built-in functions (such as unique()) occasionally result in a Segfault.
> The MWE is attached: [^segfault_ex.py]
>  # The segfaults happen randomly, around 30% of the time.
>  # Commenting out line 10 in the MWE results in no segfaults.
>  # The segfault is triggered using the unique() function, but I doubt the 
> behaviour is specific to that function, from what I gather the problem lies 
> in Table creation.
> I'm on Windows 10, using Python 3.6 and pyarrow 0.14.0 installed through pip 
> (problem also occurs with 0.13.0 from conda-forge).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6249) [Java] Remove useless class ByteArrayWrapper

2019-08-15 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6249.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5093
[https://github.com/apache/arrow/pull/5093]

> [Java] Remove useless class ByteArrayWrapper
> 
>
> Key: ARROW-6249
> URL: https://issues.apache.org/jira/browse/ARROW-6249
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This class was introduced into encoding part to compare byte[] values equals.
> Since now we compare value/vector equals by new added visitor API by 
> ARROW-6022 instead of  comparing {{getObject}}, this class is no use anymore.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3

2019-08-15 Thread Wong Chung Hoi (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908622#comment-16908622
 ] 

Wong Chung Hoi edited comment on ARROW-6058 at 8/16/19 2:03 AM:


Hi all,

below is a simple piece of code to reproduce the issue using:

 
{code:java}
s3fs==0.3.3
pyarrow==0.14.1
pandas==0.24.0 
{code}
 

The file generated is roughly 170MB

 
{code:java}
import pandas as pd
>>> import numpy as np
>>> pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) 
>>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet')
>>> pd.read_parquet('s3://path/to/file.snappy.parquet')
{code}
{code:java}
Traceback (most recent call last):
 File "", line 1, in 
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 282, in read_parquet
 return impl.read(path, columns=columns, **kwargs)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 129, in read
 **kwargs).to_pandas()
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 1216, in read_table
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 216, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1086, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
 pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) 
than expected (979599){code}
 


was (Author: hoi):
Hi all,

below is a simple piece of code to reproduce the issue using:

s3fs==0.3.3

pyarrow==0.14.1

pandas==0.24.0

 

The file generated is roughly 170MB

```

import pandas as pd
>>> import numpy as np
>>> pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) 
>>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet')
>>> pd.read_parquet('s3://path/to/file.snappy.parquet')

```

```
Traceback (most recent call last):
 File "", line 1, in 
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 282, in read_parquet
 return impl.read(path, columns=columns, **kwargs)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 129, in read
 **kwargs).to_pandas()
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 1216, in read_table
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 216, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1086, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) 
than expected (979599)

```

 

> [Python][Parquet] Failure when reading Parquet file from S3 
> 
>
> Key: ARROW-6058
> URL: https://issues.apache.org/jira/browse/ARROW-6058
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.1
>Reporter: Siddharth
>Priority: Major
>  Labels: parquet
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, 
> in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, 
> in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in 
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) 
> than expected (263929)
> {code}
>  
> *Note: Same code works on relatively smaller dataset (approx < 50M records)* 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3

2019-08-15 Thread Wong Chung Hoi (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908622#comment-16908622
 ] 

Wong Chung Hoi commented on ARROW-6058:
---

Hi all,

below is a simple piece of code to reproduce the issue using:

s3fs==0.3.3

pyarrow==0.14.1

pandas==0.24.0

 

The file generated is roughly 170MB

```

import pandas as pd
>>> import numpy as np
>>> pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) 
>>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet')
>>> pd.read_parquet('s3://path/to/file.snappy.parquet')

```

```
Traceback (most recent call last):
 File "", line 1, in 
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 282, in read_parquet
 return impl.read(path, columns=columns, **kwargs)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 129, in read
 **kwargs).to_pandas()
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 1216, in read_table
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 216, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1086, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) 
than expected (979599)

```

 

> [Python][Parquet] Failure when reading Parquet file from S3 
> 
>
> Key: ARROW-6058
> URL: https://issues.apache.org/jira/browse/ARROW-6058
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.1
>Reporter: Siddharth
>Priority: Major
>  Labels: parquet
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, 
> in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, 
> in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in 
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) 
> than expected (263929)
> {code}
>  
> *Note: Same code works on relatively smaller dataset (approx < 50M records)* 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6130) [Release] Use 0.15.0 as the next release

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6130.
-
Resolution: Fixed

Issue resolved by pull request 5007
[https://github.com/apache/arrow/pull/5007]

> [Release] Use 0.15.0 as the next release
> 
>
> Key: ARROW-6130
> URL: https://issues.apache.org/jira/browse/ARROW-6130
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6262) [Developer] Show JIRA issue before merging

2019-08-15 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6262:
--
Labels: pull-request-available  (was: )

> [Developer] Show JIRA issue before merging
> --
>
> Key: ARROW-6262
> URL: https://issues.apache.org/jira/browse/ARROW-6262
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Minor
>  Labels: pull-request-available
>
> It's useful to confirm whehter the associated JIRA issue is right or not.
> 
> We couldn't find wrong associated JIRA issue after we merge the pull request 
> https://github.com/apache/arrow/pull/5050 .



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6262) [Developer] Show JIRA issue before merging

2019-08-15 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-6262:
---

 Summary: [Developer] Show JIRA issue before merging
 Key: ARROW-6262
 URL: https://issues.apache.org/jira/browse/ARROW-6262
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei


It's useful to confirm whehter the associated JIRA issue is right or not.

We couldn't find wrong associated JIRA issue after we merge the pull request 
https://github.com/apache/arrow/pull/5050 .




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6259) [C++][CI] Flatbuffers-related failures in CI on macOS

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6259.
-
Resolution: Fixed

Issue resolved by pull request 5096
[https://github.com/apache/arrow/pull/5096]

> [C++][CI] Flatbuffers-related failures in CI on macOS
> -
>
> Key: ARROW-6259
> URL: https://issues.apache.org/jira/browse/ARROW-6259
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This seemingly has just started happening randomly today
> https://travis-ci.org/apache/arrow/jobs/572381802#L2864



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-4844) Static libarrow is missing vendored libdouble-conversion

2019-08-15 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908555#comment-16908555
 ] 

Wes McKinney commented on ARROW-4844:
-

I opened https://issues.apache.org/jira/browse/ARROW-6261 to be the umbrella 
issue for the project

> Static libarrow is missing vendored libdouble-conversion
> 
>
> Key: ARROW-4844
> URL: https://issues.apache.org/jira/browse/ARROW-4844
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.12.1
>Reporter: Jeroen
>Priority: Major
>
> When trying to statically link the R bindings to libarrow.a, I get linking 
> errors which suggest that libdouble-conversion.a was not properly embedded in 
> libarrow.a. This problem happens on both MacOS and Windows.
> Here is the arrow build log: 
> https://ci.appveyor.com/project/jeroen/rtools-packages/builds/23015303/job/mtgl6rvfde502iu7
> {code}
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(cast.cc.obj):(.text+0x1c77c):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x5fda):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6097):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6589):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6647):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, 
> int*) const'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6261) [C++] Install any bundled components and add installed CMake or pkgconfig configuration to enable downstream linkers to utilize bundled libraries when statically linking

2019-08-15 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6261:
---

 Summary: [C++] Install any bundled components and add installed 
CMake or pkgconfig configuration to enable downstream linkers to utilize 
bundled libraries when statically linking
 Key: ARROW-6261
 URL: https://issues.apache.org/jira/browse/ARROW-6261
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


The objective of this change would be to make it easier for toolchain builders 
to ship bundled thirdparty libraries together with the Arrow libraries in case 
there is a particular library version that is only used when linking with 
{{libarrow.a}}. In theory configuration could be added to arrowTargets.cmake 
(or pkgconfig) to simplify static linking



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-4844) Static libarrow is missing vendored libdouble-conversion

2019-08-15 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908546#comment-16908546
 ] 

Wes McKinney commented on ARROW-4844:
-

If you're statically linking that's the correct approach for right now. 
Shipping a complete vendored library toolchain is probably a fairly extensive 
project, so a volunteer is free to take up that work in the future. 

> Static libarrow is missing vendored libdouble-conversion
> 
>
> Key: ARROW-4844
> URL: https://issues.apache.org/jira/browse/ARROW-4844
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.12.1
>Reporter: Jeroen
>Priority: Major
>
> When trying to statically link the R bindings to libarrow.a, I get linking 
> errors which suggest that libdouble-conversion.a was not properly embedded in 
> libarrow.a. This problem happens on both MacOS and Windows.
> Here is the arrow build log: 
> https://ci.appveyor.com/project/jeroen/rtools-packages/builds/23015303/job/mtgl6rvfde502iu7
> {code}
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(cast.cc.obj):(.text+0x1c77c):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x5fda):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6097):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6589):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6647):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, 
> int*) const'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-4844) Static libarrow is missing vendored libdouble-conversion

2019-08-15 Thread Jeroen (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908545#comment-16908545
 ] 

Jeroen commented on ARROW-4844:
---

I'm working around it now by linking an external libdouble-conversion rather 
than the vendored one.

> Static libarrow is missing vendored libdouble-conversion
> 
>
> Key: ARROW-4844
> URL: https://issues.apache.org/jira/browse/ARROW-4844
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.12.1
>Reporter: Jeroen
>Priority: Major
>
> When trying to statically link the R bindings to libarrow.a, I get linking 
> errors which suggest that libdouble-conversion.a was not properly embedded in 
> libarrow.a. This problem happens on both MacOS and Windows.
> Here is the arrow build log: 
> https://ci.appveyor.com/project/jeroen/rtools-packages/builds/23015303/job/mtgl6rvfde502iu7
> {code}
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(cast.cc.obj):(.text+0x1c77c):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x5fda):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6097):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6589):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6647):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, 
> int*) const'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (ARROW-4844) Static libarrow is missing vendored libdouble-conversion

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-4844.
---
Resolution: Not A Problem
  Assignee: (was: Uwe L. Korn)

If I'm not mistaken this issue is not causing problems anymore

> Static libarrow is missing vendored libdouble-conversion
> 
>
> Key: ARROW-4844
> URL: https://issues.apache.org/jira/browse/ARROW-4844
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.12.1
>Reporter: Jeroen
>Priority: Major
>
> When trying to statically link the R bindings to libarrow.a, I get linking 
> errors which suggest that libdouble-conversion.a was not properly embedded in 
> libarrow.a. This problem happens on both MacOS and Windows.
> Here is the arrow build log: 
> https://ci.appveyor.com/project/jeroen/rtools-packages/builds/23015303/job/mtgl6rvfde502iu7
> {code}
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(cast.cc.obj):(.text+0x1c77c):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x5fda):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6097):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6589):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6647):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, 
> int*) const'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6015) [Python] pyarrow: `DLL load failed` when importing on windows

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6015:

Fix Version/s: 0.15.0

> [Python] pyarrow:  `DLL load failed` when importing on windows
> --
>
> Key: ARROW-6015
> URL: https://issues.apache.org/jira/browse/ARROW-6015
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.14.1
>Reporter: Ruslan Kuprieiev
>Priority: Major
> Fix For: 0.15.0
>
>
> When installing pyarrow 0.14.1 on windows 10 x64 with python 3.7, you get:
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in 
> 
>     from pyarrow.lib import cpu_count, set_cpu_count
>   ImportError: DLL load failed: The specified module could not be found.
>  On 0.14.0 everything works fine.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-3243) [C++] Upgrade jemalloc to version 5

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3243:
---

Assignee: Antoine Pitrou

> [C++] Upgrade jemalloc to version 5
> ---
>
> Key: ARROW-3243
> URL: https://issues.apache.org/jira/browse/ARROW-3243
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Assignee: Antoine Pitrou
>Priority: Major
>
> Is it possible/feasible to upgrade jemalloc to version 5 and assume that 
> version? I'm asking because I've been working towards replacing dlmalloc in 
> plasma with jemalloc, which makes some of the code much nicer and removes 
> some of the issues we had with dlmalloc, but it requires jemalloc APIs that 
> are only available starting from jemalloc version 5, in particular, I'm using 
> the extent_hooks_t capability.
> For now I can submit a patch that uses a different version of jemalloc in 
> plasma and then we can figure out how to deal with it (maybe there is a way 
> to make it work with older versions). What are your thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3243) [C++] Upgrade jemalloc to version 5

2019-08-15 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908542#comment-16908542
 ] 

Wes McKinney commented on ARROW-3243:
-

See 
https://github.com/apache/arrow/commit/f913d8f0adff71c288a10f6c1b0ad2d1ab3e9e32

> [C++] Upgrade jemalloc to version 5
> ---
>
> Key: ARROW-3243
> URL: https://issues.apache.org/jira/browse/ARROW-3243
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Priority: Major
>
> Is it possible/feasible to upgrade jemalloc to version 5 and assume that 
> version? I'm asking because I've been working towards replacing dlmalloc in 
> plasma with jemalloc, which makes some of the code much nicer and removes 
> some of the issues we had with dlmalloc, but it requires jemalloc APIs that 
> are only available starting from jemalloc version 5, in particular, I'm using 
> the extent_hooks_t capability.
> For now I can submit a patch that uses a different version of jemalloc in 
> plasma and then we can figure out how to deal with it (maybe there is a way 
> to make it work with older versions). What are your thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-3243) [C++] Upgrade jemalloc to version 5

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3243.
-
Resolution: Fixed

We're using jemalloc 5.2.0 now

> [C++] Upgrade jemalloc to version 5
> ---
>
> Key: ARROW-3243
> URL: https://issues.apache.org/jira/browse/ARROW-3243
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Priority: Major
>
> Is it possible/feasible to upgrade jemalloc to version 5 and assume that 
> version? I'm asking because I've been working towards replacing dlmalloc in 
> plasma with jemalloc, which makes some of the code much nicer and removes 
> some of the issues we had with dlmalloc, but it requires jemalloc APIs that 
> are only available starting from jemalloc version 5, in particular, I'm using 
> the extent_hooks_t capability.
> For now I can submit a patch that uses a different version of jemalloc in 
> plasma and then we can figure out how to deal with it (maybe there is a way 
> to make it work with older versions). What are your thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5980) [Python] Missing libarrow.so and libarrow_python.so in wheel file

2019-08-15 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908539#comment-16908539
 ] 

Wes McKinney commented on ARROW-5980:
-

setuptools does not understand symlinks during the wheel build -- previously 
the shared libraries were being duplicated inside the wheel instead of 
symlinked. If you can resolve the issue without duplicating the shared 
libraries please submit a PR

> [Python] Missing libarrow.so and libarrow_python.so in wheel file
> -
>
> Key: ARROW-5980
> URL: https://issues.apache.org/jira/browse/ARROW-5980
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: Haowei Yu
>Priority: Major
>  Labels: wheel
>
> I have installed the pyarrow 0.14.0 but it seems that by default you did not 
> provide symlink of libarrow.so and libarrow_python.so. Only .so file with 
> version suffix is provided. Hence, I cannot use the output of 
> pyarrow.get_libraries() and pyarrow.get_library_dirs() to build my link 
> option. 
> If you provide symlink, I can pass following to the linker to specify the 
> library to link. e.g. g++ -L/ -larrow -larrow_python 
> However, right now, the ld ouput complains not being able to find -larrow and 
> -larrow_python



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5980) [Python] Missing libarrow.so and libarrow_python.so in wheel file

2019-08-15 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908540#comment-16908540
 ] 

Wes McKinney commented on ARROW-5980:
-

In the meantime I would suggest developing against the conda packages which 
don't have this issue

> [Python] Missing libarrow.so and libarrow_python.so in wheel file
> -
>
> Key: ARROW-5980
> URL: https://issues.apache.org/jira/browse/ARROW-5980
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: Haowei Yu
>Priority: Major
>  Labels: wheel
>
> I have installed the pyarrow 0.14.0 but it seems that by default you did not 
> provide symlink of libarrow.so and libarrow_python.so. Only .so file with 
> version suffix is provided. Hence, I cannot use the output of 
> pyarrow.get_libraries() and pyarrow.get_library_dirs() to build my link 
> option. 
> If you provide symlink, I can pass following to the linker to specify the 
> library to link. e.g. g++ -L/ -larrow -larrow_python 
> However, right now, the ld ouput complains not being able to find -larrow and 
> -larrow_python



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6119) [Python] PyArrow import fails on Windows Python 3.7

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6119:

Fix Version/s: 0.15.0

> [Python] PyArrow import fails on Windows Python 3.7
> ---
>
> Key: ARROW-6119
> URL: https://issues.apache.org/jira/browse/ARROW-6119
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: Windows, Python 3.7
>Reporter: Paul Suganthan
>Priority: Major
> Fix For: 0.15.0
>
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in 
> 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified procedure could not be found.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6170) [R] "docker-compose build r" is slow

2019-08-15 Thread Sutou Kouhei (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei resolved ARROW-6170.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5039
[https://github.com/apache/arrow/pull/5039]

> [R] "docker-compose build r" is slow
> 
>
> Key: ARROW-6170
> URL: https://issues.apache.org/jira/browse/ARROW-6170
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools, R
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Apparently it installs and compiles all packages in single-thread mode.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6259) [C++][CI] Flatbuffers-related failures in CI on macOS

2019-08-15 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6259:
--
Labels: pull-request-available  (was: )

> [C++][CI] Flatbuffers-related failures in CI on macOS
> -
>
> Key: ARROW-6259
> URL: https://issues.apache.org/jira/browse/ARROW-6259
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> This seemingly has just started happening randomly today
> https://travis-ci.org/apache/arrow/jobs/572381802#L2864



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6260) [Website] Use deploy key on Travis to build and push to asf-site

2019-08-15 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-6260:
--

 Summary: [Website] Use deploy key on Travis to build and push to 
asf-site
 Key: ARROW-6260
 URL: https://issues.apache.org/jira/browse/ARROW-6260
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Neal Richardson
Assignee: Neal Richardson


ARROW-4473 added CI/CD for the website, but there was some discomfort about 
having a committer provide a GitHub personal access token to do the pushing of 
the built site to the asf-site branch. Investigate using GitHub Deploy Keys 
instead, which are scoped to a single repository, not all public repositories 
that a user has access to.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6259) [C++][CI] Flatbuffers-related failures in CI on macOS

2019-08-15 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908497#comment-16908497
 ] 

Wes McKinney commented on ARROW-6259:
-

Reported upstream to Flatbuffers

https://github.com/google/flatbuffers/issues/5482

> [C++][CI] Flatbuffers-related failures in CI on macOS
> -
>
> Key: ARROW-6259
> URL: https://issues.apache.org/jira/browse/ARROW-6259
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
> Fix For: 0.15.0
>
>
> This seemingly has just started happening randomly today
> https://travis-ci.org/apache/arrow/jobs/572381802#L2864



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6258) [R] Add macOS build scripts

2019-08-15 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6258:
--
Labels: pull-request-available  (was: )

> [R] Add macOS build scripts
> ---
>
> Key: ARROW-6258
> URL: https://issues.apache.org/jira/browse/ARROW-6258
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>
> CRAN builds binary packages for Windows and macOS. It generally does this by 
> building on its servers and bundling all dependencies in the R package. This 
> has been accomplished by having separate processes for building and hosting 
> system dependencies, and then downloading and bundling those with scripts 
> that get executed at install time (and then create the binary package as a 
> side effect).
> ARROW-3758 added the Windows PKGBUILD and related packaging scripts and ran 
> them on our Appveyor. This ticket is to do the same for the macOS scripts.
> The purpose of these tickets is to bring the whole build pipeline under our 
> version control and CI so that we can address any C++ build and dependency 
> changes as they arise and not be surprised when it comes time to cut a 
> release. A side benefit is that they also enable us to offer a nightly binary 
> package repository with minimal additional effort.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6259) [C++][CI] Flatbuffers-related failures in CI on macOS

2019-08-15 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908493#comment-16908493
 ] 

Wes McKinney commented on ARROW-6259:
-

conda-forge confirms the compiler switch occurred this afternoon

https://gitter.im/conda-forge/conda-forge.github.io?at=5d55d1e0beba830fff9ce0b3

probably we'll have to suppress the compiler warning

> [C++][CI] Flatbuffers-related failures in CI on macOS
> -
>
> Key: ARROW-6259
> URL: https://issues.apache.org/jira/browse/ARROW-6259
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 0.15.0
>
>
> This seemingly has just started happening randomly today
> https://travis-ci.org/apache/arrow/jobs/572381802#L2864



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6259) [C++][CI] Flatbuffers-related failures in CI on macOS

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6259:
---

Assignee: Wes McKinney

> [C++][CI] Flatbuffers-related failures in CI on macOS
> -
>
> Key: ARROW-6259
> URL: https://issues.apache.org/jira/browse/ARROW-6259
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
> Fix For: 0.15.0
>
>
> This seemingly has just started happening randomly today
> https://travis-ci.org/apache/arrow/jobs/572381802#L2864



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6259) [C++][CI] Flatbuffers-related failures in CI on macOS

2019-08-15 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908490#comment-16908490
 ] 

Wes McKinney commented on ARROW-6259:
-

Comparing 

* failure https://api.travis-ci.org/v3/job/572381802/log.txt
* success (1 commit prior) https://api.travis-ci.org/v3/job/572286191/log.txt

it appears that the conda toolchain upgraded from clang 4.0.1 to clang 8.0.0

> [C++][CI] Flatbuffers-related failures in CI on macOS
> -
>
> Key: ARROW-6259
> URL: https://issues.apache.org/jira/browse/ARROW-6259
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 0.15.0
>
>
> This seemingly has just started happening randomly today
> https://travis-ci.org/apache/arrow/jobs/572381802#L2864



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6204) [GLib] Add garrow_array_is_in_chunked_array()

2019-08-15 Thread Sutou Kouhei (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei resolved ARROW-6204.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5086
[https://github.com/apache/arrow/pull/5086]

> [GLib] Add garrow_array_is_in_chunked_array()
> -
>
> Key: ARROW-6204
> URL: https://issues.apache.org/jira/browse/ARROW-6204
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Reporter: Yosuke Shiro
>Assignee: Yosuke Shiro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> This is follow-up of 
> [https://github.com/apache/arrow/pull/5047#issuecomment-520103706].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6259) [C++][CI] Flatbuffers-related failures in CI on macOS

2019-08-15 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6259:
---

 Summary: [C++][CI] Flatbuffers-related failures in CI on macOS
 Key: ARROW-6259
 URL: https://issues.apache.org/jira/browse/ARROW-6259
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.15.0


This seemingly has just started happening randomly today

https://travis-ci.org/apache/arrow/jobs/572381802#L2864



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6186) [Packaging][C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian package

2019-08-15 Thread Sutou Kouhei (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei resolved ARROW-6186.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5050
[https://github.com/apache/arrow/pull/5050]

> [Packaging][C++] Plasma headers not included for ubuntu-xenial libplasma-dev 
> debian package
> ---
>
> Key: ARROW-6186
> URL: https://issues.apache.org/jira/browse/ARROW-6186
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Plasma, Packaging
>Affects Versions: 0.14.1
>Reporter: Wannes G
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: debian, packaging, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See 
> [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install]
> Issue is still present on latest master branch, the debian install script is 
> correct: 
> [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian/libplasma-dev.install]
> The first line is missing from the ubuntu install script causing no headers 
> to be installed when apt-get is used to install libplasma-dev.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6258) [R] Add macOS build scripts

2019-08-15 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-6258:
--

 Summary: [R] Add macOS build scripts
 Key: ARROW-6258
 URL: https://issues.apache.org/jira/browse/ARROW-6258
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson


CRAN builds binary packages for Windows and macOS. It generally does this by 
building on its servers and bundling all dependencies in the R package. This 
has been accomplished by having separate processes for building and hosting 
system dependencies, and then downloading and bundling those with scripts that 
get executed at install time (and then create the binary package as a side 
effect).

ARROW-3758 added the Windows PKGBUILD and related packaging scripts and ran 
them on our Appveyor. This ticket is to do the same for the macOS scripts.

The purpose of these tickets is to bring the whole build pipeline under our 
version control and CI so that we can address any C++ build and dependency 
changes as they arise and not be surprised when it comes time to cut a release. 
A side benefit is that they also enable us to offer a nightly binary package 
repository with minimal additional effort.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5956) [R] Ability for R to link to C++ libraries from pyarrow Wheel

2019-08-15 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908480#comment-16908480
 ] 

Neal Richardson commented on ARROW-5956:


[~jeffreyw] could you try setting {{R_LD_LIBRARY_PATH}} instead of 
{{LD_LIBRARY_PATH?}} 
[https://github.com/apache/arrow/blob/master/r/README.Rmd#L132]

(For context, see discussion starting here: 
[https://github.com/apache/arrow/pull/5036#issuecomment-519703937])

> [R] Ability for R to link to C++ libraries from pyarrow Wheel
> -
>
> Key: ARROW-5956
> URL: https://issues.apache.org/jira/browse/ARROW-5956
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
> Environment: Ubuntu 16.04, R 3.4.4, python 3.6.5
>Reporter: Jeffrey Wong
>Priority: Major
>
> I have installed pyarrow 0.14.0 and want to be able to also use R arrow. In 
> my work I use rpy2 a lot to exchange python data structures with R data 
> structures, so would like R arrow to link against the exact same .so files 
> found in pyarrow
>  
>  
> When I pass in include_dir and lib_dir to R's configure, pointing to 
> pyarrow's include and pyarrow's root directories, I am able to compile R's 
> arrow.so file. However, I am unable to load it in an R session, getting the 
> error:
>  
> {code:java}
> > dyn.load('arrow.so')
> Error in dyn.load("arrow.so") :
>  unable to load shared object '/tmp/arrow2/r/src/arrow.so':
>  /tmp/arrow2/r/src/arrow.so: undefined symbol: 
> _ZNK5arrow11StructArray14GetFieldByNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE{code}
>  
>  
> Steps to reproduce:
>  
> Install pyarrow, which also ships libarrow.so and libparquet.so
>  
> {code:java}
> pip3 install pyarrow --upgrade --user
> PY_ARROW_PATH=$(python3 -c "import pyarrow, os; 
> print(os.path.dirname(pyarrow.__file__))")
> PY_ARROW_VERSION=$(python3 -c "import pyarrow; print(pyarrow.__version__)")
> ln -s $PY_ARROW_PATH/libarrow.so.14 $PY_ARROW_PATH/libarrow.so
> ln -s $PY_ARROW_PATH/libparquet.so.14 $PY_ARROW_PATH/libparquet.so
> {code}
>  
>  
> Add to LD_LIBRARY_PATH
>  
> {code:java}
> sudo tee -a /usr/lib/R/etc/ldpaths < LD_LIBRARY_PATH="\${LD_LIBRARY_PATH}:$PY_ARROW_PATH"
> export LD_LIBRARY_PATH
> LINES
> sudo tee -a /usr/lib/rstudio-server/bin/r-ldpath < LD_LIBRARY_PATH="\${LD_LIBRARY_PATH}:$PY_ARROW_PATH"
> export LD_LIBRARY_PATH
> LINES
> export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:$PY_ARROW_PATH"
> {code}
>  
>  
> Install r arrow from source
> {code:java}
> git clone https://github.com/apache/arrow.git /tmp/arrow2
> cd /tmp/arrow2/r
> git checkout tags/apache-arrow-0.14.0
> R CMD INSTALL ./ --configure-vars="INCLUDE_DIR=$PY_ARROW_PATH/include 
> LIB_DIR=$PY_ARROW_PATH"{code}
>  
> I have noticed that the R package for arrow no longer has an RcppExports, but 
> instead an arrowExports. Could it be that the lack of RcppExports has made it 
> difficult to find GetFieldByName?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-5134) [R][CI] Run nightly tests against multiple R versions

2019-08-15 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-5134:
--

Assignee: Neal Richardson

> [R][CI] Run nightly tests against multiple R versions
> -
>
> Key: ARROW-5134
> URL: https://issues.apache.org/jira/browse/ARROW-5134
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Krisztian Szucs
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> This requires to fix the docker-compose build of R, which is failing 
> currently:
> https://travis-ci.org/kszucs/crossbow/builds/508343597
> Reproducible locally with command:
> {code}
> docker-compose build cpp
> docker-compose build r
> docker-compose run r
> {code}
> Then introduce an {{R_VERSION}} build argument to the dockerfile, similarly 
> like
> the python docker-compose defines and uses {{PYTHON_VERSION}}, see:
> - https://github.com/apache/arrow/blob/master/python/Dockerfile#L21
> - https://github.com/apache/arrow/blob/master/docker-compose.yml#L247-L259
> Then add to the nightly builds, similarly like python:
> - https://github.com/apache/arrow/blob/master/dev/tasks/tests.yml#L29-L31
> - https://github.com/apache/arrow/blob/master/dev/tasks/tests.yml#L153-L184
> There is already a {{docker-r}} definition, the only difference is to export 
> an 
> {{R_VERSION}} environment variable.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6170) [R] "docker-compose build r" is slow

2019-08-15 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6170:
--

Assignee: Neal Richardson

> [R] "docker-compose build r" is slow
> 
>
> Key: ARROW-6170
> URL: https://issues.apache.org/jira/browse/ARROW-6170
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools, R
>Reporter: Antoine Pitrou
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Apparently it installs and compiles all packages in single-thread mode.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6170) [R] "docker-compose build r" is slow

2019-08-15 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6170:
--

Assignee: Antoine Pitrou  (was: Neal Richardson)

> [R] "docker-compose build r" is slow
> 
>
> Key: ARROW-6170
> URL: https://issues.apache.org/jira/browse/ARROW-6170
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools, R
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Apparently it installs and compiles all packages in single-thread mode.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6182) [R] Add note to README about r-arrow conda installation

2019-08-15 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6182:
---
Summary: [R] Add note to README about r-arrow conda installation   (was: 
[R] Package fails to load with error `CXXABI_1.3.11' not found )

> [R] Add note to README about r-arrow conda installation 
> 
>
> Key: ARROW-6182
> URL: https://issues.apache.org/jira/browse/ARROW-6182
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.14.1
> Environment: Ubuntu 16.04.6
>Reporter: Ian Cook
>Priority: Major
>
> I'm able to successfully install the C++ and Python libraries from 
> conda-forge, then successfully install the R package from CRAN if I use 
> {{--no-test-load}}. But after installation, the R package fails to load 
> because {{dyn.load("arrow.so")}} fails. It throws this error when loading:
> {code:java}
> unable to load shared object '~/R/arrow/libs/arrow.so':
>  /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.11' not found 
> (required by ~/.conda/envs/python3.6/lib/libarrow.so.14)
> {code}
> Do the Arrow C++ libraries actually require GCC 7.1.0 / CXXABI_1.3.11? If 
> not, what might explain this error message? Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-4316) Reusing arrow.so for both Python and R

2019-08-15 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-4316.

Resolution: Duplicate

https://issues.apache.org/jira/browse/ARROW-5956 looks to be a more 
contemporary request of the same thing, so closing in favor of that one.

> Reusing arrow.so for both Python and R
> --
>
> Key: ARROW-4316
> URL: https://issues.apache.org/jira/browse/ARROW-4316
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python, R
>Affects Versions: 0.12.0
> Environment: Ubuntu 16.04, R 3.4.4, pyarrow 0.12, cmake 3.12
>Reporter: Jeffrey Wong
>Priority: Major
>
> My team uses both pyarrow and R arrow, we'd like both libraries to link to 
> the same arrow.so file for consistency. pyarrow ships both arrow.so and 
> parquet.so, if I can reuse those .so's to  link R that would guarantee 
> consistency. 
>  Under arrow v0.11.1 I was able to link R against libarrow.so found under 
> pyarrow by passing LIB_DIR to the R [configure 
> file|https://github.com/apache/arrow/blob/master/r/configure]. However, in 
> v0.12.0 I am no longer able to do that. Here is a reproducible example on 
> Ubuntu 16.04 which produces the error:
>  
> {code:java}
> sh: line 1: 5404 Segmentation fault (core dumped) '/usr/lib/R/bin/R' 
> --no-save --slave 2>&1 < '/tmp/RtmpyOuz4g/file14716feda8fc'
> *** caught segfault ***
> address 0x7f160f026250, cause 'invalid permissions'
> An irrecoverable exception occurred. R is aborting now ...
> {code}
>  
>  Reproducible example:
> {code:java}
>  # get the parquet headers which are not shipped with pyarrow
>   
> tee /etc/apt/sources.list.d/apache-arrow.list < deb [arch=amd64] https://dl.bintray.com/apache/arrow/$(lsb_release --id 
> --short | tr 'A-Z' 'a-z')/ $(lsb_release --codename --short) main
> deb-src [] https://dl.bintray.com/apache/arrow/$(lsb_release --id --short | 
> tr 'A-Z' 'a-z')/ $(lsb_release --codename --short) main
> APT_LINE
> apt-get update
> mkdir /tmp/arrow_headers; cd /tmp/arrow_headers
> apt-get download --allow-unauthenticated libparquet-dev
> ar -x libparquet-dev_0.12.0-1_amd64.deb
> tar -xJvf data.tar.xz
>   
>  #get pyarrow v0.12
>   
>  pip3 install pyarrow --upgrade
>  #figure out where pyarrow is
>  PY_ARROW_PATH=$(python3 -c "import pyarrow, os; 
> print(os.path.dirname(pyarrow.__file__))")
>  PY_ARROW_VERSION=$(python3 -c "import pyarrow; print(pyarrow.__version__)")
>  PYTHON_LIBDIR=$(python3 -c "import sysconfig; 
> print(sysconfig.get_config_var('LIBDIR'))")
>   
>  # pyarrow doesn't ship parquet headers. Copy the ones from apt into the 
> pyarrow dir
>  mkdir $PY_ARROW_PATH/include/parquet
>  cp -r /tmp/arrow_headers/usr/include/parquet/* 
> $PY_ARROW_PATH/include/parquet/
>   
>  #install R arrow
>  echo "export 
> LD_LIBRARY_PATH=\"\${LD_LIBRARY_PATH}:${PYTHON_LIBDIR}:${PY_ARROW_PATH}\"" | 
> tee -a /usr/lib/R/etc/ldpaths
>  git clone https://github.com/apache/arrow.git /tmp/arrow
>  cd /tmp/arrow/r
>  git checkout "apache-arrow-${PY_ARROW_VERSION}"
>  sed -i "/Depends: R/c\Depends: R (>= 3.4)" DESCRIPTION
>  sed -i "s/PKG_CXXFLAGS=/PKG_CXXFLAGS=-D_GLIBCXX_USE_CXX11_ABI=0 /g" 
> src/Makevars.in
>  R CMD INSTALL ./ --configure-vars="INCLUDE_DIR=$PY_ARROW_PATH/include 
> LIB_DIR=$PY_ARROW_PATH" {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6151) [R] See if possible to generate r/inst/NOTICE.txt rather than duplicate information

2019-08-15 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908474#comment-16908474
 ] 

Neal Richardson commented on ARROW-6151:


Any further thoughts [~wesmckinn] or can we close this?

> [R] See if possible to generate r/inst/NOTICE.txt rather than duplicate 
> information
> ---
>
> Key: ARROW-6151
> URL: https://issues.apache.org/jira/browse/ARROW-6151
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Wes McKinney
>Priority: Major
>
> I noticed this file -- I am concerned about its maintainability. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6139) [Documentation][R] Build R docs (pkgdown) site and add to arrow-site

2019-08-15 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6139:
---
Labels:   (was: pull-request-available)

> [Documentation][R] Build R docs (pkgdown) site and add to arrow-site
> 
>
> Key: ARROW-6139
> URL: https://issues.apache.org/jira/browse/ARROW-6139
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R, Website
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 0.15.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Now that the R package is up on CRAN, we should publish the documentation 
> site. We should get this up before we publish the blog post (ARROW-6041) so 
> that we can link to it in the post.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6139) [Documentation][R] Build R docs (pkgdown) site and add to arrow-site

2019-08-15 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6139.

   Resolution: Fixed
Fix Version/s: 0.15.0

> [Documentation][R] Build R docs (pkgdown) site and add to arrow-site
> 
>
> Key: ARROW-6139
> URL: https://issues.apache.org/jira/browse/ARROW-6139
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R, Website
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Now that the R package is up on CRAN, we should publish the documentation 
> site. We should get this up before we publish the blog post (ARROW-6041) so 
> that we can link to it in the post.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6257) [C++] Add fnmatch compatible globbing function

2019-08-15 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-6257:


 Summary: [C++] Add fnmatch compatible globbing function
 Key: ARROW-6257
 URL: https://issues.apache.org/jira/browse/ARROW-6257
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


This will be useful for the filesystems module and in datasource discovery, 
which uses it



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6257) [C++] Add fnmatch compatible globbing function

2019-08-15 Thread Benjamin Kietzman (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Kietzman updated ARROW-6257:
-
Description: 
This will be useful for the filesystems module and in datasource discovery, 
which uses it.

Behavior should be compatible with 
http://pubs.opengroup.org/onlinepubs/95399/functions/fnmatch.html

  was:This will be useful for the filesystems module and in datasource 
discovery, which uses it


> [C++] Add fnmatch compatible globbing function
> --
>
> Key: ARROW-6257
> URL: https://issues.apache.org/jira/browse/ARROW-6257
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Major
>
> This will be useful for the filesystems module and in datasource discovery, 
> which uses it.
> Behavior should be compatible with 
> http://pubs.opengroup.org/onlinepubs/95399/functions/fnmatch.html



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6256) [Rust] parquet-format should be released by Apache process

2019-08-15 Thread Andy Grove (JIRA)
Andy Grove created ARROW-6256:
-

 Summary: [Rust] parquet-format should be released by Apache process
 Key: ARROW-6256
 URL: https://issues.apache.org/jira/browse/ARROW-6256
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Affects Versions: 0.14.1
Reporter: Andy Grove
 Fix For: 0.15.0


The Arrow parquet crate depends on the parquet-format crate. Parquet-format 
2.5.0 was recently released and has breaking changes compared to 2.4.0.

This means that previously published Arrow Parquet/DataFusion crates are now 
unusable out the box (see https://issues.apache.org/jira/browse/ARROW-6255).

We should bring parquet-format into an Apache release process to avoid this 
type of issue in the future.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6255) [Rust] [Parquet] Cannot use any published parquet crate due to parquet-format breaking change

2019-08-15 Thread Andy Grove (JIRA)
Andy Grove created ARROW-6255:
-

 Summary: [Rust] [Parquet] Cannot use any published parquet crate 
due to parquet-format breaking change
 Key: ARROW-6255
 URL: https://issues.apache.org/jira/browse/ARROW-6255
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 0.14.1, 0.14.0, 0.13.0, 0.12.1, 0.12.0
Reporter: Andy Grove
 Fix For: 0.15.0


As a user who wants to use the Rust version of Arrow, I am unable to use any of 
the previously published versions due to the recent breaking change in 
parquet-format 2.5.0.

To reproduce, simply create an empty Rust project using "cargo init example 
--bin", add a dependency on "parquet-0.14.1" and attempt to build the project.
{code:java}
   Compiling parquet v0.13.0

error[E0599]: no variant or associated item named `BOOLEAN` found for type 
`parquet_format::parquet_format::Type` in the current scope

   --> 
/Users/agrove/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-0.13.0/src/basic.rs:408:28

    |

408 |             parquet::Type::BOOLEAN => Type::BOOLEAN,

    |                            ^^^ variant or associated item not found 
in `parquet_format::parquet_format::Type`{code}
This bug has already been fixed in master, but there is no usable published 
crate. We could consider publishing a 0.14.2 to resolve this or just wait until 
the 0.15.0 release. We could also consider using this Jira to at least document 
a workaround, if one exists (maybe Cargo provides a mechanism for overriding 
transitive dependencies?).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6254) [Rust][Parquet] Parquet dependency fails to compile

2019-08-15 Thread Dongha Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908433#comment-16908433
 ] 

Dongha Lee commented on ARROW-6254:
---

As far as I remember, it didn't work. But I can double check it later.

And in `rust/parquet` I couldn't find any line `extern crate arrow`. I am a 
rust newbie, but I guess it's always using the local dependencies.

> [Rust][Parquet] Parquet dependency fails to compile
> ---
>
> Key: ARROW-6254
> URL: https://issues.apache.org/jira/browse/ARROW-6254
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.14.1
>Reporter: Dongha Lee
>Priority: Major
>
> Hi,
> I set up a blank rust project, added dependency `parquet = "0.14.1"` and ran 
> `cargo build`. But unfortunately, it with a large error message.
> Use used rust nightly: `cargo 1.38.0-nightly` and `rustc 1.38.0-nightly`. It 
> failed both on arch and ubuntu.
> I tried to build directly in 
> `.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-0.14.1` but it 
> failed.
> I cloned arrow repository and tried to build in the directory `rust/parquet` 
> and it succeeded. But as soon I moved the rust/parquet to some other 
> location, the build failed. So my guess is that the failure has to do 
> something with dependent modules `rust/arrow`.
> Is this a known issue? I couldn't find any ticket for that.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6254) [Rust][Parquet] Parquet dependency fails to compile

2019-08-15 Thread Paddy Horan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908424#comment-16908424
 ] 

Paddy Horan commented on ARROW-6254:


This is not a known error, we use relative paths within the workspace for 
development like 
[here]([https://github.com/apache/arrow/blob/master/rust/parquet/Cargo.toml#L43]).
  I guess when publishing to creates.io we need to publish arrow, parquet then 
datafusion and update the Cargo.toml for parquet and datafusion before we 
publish.

If you change parquet's Cargo.toml to: arrow = "0.14.1" does it compile when 
moved as you described above?

> [Rust][Parquet] Parquet dependency fails to compile
> ---
>
> Key: ARROW-6254
> URL: https://issues.apache.org/jira/browse/ARROW-6254
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.14.1
>Reporter: Dongha Lee
>Priority: Major
>
> Hi,
> I set up a blank rust project, added dependency `parquet = "0.14.1"` and ran 
> `cargo build`. But unfortunately, it with a large error message.
> Use used rust nightly: `cargo 1.38.0-nightly` and `rustc 1.38.0-nightly`. It 
> failed both on arch and ubuntu.
> I tried to build directly in 
> `.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-0.14.1` but it 
> failed.
> I cloned arrow repository and tried to build in the directory `rust/parquet` 
> and it succeeded. But as soon I moved the rust/parquet to some other 
> location, the build failed. So my guess is that the failure has to do 
> something with dependent modules `rust/arrow`.
> Is this a known issue? I couldn't find any ticket for that.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6254) [Rust][Parquet] Parquet dependency fails to compile

2019-08-15 Thread Dongha Lee (JIRA)
Dongha Lee created ARROW-6254:
-

 Summary: [Rust][Parquet] Parquet dependency fails to compile
 Key: ARROW-6254
 URL: https://issues.apache.org/jira/browse/ARROW-6254
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 0.14.1
Reporter: Dongha Lee


Hi,

I set up a blank rust project, added dependency `parquet = "0.14.1"` and ran 
`cargo build`. But unfortunately, it with a large error message.

Use used rust nightly: `cargo 1.38.0-nightly` and `rustc 1.38.0-nightly`. It 
failed both on arch and ubuntu.

I tried to build directly in 
`.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-0.14.1` but it failed.

I cloned arrow repository and tried to build in the directory `rust/parquet` 
and it succeeded. But as soon I moved the rust/parquet to some other location, 
the build failed. So my guess is that the failure has to do something with 
dependent modules `rust/arrow`.

Is this a known issue? I couldn't find any ticket for that.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6253) [Python] Expose "enable_buffered_stream" option from parquet::ReaderProperties in pyarrow.parquet.read_table

2019-08-15 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6253:
---

 Summary: [Python] Expose "enable_buffered_stream" option from 
parquet::ReaderProperties in pyarrow.parquet.read_table
 Key: ARROW-6253
 URL: https://issues.apache.org/jira/browse/ARROW-6253
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.15.0


See also PARQUET-1370



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6180) [C++] Create InputStream that is an isolated reader of a segment of a RandomAccessFile

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6180.
-
Resolution: Fixed

Issue resolved by pull request 5085
[https://github.com/apache/arrow/pull/5085]

> [C++] Create InputStream that is an isolated reader of a segment of a 
> RandomAccessFile
> --
>
> Key: ARROW-6180
> URL: https://issues.apache.org/jira/browse/ARROW-6180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> If different threads wants to do buffered reads over different portions of a 
> file (and they are unable to create their own separate file handles), they 
> may clobber each other. I would propose creating an object that keeps the 
> RandomAccessFile internally and implements the InputStream API in a way that 
> is safe from other threads changing the file position



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6230) [R] Reading in Parquet files are 20x slower than reading fst files in R

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6230.
-
Resolution: Fixed

> [R] Reading in Parquet files are 20x slower than reading fst files in R
> ---
>
> Key: ARROW-6230
> URL: https://issues.apache.org/jira/browse/ARROW-6230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 0.14.0
> Environment: Windows 10 Pro and Ubuntu 
>Reporter: Zhuo Jia Dai
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
> Attachments: image-2019-08-14-10-04-56-834.png
>
>
> *Problem*
> Loading any of the data I mentioned below is 20x slower than the fst format 
> in R.
>  
> *How to get the data*
> [https://loanperformancedata.fanniemae.com/lppub/index.html]
> Register and download any of these. I can't provide the data to you, and I 
> think it's best you register.
>  
> !image-2019-08-14-10-04-56-834.png!
>  
> *Code*
> ```r
> path = "data/Performance_2016Q4.txt"
> library(data.table)
>  library(arrow)
> a = data.table::fread(path, header = FALSE)
> fst::write_fst(a, "data/a.fst")
> arrow::write_parquet(a, "data/a.parquet")
> rm(a); gc()
> #read in test
> system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds
> rm(a); gc()
> read in test
> system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds
> ```



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Reopened] (ARROW-6230) [R] Reading in Parquet files are 20x slower than reading fst files in R

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reopened ARROW-6230:
-

> [R] Reading in Parquet files are 20x slower than reading fst files in R
> ---
>
> Key: ARROW-6230
> URL: https://issues.apache.org/jira/browse/ARROW-6230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 0.14.0
> Environment: Windows 10 Pro and Ubuntu 
>Reporter: Zhuo Jia Dai
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
> Attachments: image-2019-08-14-10-04-56-834.png
>
>
> *Problem*
> Loading any of the data I mentioned below is 20x slower than the fst format 
> in R.
>  
> *How to get the data*
> [https://loanperformancedata.fanniemae.com/lppub/index.html]
> Register and download any of these. I can't provide the data to you, and I 
> think it's best you register.
>  
> !image-2019-08-14-10-04-56-834.png!
>  
> *Code*
> ```r
> path = "data/Performance_2016Q4.txt"
> library(data.table)
>  library(arrow)
> a = data.table::fread(path, header = FALSE)
> fst::write_fst(a, "data/a.fst")
> arrow::write_parquet(a, "data/a.parquet")
> rm(a); gc()
> #read in test
> system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds
> rm(a); gc()
> read in test
> system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds
> ```



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6230) [R] Reading in Parquet files are 20x slower than reading fst files in R

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6230.
-
   Resolution: Cannot Reproduce
 Assignee: Wes McKinney
Fix Version/s: 0.15.0

Resolving for 0.15.0. If after 0.15.0 comes out there are performance or memory 
use problems please reopen this issue or open a new issue. Thanks!

> [R] Reading in Parquet files are 20x slower than reading fst files in R
> ---
>
> Key: ARROW-6230
> URL: https://issues.apache.org/jira/browse/ARROW-6230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 0.14.0
> Environment: Windows 10 Pro and Ubuntu 
>Reporter: Zhuo Jia Dai
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
> Attachments: image-2019-08-14-10-04-56-834.png
>
>
> *Problem*
> Loading any of the data I mentioned below is 20x slower than the fst format 
> in R.
>  
> *How to get the data*
> [https://loanperformancedata.fanniemae.com/lppub/index.html]
> Register and download any of these. I can't provide the data to you, and I 
> think it's best you register.
>  
> !image-2019-08-14-10-04-56-834.png!
>  
> *Code*
> ```r
> path = "data/Performance_2016Q4.txt"
> library(data.table)
>  library(arrow)
> a = data.table::fread(path, header = FALSE)
> fst::write_fst(a, "data/a.fst")
> arrow::write_parquet(a, "data/a.parquet")
> rm(a); gc()
> #read in test
> system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds
> rm(a); gc()
> read in test
> system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds
> ```



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6230) [R] Reading in Parquet files are 20x slower than reading fst files in R

2019-08-15 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6230:
---
Affects Version/s: 0.14.0

> [R] Reading in Parquet files are 20x slower than reading fst files in R
> ---
>
> Key: ARROW-6230
> URL: https://issues.apache.org/jira/browse/ARROW-6230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 0.14.0
> Environment: Windows 10 Pro and Ubuntu 
>Reporter: Zhuo Jia Dai
>Priority: Major
>  Labels: parquet
> Attachments: image-2019-08-14-10-04-56-834.png
>
>
> *Problem*
> Loading any of the data I mentioned below is 20x slower than the fst format 
> in R.
>  
> *How to get the data*
> [https://loanperformancedata.fanniemae.com/lppub/index.html]
> Register and download any of these. I can't provide the data to you, and I 
> think it's best you register.
>  
> !image-2019-08-14-10-04-56-834.png!
>  
> *Code*
> ```r
> path = "data/Performance_2016Q4.txt"
> library(data.table)
>  library(arrow)
> a = data.table::fread(path, header = FALSE)
> fst::write_fst(a, "data/a.fst")
> arrow::write_parquet(a, "data/a.parquet")
> rm(a); gc()
> #read in test
> system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds
> rm(a); gc()
> read in test
> system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds
> ```



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6230) [R] Reading in Parquet files are 20x slower than reading fst files in R

2019-08-15 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6230:
---
Labels: parquet  (was: paragraph)

> [R] Reading in Parquet files are 20x slower than reading fst files in R
> ---
>
> Key: ARROW-6230
> URL: https://issues.apache.org/jira/browse/ARROW-6230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
> Environment: Windows 10 Pro and Ubuntu 
>Reporter: Zhuo Jia Dai
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.1
>
> Attachments: image-2019-08-14-10-04-56-834.png
>
>
> *Problem*
> Loading any of the data I mentioned below is 20x slower than the fst format 
> in R.
>  
> *How to get the data*
> [https://loanperformancedata.fanniemae.com/lppub/index.html]
> Register and download any of these. I can't provide the data to you, and I 
> think it's best you register.
>  
> !image-2019-08-14-10-04-56-834.png!
>  
> *Code*
> ```r
> path = "data/Performance_2016Q4.txt"
> library(data.table)
>  library(arrow)
> a = data.table::fread(path, header = FALSE)
> fst::write_fst(a, "data/a.fst")
> arrow::write_parquet(a, "data/a.parquet")
> rm(a); gc()
> #read in test
> system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds
> rm(a); gc()
> read in test
> system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds
> ```



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6230) [R] Reading in Parquet files are 20x slower than reading fst files in R

2019-08-15 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6230:
---
Fix Version/s: (was: 0.14.1)

> [R] Reading in Parquet files are 20x slower than reading fst files in R
> ---
>
> Key: ARROW-6230
> URL: https://issues.apache.org/jira/browse/ARROW-6230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
> Environment: Windows 10 Pro and Ubuntu 
>Reporter: Zhuo Jia Dai
>Priority: Major
>  Labels: parquet
> Attachments: image-2019-08-14-10-04-56-834.png
>
>
> *Problem*
> Loading any of the data I mentioned below is 20x slower than the fst format 
> in R.
>  
> *How to get the data*
> [https://loanperformancedata.fanniemae.com/lppub/index.html]
> Register and download any of these. I can't provide the data to you, and I 
> think it's best you register.
>  
> !image-2019-08-14-10-04-56-834.png!
>  
> *Code*
> ```r
> path = "data/Performance_2016Q4.txt"
> library(data.table)
>  library(arrow)
> a = data.table::fread(path, header = FALSE)
> fst::write_fst(a, "data/a.fst")
> arrow::write_parquet(a, "data/a.parquet")
> rm(a); gc()
> #read in test
> system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds
> rm(a); gc()
> read in test
> system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds
> ```



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6230) [R] Reading in Parquet files are 20x slower than reading fst files in R

2019-08-15 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6230:
---
Labels: paragraph  (was: )

> [R] Reading in Parquet files are 20x slower than reading fst files in R
> ---
>
> Key: ARROW-6230
> URL: https://issues.apache.org/jira/browse/ARROW-6230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
> Environment: Windows 10 Pro and Ubuntu 
>Reporter: Zhuo Jia Dai
>Priority: Major
>  Labels: paragraph
> Fix For: 0.14.1
>
> Attachments: image-2019-08-14-10-04-56-834.png
>
>
> *Problem*
> Loading any of the data I mentioned below is 20x slower than the fst format 
> in R.
>  
> *How to get the data*
> [https://loanperformancedata.fanniemae.com/lppub/index.html]
> Register and download any of these. I can't provide the data to you, and I 
> think it's best you register.
>  
> !image-2019-08-14-10-04-56-834.png!
>  
> *Code*
> ```r
> path = "data/Performance_2016Q4.txt"
> library(data.table)
>  library(arrow)
> a = data.table::fread(path, header = FALSE)
> fst::write_fst(a, "data/a.fst")
> arrow::write_parquet(a, "data/a.parquet")
> rm(a); gc()
> #read in test
> system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds
> rm(a); gc()
> read in test
> system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds
> ```



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6230) [R] Reading in Parquet files are 20x slower than reading fst files in R

2019-08-15 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6230:
---
Summary: [R] Reading in Parquet files are 20x slower than reading fst files 
in R  (was: [R] Reading in parquent files are 20x slower than reading fst files 
in R)

> [R] Reading in Parquet files are 20x slower than reading fst files in R
> ---
>
> Key: ARROW-6230
> URL: https://issues.apache.org/jira/browse/ARROW-6230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
> Environment: Windows 10 Pro and Ubuntu 
>Reporter: Zhuo Jia Dai
>Priority: Major
> Fix For: 0.14.1
>
> Attachments: image-2019-08-14-10-04-56-834.png
>
>
> *Problem*
> Loading any of the data I mentioned below is 20x slower than the fst format 
> in R.
>  
> *How to get the data*
> [https://loanperformancedata.fanniemae.com/lppub/index.html]
> Register and download any of these. I can't provide the data to you, and I 
> think it's best you register.
>  
> !image-2019-08-14-10-04-56-834.png!
>  
> *Code*
> ```r
> path = "data/Performance_2016Q4.txt"
> library(data.table)
>  library(arrow)
> a = data.table::fread(path, header = FALSE)
> fst::write_fst(a, "data/a.fst")
> arrow::write_parquet(a, "data/a.parquet")
> rm(a); gc()
> #read in test
> system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds
> rm(a); gc()
> read in test
> system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds
> ```



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6252) [Python] Add pyarrow.Array.diff_contents method

2019-08-15 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6252:
---

 Summary: [Python] Add pyarrow.Array.diff_contents method
 Key: ARROW-6252
 URL: https://issues.apache.org/jira/browse/ARROW-6252
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.15.0


This would expose the Array diffing functionality in Python to make it easier 
to see why arrays are unequal



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5610) [Python] Define extension type API in Python to "receive" or "send" a foreign extension type

2019-08-15 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5610:
--
Labels: pull-request-available  (was: )

> [Python] Define extension type API in Python to "receive" or "send" a foreign 
> extension type
> 
>
> Key: ARROW-5610
> URL: https://issues.apache.org/jira/browse/ARROW-5610
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> In work in ARROW-840, a static {{arrow.py_extension_type}} name is used. 
> There will be cases where an extension type is coming from another 
> programming language (e.g. Java), so it would be useful to be able to "plug 
> in" a Python extension type subclass that will be used to deserialize the 
> extension type coming over the wire. This has some different API requirements 
> since the serialized representation of the type will not have knowledge of 
> Python pickling, etc. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5374) [Python] Misleading error message when calling pyarrow.read_record_batch on a complete IPC stream

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5374:

Summary: [Python] Misleading error message when calling 
pyarrow.read_record_batch on a complete IPC stream  (was: [Python] 
pa.read_record_batch() doesn't work)

> [Python] Misleading error message when calling pyarrow.read_record_batch on a 
> complete IPC stream
> -
>
> Key: ARROW-5374
> URL: https://issues.apache.org/jira/browse/ARROW-5374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: beginner
> Fix For: 0.15.0
>
>
> {code:python}
> >>> batch = pa.RecordBatch.from_arrays([pa.array([b"foo"], type=pa.utf8())], 
> >>> names=['strs'])   
> >>> 
> >>> stream = pa.BufferOutputStream()
> >>> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> >>> writer.write_batch(batch) 
> >>>   
> >>>
> >>> writer.close()
> >>>   
> >>>
> >>> buf = stream.getvalue()   
> >>>   
> >>>
> >>> pa.read_record_batch(buf, batch.schema)   
> >>>   
> >>>
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.read_record_batch(buf, batch.schema)
>   File "pyarrow/ipc.pxi", line 583, in pyarrow.lib.read_record_batch
> check_status(ReadRecordBatch(deref(message.message.get()),
>   File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> raise ArrowIOError(message)
> ArrowIOError: Expected IPC message of type schema got record batch
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5374) [Python] Misleading error message when calling pyarrow.read_record_batch on a complete IPC stream

2019-08-15 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908168#comment-16908168
 ] 

Wes McKinney commented on ARROW-5374:
-

I updated the issue title so it does not mislead contributors

> [Python] Misleading error message when calling pyarrow.read_record_batch on a 
> complete IPC stream
> -
>
> Key: ARROW-5374
> URL: https://issues.apache.org/jira/browse/ARROW-5374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: beginner
> Fix For: 0.15.0
>
>
> {code:python}
> >>> batch = pa.RecordBatch.from_arrays([pa.array([b"foo"], type=pa.utf8())], 
> >>> names=['strs'])   
> >>> 
> >>> stream = pa.BufferOutputStream()
> >>> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> >>> writer.write_batch(batch) 
> >>>   
> >>>
> >>> writer.close()
> >>>   
> >>>
> >>> buf = stream.getvalue()   
> >>>   
> >>>
> >>> pa.read_record_batch(buf, batch.schema)   
> >>>   
> >>>
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.read_record_batch(buf, batch.schema)
>   File "pyarrow/ipc.pxi", line 583, in pyarrow.lib.read_record_batch
> check_status(ReadRecordBatch(deref(message.message.get()),
>   File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> raise ArrowIOError(message)
> ArrowIOError: Expected IPC message of type schema got record batch
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6248) [Python] Use FileNotFoundError in HadoopFileSystem.open() in Python 3

2019-08-15 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908165#comment-16908165
 ] 

Wes McKinney commented on ARROW-6248:
-

Seems reasonable. Would you like to submit a PR?

> [Python] Use FileNotFoundError in HadoopFileSystem.open() in Python 3 
> --
>
> Key: ARROW-6248
> URL: https://issues.apache.org/jira/browse/ARROW-6248
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Alexander Schepanovski
>Priority: Minor
>
> When file is absent pyarrow throws 
> {code:python}
> ArrowIOError('HDFS file does not exist: ...')
> {code}
> which inherits from {{IOError}} and {{pyarrow.lib.ArrowException}}, it would 
> be better if that was {{FileNotFoundError}} a subclass of {{IOError}} for 
> this particular purpose. Also, {{.errno}} property is empty (should be 2) so 
> one needs to match by error message to check for particular error.
> *P.S.* There is no  {{FileNotFoundError}} in Python 2, but there is 
> {{.errno}} property there.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5374) [Python] pa.read_record_batch() doesn't work

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5374:

Fix Version/s: 0.15.0

> [Python] pa.read_record_batch() doesn't work
> 
>
> Key: ARROW-5374
> URL: https://issues.apache.org/jira/browse/ARROW-5374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: beginner
> Fix For: 0.15.0
>
>
> {code:python}
> >>> batch = pa.RecordBatch.from_arrays([pa.array([b"foo"], type=pa.utf8())], 
> >>> names=['strs'])   
> >>> 
> >>> stream = pa.BufferOutputStream()
> >>> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> >>> writer.write_batch(batch) 
> >>>   
> >>>
> >>> writer.close()
> >>>   
> >>>
> >>> buf = stream.getvalue()   
> >>>   
> >>>
> >>> pa.read_record_batch(buf, batch.schema)   
> >>>   
> >>>
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.read_record_batch(buf, batch.schema)
>   File "pyarrow/ipc.pxi", line 583, in pyarrow.lib.read_record_batch
> check_status(ReadRecordBatch(deref(message.message.get()),
>   File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> raise ArrowIOError(message)
> ArrowIOError: Expected IPC message of type schema got record batch
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3

2019-08-15 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908138#comment-16908138
 ] 

Wes McKinney commented on ARROW-6058:
-

So far we don't have a minimal reproduction of the issue so it's very hard for 
other developers in this project to help. Since you are encountering the 
problem, you are the best positioned to reproduce the issue or determine the 
root cause. 

> [Python][Parquet] Failure when reading Parquet file from S3 
> 
>
> Key: ARROW-6058
> URL: https://issues.apache.org/jira/browse/ARROW-6058
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.1
>Reporter: Siddharth
>Priority: Major
>  Labels: parquet
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, 
> in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, 
> in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in 
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) 
> than expected (263929)
> {code}
>  
> *Note: Same code works on relatively smaller dataset (approx < 50M records)* 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5610) [Python] Define extension type API in Python to "receive" or "send" a foreign extension type

2019-08-15 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908131#comment-16908131
 ] 

Joris Van den Bossche commented on ARROW-5610:
--

OK, I am making some progress on this (I initially was disregarding the 
parametrized type case, so we indeed need C++ <-> Python interaction). I have 
basic roundtripping with a parametrized type working. Eg in Python an 
implementor can do:
{code:java}
class PeriodType(pa.GenericExtensionType):

def __init__(self, freq):
# attributes need to be set first before calling super init (as that 
calls serialize)
self.freq = freq
pa.lib.GenericExtensionType.__init__(self, pa.int64(), 'pandas.period')

def __arrow_ext_serialize__(self):
return "freq={}".format(self.freq).encode()

@classmethod
def __arrow_ext_deserialize__(cls, storage_type, serialized):
serialized = serialized.decode()
assert serialized.startswith("freq=")
freq = serialized.split('=')[1]
return PeriodType(freq)

period_type = PeriodType('D')
pa.lib.register(period_type)
{code}
and that can roundtrip IPC with the "pandas.period" extension name (so not a 
generic "arrow.py_extension").

I based the above interface (the {{__arrow_ext_serialize_}} _and 
{{}}_{{_arrow_ext_deserialize__}} methods to implement) on the existing 
{{PyExtensionType}} that Antoine implemented.
{quote}> I assume the generic ExtensionType would have a Python "vtable" for 
Python subclasses to implement the C++ methods
{quote}
So currently I based myself on the existing {{PyExtensionType}} and copied the 
approach there to store a weakref to an instance and the class of the Python 
subclass the user defines. 
 That seems to work, but I am not familiar enough with this to judge if the 
vtable approach (as used in PyFlightServer) would be better.
{quote}> The registration method would need to support parameterized types as 
well (i.e. registering multiple instances of the same type with different 
parameters).
{quote}
Is that needed? My current idea is that you would register a certain type once 
(with _some_ parametrization, so you don't have to register each possible 
parametrization). Because we register in C++ based on the name, so otherwise 
the name would need to include the parameter. Actually, writing this down now, 
that could also be an option (currently I use the serialized metadata for 
storing the parametrization).

Other questions I still need to answer:
 - What to do with registration and unregistration? It would be nice if a user 
didn't need to register a type manually (in python that could be done with a 
metaclass to register the subclass on definition, but not sure that is possible 
in cython)
 Also for unregistering, since that is needed to avoid segfaults on shutdown, 
we probably need to keep a python side registry of the C++-registered types to 
ensure we unregister them on shutdown.
 - Do we want to keep the current {{PyExtensionType}} based on pickle? I think 
the main advantage compared to the new implementation is that when reading an 
IPC message, the type does not need to be registered to be recognized (for the 
unpickling, it is enough that the module is importable, but does not need to be 
imported manually by the user). But on the other hand it gives two largely 
overlapping alternatives.

I will try to clean up and push to a draft PR, which will be easier to get an 
idea. 

 

> [Python] Define extension type API in Python to "receive" or "send" a foreign 
> extension type
> 
>
> Key: ARROW-5610
> URL: https://issues.apache.org/jira/browse/ARROW-5610
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> In work in ARROW-840, a static {{arrow.py_extension_type}} name is used. 
> There will be cases where an extension type is coming from another 
> programming language (e.g. Java), so it would be useful to be able to "plug 
> in" a Python extension type subclass that will be used to deserialize the 
> extension type coming over the wire. This has some different API requirements 
> since the serialized representation of the type will not have knowledge of 
> Python pickling, etc. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6250) [Java] Implement ApproxEqualsVisitor comparing approx for floating point

2019-08-15 Thread Ji Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908116#comment-16908116
 ] 

Ji Liu commented on ARROW-6250:
---

cc [~pravindra]

> [Java] Implement ApproxEqualsVisitor comparing approx for floating point
> 
>
> Key: ARROW-6250
> URL: https://issues.apache.org/jira/browse/ARROW-6250
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Critical
>
> Currently we already implemented {{RangeEqualsVisitor/VectorEqualsVisitor}} 
> for comparing range/vector.
> And ARROW-6211 is created to make {{ValueVector}} work with generic visitor.
> We should also implement {{ApproxEqualsVisitor}} to compare floating point 
> just like cpp does
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.cc]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6246) [Website] Add link to R documentation site

2019-08-15 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6246.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

https://github.com/apache/arrow-site/commit/41d02ac5e96fafd3dc7663d5214cdc7cd0dedb26

> [Website] Add link to R documentation site
> --
>
> Key: ARROW-6246
> URL: https://issues.apache.org/jira/browse/ARROW-6246
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> ARROW-6139 added the R documentation at /docs/r/, but we still need to link 
> to it from the website header.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6251) [Developer] Add PR merge tool to apache/arrow-site

2019-08-15 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6251:
---

 Summary: [Developer] Add PR merge tool to apache/arrow-site
 Key: ARROW-6251
 URL: https://issues.apache.org/jira/browse/ARROW-6251
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Wes McKinney
 Fix For: 0.15.0


This will help with creating clean patches and also keeping JIRA clean



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6250) [Java] Implement ApproxEqualsVisitor comparing approx for floating point

2019-08-15 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6250:
-

 Summary: [Java] Implement ApproxEqualsVisitor comparing approx for 
floating point
 Key: ARROW-6250
 URL: https://issues.apache.org/jira/browse/ARROW-6250
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently we already implemented {{RangeEqualsVisitor/VectorEqualsVisitor}} for 
comparing range/vector.

And ARROW-6211 is created to make {{ValueVector}} work with generic visitor.

We should also implement {{ApproxEqualsVisitor}} to compare floating point just 
like cpp does

[https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.cc]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6249) [Java] Remove useless class ByteArrayWrapper

2019-08-15 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6249:
--
Labels: pull-request-available  (was: )

> [Java] Remove useless class ByteArrayWrapper
> 
>
> Key: ARROW-6249
> URL: https://issues.apache.org/jira/browse/ARROW-6249
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
>
> This class was introduced into encoding part to compare byte[] values equals.
> Since now we compare value/vector equals by new added visitor API by 
> ARROW-6022 instead of  comparing {{getObject}}, this class is no use anymore.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6249) [Java] Remove useless class ByteArrayWrapper

2019-08-15 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6249:
-

 Summary: [Java] Remove useless class ByteArrayWrapper
 Key: ARROW-6249
 URL: https://issues.apache.org/jira/browse/ARROW-6249
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


This class was introduced into encoding part to compare byte[] values equals.

Since now we compare value/vector equals by new added visitor API by ARROW-6022 
instead of  comparing {{getObject}}, this class is no use anymore.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6240) [Ruby] Arrow::Decimal128Array returns BigDecimal

2019-08-15 Thread Yosuke Shiro (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yosuke Shiro resolved ARROW-6240.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5089
[https://github.com/apache/arrow/pull/5089]

> [Ruby] Arrow::Decimal128Array returns BigDecimal
> 
>
> Key: ARROW-6240
> URL: https://issues.apache.org/jira/browse/ARROW-6240
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-4176) [C++/Python] Human readable arrow schema comparison

2019-08-15 Thread lidavidm (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lidavidm updated ARROW-4176:

Labels: beginner  (was: )

> [C++/Python] Human readable arrow schema comparison
> ---
>
> Key: ARROW-4176
> URL: https://issues.apache.org/jira/browse/ARROW-4176
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Florian Jetter
>Priority: Minor
>  Labels: beginner
>
> When working with arrow schemas it would be helpful to have a human readable 
> representation of the diff between two schemas.
> This could be either exposed as a function returning a string/diff object or 
> via a function raising an Exception with this information.
> For instance:
> {code}
> schema_diff = get_schema_diff(schema1, schema2)
> expected_diff = """
> - col_changed: int8
> + col_changed: double
> + col_additional: int8
> """
> assert schema_diff == expected_diff
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-2619) [Rust] Move JSON serde code to separate file/module

2019-08-15 Thread lidavidm (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lidavidm updated ARROW-2619:

Labels: beginner  (was: )

> [Rust] Move JSON serde code to separate file/module
> ---
>
> Key: ARROW-2619
> URL: https://issues.apache.org/jira/browse/ARROW-2619
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Priority: Minor
>  Labels: beginner
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-3776) [Rust] Mark methods that do not perform bounds checking as unsafe

2019-08-15 Thread lidavidm (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lidavidm updated ARROW-3776:

Labels: beginner  (was: )

> [Rust] Mark methods that do not perform bounds checking as unsafe
> -
>
> Key: ARROW-3776
> URL: https://issues.apache.org/jira/browse/ARROW-3776
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Paddy Horan
>Priority: Minor
>  Labels: beginner
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5248) [Python] support dateutil timezones

2019-08-15 Thread lidavidm (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lidavidm updated ARROW-5248:

Labels: beginner  (was: )

> [Python] support dateutil timezones
> ---
>
> Key: ARROW-5248
> URL: https://issues.apache.org/jira/browse/ARROW-5248
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>  Labels: beginner
>
> The {{dateutil}} packages also provides a set of timezone objects 
> (https://dateutil.readthedocs.io/en/stable/tz.html) in addition to {{pytz}}. 
> In pyarrow, we only support pytz timezones (and the stdlib datetime.timezone 
> fixed offset):
> {code}
> In [2]: import dateutil.tz
>   
>   
> In [3]: import pyarrow as pa  
>   
>   
> In [5]: pa.timestamp('us', dateutil.tz.gettz('Europe/Brussels'))  
>   
>   
> ...
> ~/miniconda3/envs/dev37/lib/python3.7/site-packages/pyarrow/types.pxi in 
> pyarrow.lib.tzinfo_to_string()
> ValueError: Unable to convert timezone 
> `tzfile('/usr/share/zoneinfo/Europe/Brussels')` to string
> {code}
> But pandas also supports dateutil timezones. As a consequence, when having a 
> pandas DataFrame that uses a dateutil timezone, you get an error when 
> converting to an arrow table.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-3552) [Python] Implement pa.RecordBatch.serialize_to to write single message to an OutputStream

2019-08-15 Thread lidavidm (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lidavidm updated ARROW-3552:

Labels: beginner  (was: )

> [Python] Implement pa.RecordBatch.serialize_to to write single message to an 
> OutputStream
> -
>
> Key: ARROW-3552
> URL: https://issues.apache.org/jira/browse/ARROW-3552
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: beginner
>
> {{RecordBatch.serialize}} writes in memory. This would help with shared 
> memory worksflows. See also pyarrow.ipc.write_tensor



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5374) [Python] pa.read_record_batch() doesn't work

2019-08-15 Thread lidavidm (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lidavidm updated ARROW-5374:

Labels: beginner  (was: begin)

> [Python] pa.read_record_batch() doesn't work
> 
>
> Key: ARROW-5374
> URL: https://issues.apache.org/jira/browse/ARROW-5374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: beginner
>
> {code:python}
> >>> batch = pa.RecordBatch.from_arrays([pa.array([b"foo"], type=pa.utf8())], 
> >>> names=['strs'])   
> >>> 
> >>> stream = pa.BufferOutputStream()
> >>> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> >>> writer.write_batch(batch) 
> >>>   
> >>>
> >>> writer.close()
> >>>   
> >>>
> >>> buf = stream.getvalue()   
> >>>   
> >>>
> >>> pa.read_record_batch(buf, batch.schema)   
> >>>   
> >>>
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.read_record_batch(buf, batch.schema)
>   File "pyarrow/ipc.pxi", line 583, in pyarrow.lib.read_record_batch
> check_status(ReadRecordBatch(deref(message.message.get()),
>   File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> raise ArrowIOError(message)
> ArrowIOError: Expected IPC message of type schema got record batch
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


  1   2   >