date:20190906

[jira] [Commented] (ARROW-6220) [Java] Add API to avro adapter to limit number of rows returned at a time.

2019-09-06 Thread Ji Liu (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924720#comment-16924720
 ] 

Ji Liu commented on ARROW-6220:
---

Issue resolved by pull request 5305
[https://github.com/apache/arrow/pull/5305]

> [Java] Add API to avro adapter to limit number of rows returned at a time.
> --
>
> Key: ARROW-6220
> URL: https://issues.apache.org/jira/browse/ARROW-6220
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Major
>  Labels: avro, pull-request-available
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> We can either let clients iterate or ideally provide an iterator interface.  
> This is important for large avro data and was also discussed as something 
> readers/adapters should haven.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6460) [Java] Add unit test for large avro data

2019-09-06 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924713#comment-16924713
 ] 

Micah Kornfield commented on ARROW-6460:


as part of this can we add in a performance test as well to get a baseline 
number.

> [Java] Add unit test for large avro data
> 
>
> Key: ARROW-6460
> URL: https://issues.apache.org/jira/browse/ARROW-6460
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Critical
>
> To avoid OOM, we have implement iterator API in ARROW-6220.
> This issue is about to add tests with a large fake data set (say 6MM rows in 
> JDBC adapter test) and ensures no OOMs occur.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6356) [Java] Avro adapter implement Enum type and nested Record type

2019-09-06 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6356:
---
Component/s: Java

> [Java] Avro adapter implement Enum type and nested Record type
> --
>
> Key: ARROW-6356
> URL: https://issues.apache.org/jira/browse/ARROW-6356
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Implement for converting avro {{Enum}} type.
> Convert nested avro {{Record}} type to Arrow {{StructVector}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-6220) [Java] Add API to avro adapter to limit number of rows returned at a time.

2019-09-06 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield closed ARROW-6220.
--
Resolution: Fixed

> [Java] Add API to avro adapter to limit number of rows returned at a time.
> --
>
> Key: ARROW-6220
> URL: https://issues.apache.org/jira/browse/ARROW-6220
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Major
>  Labels: avro, pull-request-available
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> We can either let clients iterate or ideally provide an iterator interface.  
> This is important for large avro data and was also discussed as something 
> readers/adapters should haven.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6356) [Java] Avro adapter implement Enum type and nested Record type

2019-09-06 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6356.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5305
[https://github.com/apache/arrow/pull/5305]

> [Java] Avro adapter implement Enum type and nested Record type
> --
>
> Key: ARROW-6356
> URL: https://issues.apache.org/jira/browse/ARROW-6356
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Implement for converting avro {{Enum}} type.
> Convert nested avro {{Record}} type to Arrow {{StructVector}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6315) [Java] Make change to ensure flatbuffer reads are aligned

2019-09-06 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6315.

Resolution: Fixed

Issue resolved by pull request 5229
[https://github.com/apache/arrow/pull/5229]

> [Java] Make change to ensure flatbuffer reads are aligned 
> --
>
> Key: ARROW-6315
> URL: https://issues.apache.org/jira/browse/ARROW-6315
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> See parent bug for details on requirements.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6480) [Developer] Add command to generate and send e-mail report for a Crossbow run

2019-09-06 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6480:
---

 Summary: [Developer] Add command to generate and send e-mail 
report for a Crossbow run
 Key: ARROW-6480
 URL: https://issues.apache.org/jira/browse/ARROW-6480
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Developer Tools
Reporter: Wes McKinney
 Fix For: 0.15.0


We also need a simple wrapper to poll a Crossbow job periodically for 
completion and then send the report once all the tasks have finished. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6360) [R] Update support for compression

2019-09-06 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6360:
--

Assignee: Neal Richardson

> [R] Update support for compression
> --
>
> Key: ARROW-6360
> URL: https://issues.apache.org/jira/browse/ARROW-6360
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 0.15.0
>
>
> At least two issues:
>  * [https://github.com/apache/arrow/blob/master/r/R/compression.R#L46] says 
> that compression is not supported on Windows, but ARROW-5683 added Snappy 
> support for Windows.
>  * ARROW-6216 added more compression options, including for Parquet
> Update/implement those, and we should also add some convenience arguments to 
> {{write_parquet()}} for selecting compression.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-3943) [R] Write vignette for R package

2019-09-06 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924648#comment-16924648
 ] 

Neal Richardson commented on ARROW-3943:


This is being done in ARROW-5505.

> [R] Write vignette for R package
> 
>
> Key: ARROW-3943
> URL: https://issues.apache.org/jira/browse/ARROW-3943
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Romain François
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> a vignette similar to https://arrow.apache.org/docs/python/index.html



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-5176) [Python] Automate formatting of python files

2019-09-06 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-5176:
--

Assignee: (was: Neal Richardson)

> [Python] Automate formatting of python files
> 
>
> Key: ARROW-5176
> URL: https://issues.apache.org/jira/browse/ARROW-5176
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Benjamin Kietzman
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> [Black](https://github.com/ambv/black) is a tool for automatically formatting 
> python code in ways which flake8 and our other linters approve of. Adding it 
> to the project will allow more reliably formatted python code and fill a 
> similar role to {{clang-format}} for c++ and {{cmake-format}} for cmake



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6449) [R] io "tell()" methods are inconsistently named and untested

2019-09-06 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924647#comment-16924647
 ] 

Neal Richardson commented on ARROW-6449:


This is being done in ARROW-5505.

> [R] io "tell()" methods are inconsistently named and untested
> -
>
> Key: ARROW-6449
> URL: https://issues.apache.org/jira/browse/ARROW-6449
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
> Fix For: 0.15.0
>
>
> Looking in r/R/io.R, there is inconsistency as to whether the tell method of 
> the various streams should be spelled "tell" or "Tell", and most are 
> untested. Also, [~pitrou] asks, "is it necessary to redefine it [for all 
> subclasses]? It should be inherited."



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6449) [R] io "tell()" methods are inconsistently named and untested

2019-09-06 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6449:
--

Assignee: Neal Richardson  (was: Romain François)

> [R] io "tell()" methods are inconsistently named and untested
> -
>
> Key: ARROW-6449
> URL: https://issues.apache.org/jira/browse/ARROW-6449
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>
> Looking in r/R/io.R, there is inconsistency as to whether the tell method of 
> the various streams should be spelled "tell" or "Tell", and most are 
> untested. Also, [~pitrou] asks, "is it necessary to redefine it [for all 
> subclasses]? It should be inherited."



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6449) [R] io "tell()" methods are inconsistently named and untested

2019-09-06 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6449:
---
Fix Version/s: 0.15.0

> [R] io "tell()" methods are inconsistently named and untested
> -
>
> Key: ARROW-6449
> URL: https://issues.apache.org/jira/browse/ARROW-6449
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
> Fix For: 0.15.0
>
>
> Looking in r/R/io.R, there is inconsistency as to whether the tell method of 
> the various streams should be spelled "tell" or "Tell", and most are 
> untested. Also, [~pitrou] asks, "is it necessary to redefine it [for all 
> subclasses]? It should be inherited."



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-1741) [C++] Comparison function for DictionaryArray to determine if indices are "compatible"

2019-09-06 Thread Benjamin Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Kietzman reassigned ARROW-1741:


Assignee: Benjamin Kietzman

> [C++] Comparison function for DictionaryArray to determine if indices are 
> "compatible"
> --
>
> Key: ARROW-1741
> URL: https://issues.apache.org/jira/browse/ARROW-1741
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Benjamin Kietzman
>Priority: Major
> Fix For: 0.15.0
>
>
> For example, if one array's dictionary is larger than the other, but the 
> overlapping beginning portion is the same, then the respective dictionary 
> indices correspond to the same values. Therefore, in analytics, one may 
> choose to drop the smaller dictionary in favor of the larger dictionary, and 
> this need not incur any computational overhead (beyond comparing the 
> dictionary prefixes -- there may be some way to engineer "dictionary lineage" 
> to make this comparison even cheaper)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5104) [Python/C++] Schema for empty tables include index column as integer

2019-09-06 Thread Benjamin Kietzman (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924627#comment-16924627
 ] 

Benjamin Kietzman commented on ARROW-5104:
--

[~jorisvandenbossche] is there work to be done here or should this issue be 
closed?

> [Python/C++] Schema for empty tables include index column as integer
> 
>
> Key: ARROW-5104
> URL: https://issues.apache.org/jira/browse/ARROW-5104
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 0.15.0
>
>
> The schema for an empty table/dataframe still includes the index as an 
> integer column instead of being serialized solely as a metadata reference 
> (see ARROW-1639)
> In the example below, the empty dataframe still holds `__index_level_0__` as 
> an integer column. Proper behavior would be to exclude it and reference the 
> index information in the pandas metadata as it is the case for a non-empty 
> column
> {code}
> In [1]: import pandas as pd
> im
> In [2]: import pyarrow as pa
> In [3]: non_empty =  pd.DataFrame({"col": [1]})
> In [4]: empty = non_empty.drop(0)
> In [5]: empty
> Out[5]:
> Empty DataFrame
> Columns: [col]
> Index: []
> In [6]: pa.Table.from_pandas(non_empty)
> Out[6]:
> pyarrow.Table
> col: int64
> metadata
> 
> OrderedDict([(b'pandas',
>   b'{"index_columns": [{"kind": "range", "name": null, "start": '
>   b'0, "stop": 1, "step": 1}], "column_indexes": [{"name": null,'
>   b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
>   b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
>   b'{"name": "col", "field_name": "col", "pandas_type": "int64",'
>   b' "numpy_type": "int64", "metadata": null}], "creator": {"lib'
>   b'rary": "pyarrow", "version": "0.13.0"}, "pandas_version": nu'
>   b'll}')])
> In [7]: pa.Table.from_pandas(empty)
> Out[7]:
> pyarrow.Table
> col: int64
> __index_level_0__: int64
> metadata
> 
> OrderedDict([(b'pandas',
>   b'{"index_columns": ["__index_level_0__"], "column_indexes": ['
>   b'{"name": null, "field_name": null, "pandas_type": "unicode",'
>   b' "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}]'
>   b', "columns": [{"name": "col", "field_name": "col", "pandas_t'
>   b'ype": "int64", "numpy_type": "int64", "metadata": null}, {"n'
>   b'ame": null, "field_name": "__index_level_0__", "pandas_type"'
>   b': "int64", "numpy_type": "int64", "metadata": null}], "creat'
>   b'or": {"library": "pyarrow", "version": "0.13.0"}, "pandas_ve'
>   b'rsion": null}')])
> In [8]: pa.__version__
> Out[8]: '0.13.0'
> In [9]: ! python --version
> Python 3.6.7
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-5337) [C++] Add RecordBatch::field method, possibly deprecate "column"

2019-09-06 Thread Benjamin Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Kietzman closed ARROW-5337.

Resolution: Not A Problem

> [C++] Add RecordBatch::field method, possibly deprecate "column"
> 
>
> Key: ARROW-5337
> URL: https://issues.apache.org/jira/browse/ARROW-5337
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> As a matter of consistency, it might be better to rename 
> {{RecordBatch::column}} to {{RecordBatch::field}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-3762) [C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray

2019-09-06 Thread Benjamin Kietzman (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924624#comment-16924624
 ] 

Benjamin Kietzman commented on ARROW-3762:
--

[~wesmckinn] is this resolved by https://github.com/apache/arrow/pull/5268 ?

> [C++] Parquet arrow::Table reads error when overflowing capacity of 
> BinaryArray
> ---
>
> Key: ARROW-3762
> URL: https://issues.apache.org/jira/browse/ARROW-3762
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Chris Ellison
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0, 0.15.0
>
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> # When reading a parquet file with binary data > 2 GiB, we get an 
> ArrowIOError due to it not creating chunked arrays. Reading each row group 
> individually and then concatenating the tables works, however.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> x = pa.array(list('1' * 2**30))
> demo = 'demo.parquet'
> def scenario():
> t = pa.Table.from_arrays([x], ['x'])
> writer = pq.ParquetWriter(demo, t.schema)
> for i in range(2):
> writer.write_table(t)
> writer.close()
> pf = pq.ParquetFile(demo)
> # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483647
> t2 = pf.read()
> # Works, but note, there are 32 row groups, not 2 as suggested by:
> # 
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
> t3 = pa.concat_tables(tables)
> scenario()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6256) [Rust] parquet-format should be released by Apache process

2019-09-06 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924611#comment-16924611
 ] 

Wes McKinney commented on ARROW-6256:
-

Moving out of 0.15.0

> [Rust] parquet-format should be released by Apache process
> --
>
> Key: ARROW-6256
> URL: https://issues.apache.org/jira/browse/ARROW-6256
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.14.1
>Reporter: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> The Arrow parquet crate depends on the parquet-format crate [1]. 
> Parquet-format 2.6.0 was recently released and has breaking changes compared 
> to 2.5.0.
> This means that previously published Arrow Parquet/DataFusion crates are now 
> unusable out the box [2].
> We should bring parquet-format into an Apache release process to avoid this 
> type of issue in the future.
>  
> [1] [https://github.com/sunchao/parquet-format-rs]
> [2] https://issues.apache.org/jira/browse/ARROW-6255



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6256) [Rust] parquet-format should be released by Apache process

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6256:

Fix Version/s: (was: 0.15.0)
   1.0.0

> [Rust] parquet-format should be released by Apache process
> --
>
> Key: ARROW-6256
> URL: https://issues.apache.org/jira/browse/ARROW-6256
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.14.1
>Reporter: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> The Arrow parquet crate depends on the parquet-format crate [1]. 
> Parquet-format 2.6.0 was recently released and has breaking changes compared 
> to 2.5.0.
> This means that previously published Arrow Parquet/DataFusion crates are now 
> unusable out the box [2].
> We should bring parquet-format into an Apache release process to avoid this 
> type of issue in the future.
>  
> [1] [https://github.com/sunchao/parquet-format-rs]
> [2] https://issues.apache.org/jira/browse/ARROW-6255



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6312) [C++] Declare required Libs.private in arrow.pc package config

2019-09-06 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924609#comment-16924609
 ] 

Wes McKinney commented on ARROW-6312:
-

I moved this out of 0.15.0. Patches welcome

> [C++] Declare required Libs.private in arrow.pc package config
> --
>
> Key: ARROW-6312
> URL: https://issues.apache.org/jira/browse/ARROW-6312
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Michael Maguire
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The current arrow.pc package config file produced is deficient and doesn't 
> properly declare static libraries pre-requisities that must be linked in in 
> order to *statically* link in libarrow.a
> Currently it just has:
> ```
>  Libs: -L${libdir} -larrow
> ```
> But in cases, e.g. where you enabled snappy, brotli or zlib support in arrow, 
> our toolchains need to see an arrow.pc file something more like:
> ```
>  Libs: -L${libdir} -larrow
>  Libs.private: -lsnappy -lboost_system -lz -llz4 -lbrotlidec -lbrotlienc 
> -lbrotlicommon -lzstd
> ```
> If not, we get linkage errors.  I'm told the convention is that if the .a has 
> an UNDEF, the Requires.private plus the Libs.private should resolve all the 
> undefs. See the Libs.private info in [https://linux.die.net/man/1/pkg-config]
>  
> Note, however, as Sutou Kouhei pointed out in 
> [https://github.com/apache/arrow/pull/5123#issuecomment-522771452,] the 
> additional Libs.private need to be dynamically generated based on whether 
> functionality like snappy, brotli or zlib is enabled..



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6312) [C++] Declare required Libs.private in arrow.pc package config

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6312:
---

Assignee: (was: Michael Maguire)

> [C++] Declare required Libs.private in arrow.pc package config
> --
>
> Key: ARROW-6312
> URL: https://issues.apache.org/jira/browse/ARROW-6312
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Michael Maguire
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The current arrow.pc package config file produced is deficient and doesn't 
> properly declare static libraries pre-requisities that must be linked in in 
> order to *statically* link in libarrow.a
> Currently it just has:
> ```
>  Libs: -L${libdir} -larrow
> ```
> But in cases, e.g. where you enabled snappy, brotli or zlib support in arrow, 
> our toolchains need to see an arrow.pc file something more like:
> ```
>  Libs: -L${libdir} -larrow
>  Libs.private: -lsnappy -lboost_system -lz -llz4 -lbrotlidec -lbrotlienc 
> -lbrotlicommon -lzstd
> ```
> If not, we get linkage errors.  I'm told the convention is that if the .a has 
> an UNDEF, the Requires.private plus the Libs.private should resolve all the 
> undefs. See the Libs.private info in [https://linux.die.net/man/1/pkg-config]
>  
> Note, however, as Sutou Kouhei pointed out in 
> [https://github.com/apache/arrow/pull/5123#issuecomment-522771452,] the 
> additional Libs.private need to be dynamically generated based on whether 
> functionality like snappy, brotli or zlib is enabled..



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6312) [C++] Declare required Libs.private in arrow.pc package config

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6312:

Fix Version/s: (was: 0.15.0)
   1.0.0

> [C++] Declare required Libs.private in arrow.pc package config
> --
>
> Key: ARROW-6312
> URL: https://issues.apache.org/jira/browse/ARROW-6312
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Michael Maguire
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The current arrow.pc package config file produced is deficient and doesn't 
> properly declare static libraries pre-requisities that must be linked in in 
> order to *statically* link in libarrow.a
> Currently it just has:
> ```
>  Libs: -L${libdir} -larrow
> ```
> But in cases, e.g. where you enabled snappy, brotli or zlib support in arrow, 
> our toolchains need to see an arrow.pc file something more like:
> ```
>  Libs: -L${libdir} -larrow
>  Libs.private: -lsnappy -lboost_system -lz -llz4 -lbrotlidec -lbrotlienc 
> -lbrotlicommon -lzstd
> ```
> If not, we get linkage errors.  I'm told the convention is that if the .a has 
> an UNDEF, the Requires.private plus the Libs.private should resolve all the 
> undefs. See the Libs.private info in [https://linux.die.net/man/1/pkg-config]
>  
> Note, however, as Sutou Kouhei pointed out in 
> [https://github.com/apache/arrow/pull/5123#issuecomment-522771452,] the 
> additional Libs.private need to be dynamically generated based on whether 
> functionality like snappy, brotli or zlib is enabled..



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6171) [R] "docker-compose run r" fails

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6171.
-
Resolution: Fixed

Issue resolved by pull request 5295
[https://github.com/apache/arrow/pull/5295]

> [R] "docker-compose run r" fails
> 
>
> Key: ARROW-6171
> URL: https://issues.apache.org/jira/browse/ARROW-6171
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools, R
>Reporter: Antoine Pitrou
>Assignee: Francois Saint-Jacques
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I get the following failure:
> {code}
> ** testing if installed package can be loaded from temporary location
> Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '/usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so':
>   /opt/conda/lib/libarrow.so.100: undefined symbol: 
> LZ4F_resetDecompressionContext
> Error: loading failed
> Execution halted
> ERROR: loading failed
> * removing '/usr/local/lib/R/site-library/arrow'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-5743) [C++] Add CMake option to enable "large memory" unit tests

2019-09-06 Thread Benjamin Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Kietzman reassigned ARROW-5743:


Assignee: Benjamin Kietzman

> [C++] Add CMake option to enable "large memory" unit tests
> --
>
> Key: ARROW-5743
> URL: https://issues.apache.org/jira/browse/ARROW-5743
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Benjamin Kietzman
>Priority: Major
> Fix For: 0.15.0
>
>
> We have a number of unit tests that need to exercise code paths where memory 
> in excess of 2-4GB is allocated. Some of these are marked as {{DISABLED_*}} 
> in googletest which seems to be a recipe for bitrot.
> I propose instead to have a CMake option that sets a compiler definition to 
> enable these tests at build time, so that they can be run regularly on 
> machines that have adequate RAM (i.e. not public CI services)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6360) [R] Update support for compression

2019-09-06 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924604#comment-16924604
 ] 

Wes McKinney commented on ARROW-6360:
-

I don't think an error message is needed in R. If a codec is not built with the 
project the error will be raised from C++

> [R] Update support for compression
> --
>
> Key: ARROW-6360
> URL: https://issues.apache.org/jira/browse/ARROW-6360
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 0.15.0
>
>
> At least two issues:
>  * [https://github.com/apache/arrow/blob/master/r/R/compression.R#L46] says 
> that compression is not supported on Windows, but ARROW-5683 added Snappy 
> support for Windows.
>  * ARROW-6216 added more compression options, including for Parquet
> Update/implement those, and we should also add some convenience arguments to 
> {{write_parquet()}} for selecting compression.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6434) [CI][Crossbow] Nightly HDFS integration job fails

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6434.
-
Resolution: Fixed

Resolved in 
https://github.com/apache/arrow/commit/e29e26737ff4fcdf47fa75a14cc26e2ecc559e76

> [CI][Crossbow] Nightly HDFS integration job fails
> -
>
> Key: ARROW-6434
> URL: https://issues.apache.org/jira/browse/ARROW-6434
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: nightly, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See https://circleci.com/gh/ursa-labs/crossbow/2322. Either fix, skip job and 
> create followup Jira to unskip, or delete job.
> See also ARROW-2248.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5550) [C++] Refactor Buffers method on concatenate to consolidate code.

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5550:

Fix Version/s: (was: 0.15.0)
   1.0.0

> [C++] Refactor Buffers method on concatenate to consolidate code.
> -
>
> Key: ARROW-5550
> URL: https://issues.apache.org/jira/browse/ARROW-5550
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Minor
> Fix For: 1.0.0
>
>
> See https://github.com/apache/arrow/pull/4498/files for reference.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5508) [C++] Create reusable Iterator interface

2019-09-06 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924598#comment-16924598
 ] 

Wes McKinney commented on ARROW-5508:
-

Moved out of 0.15.0 as it seems like there is more thinking to do on this

> [C++] Create reusable Iterator interface 
> 
>
> Key: ARROW-5508
> URL: https://issues.apache.org/jira/browse/ARROW-5508
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> We have various iterator-like classes. I envision a reusable interface like
> {code}
> template 
> class Iterator {
>  public:
>   virtual ~Iterator() = default;
>   virtual Status Next(T* out) = 0;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5508) [C++] Create reusable Iterator interface

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5508:

Fix Version/s: (was: 0.15.0)
   1.0.0

> [C++] Create reusable Iterator interface 
> 
>
> Key: ARROW-5508
> URL: https://issues.apache.org/jira/browse/ARROW-5508
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> We have various iterator-like classes. I envision a reusable interface like
> {code}
> template 
> class Iterator {
>  public:
>   virtual ~Iterator() = default;
>   virtual Status Next(T* out) = 0;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-5450) [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too large to convert to C long

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5450:
---

Assignee: Wes McKinney

> [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too 
> large to convert to C long
> ---
>
> Key: ARROW-5450
> URL: https://issues.apache.org/jira/browse/ARROW-5450
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Tim Swast
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> When I attempt to roundtrip from a list of moderately large (beyond what can 
> be represented in nanosecond precision, but within microsecond precision) 
> datetime objects to pyarrow and back, I get an OverflowError: Python int too 
> large to convert to C long.
> pyarrow version:
> {noformat}
> $ pip freeze | grep pyarrow
> pyarrow==0.13.0{noformat}
>  
> Reproduction:
> {code:java}
> import datetime
> import pandas
> import pyarrow
> import pytz
> timestamp_rows = [
> datetime.datetime(1, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
> None,
> datetime.datetime(, 12, 31, 23, 59, 59, 99, tzinfo=pytz.utc),
> datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
> ]
> timestamp_array = pyarrow.array(timestamp_rows, pyarrow.timestamp("us", 
> tz="UTC"))
> timestamp_roundtrip = timestamp_array.to_pylist()
> # ---
> # OverflowError Traceback (most recent call last)
> #  in 
> # > 1 timestamp_roundtrip = timestamp_array.to_pylist()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
>  in __iter__()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib.TimestampValue.as_py()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib._datetime_conversion_functions.lambda5()
> #
> # pandas/_libs/tslibs/timestamps.pyx in 
> pandas._libs.tslibs.timestamps.Timestamp.__new__()
> #
> # pandas/_libs/tslibs/conversion.pyx in 
> pandas._libs.tslibs.conversion.convert_to_tsobject()
> #
> # OverflowError: Python int too large to convert to C long
> {code}
> For good measure, I also tested with timezone-naive timestamps with the same 
> error:
> {code:java}
> naive_rows = [
> datetime.datetime(1, 1, 1, 0, 0, 0),
> None,
> datetime.datetime(, 12, 31, 23, 59, 59, 99),
> datetime.datetime(1970, 1, 1, 0, 0, 0),
> ]
> naive_array = pyarrow.array(naive_rows, pyarrow.timestamp("us", tz=None))
> naive_roundtrip = naive_array.to_pylist()
> # ---
> # OverflowError Traceback (most recent call last)
> #  in 
> # > 1 naive_roundtrip = naive_array.to_pylist()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
>  in __iter__()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib.TimestampValue.as_py()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib._datetime_conversion_functions.lambda5()
> #
> # pandas/_libs/tslibs/timestamps.pyx in 
> pandas._libs.tslibs.timestamps.Timestamp.__new__()
> #
> # pandas/_libs/tslibs/conversion.pyx in 
> pandas._libs.tslibs.conversion.convert_to_tsobject()
> #
> # OverflowError: Python int too large to convert to C long
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4880) [Python] python/asv-build.sh is probably broken after CMake refactor

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-4880:
--
Labels: pull-request-available  (was: )

> [Python] python/asv-build.sh is probably broken after CMake refactor
> 
>
> Key: ARROW-4880
> URL: https://issues.apache.org/jira/browse/ARROW-4880
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> uses {{$ARROW_BUILD_TOOLCHAIN}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-4880) [Python] python/asv-build.sh is probably broken after CMake refactor

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-4880:
---

Assignee: Wes McKinney

> [Python] python/asv-build.sh is probably broken after CMake refactor
> 
>
> Key: ARROW-4880
> URL: https://issues.apache.org/jira/browse/ARROW-4880
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> uses {{$ARROW_BUILD_TOOLCHAIN}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-5292) [C++] Static libraries are built on AppVeyor

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5292.
-
Resolution: Fixed

Issue resolved by pull request 5283
[https://github.com/apache/arrow/pull/5283]

> [C++] Static libraries are built on AppVeyor
> 
>
> Key: ARROW-5292
> URL: https://issues.apache.org/jira/browse/ARROW-5292
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Building both static and shared libraries on Windows needs to compile all 
> source files twice, making CI slwoer.
> Normally, only the shared libraries are needed for testing (except for 
> Parquet, see PARQUET-1420).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-1984) [Java] NullableDateMilliVector.getObject() should return a LocalDate, not a LocalDateTime

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1984:
--
Labels: beginner pull-request-available  (was: beginner)

> [Java] NullableDateMilliVector.getObject() should return a LocalDate, not a 
> LocalDateTime
> -
>
> Key: ARROW-1984
> URL: https://issues.apache.org/jira/browse/ARROW-1984
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Vanco Buca
>Priority: Minor
>  Labels: beginner, pull-request-available
>
> NullableDateMilliVector.getObject() today returns a LocalDateTime. However, 
> this vector is used to store date information, and thus, getObject() should 
> return a LocalDate. 
> Please note: there already exists a vector that returns LocalDateTime --
>  the NullableTimestampMilliVector.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3933) [Python] Segfault reading Parquet files from GNOMAD

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3933:
--
Labels: parquet pull-request-available  (was: parquet)

> [Python] Segfault reading Parquet files from GNOMAD
> ---
>
> Key: ARROW-3933
> URL: https://issues.apache.org/jira/browse/ARROW-3933
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Ubuntu 18.04 or Mac OS X
>Reporter: David Konerding
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: parquet, pull-request-available
> Fix For: 0.15.0
>
> Attachments: 
> part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>
>
> I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). 
> Error also occurs out of box on Mac OS X.
> $ sudo snap install --classic google-cloud-sdk
> $ gsutil cp 
> gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>  .
> $ conda install pyarrow
> $ python test.py
> Segmentation fault (core dumped)
> test.py:
> import pyarrow.parquet as pq
> path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet"
> pq.read_table(path)
> gdb output:
> Thread 3 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffdf199700 (LWP 13703)]
> 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, 
> unsigned long*) () from 
> /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11
> I tested fastparquet, it reads the file just fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-3933) [Python] Segfault reading Parquet files from GNOMAD

2019-09-06 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924566#comment-16924566
 ] 

Wes McKinney commented on ARROW-3933:
-

As step one, I have it returning a reasonable error message instead of 
segfaulting:

{code}
pyarrow.lib.ArrowIOError: Parquet struct decoding error. Expected to decode 
1777 definition levels from child field "bytes: binary not null" in parent "gs: 
struct not null" but was only able to 
decode 0
In ../src/parquet/arrow/reader.cc, line 561, code: 
GetDefLevels(_levels_data, _levels_length)
In ../src/parquet/arrow/reader.cc, line 659, code: 
DefLevelsToNullArray(_bitmap, _count)
In ../src/parquet/arrow/reader.cc, line 795, code: final_status
{code}

Note that it won't be possible to read this file anyway because it contains 
repeated structs (see ARROW-1644)

https://gist.github.com/wesm/fefdfc74bd5acffb92a6cbd3ec6e3c20

> [Python] Segfault reading Parquet files from GNOMAD
> ---
>
> Key: ARROW-3933
> URL: https://issues.apache.org/jira/browse/ARROW-3933
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Ubuntu 18.04 or Mac OS X
>Reporter: David Konerding
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: parquet
> Fix For: 0.15.0
>
> Attachments: 
> part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>
>
> I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). 
> Error also occurs out of box on Mac OS X.
> $ sudo snap install --classic google-cloud-sdk
> $ gsutil cp 
> gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>  .
> $ conda install pyarrow
> $ python test.py
> Segmentation fault (core dumped)
> test.py:
> import pyarrow.parquet as pq
> path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet"
> pq.read_table(path)
> gdb output:
> Thread 3 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffdf199700 (LWP 13703)]
> 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, 
> unsigned long*) () from 
> /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11
> I tested fastparquet, it reads the file just fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3933) [Python] Segfault reading Parquet files from GNOMAD

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3933:

Attachment: part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet

> [Python] Segfault reading Parquet files from GNOMAD
> ---
>
> Key: ARROW-3933
> URL: https://issues.apache.org/jira/browse/ARROW-3933
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Ubuntu 18.04 or Mac OS X
>Reporter: David Konerding
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: parquet
> Fix For: 0.15.0
>
> Attachments: 
> part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>
>
> I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). 
> Error also occurs out of box on Mac OS X.
> $ sudo snap install --classic google-cloud-sdk
> $ gsutil cp 
> gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>  .
> $ conda install pyarrow
> $ python test.py
> Segmentation fault (core dumped)
> test.py:
> import pyarrow.parquet as pq
> path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet"
> pq.read_table(path)
> gdb output:
> Thread 3 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffdf199700 (LWP 13703)]
> 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, 
> unsigned long*) () from 
> /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11
> I tested fastparquet, it reads the file just fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (ARROW-3933) [Python] Segfault reading Parquet files from GNOMAD

2019-09-06 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924554#comment-16924554
 ] 

Wes McKinney edited comment on ARROW-3933 at 9/6/19 7:41 PM:
-

This is still core dumping. I'm going to take a look and see if I can figure it 
out

{code}
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x77805801 in __GI_abort () at abort.c:79
#2  0x7fffa4a37eee in arrow::util::CerrLog::~CerrLog (this=0x7fff84003b60) 
at ../src/arrow/util/logging.cc:50
#3  0x7fffa4a37f29 in arrow::util::CerrLog::~CerrLog (this=0x7fff84003b60) 
at ../src/arrow/util/logging.cc:44
#4  0x7fffa4a37b40 in arrow::util::ArrowLog::~ArrowLog 
(this=0x7fff9e63d188) at ../src/arrow/util/logging.cc:228
#5  0x7fffa2f8e0cc in parquet::arrow::StructReader::GetDefLevels 
(this=0x7fff8400b4b0, data=0x7fff9e63d328, length=0x7fff9e63d320)
at ../src/parquet/arrow/reader.cc:607
#6  0x7fffa2f8d615 in parquet::arrow::StructReader::DefLevelsToNullArray 
(this=0x7fff8400b4b0, null_bitmap_out=0x7fff9e63d4b0, 
null_count_out=0x7fff9e63d4a8)
at ../src/parquet/arrow/reader.cc:561
#7  0x7fffa2f8e75b in parquet::arrow::StructReader::NextBatch 
(this=0x7fff8400b4b0, records_to_read=1777, out=0x5607cd60)
at ../src/parquet/arrow/reader.cc:647
#8  0x7fffa2fa403b in parquet::arrow::FileReaderImpl::ReadSchemaField 
(this=0x55eb6310, i=2, indices=..., row_groups=..., 
out_field=0x5607cd20, 
out=0x5607cd60) at ../src/parquet/arrow/reader.cc:182
{code}


was (Author: wesmckinn):
This is core dumping. I'm going to take a look and see if I can figure it out

{code}
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x77805801 in __GI_abort () at abort.c:79
#2  0x7fffa4a37eee in arrow::util::CerrLog::~CerrLog (this=0x7fff84003b60) 
at ../src/arrow/util/logging.cc:50
#3  0x7fffa4a37f29 in arrow::util::CerrLog::~CerrLog (this=0x7fff84003b60) 
at ../src/arrow/util/logging.cc:44
#4  0x7fffa4a37b40 in arrow::util::ArrowLog::~ArrowLog 
(this=0x7fff9e63d188) at ../src/arrow/util/logging.cc:228
#5  0x7fffa2f8e0cc in parquet::arrow::StructReader::GetDefLevels 
(this=0x7fff8400b4b0, data=0x7fff9e63d328, length=0x7fff9e63d320)
at ../src/parquet/arrow/reader.cc:607
#6  0x7fffa2f8d615 in parquet::arrow::StructReader::DefLevelsToNullArray 
(this=0x7fff8400b4b0, null_bitmap_out=0x7fff9e63d4b0, 
null_count_out=0x7fff9e63d4a8)
at ../src/parquet/arrow/reader.cc:561
#7  0x7fffa2f8e75b in parquet::arrow::StructReader::NextBatch 
(this=0x7fff8400b4b0, records_to_read=1777, out=0x5607cd60)
at ../src/parquet/arrow/reader.cc:647
#8  0x7fffa2fa403b in parquet::arrow::FileReaderImpl::ReadSchemaField 
(this=0x55eb6310, i=2, indices=..., row_groups=..., 
out_field=0x5607cd20, 
out=0x5607cd60) at ../src/parquet/arrow/reader.cc:182
{code}

> [Python] Segfault reading Parquet files from GNOMAD
> ---
>
> Key: ARROW-3933
> URL: https://issues.apache.org/jira/browse/ARROW-3933
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Ubuntu 18.04 or Mac OS X
>Reporter: David Konerding
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: parquet
> Fix For: 0.15.0
>
> Attachments: 
> part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>
>
> I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). 
> Error also occurs out of box on Mac OS X.
> $ sudo snap install --classic google-cloud-sdk
> $ gsutil cp 
> gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>  .
> $ conda install pyarrow
> $ python test.py
> Segmentation fault (core dumped)
> test.py:
> import pyarrow.parquet as pq
> path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet"
> pq.read_table(path)
> gdb output:
> Thread 3 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffdf199700 (LWP 13703)]
> 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, 
> unsigned long*) () from 
> /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11
> I tested fastparquet, it reads the file just fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-3933) [Python] Segfault reading Parquet files from GNOMAD

2019-09-06 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924555#comment-16924555
 ] 

Wes McKinney commented on ARROW-3933:
-

I attached the offending file for convenience

> [Python] Segfault reading Parquet files from GNOMAD
> ---
>
> Key: ARROW-3933
> URL: https://issues.apache.org/jira/browse/ARROW-3933
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Ubuntu 18.04 or Mac OS X
>Reporter: David Konerding
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: parquet
> Fix For: 0.15.0
>
> Attachments: 
> part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>
>
> I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). 
> Error also occurs out of box on Mac OS X.
> $ sudo snap install --classic google-cloud-sdk
> $ gsutil cp 
> gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>  .
> $ conda install pyarrow
> $ python test.py
> Segmentation fault (core dumped)
> test.py:
> import pyarrow.parquet as pq
> path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet"
> pq.read_table(path)
> gdb output:
> Thread 3 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffdf199700 (LWP 13703)]
> 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, 
> unsigned long*) () from 
> /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11
> I tested fastparquet, it reads the file just fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-3933) [Python] Segfault reading Parquet files from GNOMAD

2019-09-06 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924554#comment-16924554
 ] 

Wes McKinney commented on ARROW-3933:
-

This is core dumping. I'm going to take a look and see if I can figure it out

{code}
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x77805801 in __GI_abort () at abort.c:79
#2  0x7fffa4a37eee in arrow::util::CerrLog::~CerrLog (this=0x7fff84003b60) 
at ../src/arrow/util/logging.cc:50
#3  0x7fffa4a37f29 in arrow::util::CerrLog::~CerrLog (this=0x7fff84003b60) 
at ../src/arrow/util/logging.cc:44
#4  0x7fffa4a37b40 in arrow::util::ArrowLog::~ArrowLog 
(this=0x7fff9e63d188) at ../src/arrow/util/logging.cc:228
#5  0x7fffa2f8e0cc in parquet::arrow::StructReader::GetDefLevels 
(this=0x7fff8400b4b0, data=0x7fff9e63d328, length=0x7fff9e63d320)
at ../src/parquet/arrow/reader.cc:607
#6  0x7fffa2f8d615 in parquet::arrow::StructReader::DefLevelsToNullArray 
(this=0x7fff8400b4b0, null_bitmap_out=0x7fff9e63d4b0, 
null_count_out=0x7fff9e63d4a8)
at ../src/parquet/arrow/reader.cc:561
#7  0x7fffa2f8e75b in parquet::arrow::StructReader::NextBatch 
(this=0x7fff8400b4b0, records_to_read=1777, out=0x5607cd60)
at ../src/parquet/arrow/reader.cc:647
#8  0x7fffa2fa403b in parquet::arrow::FileReaderImpl::ReadSchemaField 
(this=0x55eb6310, i=2, indices=..., row_groups=..., 
out_field=0x5607cd20, 
out=0x5607cd60) at ../src/parquet/arrow/reader.cc:182
{code}

> [Python] Segfault reading Parquet files from GNOMAD
> ---
>
> Key: ARROW-3933
> URL: https://issues.apache.org/jira/browse/ARROW-3933
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Ubuntu 18.04 or Mac OS X
>Reporter: David Konerding
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: parquet
> Fix For: 0.15.0
>
> Attachments: 
> part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>
>
> I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). 
> Error also occurs out of box on Mac OS X.
> $ sudo snap install --classic google-cloud-sdk
> $ gsutil cp 
> gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>  .
> $ conda install pyarrow
> $ python test.py
> Segmentation fault (core dumped)
> test.py:
> import pyarrow.parquet as pq
> path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet"
> pq.read_table(path)
> gdb output:
> Thread 3 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffdf199700 (LWP 13703)]
> 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, 
> unsigned long*) () from 
> /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11
> I tested fastparquet, it reads the file just fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5374) [Python] Misleading error message when calling pyarrow.read_record_batch on a complete IPC stream

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5374:
--
Labels: beginner pull-request-available  (was: beginner)

> [Python] Misleading error message when calling pyarrow.read_record_batch on a 
> complete IPC stream
> -
>
> Key: ARROW-5374
> URL: https://issues.apache.org/jira/browse/ARROW-5374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 0.15.0
>
>
> {code:python}
> >>> batch = pa.RecordBatch.from_arrays([pa.array([b"foo"], type=pa.utf8())], 
> >>> names=['strs'])   
> >>> 
> >>> stream = pa.BufferOutputStream()
> >>> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> >>> writer.write_batch(batch) 
> >>>   
> >>>
> >>> writer.close()
> >>>   
> >>>
> >>> buf = stream.getvalue()   
> >>>   
> >>>
> >>> pa.read_record_batch(buf, batch.schema)   
> >>>   
> >>>
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.read_record_batch(buf, batch.schema)
>   File "pyarrow/ipc.pxi", line 583, in pyarrow.lib.read_record_batch
> check_status(ReadRecordBatch(deref(message.message.get()),
>   File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> raise ArrowIOError(message)
> ArrowIOError: Expected IPC message of type schema got record batch
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-5374) [Python] Misleading error message when calling pyarrow.read_record_batch on a complete IPC stream

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5374:
---

Assignee: Wes McKinney

> [Python] Misleading error message when calling pyarrow.read_record_batch on a 
> complete IPC stream
> -
>
> Key: ARROW-5374
> URL: https://issues.apache.org/jira/browse/ARROW-5374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Major
>  Labels: beginner
> Fix For: 0.15.0
>
>
> {code:python}
> >>> batch = pa.RecordBatch.from_arrays([pa.array([b"foo"], type=pa.utf8())], 
> >>> names=['strs'])   
> >>> 
> >>> stream = pa.BufferOutputStream()
> >>> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> >>> writer.write_batch(batch) 
> >>>   
> >>>
> >>> writer.close()
> >>>   
> >>>
> >>> buf = stream.getvalue()   
> >>>   
> >>>
> >>> pa.read_record_batch(buf, batch.schema)   
> >>>   
> >>>
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.read_record_batch(buf, batch.schema)
>   File "pyarrow/ipc.pxi", line 583, in pyarrow.lib.read_record_batch
> check_status(ReadRecordBatch(deref(message.message.get()),
>   File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> raise ArrowIOError(message)
> ArrowIOError: Expected IPC message of type schema got record batch
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-5161) [Python] Cannot convert struct type from Pandas object column

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-5161.
---
  Assignee: (was: Wes McKinney)
Resolution: Duplicate

This was fixed and tested in ARROW-5286

> [Python] Cannot convert struct type from Pandas object column
> -
>
> Key: ARROW-5161
> URL: https://issues.apache.org/jira/browse/ARROW-5161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.15.0
>
>
> As reported on [https://github.com/apache/arrow/issues/4045]. Interestingly, 
> the datatype is inferred correctly.
> {code:python}
> >>> df = pd.DataFrame({'col': [{'ints': 5, 'strs': 'foo'}, {'ints': 6, 
> >>> 'strs': 'bar'}]}) 
> >>> 
> >>> df
> >>>   
> >>>  
>   col
> 0  {'ints': 5, 'strs': 'foo'}
> 1  {'ints': 6, 'strs': 'bar'}
> >>> pa.Table.from_pandas(df)  
> >>>   
> >>>  
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.Table.from_pandas(df)
>   File "pyarrow/table.pxi", line 1139, in pyarrow.lib.Table.from_pandas
> names, arrays, metadata = dataframe_to_arrays(
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 480, 
> in dataframe_to_arrays
> types)
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 209, 
> in construct_metadata
> field_name=sanitized_name)
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 151, 
> in get_column_metadata
> logical_type = get_logical_type(arrow_type)
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 79, in 
> get_logical_type
> raise NotImplementedError(str(arrow_type))
> NotImplementedError: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5161) [Python] Cannot convert struct type from Pandas object column

2019-09-06 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924539#comment-16924539
 ] 

Wes McKinney commented on ARROW-5161:
-

This is working in master. I'll add a unit test

> [Python] Cannot convert struct type from Pandas object column
> -
>
> Key: ARROW-5161
> URL: https://issues.apache.org/jira/browse/ARROW-5161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> As reported on [https://github.com/apache/arrow/issues/4045]. Interestingly, 
> the datatype is inferred correctly.
> {code:python}
> >>> df = pd.DataFrame({'col': [{'ints': 5, 'strs': 'foo'}, {'ints': 6, 
> >>> 'strs': 'bar'}]}) 
> >>> 
> >>> df
> >>>   
> >>>  
>   col
> 0  {'ints': 5, 'strs': 'foo'}
> 1  {'ints': 6, 'strs': 'bar'}
> >>> pa.Table.from_pandas(df)  
> >>>   
> >>>  
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.Table.from_pandas(df)
>   File "pyarrow/table.pxi", line 1139, in pyarrow.lib.Table.from_pandas
> names, arrays, metadata = dataframe_to_arrays(
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 480, 
> in dataframe_to_arrays
> types)
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 209, 
> in construct_metadata
> field_name=sanitized_name)
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 151, 
> in get_column_metadata
> logical_type = get_logical_type(arrow_type)
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 79, in 
> get_logical_type
> raise NotImplementedError(str(arrow_type))
> NotImplementedError: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-5161) [Python] Cannot convert struct type from Pandas object column

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5161:
---

Assignee: Wes McKinney

> [Python] Cannot convert struct type from Pandas object column
> -
>
> Key: ARROW-5161
> URL: https://issues.apache.org/jira/browse/ARROW-5161
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> As reported on [https://github.com/apache/arrow/issues/4045]. Interestingly, 
> the datatype is inferred correctly.
> {code:python}
> >>> df = pd.DataFrame({'col': [{'ints': 5, 'strs': 'foo'}, {'ints': 6, 
> >>> 'strs': 'bar'}]}) 
> >>> 
> >>> df
> >>>   
> >>>  
>   col
> 0  {'ints': 5, 'strs': 'foo'}
> 1  {'ints': 6, 'strs': 'bar'}
> >>> pa.Table.from_pandas(df)  
> >>>   
> >>>  
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.Table.from_pandas(df)
>   File "pyarrow/table.pxi", line 1139, in pyarrow.lib.Table.from_pandas
> names, arrays, metadata = dataframe_to_arrays(
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 480, 
> in dataframe_to_arrays
> types)
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 209, 
> in construct_metadata
> field_name=sanitized_name)
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 151, 
> in get_column_metadata
> logical_type = get_logical_type(arrow_type)
>   File "/home/antoine/arrow/dev/python/pyarrow/pandas_compat.py", line 79, in 
> get_logical_type
> raise NotImplementedError(str(arrow_type))
> NotImplementedError: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3651) [Python] Datetimes from non-DateTimeIndex cannot be deserialized

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3651:
--
Labels: parquet pull-request-available  (was: parquet)

> [Python] Datetimes from non-DateTimeIndex cannot be deserialized
> 
>
> Key: ARROW-3651
> URL: https://issues.apache.org/jira/browse/ARROW-3651
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Armin Berres
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.15.0
>
>
> Given an index which contains datetimes but is no DateTimeIndex writing the 
> file works but reading back fails.
> {code:python}
> df = pd.DataFrame(1, index=pd.MultiIndex.from_arrays([[1,2],[3,4]]), 
> columns=[pd.to_datetime("2018/01/01")])
> # columns index is no DateTimeIndex anymore
> df = df.reset_index().set_index(['level_0', 'level_1'])
> table = pa.Table.from_pandas(df)
> pq.write_table(table, 'test.parquet')
> pq.read_pandas('test.parquet').to_pandas()
> {code}
> results in 
> {code}
> KeyError  Traceback (most recent call last)
> ~/venv/mpptool/lib/python3.7/site-packages/pyarrow/pandas_compat.py in 
> _pandas_type_to_numpy_type(pandas_type)
> 676 try:
> --> 677 return _pandas_logical_type_map[pandas_type]
> 678 except KeyError:
> KeyError: 'datetime'
> {code}
> The created schema:
> {code}
> 2018-01-01 00:00:00: int64
> level_0: int64
> level_1: int64
> metadata
> 
> {b'pandas': b'{"index_columns": ["level_0", "level_1"], "column_indexes": 
> [{"n'
> b'ame": null, "field_name": null, "pandas_type": "datetime", 
> "nump'
> b'y_type": "object", "metadata": null}], "columns": [{"name": 
> "201'
> b'8-01-01 00:00:00", "field_name": "2018-01-01 00:00:00", 
> "pandas_'
> b'type": "int64", "numpy_type": "int64", "metadata": null}, 
> {"name'
> b'": "level_0", "field_name": "level_0", "pandas_type": "int64", 
> "'
> b'numpy_type": "int64", "metadata": null}, {"name": "level_1", 
> "fi'
> b'eld_name": "level_1", "pandas_type": "int64", "numpy_type": 
> "int'
> b'64", "metadata": null}], "pandas_version": "0.23.4"}'}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-3651) [Python] Datetimes from non-DateTimeIndex cannot be deserialized

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3651:
---

Assignee: Wes McKinney

> [Python] Datetimes from non-DateTimeIndex cannot be deserialized
> 
>
> Key: ARROW-3651
> URL: https://issues.apache.org/jira/browse/ARROW-3651
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Armin Berres
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> Given an index which contains datetimes but is no DateTimeIndex writing the 
> file works but reading back fails.
> {code:python}
> df = pd.DataFrame(1, index=pd.MultiIndex.from_arrays([[1,2],[3,4]]), 
> columns=[pd.to_datetime("2018/01/01")])
> # columns index is no DateTimeIndex anymore
> df = df.reset_index().set_index(['level_0', 'level_1'])
> table = pa.Table.from_pandas(df)
> pq.write_table(table, 'test.parquet')
> pq.read_pandas('test.parquet').to_pandas()
> {code}
> results in 
> {code}
> KeyError  Traceback (most recent call last)
> ~/venv/mpptool/lib/python3.7/site-packages/pyarrow/pandas_compat.py in 
> _pandas_type_to_numpy_type(pandas_type)
> 676 try:
> --> 677 return _pandas_logical_type_map[pandas_type]
> 678 except KeyError:
> KeyError: 'datetime'
> {code}
> The created schema:
> {code}
> 2018-01-01 00:00:00: int64
> level_0: int64
> level_1: int64
> metadata
> 
> {b'pandas': b'{"index_columns": ["level_0", "level_1"], "column_indexes": 
> [{"n'
> b'ame": null, "field_name": null, "pandas_type": "datetime", 
> "nump'
> b'y_type": "object", "metadata": null}], "columns": [{"name": 
> "201'
> b'8-01-01 00:00:00", "field_name": "2018-01-01 00:00:00", 
> "pandas_'
> b'type": "int64", "numpy_type": "int64", "metadata": null}, 
> {"name'
> b'": "level_0", "field_name": "level_0", "pandas_type": "int64", 
> "'
> b'numpy_type": "int64", "metadata": null}, {"name": "level_1", 
> "fi'
> b'eld_name": "level_1", "pandas_type": "int64", "numpy_type": 
> "int'
> b'64", "metadata": null}], "pandas_version": "0.23.4"}'}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-412) [Format] Handling of buffer padding in the IPC metadata

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-412:
---
Fix Version/s: (was: 0.15.0)
   1.0.0

> [Format] Handling of buffer padding in the IPC metadata
> ---
>
> Key: ARROW-412
> URL: https://issues.apache.org/jira/browse/ARROW-412
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> See discussion in ARROW-399. Do we include padding bytes in the metadata or 
> set the actual used bytes? In the latter case, the padding would be a part of 
> the format (any buffers continue to be expected to be 64-byte padded, to 
> permit AVX512 instructions)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6476) [Java][CI] Travis java all-jdks job is broken

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6476.
-
Resolution: Fixed

Issue resolved by pull request 5308
[https://github.com/apache/arrow/pull/5308]

> [Java][CI] Travis java all-jdks job is broken
> -
>
> Key: ARROW-6476
> URL: https://issues.apache.org/jira/browse/ARROW-6476
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Java
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Introduced by ARROW-6433, fixing the shade check enabled evaluation of the 
> incorrect body. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6476) [Java][CI] Travis java all-jdks job is broken

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6476:
--
Labels: pull-request-available  (was: )

> [Java][CI] Travis java all-jdks job is broken
> -
>
> Key: ARROW-6476
> URL: https://issues.apache.org/jira/browse/ARROW-6476
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Java
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> Introduced by ARROW-6433, fixing the shade check enabled evaluation of the 
> incorrect body. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6476) [Java][CI] Travis java all-jdks job is broken

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6476:

Fix Version/s: 0.15.0

> [Java][CI] Travis java all-jdks job is broken
> -
>
> Key: ARROW-6476
> URL: https://issues.apache.org/jira/browse/ARROW-6476
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Java
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.15.0
>
>
> Introduced by ARROW-6433, fixing the shade check enabled evaluation of the 
> incorrect body. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6435) [CI][Crossbow] Nightly dask integration job fails

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6435:
---

Assignee: Wes McKinney  (was: Benjamin Kietzman)

> [CI][Crossbow] Nightly dask integration job fails
> -
>
> Key: ARROW-6435
> URL: https://issues.apache.org/jira/browse/ARROW-6435
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: nightly, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> See https://circleci.com/gh/ursa-labs/crossbow/2326. Either fix, skip job and 
> create followup Jira to unskip, or delete job.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6435) [CI][Crossbow] Nightly dask integration job fails

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6435.
-
Resolution: Fixed

Issue resolved by pull request 5302
[https://github.com/apache/arrow/pull/5302]

> [CI][Crossbow] Nightly dask integration job fails
> -
>
> Key: ARROW-6435
> URL: https://issues.apache.org/jira/browse/ARROW-6435
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Neal Richardson
>Assignee: Benjamin Kietzman
>Priority: Blocker
>  Labels: nightly, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> See https://circleci.com/gh/ursa-labs/crossbow/2326. Either fix, skip job and 
> create followup Jira to unskip, or delete job.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6479) [C++] inline errors from external projects' build logs

2019-09-06 Thread Benjamin Kietzman (Jira)

Benjamin Kietzman created ARROW-6479:


 Summary: [C++] inline errors from external projects' build logs
 Key: ARROW-6479
 URL: https://issues.apache.org/jira/browse/ARROW-6479
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benjamin Kietzman


Currently when an external project build fails, we get a very uninformative 
message:

{code}
[88/543] Performing build step for 'flatbuffers_ep'
FAILED: flatbuffers_ep-prefix/src/flatbuffers_ep-stamp/flatbuffers_ep-build 
flatbuffers_ep-prefix/src/flatbuffers_ep-install/bin/flatc 
flatbuffers_ep-prefix/src/flatbuffers_ep-install/lib/libflatbuffers.a 
cd /build/cpp/flatbuffers_ep-prefix/src/flatbuffers_ep-build && /usr/bin/cmake 
-P 
/build/cpp/flatbuffers_ep-prefix/src/flatbuffers_ep-stamp/flatbuffers_ep-build-DEBUG.cmake
 && /usr/bin/cmake -E touch 
/build/cpp/flatbuffers_ep-prefix/src/flatbuffers_ep-stamp/flatbuffers_ep-build
CMake Error at 
/build/cpp/flatbuffers_ep-prefix/src/flatbuffers_ep-stamp/flatbuffers_ep-build-DEBUG.cmake:16
 (message):
  Command failed: 1

   '/usr/bin/cmake' '--build' '.'

  See also


/build/cpp/flatbuffers_ep-prefix/src/flatbuffers_ep-stamp/flatbuffers_ep-build-*.log
{code}

It would be far more useful if the error were caught and relevant section (or 
even the entirity) of {{ 
/build/cpp/flatbuffers_ep-prefix/src/flatbuffers_ep-stamp/flatbuffers_ep-build-*.log}}
 were output instead. This is doubly the case on CI where accessing those logs 
is non trivial



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3643) [Rust] Optimize `push_slice` of `BufferBuilder`

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3643:
--
Labels: pull-request-available  (was: )

> [Rust] Optimize `push_slice` of `BufferBuilder`
> -
>
> Key: ARROW-3643
> URL: https://issues.apache.org/jira/browse/ARROW-3643
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Paddy Horan
>Priority: Minor
>  Labels: pull-request-available
>
> Current implementation just repeatedly calls `push`, this should be optimized.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6478) [C++] Roll back to jemalloc stable-4 branch until performance issues in 5.2.x addressed

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6478.
-
Resolution: Fixed

Issue resolved by pull request 5297
[https://github.com/apache/arrow/pull/5297]

> [C++] Roll back to jemalloc stable-4 branch until performance issues in 5.2.x 
> addressed
> ---
>
> Key: ARROW-6478
> URL: https://issues.apache.org/jira/browse/ARROW-6478
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> New JIRA for changelog per ongoing thread in ARROW-6417



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6478) [C++] Roll back to jemalloc stable-4 branch until performance issues in 5.2.x addressed

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6478:
--
Labels: pull-request-available  (was: )

> [C++] Roll back to jemalloc stable-4 branch until performance issues in 5.2.x 
> addressed
> ---
>
> Key: ARROW-6478
> URL: https://issues.apache.org/jira/browse/ARROW-6478
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> New JIRA for changelog per ongoing thread in ARROW-6417



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6478) [C++] Roll back to jemalloc stable-4 branch until performance issues in 5.2.x addressed

2019-09-06 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6478:
---

 Summary: [C++] Roll back to jemalloc stable-4 branch until 
performance issues in 5.2.x addressed
 Key: ARROW-6478
 URL: https://issues.apache.org/jira/browse/ARROW-6478
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.15.0


New JIRA for changelog per ongoing thread in ARROW-6417



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6474) [Python] Provide mechanism for python to write out old format

2019-09-06 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924349#comment-16924349
 ] 

Wes McKinney commented on ARROW-6474:
-

I would expose this as an option in the stream writer API. If Spark downstream 
releases are unable to apply patches to accommodate new Arrow releases I'm not 
sure how much we should bend over backwards to accommodate

> [Python] Provide mechanism for python to write out old format
> -
>
> Key: ARROW-6474
> URL: https://issues.apache.org/jira/browse/ARROW-6474
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Micah Kornfield
>Priority: Blocker
> Fix For: 0.15.0
>
>
> I think this needs to be an environment variable, so it can be made to work 
> with old version of the Java library pyspark integration.
>  
>  [~bryanc] can you check if this captures the requirements?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6474) [Python] Provide mechanism for python to write out old format

2019-09-06 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6474:

Summary: [Python] Provide mechanism for python to write out old format  
(was: Provide mechanism for python to write out old format)

> [Python] Provide mechanism for python to write out old format
> -
>
> Key: ARROW-6474
> URL: https://issues.apache.org/jira/browse/ARROW-6474
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Micah Kornfield
>Priority: Blocker
> Fix For: 0.15.0
>
>
> I think this needs to be an environment variable, so it can be made to work 
> with old version of the Java library pyspark integration.
>  
>  [~bryanc] can you check if this captures the requirements?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-4914) [Rust] Array slice returns incorrect bitmask

2019-09-06 Thread lidavidm (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924348#comment-16924348
 ] 

lidavidm commented on ARROW-4914:
-

It looks like this was fixed as part of ARROW-4853, can it be closed?

> [Rust] Array slice returns incorrect bitmask
> 
>
> Key: ARROW-4914
> URL: https://issues.apache.org/jira/browse/ARROW-4914
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.13.0
>Reporter: Neville Dipale
>Priority: Blocker
>  Labels: beginner
>
> Slicing arrays changes the offset, length and null count of their array data, 
> but the bitmask is not changed.
> This results in the correct null count, but the array values might be marked 
> incorrectly as valid/invalid based on the old bitmask positions before the 
> offset.
> To reproduce, create an array with some null values, slice the array, and 
> then dbg!() it (after downcasting).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5722) [Rust] Implement std::fmt::Debug for ListArray, BinaryArray and StructArray

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5722:
--
Labels: beginner pull-request-available  (was: beginner)

> [Rust] Implement std::fmt::Debug for ListArray, BinaryArray and StructArray
> ---
>
> Key: ARROW-5722
> URL: https://issues.apache.org/jira/browse/ARROW-5722
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Chao Sun
>Priority: Major
>  Labels: beginner, pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-1984) [Java] NullableDateMilliVector.getObject() should return a LocalDate, not a LocalDateTime

2019-09-06 Thread Yongbo Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924322#comment-16924322
 ] 

Yongbo Zhang commented on ARROW-1984:
-

I'm looking at this ticket.

> [Java] NullableDateMilliVector.getObject() should return a LocalDate, not a 
> LocalDateTime
> -
>
> Key: ARROW-1984
> URL: https://issues.apache.org/jira/browse/ARROW-1984
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Vanco Buca
>Priority: Minor
>  Labels: beginner
>
> NullableDateMilliVector.getObject() today returns a LocalDateTime. However, 
> this vector is used to store date information, and thus, getObject() should 
> return a LocalDate. 
> Please note: there already exists a vector that returns LocalDateTime --
>  the NullableTimestampMilliVector.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5722) [Rust] Implement std::fmt::Debug for ListArray, BinaryArray and StructArray

2019-09-06 Thread lidavidm (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924290#comment-16924290
 ] 

lidavidm commented on ARROW-5722:
-

[~csun], I have some basic implementations. Printing nested arrays is 
difficult; I've punted on that for StructArray/ListArray. Really, we need Array 
to have a Debug trait bound as well - is that acceptable?

In the future, we may also want a pretty-printer API to make nested arrays look 
better (with indentation, etc).

> [Rust] Implement std::fmt::Debug for ListArray, BinaryArray and StructArray
> ---
>
> Key: ARROW-5722
> URL: https://issues.apache.org/jira/browse/ARROW-5722
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Chao Sun
>Priority: Major
>  Labels: beginner
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6477) [Packaging][Crossbow] Use Azure Pipelines to build linux packages

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6477:
--
Labels: pull-request-available  (was: )

> [Packaging][Crossbow] Use Azure Pipelines to build linux packages
> -
>
> Key: ARROW-6477
> URL: https://issues.apache.org/jira/browse/ARROW-6477
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> We have hit the time limitation of Travis for the Debian builds, se we need 
> to move these builds to another CI provider.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6477) [Packaging][Crossbow] Use Azure Pipelines to build linux packages

2019-09-06 Thread Krisztian Szucs (Jira)

Krisztian Szucs created ARROW-6477:
--

 Summary: [Packaging][Crossbow] Use Azure Pipelines to build linux 
packages
 Key: ARROW-6477
 URL: https://issues.apache.org/jira/browse/ARROW-6477
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 0.15.0


We have hit the time limitation of Travis for the Debian builds, se we need to 
move these builds to another CI provider.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-2428) [Python] Add API to map Arrow types (including extension types) to pandas ExtensionArray instances for to_pandas conversions

2019-09-06 Thread lidavidm (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924221#comment-16924221
 ] 

lidavidm commented on ARROW-2428:
-

It sounds like a new registry isn't needed, but adding parameters to to_pandas 
would be useful for customizing conversions of built-in types; Joris notes 
Fletcher would want to use that.

> [Python] Add API to map Arrow types (including extension types) to pandas 
> ExtensionArray instances for to_pandas conversions
> 
>
> Key: ARROW-2428
> URL: https://issues.apache.org/jira/browse/ARROW-2428
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 1.0.0
>
>
> With the next release of Pandas, it will be possible to define custom column 
> types that back a {{pandas.Series}}. Thus we will not be able to cover all 
> possible column types in the {{to_pandas}} conversion by default as we won't 
> be aware of all extension arrays.
> To enable users to create {{ExtensionArray}} instances from Arrow columns in 
> the {{to_pandas}} conversion, we should provide a hook in the {{to_pandas}} 
> call where they can overload the default conversion routines with the ones 
> that produce their {{ExtensionArray}} instances.
> This should avoid additional copies in the case where we would nowadays first 
> convert the Arrow column into a default Pandas column (probably of object 
> type) and the user would afterwards convert it to a more efficient 
> {{ExtensionArray}}. This hook here will be especially useful when you build 
> {{ExtensionArrays}} where the storage is backed by Arrow.
> The meta-issue that tracks the implementation inside of Pandas is: 
> https://github.com/pandas-dev/pandas/issues/19696



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6476) [Java][CI] Travis java all-jdks job is broken

2019-09-06 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6476:
--
Component/s: Java
 Continuous Integration

> [Java][CI] Travis java all-jdks job is broken
> -
>
> Key: ARROW-6476
> URL: https://issues.apache.org/jira/browse/ARROW-6476
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Java
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>
> Introduced by ARROW-6433, fixing the shade check enabled evaluation of the 
> incorrect body. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6476) [Java][CI] Travis java all-jdks job is broken

2019-09-06 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-6476:
-

Assignee: Francois Saint-Jacques

> [Java][CI] Travis java all-jdks job is broken
> -
>
> Key: ARROW-6476
> URL: https://issues.apache.org/jira/browse/ARROW-6476
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>
> Introduced by ARROW-6433, fixing the shade check enabled evaluation of the 
> incorrect body. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6476) [Java][CI] Travis java all-jdks job is broken

2019-09-06 Thread Francois Saint-Jacques (Jira)

Francois Saint-Jacques created ARROW-6476:
-

 Summary: [Java][CI] Travis java all-jdks job is broken
 Key: ARROW-6476
 URL: https://issues.apache.org/jira/browse/ARROW-6476
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Francois Saint-Jacques


Introduced by ARROW-6433, fixing the shade check enabled evaluation of the 
incorrect body. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6472) [Java] ValueVector#accept may has potential cast exception

2019-09-06 Thread Liya Fan (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924198#comment-16924198
 ] 

Liya Fan commented on ARROW-6472:
-

My suggestions:

1. We need to use both left and right vectors in the constructors. The reason 
is for performance: this is a short-cut and avoids repeated type check.

2. For each visit invocation, we first check if the parameter is the same as 
the left vector (which is usually true). If it is true, we go ahead to call the 
compare methods. If it is false, we 1) override the left vector; 2) do the type 
check; 3) call the compare methods.

public void visit(Vector left, Range range) {
  if (left != this.left) {
this.left = left;
typeCheck();
  }
  compareXXX();
}
This method keeps the good performance, while avoiding the problem indicated by 
[~Pindikura Ravindra]. 

What do you think?


> [Java] ValueVector#accept may has potential cast exception
> --
>
> Key: ARROW-6472
> URL: https://issues.apache.org/jira/browse/ARROW-6472
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>
> Per discussion 
> [https://github.com/apache/arrow/pull/5195#issuecomment-528425302]
> We may use API this way:
> {code:java}
> RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector1, vector2);
> vector3.accept(visitor, range){code}
> if vector1/vector2 are say, {{StructVector}}s and vector3 is an {{IntVector}} 
> - things can go bad. we'll use the {{compareBaseFixedWidthVectors()}} and do 
> wrong type-casts for vector1/vector2.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6448) [CI] Add crossbow notifications

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6448:
--
Labels: pull-request-available  (was: )

> [CI] Add crossbow notifications
> ---
>
> Key: ARROW-6448
> URL: https://issues.apache.org/jira/browse/ARROW-6448
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Continuous Integration
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6475) [C++] Don't try to dictionary encode dictionary arrays

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6475:
--
Labels: pull-request-available  (was: )

> [C++] Don't try to dictionary encode dictionary arrays
> --
>
> Key: ARROW-6475
> URL: https://issues.apache.org/jira/browse/ARROW-6475
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> With #5077 (or possibly #4949) behavior with dictionary arrays changed, 
> leaving the explicit call to DictionaryEncode() redundant.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6475) [C++] Don't try to dictionary encode dictionary arrays

2019-09-06 Thread Krisztian Szucs (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-6475.

Resolution: Fixed

Issue resolved by pull request 5299
[https://github.com/apache/arrow/pull/5299]

> [C++] Don't try to dictionary encode dictionary arrays
> --
>
> Key: ARROW-6475
> URL: https://issues.apache.org/jira/browse/ARROW-6475
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> With #5077 (or possibly #4949) behavior with dictionary arrays changed, 
> leaving the explicit call to DictionaryEncode() redundant.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6475) [C++] Don't try to dictionary encode dictionary arrays

2019-09-06 Thread Krisztian Szucs (Jira)

Krisztian Szucs created ARROW-6475:
--

 Summary: [C++] Don't try to dictionary encode dictionary arrays
 Key: ARROW-6475
 URL: https://issues.apache.org/jira/browse/ARROW-6475
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Krisztian Szucs
Assignee: Benjamin Kietzman
 Fix For: 0.15.0


With #5077 (or possibly #4949) behavior with dictionary arrays changed, leaving 
the explicit call to DictionaryEncode() redundant.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6120) [C++][Gandiva] including some headers causes decimal_test to fail

2019-09-06 Thread Krisztian Szucs (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-6120.

Resolution: Fixed

Issue resolved by pull request 5300
[https://github.com/apache/arrow/pull/5300]

> [C++][Gandiva] including some headers causes decimal_test to fail
> -
>
> Key: ARROW-6120
> URL: https://issues.apache.org/jira/browse/ARROW-6120
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Benjamin Kietzman
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> It seems this is due to precompiled code being contaminated with undesired 
> headers
> For example, {{#include }} in {{arrow/compare.h}} causes:
> {code}
> [ RUN  ] TestDecimal.TestCastFunctions
> ../../src/gandiva/tests/decimal_test.cc:478: Failure
> Value of: (array_dec)->Equals(outputs[2], 
> arrow::EqualOptions().nans_equal(true))
>   Actual: false
> Expected: true
> expected array: [
>   1.23,
>   1.58,
>   -1.23,
>   -1.58
> ] actual array: [
>   0.00,
>   0.00,
>   0.00,
>   0.00
> ]
> ../../src/gandiva/tests/decimal_test.cc:481: Failure
> Value of: (array_dec)->Equals(outputs[2], 
> arrow::EqualOptions().nans_equal(true))
>   Actual: false
> Expected: true
> expected array: [
>   1.23,
>   1.58,
>   -1.23,
>   -1.58
> ] actual array: [
>   0.00,
>   0.00,
>   0.00,
>   0.00
> ]
> ../../src/gandiva/tests/decimal_test.cc:484: Failure
> Value of: (array_dec)->Equals(outputs[3], 
> arrow::EqualOptions().nans_equal(true))
>   Actual: false
> Expected: true
> expected array: [
>   1.23,
>   1.58,
>   -1.23,
>   -1.58
> ] actual array: [
>   0.00,
>   0.00,
>   0.00,
>   0.00
> ]
> ../../src/gandiva/tests/decimal_test.cc:497: Failure
> Value of: (array_float64)->Equals(outputs[6], 
> arrow::EqualOptions().nans_equal(true))
>   Actual: false
> Expected: true
> expected array: [
>   1.23,
>   1.58,
>   -1.23,
>   -1.58
> ] actual array: [
>   inf,
>   inf,
>   -inf,
>   -inf
> ]
> [  FAILED  ] TestDecimal.TestCastFunctions (134 ms)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6369) [Python] Support list-of-boolean in Array.to_pandas conversion

2019-09-06 Thread Krisztian Szucs (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-6369.

Resolution: Fixed

Issue resolved by pull request 5301
[https://github.com/apache/arrow/pull/5301]

> [Python] Support list-of-boolean in Array.to_pandas conversion
> --
>
> Key: ARROW-6369
> URL: https://issues.apache.org/jira/browse/ARROW-6369
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See
> {code}
> In [4]: paste 
>   
>
> a = pa.array(np.array([[True, False], [True, True, True]]))
> ## -- End pasted text --
> In [5]: a 
>   
>
> Out[5]: 
> 
> [
>   [
> true,
> false
>   ],
>   [
> true,
> true,
> true
>   ]
> ]
> In [6]: a.to_pandas() 
>   
>
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 a.to_pandas()
> ~/code/arrow/python/pyarrow/array.pxi in 
> pyarrow.lib._PandasConvertible.to_pandas()
> 439 deduplicate_objects=deduplicate_objects)
> 440 
> --> 441 return self._to_pandas(options, categories=categories,
> 442ignore_metadata=ignore_metadata)
> 443 
> ~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.Array._to_pandas()
> 815 
> 816 with nogil:
> --> 817 check_status(ConvertArrayToPandas(c_options, 
> self.sp_array,
> 818   self, ))
> 819 return wrap_array_output(out)
> ~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
>  84 raise ArrowKeyError(message)
>  85 elif status.IsNotImplemented():
> ---> 86 raise ArrowNotImplementedError(message)
>  87 elif status.IsTypeError():
>  88 raise ArrowTypeError(message)
> ArrowNotImplementedError: Not implemented type for lists: bool
> In ../src/arrow/python/arrow_to_pandas.cc, line 1910, code: 
> VisitTypeInline(*data_->type(), this)
> {code}
> as reported in https://github.com/apache/arrow/issues/5203



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6472) [Java] ValueVector#accept may has potential cast exception

2019-09-06 Thread Ji Liu (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924014#comment-16924014
 ] 

Ji Liu commented on ARROW-6472:
---

cc [~pravindra] [~fan_li_ya] for comments on this issue, do you have any points 
how to resolve this problem in your mind?

Here are some my initial thoughts:

I am against that visitors keep both leftVector and rightVector for some 
reasons:

i. make visitor and accept API inconsistent as we mentioned before

ii. I think visitor should be reused for multiple leftVectors (with one 
visitor, vector1.accept(visitor, IN), vector2.accept(visitor, IN) …), and this 
way, the left and right vectors are fixed.

 

If we remove leftVector and pass it in via accept API, the main problem we 
should solve is repeated type checks in ListVector, we could do something to 
avoid this:

i. Keep Range param since it makes it possible for compare different ranges 
without creating extra visitor instance.

ii. Move typeCheck flag into Range(default is true), in cases compare 
ListVector data vector, change the flag after first compare and we could skip 
type checks in follow-up compares according to this flag.

 

This is only an initial thoughts and would be better if you have other points.

If we could not find a proper solution or reach a consistent in this JIRA, it‘s 
ok for me to start a discuss in ML.

Thanks!

> [Java] ValueVector#accept may has potential cast exception
> --
>
> Key: ARROW-6472
> URL: https://issues.apache.org/jira/browse/ARROW-6472
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>
> Per discussion 
> [https://github.com/apache/arrow/pull/5195#issuecomment-528425302]
> We may use API this way:
> {code:java}
> RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector1, vector2);
> vector3.accept(visitor, range){code}
> if vector1/vector2 are say, {{StructVector}}s and vector3 is an {{IntVector}} 
> - things can go bad. we'll use the {{compareBaseFixedWidthVectors()}} and do 
> wrong type-casts for vector1/vector2.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5125) [Python] Cannot roundtrip extreme dates through pyarrow

2019-09-06 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-5125:
---
Labels: pull-request-available windows  (was: parquet 
pull-request-available windows)

> [Python] Cannot roundtrip extreme dates through pyarrow
> ---
>
> Key: ARROW-5125
> URL: https://issues.apache.org/jira/browse/ARROW-5125
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Windows 10, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 
> 2019, 22:22:05)
>Reporter: Max Bolingbroke
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available, windows
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> You can roundtrip many dates through a pyarrow array:
>  
> {noformat}
> >>> pa.array([datetime.date(1980, 1, 1)], type=pa.date32())[0]
> datetime.date(1980, 1, 1){noformat}
>  
> But (on Windows at least), not extreme ones:
>  
> {noformat}
> >>> pa.array([datetime.date(1960, 1, 1)], type=pa.date32())[0]
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "pyarrow\scalar.pxi", line 74, in pyarrow.lib.ArrayValue.__repr__
>  File "pyarrow\scalar.pxi", line 226, in pyarrow.lib.Date32Value.as_py
> OSError: [Errno 22] Invalid argument
> >>> pa.array([datetime.date(3200, 1, 1)], type=pa.date32())[0]
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "pyarrow\scalar.pxi", line 74, in pyarrow.lib.ArrayValue.__repr__
>  File "pyarrow\scalar.pxi", line 226, in pyarrow.lib.Date32Value.as_py
> {noformat}
> This is because datetime.utcfromtimestamp and datetime.timestamp fail on 
> these dates, but it seems we should be able to totally avoid invoking this 
> function when deserializing dates. Ideally we would be able to roundtrip 
> these as datetimes too, of course, but it's less clear that this will be 
> easy. For some context on this see [https://bugs.python.org/issue29097].
> This may be related to ARROW-3176 and ARROW-4746



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5125) [Python] Cannot roundtrip extreme dates through pyarrow

2019-09-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5125:
--
Labels: parquet pull-request-available windows  (was: parquet windows)

> [Python] Cannot roundtrip extreme dates through pyarrow
> ---
>
> Key: ARROW-5125
> URL: https://issues.apache.org/jira/browse/ARROW-5125
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Windows 10, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 
> 2019, 22:22:05)
>Reporter: Max Bolingbroke
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: parquet, pull-request-available, windows
> Fix For: 0.15.0
>
>
> You can roundtrip many dates through a pyarrow array:
>  
> {noformat}
> >>> pa.array([datetime.date(1980, 1, 1)], type=pa.date32())[0]
> datetime.date(1980, 1, 1){noformat}
>  
> But (on Windows at least), not extreme ones:
>  
> {noformat}
> >>> pa.array([datetime.date(1960, 1, 1)], type=pa.date32())[0]
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "pyarrow\scalar.pxi", line 74, in pyarrow.lib.ArrayValue.__repr__
>  File "pyarrow\scalar.pxi", line 226, in pyarrow.lib.Date32Value.as_py
> OSError: [Errno 22] Invalid argument
> >>> pa.array([datetime.date(3200, 1, 1)], type=pa.date32())[0]
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "pyarrow\scalar.pxi", line 74, in pyarrow.lib.ArrayValue.__repr__
>  File "pyarrow\scalar.pxi", line 226, in pyarrow.lib.Date32Value.as_py
> {noformat}
> This is because datetime.utcfromtimestamp and datetime.timestamp fail on 
> these dates, but it seems we should be able to totally avoid invoking this 
> function when deserializing dates. Ideally we would be able to roundtrip 
> these as datetimes too, of course, but it's less clear that this will be 
> easy. For some context on this see [https://bugs.python.org/issue29097].
> This may be related to ARROW-3176 and ARROW-4746



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

80 matches

Mail list logo