[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

2019-07-17 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887610#comment-16887610
 ] 

Micah Kornfield commented on ARROW-3772:


"I'm looking at this. This is not a small project – the assumption that values 
are fully materialized is pretty deeply baked into the library. We also have to 
deal with the "fallback" case where a column chunk starts out dictionary 
encoded and switches mid-stream because the dictionary got too big"

I don't have context on how we decided originally to designate an entire column 
dictionary encoded vs a chunk/record batch column but it seems like this might 
be another use-case where the proposal on encoding/compression might make 
things easier to code (i.e. specify dictionary encoding only on 
SparseRecordBatches where it makes sense and leave the fallback to dense 
encoding where it no longer makes sense).

> [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow 
> DictionaryArray
> -
>
> Key: ARROW-3772
> URL: https://issues.apache.org/jira/browse/ARROW-3772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Stav Nir
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Dictionary data is very common in parquet, in the current implementation 
> parquet-cpp decodes dictionary encoded data always before creating a plain 
> arrow array. This process is wasteful since we could use arrow's 
> DictionaryArray directly and achieve several benefits:
>  # Smaller memory footprint - both in the decoding process and in the 
> resulting arrow table - especially when the dict values are large
>  # Better decoding performance - mostly as a result of the first bullet - 
> less memory fetches and less allocations.
> I think those benefits could achieve significant improvements in runtime.
> My direction for the implementation is to read the indices (through the 
> DictionaryDecoder, after the RLE decoding) and values separately into 2 
> arrays and create a DictionaryArray using them.
> There are some questions to discuss:
>  # Should this be the default behavior for dictionary encoded data
>  # Should it be controlled with a parameter in the API
>  # What should be the policy in case some of the chunks are dictionary 
> encoded and some are not.
> I started implementing this but would like to hear your opinions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

2019-07-17 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887577#comment-16887577
 ] 

Wes McKinney commented on ARROW-3772:
-

I'm looking at this. This is not a small project -- the assumption that values 
are fully materialized is pretty deeply baked into the library. We also have to 
deal with the "fallback" case where a column chunk starts out dictionary 
encoded and switches mid-stream because the dictionary got too big. What to do 
in that case is ambiguous:

* One option is to dictionary-encode the additional pages, so we could end up 
with one big dictionary
* Another option is to optimistically leave things dictionary-encoded, and if 
we hit the fallback case then we fully materialize. We can always do a cast on 
the Arrow side after the fact in this case

FWIW, the fallback scenario is not at all esoteric because the default 
dictionary pagesize limit in the C++ library is 1MB. I think Java is the same 

https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L44

I think adding an option to raise the limit to 2GB or so when writing Arrow 
DictionaryArray would help. 

Things are made a bit more complex by the code duplication between 
parquet/column_reader.cc and parquet/arrow/record_reader.cc. I'll see if 
there's some things I can do to fix that while I'm working on this

> [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow 
> DictionaryArray
> -
>
> Key: ARROW-3772
> URL: https://issues.apache.org/jira/browse/ARROW-3772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Stav Nir
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Dictionary data is very common in parquet, in the current implementation 
> parquet-cpp decodes dictionary encoded data always before creating a plain 
> arrow array. This process is wasteful since we could use arrow's 
> DictionaryArray directly and achieve several benefits:
>  # Smaller memory footprint - both in the decoding process and in the 
> resulting arrow table - especially when the dict values are large
>  # Better decoding performance - mostly as a result of the first bullet - 
> less memory fetches and less allocations.
> I think those benefits could achieve significant improvements in runtime.
> My direction for the implementation is to read the indices (through the 
> DictionaryDecoder, after the RLE decoding) and values separately into 2 
> arrays and create a DictionaryArray using them.
> There are some questions to discuss:
>  # Should this be the default behavior for dictionary encoded data
>  # Should it be controlled with a parameter in the API
>  # What should be the policy in case some of the chunks are dictionary 
> encoded and some are not.
> I started implementing this but would like to hear your opinions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5974) read_csv returns truncated read for some valid gzip files

2019-07-17 Thread Jordan Samuels (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan Samuels updated ARROW-5974:
--
Affects Version/s: 0.13.0

Confirmed same issue for 0.13.0

> read_csv returns truncated read for some valid gzip files
> -
>
> Key: ARROW-5974
> URL: https://issues.apache.org/jira/browse/ARROW-5974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Jordan Samuels
>Priority: Minor
>
> If two gzipped files are concatenated together, the result is a valid gzip 
> file.  However, it appears that pyarrow.csv.read_csv will only read the 
> portion related to the first file.
> If the repro script 
> [here|https://gist.github.com/jordansamuels/d69f1c22c58418f5dfa0785b9ecd211e] 
> is run, the output is:
> {{$ python repro.py}}
> {{pyarrow.csv only reads one row:}}
> {{ x}}
> {{0 1}}
> {{pandas reads two rows:}}
> {{ x}}
> {{0 1}}
> {{1 2}}
> {{pyarrow version: 0.14.0}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5974) read_csv returns truncated read for some valid gzip files

2019-07-17 Thread Jordan Samuels (JIRA)
Jordan Samuels created ARROW-5974:
-

 Summary: read_csv returns truncated read for some valid gzip files
 Key: ARROW-5974
 URL: https://issues.apache.org/jira/browse/ARROW-5974
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.0
Reporter: Jordan Samuels


If two gzipped files are concatenated together, the result is a valid gzip 
file.  However, it appears that pyarrow.csv.read_csv will only read the portion 
related to the first file.

If the repro script 
[here|https://gist.github.com/jordansamuels/d69f1c22c58418f5dfa0785b9ecd211e] 
is run, the output is:

{{$ python repro.py}}
{{pyarrow.csv only reads one row:}}
{{ x}}
{{0 1}}
{{pandas reads two rows:}}
{{ x}}
{{0 1}}
{{1 2}}
{{pyarrow version: 0.14.0}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5973) [Java] Variable width vectors' get methods should return null when the underlying data is null

2019-07-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5973:
--
Labels: pull-request-available  (was: )

> [Java] Variable width vectors' get methods should return null when the 
> underlying data is null
> --
>
> Key: ARROW-5973
> URL: https://issues.apache.org/jira/browse/ARROW-5973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>
> For variable-width vectors (VarCharVector and VarBinaryVector), when the 
> validity bit is not set, it means the underlying data is null, so the get 
> method should return null.
> However, the current implementation throws an IllegalStateException when 
> NULL_CHECKING_ENABLED is set, or returns an empty array when the flag is 
> clear.
> Maybe the purpose of this design is to be consistent with fixed-width 
> vectors. However, the scenario is different: fixed-width vectors (e.g. 
> IntVector) throw an IllegalStateException, simply because the primitive types 
> are non-nullable.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5973) [Java] Variable width vectors' get methods should return null when the underlying data is null

2019-07-17 Thread Liya Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan updated ARROW-5973:

Summary: [Java] Variable width vectors' get methods should return null when 
the underlying data is null  (was: [Java] Variable width vectors' get methods 
should return return null when the underlying data is null)

> [Java] Variable width vectors' get methods should return null when the 
> underlying data is null
> --
>
> Key: ARROW-5973
> URL: https://issues.apache.org/jira/browse/ARROW-5973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>
> For variable-width vectors (VarCharVector and VarBinaryVector), when the 
> validity bit is not set, it means the underlying data is null, so the get 
> method should return null.
> However, the current implementation throws an IllegalStateException when 
> NULL_CHECKING_ENABLED is set, or returns an empty array when the flag is 
> clear.
> Maybe the purpose of this design is to be consistent with fixed-width 
> vectors. However, the scenario is different: fixed-width vectors (e.g. 
> IntVector) throw an IllegalStateException, simply because the primitive types 
> are non-nullable.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5973) [Java] Variable width vectors' get methods should return return null when the underlying data is null

2019-07-17 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5973:
---

 Summary: [Java] Variable width vectors' get methods should return 
return null when the underlying data is null
 Key: ARROW-5973
 URL: https://issues.apache.org/jira/browse/ARROW-5973
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


For variable-width vectors (VarCharVector and VarBinaryVector), when the 
validity bit is not set, it means the underlying data is null, so the get 
method should return null.

However, the current implementation throws an IllegalStateException when 
NULL_CHECKING_ENABLED is set, or returns an empty array when the flag is clear.

Maybe the purpose of this design is to be consistent with fixed-width vectors. 
However, the scenario is different: fixed-width vectors (e.g. IntVector) throw 
an IllegalStateException, simply because the primitive types are non-nullable.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (ARROW-5815) [Java] Support swap functionality for fixed-width vectors

2019-07-17 Thread Liya Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan closed ARROW-5815.
---
Resolution: Won't Fix

> [Java] Support swap functionality for fixed-width vectors
> -
>
> Key: ARROW-5815
> URL: https://issues.apache.org/jira/browse/ARROW-5815
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Support swapping data elements for fixed-width vectors.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5970) [Java] Provide pointer to Arrow buffer

2019-07-17 Thread Liya Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan updated ARROW-5970:

Description: 
Introduce pointer to a memory region within an ArrowBuf.

This pointer will be used as the basis for calculating the hash code within a 
vector, and equality determination.

This data structure can be considered as a "universal value holder".

  was:
Introduce pointer to a memory region within an ArrowBuf.

This pointer will be used as the basis for calculating the hash code within a 
vector, and equality determination.


> [Java] Provide pointer to Arrow buffer
> --
>
> Key: ARROW-5970
> URL: https://issues.apache.org/jira/browse/ARROW-5970
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Introduce pointer to a memory region within an ArrowBuf.
> This pointer will be used as the basis for calculating the hash code within a 
> vector, and equality determination.
> This data structure can be considered as a "universal value holder".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-5762) [Integration][JS] Integration Tests for Map Type

2019-07-17 Thread Paul Taylor (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Taylor reassigned ARROW-5762:
--

Assignee: Paul Taylor

> [Integration][JS] Integration Tests for Map Type
> 
>
> Key: ARROW-5762
> URL: https://issues.apache.org/jira/browse/ARROW-5762
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration, JavaScript
>Reporter: Bryan Cutler
>Assignee: Paul Taylor
>Priority: Major
> Fix For: 1.0.0
>
>
> ARROW-1279 enabled integration tests for MapType between Java and C++, but 
> JavaScript had to be disabled for the map case due to an error.  Once this is 
> fixed, {{generate_map_case}} could be moved under {{generate_nested_case}} 
> with the other nested types.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5762) [Integration][JS] Integration Tests for Map Type

2019-07-17 Thread Paul Taylor (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Taylor updated ARROW-5762:
---
Fix Version/s: 1.0.0

> [Integration][JS] Integration Tests for Map Type
> 
>
> Key: ARROW-5762
> URL: https://issues.apache.org/jira/browse/ARROW-5762
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration, JavaScript
>Reporter: Bryan Cutler
>Priority: Major
> Fix For: 1.0.0
>
>
> ARROW-1279 enabled integration tests for MapType between Java and C++, but 
> JavaScript had to be disabled for the map case due to an error.  Once this is 
> fixed, {{generate_map_case}} could be moved under {{generate_nested_case}} 
> with the other nested types.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5762) [Integration][JS] Integration Tests for Map Type

2019-07-17 Thread Paul Taylor (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887544#comment-16887544
 ] 

Paul Taylor commented on ARROW-5762:


After reviewing the C++, the JS version of the Map type is not the same (it's 
essentially a Struct except instead of accessing child fields by field index, 
they're accessed by name). We should absolutely update the JS Map 
implementation before the 1.0 release.


> [Integration][JS] Integration Tests for Map Type
> 
>
> Key: ARROW-5762
> URL: https://issues.apache.org/jira/browse/ARROW-5762
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration, JavaScript
>Reporter: Bryan Cutler
>Priority: Major
>
> ARROW-1279 enabled integration tests for MapType between Java and C++, but 
> JavaScript had to be disabled for the map case due to an error.  Once this is 
> fixed, {{generate_map_case}} could be moved under {{generate_nested_case}} 
> with the other nested types.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5747) [C++] Better column name and header support in CSV reader

2019-07-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5747:
--
Labels: csv pull-request-available  (was: csv)

> [C++] Better column name and header support in CSV reader
> -
>
> Key: ARROW-5747
> URL: https://issues.apache.org/jira/browse/ARROW-5747
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: csv, pull-request-available
> Fix For: 1.0.0
>
>
> While working on ARROW-5500, I found a number of issues around the CSV parse 
> options {{header_rows}}:
>  * If header_rows is 0, [the reader 
> errors|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L150]
>  * It's not possible to supply your own column names, as [this 
> TODO|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L149]
>  notes. ARROW-4912 allows renaming columns after reading in, which _maybe_ is 
> enough as long as header_rows == 0 doesn't error, but then you can't 
> naturally specify column types in the convert options because that takes a 
> map of column name to type.
>  * If header_rows is > 1, every cell gets turned into a column name, so if 
> header_rows == 2, you get twice the number of column names as columns. This 
> doesn't error, but it leads to unexpected results.
> IMO a better interface would be to have a {{skip_rows}} argument to let you 
> ignore a large header, and a {{column_names}} argument that, if provided, 
> gives the column names. If not provided, the first row after {{skip_rows}} is 
> taken to be the column names. If it were also possible for {{column_names}} 
> to take a {{false}} or {{null}} argument, then we could support the case of 
> autogenerating names when none are provided and there's no header row. 
> Alternatively, we could use a boolean {{header}} argument to govern whether 
> the first (non-skipped) row should be interpreted as column names. (For 
> reference, R's 
> [readr|https://github.com/tidyverse/readr/blob/master/R/read_delim.R#L14-L27] 
> takes TRUE/FALSE/array of strings in one arg; the base 
> [read.csv|https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html]
>  uses separate args for header and col.names. Both have a {{skip}} argument.)
> I don't think there's value in trying to be clever about multirow headers and 
> converting those to column names; if there's meaningful information in a tall 
> header, let the user parse it themselves.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5763) [JS] enable integration tests for MapVector

2019-07-17 Thread Paul Taylor (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887479#comment-16887479
 ] 

Paul Taylor commented on ARROW-5763:


After reviewing the C++, the JS version of the Map type is not the same (it's 
essentially a Struct except instead of accessing child fields by field index, 
they're accessed by name). We should absolutely update the JS Map 
implementation before the 1.0 release.

> [JS] enable integration tests for MapVector
> ---
>
> Key: ARROW-5763
> URL: https://issues.apache.org/jira/browse/ARROW-5763
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: JavaScript
>Reporter: Benjamin Kietzman
>Priority: Minor
>
> As of 0.14, C++ and Java support Map arrays those implementations pass 
> integration tests. JS has a MapVector and some unit tests for it, but it 
> should be tested against other implementations as well



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-5894) [C++] libgandiva.so.14 is exporting libstdc++ symbols

2019-07-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5894:
---

Assignee: Zhuo Peng

> [C++] libgandiva.so.14 is exporting libstdc++ symbols
> -
>
> Key: ARROW-5894
> URL: https://issues.apache.org/jira/browse/ARROW-5894
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Affects Versions: 0.14.0
>Reporter: Zhuo Peng
>Assignee: Zhuo Peng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> For example:
> $ nm libgandiva.so.14 | grep "once_proxy"
> 018c0a10 T __once_proxy
>  
> many other symbols are also exported which I guess shouldn't be (e.g. LLVM 
> symbols)
>  
> There seems to be no linker script for libgandiva.so (there was, but was 
> never used and got deleted? 
> [https://github.com/apache/arrow/blob/9265fe35b67db93f5af0b47e92e039c637ad5b3e/cpp/src/gandiva/symbols-helpers.map]).
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5894) [C++] libgandiva.so.14 is exporting libstdc++ symbols

2019-07-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5894.
-
Resolution: Fixed

Issue resolved by pull request 4883
[https://github.com/apache/arrow/pull/4883]

> [C++] libgandiva.so.14 is exporting libstdc++ symbols
> -
>
> Key: ARROW-5894
> URL: https://issues.apache.org/jira/browse/ARROW-5894
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Affects Versions: 0.14.0
>Reporter: Zhuo Peng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> For example:
> $ nm libgandiva.so.14 | grep "once_proxy"
> 018c0a10 T __once_proxy
>  
> many other symbols are also exported which I guess shouldn't be (e.g. LLVM 
> symbols)
>  
> There seems to be no linker script for libgandiva.so (there was, but was 
> never used and got deleted? 
> [https://github.com/apache/arrow/blob/9265fe35b67db93f5af0b47e92e039c637ad5b3e/cpp/src/gandiva/symbols-helpers.map]).
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5964) [C++][Gandiva] Cast double to decimal with rounding returns 0

2019-07-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5964:

Component/s: C++ - Gandiva

> [C++][Gandiva] Cast double to decimal with rounding returns 0
> -
>
> Key: ARROW-5964
> URL: https://issues.apache.org/jira/browse/ARROW-5964
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Pindikura Ravindra
>Assignee: Pindikura Ravindra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> casting 1.15470053838 to decimal(18,0) gives 0. should return 1.
> there is a bug in the overflow check after rounding.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5964) [C++][Gandiva] Cast double to decimal with rounding returns 0

2019-07-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5964.
-
   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4894
[https://github.com/apache/arrow/pull/4894]

> [C++][Gandiva] Cast double to decimal with rounding returns 0
> -
>
> Key: ARROW-5964
> URL: https://issues.apache.org/jira/browse/ARROW-5964
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Pindikura Ravindra
>Assignee: Pindikura Ravindra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> casting 1.15470053838 to decimal(18,0) gives 0. should return 1.
> there is a bug in the overflow check after rounding.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5741) [JS] Make numeric vector from functions consistent with TypedArray.from

2019-07-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5741:

Fix Version/s: 1.0.0

> [JS] Make numeric vector from functions consistent with TypedArray.from
> ---
>
> Key: ARROW-5741
> URL: https://issues.apache.org/jira/browse/ARROW-5741
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Described in 
> https://lists.apache.org/thread.html/b648a781cba7f10d5a6072ff2e7dab6c03e2d1f12e359d9261891486@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-17 Thread H. Vetinari (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352
 ] 

H. Vetinari edited comment on ARROW-5965 at 7/17/19 7:11 PM:
-

[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
{{
# conda install pyarrow=0.14 gdb
Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible 
with each other:

  - pip -> python[version='>=3.7,<3.8.0a0']
}}

which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...

EDIT: can't seem to format the code-block correctly, sorry.


was (Author: h-vetinari):
[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
{{
{quote}# conda install pyarrow=0.14 gdb
Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible 
with each other:

  - pip -> python[version='>=3.7,<3.8.0a0']{quote}
}}

which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-17 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887356#comment-16887356
 ] 

Wes McKinney commented on ARROW-5965:
-

Note I linked this with ARROW-2652 since many users aren't familiar with 
producing gdb backtraces generated in Python programs

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-17 Thread H. Vetinari (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352
 ] 

H. Vetinari edited comment on ARROW-5965 at 7/17/19 7:10 PM:
-

[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
{{
{quote}# conda install pyarrow=0.14 gdb
Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible 
with each other:

  - pip -> python[version='>=3.7,<3.8.0a0']{quote}
}}

which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...


was (Author: h-vetinari):
[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
{{# conda install pyarrow=0.14 gdb
Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible 
with each other:

  - pip -> python[version='>=3.7,<3.8.0a0']}}

which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-17 Thread H. Vetinari (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352
 ] 

H. Vetinari edited comment on ARROW-5965 at 7/17/19 7:09 PM:
-

[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
{{# conda install pyarrow=0.14 gdb
Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible 
with each other:

  - pip -> python[version='>=3.7,<3.8.0a0']}}

which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...


was (Author: h-vetinari):
[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
```
# conda install pyarrow=0.14 gdb
Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible 
with each other:

  - pip -> python[version='>=3.7,<3.8.0a0']
```
which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-17 Thread H. Vetinari (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352
 ] 

H. Vetinari commented on ARROW-5965:


[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
```
# conda install pyarrow=0.14 gdb
Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible 
with each other:

  - pip -> python[version='>=3.7,<3.8.0a0']
```
which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3032) [Python] Clean up NumPy-related C++ headers

2019-07-17 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887347#comment-16887347
 ] 

Wes McKinney commented on ARROW-3032:
-

We decided in the PR not to combine any of the headers

> [Python] Clean up NumPy-related C++ headers
> ---
>
> Key: ARROW-3032
> URL: https://issues.apache.org/jira/browse/ARROW-3032
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> There are 4 different headers. After ARROW-2814, we can probably eliminate 
> numpy_convert.h and combine with numpy_to_arrow.h



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-3032) [Python] Clean up NumPy-related C++ headers

2019-07-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3032.
-
Resolution: Fixed

Issue resolved by pull request 4899
[https://github.com/apache/arrow/pull/4899]

> [Python] Clean up NumPy-related C++ headers
> ---
>
> Key: ARROW-3032
> URL: https://issues.apache.org/jira/browse/ARROW-3032
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> There are 4 different headers. After ARROW-2814, we can probably eliminate 
> numpy_convert.h and combine with numpy_to_arrow.h



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-3032) [Python] Clean up NumPy-related C++ headers

2019-07-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3032:
---

Assignee: Antoine Pitrou

> [Python] Clean up NumPy-related C++ headers
> ---
>
> Key: ARROW-3032
> URL: https://issues.apache.org/jira/browse/ARROW-3032
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> There are 4 different headers. After ARROW-2814, we can probably eliminate 
> numpy_convert.h and combine with numpy_to_arrow.h



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5963) [R] R Appveyor job does not test changes in the C++ library

2019-07-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5963:
--
Labels: pull-request-available  (was: )

> [R] R Appveyor job does not test changes in the C++ library
> ---
>
> Key: ARROW-5963
> URL: https://issues.apache.org/jira/browse/ARROW-5963
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Wes McKinney
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> It seems like master is being used 
> https://github.com/apache/arrow/blob/master/ci/PKGBUILD#L42
> I observed this in 
> https://ci.appveyor.com/project/wesm/arrow/builds/26030853/job/7vn8q3l8e24t83jh?fullLog=true
> from this PR
> https://github.com/apache/arrow/pull/4841 for ARROW-5893



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (ARROW-5953) Thrift download ERRORS with apache-arrow-0.14.0

2019-07-17 Thread Brian (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian closed ARROW-5953.

Resolution: Information Provided

This is a user-specific build environment issue unrelated to the 
apache-arrow-0.14.0 codebase. 

> Thrift download ERRORS with apache-arrow-0.14.0
> ---
>
> Key: ARROW-5953
> URL: https://issues.apache.org/jira/browse/ARROW-5953
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0, 0.14.0
> Environment: RHEL 6.7
>Reporter: Brian
>Priority: Major
>
> {color:#33}cmake returns:{color}
> requests.excetions.SSLError: hostname 'www.apache.org' doesn't match either 
> of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz
> {color:#33}during check for thrift download location.  {color}
> {color:#33}This occurs with a freshly inflated arrow source release tree 
> where cmake is running for the first time. {color}
> {color:#33}Reproducible with the release levels of apache-arrow-0.14.0 
> and  0.13.0.  I tried this 3-5x on 15Jul2019 and see it consistently each 
> time.{color}
> {color:#33}Here's the full context from cmake output: {color}
> {quote}-- Checking for module 'thrift'
> --   No package 'thrift' found
> -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB THRIFT_INCLUDE_DIR 
> THRIFT_COMPILER)
> Building Apache Thrift from source
> Downloading Apache Thrift from Traceback (most recent call last):
>   File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 
> 38, in 
>     suggested_mirror = get_url('[https://www.apache.org/dyn/]'
>   File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 
> 27, in get_url
>     return requests.get(url).content
>   File "/usr/lib/python2.6/site-packages/requests/api.py", line 68, in get
>     return request('get', url, **kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/api.py", line 50, in request
>     response = session.request(method=method, url=url, **kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 464, in 
> request
>     resp = self.send(prep, **send_kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 576, in 
> send
>     r = adapter.send(request, **kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 431, in 
> send
>     raise SSLError(e, request=request)
> requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either 
> of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz
> {quote}
> {color:#FF} {color}
> {color:#FF}{color:#33}Per Wes' suggestion I ran the following 
> directly:{color}{color}
> {color:#FF}{color:#33}python cpp/build-support/get_apache_mirror.py 
> [https://www-eu.apache.org/dist/] [http://us.mirrors.quenda.co/apache/]
> {color}{color}
> {color:#FF}{color:#33}with this output:{color}{color}
> [https://www-eu.apache.org/dist/]  [http://us.mirrors.quenda.co/apache/]
>  
>  
> *NOTE:* here are the cmake thrift log lines from a build of apache-arrow git 
> clone on 06Jul2019 where cmake/make were run fine.pwd
>  
> {quote}-- Checking for module 'thrift'
> -- No package 'thrift' found
> -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB) 
> Building Apache Thrift from source
> Downloading Apache Thrift from 
> http://mirror.metrocast.net/apache//thrift/0.12.0/thrift-0.12.0.tar.gz
> {quote}
> Currently, cmake runs successfully on this apache-arrow-0.14.0 directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5953) Thrift download ERRORS with apache-arrow-0.14.0

2019-07-17 Thread Brian (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887288#comment-16887288
 ] 

Brian commented on ARROW-5953:
--

This turns out to be a problem with cert validation when cmake sets up to 
download thrift, due to back-level RHEL 6 and Python 2.6.6 on the internal SAS 
build node where this fails.
We need to update to a newer supported version of Python on these machines.

> Thrift download ERRORS with apache-arrow-0.14.0
> ---
>
> Key: ARROW-5953
> URL: https://issues.apache.org/jira/browse/ARROW-5953
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0, 0.14.0
> Environment: RHEL 6.7
>Reporter: Brian
>Priority: Major
>
> {color:#33}cmake returns:{color}
> requests.excetions.SSLError: hostname 'www.apache.org' doesn't match either 
> of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz
> {color:#33}during check for thrift download location.  {color}
> {color:#33}This occurs with a freshly inflated arrow source release tree 
> where cmake is running for the first time. {color}
> {color:#33}Reproducible with the release levels of apache-arrow-0.14.0 
> and  0.13.0.  I tried this 3-5x on 15Jul2019 and see it consistently each 
> time.{color}
> {color:#33}Here's the full context from cmake output: {color}
> {quote}-- Checking for module 'thrift'
> --   No package 'thrift' found
> -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB THRIFT_INCLUDE_DIR 
> THRIFT_COMPILER)
> Building Apache Thrift from source
> Downloading Apache Thrift from Traceback (most recent call last):
>   File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 
> 38, in 
>     suggested_mirror = get_url('[https://www.apache.org/dyn/]'
>   File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 
> 27, in get_url
>     return requests.get(url).content
>   File "/usr/lib/python2.6/site-packages/requests/api.py", line 68, in get
>     return request('get', url, **kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/api.py", line 50, in request
>     response = session.request(method=method, url=url, **kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 464, in 
> request
>     resp = self.send(prep, **send_kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 576, in 
> send
>     r = adapter.send(request, **kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 431, in 
> send
>     raise SSLError(e, request=request)
> requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either 
> of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz
> {quote}
> {color:#FF} {color}
> {color:#FF}{color:#33}Per Wes' suggestion I ran the following 
> directly:{color}{color}
> {color:#FF}{color:#33}python cpp/build-support/get_apache_mirror.py 
> [https://www-eu.apache.org/dist/] [http://us.mirrors.quenda.co/apache/]
> {color}{color}
> {color:#FF}{color:#33}with this output:{color}{color}
> [https://www-eu.apache.org/dist/]  [http://us.mirrors.quenda.co/apache/]
>  
>  
> *NOTE:* here are the cmake thrift log lines from a build of apache-arrow git 
> clone on 06Jul2019 where cmake/make were run fine.pwd
>  
> {quote}-- Checking for module 'thrift'
> -- No package 'thrift' found
> -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB) 
> Building Apache Thrift from source
> Downloading Apache Thrift from 
> http://mirror.metrocast.net/apache//thrift/0.12.0/thrift-0.12.0.tar.gz
> {quote}
> Currently, cmake runs successfully on this apache-arrow-0.14.0 directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5966) [Python] Capacity error when converting large UTF32 numpy array to arrow array

2019-07-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5966:
--
Summary: [Python] Capacity error when converting large UTF32 numpy array to 
arrow array  (was: [Python] Capacity error when converting large string numpy 
array to arrow array)

> [Python] Capacity error when converting large UTF32 numpy array to arrow array
> --
>
> Key: ARROW-5966
> URL: https://issues.apache.org/jira/browse/ARROW-5966
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Igor Yastrebov
>Priority: Major
>
> Trying to create a large string array fails with 
> ArrowCapacityError: Encoded string length exceeds maximum size (2GB)
> instead of creating a chunked array.
>  
> A reproducible example:
> {code:java}
> import uuid
> import numpy as np
> import pyarrow as pa
> li = []
> for i in range(1):
> li.append(uuid.uuid4().hex)
> arr = np.array(li)
> parr = pa.array(arr)
> {code}
> Is it a regression or was it never properly fixed: 
> [https://github.com/apache/arrow/issues/1855]?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5972) [Rust] Installing cargo-tarpaulin and generating coverage report takes over 20 minutes

2019-07-17 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5972:
---

 Summary: [Rust] Installing cargo-tarpaulin and generating coverage 
report takes over 20 minutes
 Key: ARROW-5972
 URL: https://issues.apache.org/jira/browse/ARROW-5972
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Wes McKinney
 Fix For: 1.0.0


See example build:

https://travis-ci.org/apache/arrow/jobs/558986931

Here, installing cargo-tarpaulin takes 13m32s. Running the coverage report 
takes another 7m40s. 

Given the Travis CI build queue issues we're having, this might be worth 
optimizing or moving to Docker/Buildbot



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5971) [Website] Blog post introducing Arrow Flight

2019-07-17 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887257#comment-16887257
 ] 

Wes McKinney commented on ARROW-5971:
-

Yeah I was thinking to use Python for all the benchmarking, both server and 
client. Good dogfooding exercise

> [Website] Blog post introducing Arrow Flight
> 
>
> Key: ARROW-5971
> URL: https://issues.apache.org/jira/browse/ARROW-5971
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it's a good time to be bringing more attention to our work over the 
> last 12-14 months on Arrow Flight. 
> I would be OK to draft an initial version of the blog post, and I can 
> circulate to others for review / edit / comment. If there are particular 
> benchmarks you would like to see included, contributing code for that would 
> also be helpful. My plan would be to show tcp throughput on localhost, and 
> node-to-node throughput on a local gigabit ethernet network. I think the 
> localhost throughput is important to show that Flight is a tool that you 
> would want to reach for for faster throughput in high performance networking 
> (e.g. 10/40 gigabit)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5971) [Website] Blog post introducing Arrow Flight

2019-07-17 Thread lidavidm (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887246#comment-16887246
 ] 

lidavidm commented on ARROW-5971:
-

I'd be happy to look over anything. We're also working on a post of our own, 
though that probably won't come in the near future.

It might be interesting to show Python numbers as well - it actually performs 
better than Java in our tests (don't think I can share actual data though).

> [Website] Blog post introducing Arrow Flight
> 
>
> Key: ARROW-5971
> URL: https://issues.apache.org/jira/browse/ARROW-5971
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it's a good time to be bringing more attention to our work over the 
> last 12-14 months on Arrow Flight. 
> I would be OK to draft an initial version of the blog post, and I can 
> circulate to others for review / edit / comment. If there are particular 
> benchmarks you would like to see included, contributing code for that would 
> also be helpful. My plan would be to show tcp throughput on localhost, and 
> node-to-node throughput on a local gigabit ethernet network. I think the 
> localhost throughput is important to show that Flight is a tool that you 
> would want to reach for for faster throughput in high performance networking 
> (e.g. 10/40 gigabit)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array

2019-07-17 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887219#comment-16887219
 ] 

Antoine Pitrou commented on ARROW-5966:
---

I think that's because you're going through a Numpy Array (which also uses a 
wasteful UTF32 encoding). Just call pa.array() directly on the source. Or use 
another dtype for the Numpy Array.

> [Python] Capacity error when converting large string numpy array to arrow 
> array
> ---
>
> Key: ARROW-5966
> URL: https://issues.apache.org/jira/browse/ARROW-5966
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Igor Yastrebov
>Priority: Major
>
> Trying to create a large string array fails with 
> ArrowCapacityError: Encoded string length exceeds maximum size (2GB)
> instead of creating a chunked array.
>  
> A reproducible example:
> {code:java}
> import uuid
> import numpy as np
> import pyarrow as pa
> li = []
> for i in range(1):
> li.append(uuid.uuid4().hex)
> arr = np.array(li)
> parr = pa.array(arr)
> {code}
> Is it a regression or was it never properly fixed: 
> [https://github.com/apache/arrow/issues/1855]?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array

2019-07-17 Thread Igor Yastrebov (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887197#comment-16887197
 ] 

Igor Yastrebov commented on ARROW-5966:
---

I tried your example and it worked but uuid array fails. I have pyarrow 0.14.0 
(from conda-forge)

> [Python] Capacity error when converting large string numpy array to arrow 
> array
> ---
>
> Key: ARROW-5966
> URL: https://issues.apache.org/jira/browse/ARROW-5966
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Igor Yastrebov
>Priority: Major
>
> Trying to create a large string array fails with 
> ArrowCapacityError: Encoded string length exceeds maximum size (2GB)
> instead of creating a chunked array.
>  
> A reproducible example:
> {code:java}
> import uuid
> import numpy as np
> import pyarrow as pa
> li = []
> for i in range(1):
> li.append(uuid.uuid4().hex)
> arr = np.array(li)
> parr = pa.array(arr)
> {code}
> Is it a regression or was it never properly fixed: 
> [https://github.com/apache/arrow/issues/1855]?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5811) [C++] CSV reader: Ability to not infer column types.

2019-07-17 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887195#comment-16887195
 ] 

Wes McKinney commented on ARROW-5811:
-

Yeah, so we could define a conversion rule to return string or binary, and then 
add an option to set a default conversion rule (where currently we have an 
implicit default of "use type inference")

> [C++] CSV reader: Ability to not infer column types.
> 
>
> Key: ARROW-5811
> URL: https://issues.apache.org/jira/browse/ARROW-5811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.13.0
> Environment: Ubuntu Xenial
>Reporter: Bogdan Klichuk
>Priority: Minor
>  Labels: csv, csvparser, pyarrow
> Fix For: 1.0.0
>
>
> I'm trying to read CSV as is. All columns as strings. I don't know the schema 
> of these CSVs and they will vary as they are provided by user.
> Right now i'm using pandas.read_csv(dtype=str) which works great, but since 
> final destination of these CSVs are parquet files it seems like much more 
> efficient to use pyarrow.csv.read_csv in future, as soon as this becomes 
> available :)
> I tried things like 
> `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda:
>  'string')))` but it doesn't work.
> Maybe I just didnt' find something that already exists? :)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5811) [C++] CSV reader: Ability to not infer column types.

2019-07-17 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887187#comment-16887187
 ] 

Antoine Pitrou commented on ARROW-5811:
---

We're talking about C++ here. Dynamic typing isn't terribly idiomatic (though 
it's possible using std::variant) :-)

> [C++] CSV reader: Ability to not infer column types.
> 
>
> Key: ARROW-5811
> URL: https://issues.apache.org/jira/browse/ARROW-5811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.13.0
> Environment: Ubuntu Xenial
>Reporter: Bogdan Klichuk
>Priority: Minor
>  Labels: csv, csvparser, pyarrow
> Fix For: 1.0.0
>
>
> I'm trying to read CSV as is. All columns as strings. I don't know the schema 
> of these CSVs and they will vary as they are provided by user.
> Right now i'm using pandas.read_csv(dtype=str) which works great, but since 
> final destination of these CSVs are parquet files it seems like much more 
> efficient to use pyarrow.csv.read_csv in future, as soon as this becomes 
> available :)
> I tried things like 
> `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda:
>  'string')))` but it doesn't work.
> Maybe I just didnt' find something that already exists? :)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-3032) [Python] Clean up NumPy-related C++ headers

2019-07-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3032:
--
Labels: pull-request-available  (was: )

> [Python] Clean up NumPy-related C++ headers
> ---
>
> Key: ARROW-3032
> URL: https://issues.apache.org/jira/browse/ARROW-3032
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> There are 4 different headers. After ARROW-2814, we can probably eliminate 
> numpy_convert.h and combine with numpy_to_arrow.h



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5811) [C++] CSV reader: Ability to not infer column types.

2019-07-17 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887183#comment-16887183
 ] 

Neal Richardson commented on ARROW-5811:


In principle, a user could parse the header row of the CSV separately to 
identify the column names, then use that to define {{column_types}} mapping 
each name to string type. So are we just talking about how to facilitate that, 
whether/how to internalize that logic and expose it as a simple argument? Or is 
there something else?

If {{column_types}} didn't have to be a map, maybe that would help. Perhaps it 
could also accept an array of length equal to the number of columns, or a 
single value, in which case it would recycle that type for every column. 

 

> [C++] CSV reader: Ability to not infer column types.
> 
>
> Key: ARROW-5811
> URL: https://issues.apache.org/jira/browse/ARROW-5811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.13.0
> Environment: Ubuntu Xenial
>Reporter: Bogdan Klichuk
>Priority: Minor
>  Labels: csv, csvparser, pyarrow
> Fix For: 1.0.0
>
>
> I'm trying to read CSV as is. All columns as strings. I don't know the schema 
> of these CSVs and they will vary as they are provided by user.
> Right now i'm using pandas.read_csv(dtype=str) which works great, but since 
> final destination of these CSVs are parquet files it seems like much more 
> efficient to use pyarrow.csv.read_csv in future, as soon as this becomes 
> available :)
> I tried things like 
> `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda:
>  'string')))` but it doesn't work.
> Maybe I just didnt' find something that already exists? :)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5962) [CI][Python] Do not test manylinux1 wheels in Travis CI

2019-07-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5962.
-
Resolution: Fixed

Issue resolved by pull request 4893
[https://github.com/apache/arrow/pull/4893]

> [CI][Python] Do not test manylinux1 wheels in Travis CI
> ---
>
> Key: ARROW-5962
> URL: https://issues.apache.org/jira/browse/ARROW-5962
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> These can be tested via Crossbow either on demand or nightly. Removing these 
> from Travis CI will save 30 minutes of build time resulting in better team 
> productivity



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5971) [Website] Blog post introducing Arrow Flight

2019-07-17 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5971:
---

 Summary: [Website] Blog post introducing Arrow Flight
 Key: ARROW-5971
 URL: https://issues.apache.org/jira/browse/ARROW-5971
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Website
Reporter: Wes McKinney
 Fix For: 1.0.0


I think it's a good time to be bringing more attention to our work over the 
last 12-14 months on Arrow Flight. 

I would be OK to draft an initial version of the blog post, and I can circulate 
to others for review / edit / comment. If there are particular benchmarks you 
would like to see included, contributing code for that would also be helpful. 
My plan would be to show tcp throughput on localhost, and node-to-node 
throughput on a local gigabit ethernet network. I think the localhost 
throughput is important to show that Flight is a tool that you would want to 
reach for for faster throughput in high performance networking (e.g. 10/40 
gigabit)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5811) [C++] CSV reader: Ability to not infer column types.

2019-07-17 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887172#comment-16887172
 ] 

Antoine Pitrou commented on ARROW-5811:
---

The request is for no inference to occur, without knowing the column names or 
the number of columns in advance (so you cannot pass an explicit 
{{column_types}}).

> [C++] CSV reader: Ability to not infer column types.
> 
>
> Key: ARROW-5811
> URL: https://issues.apache.org/jira/browse/ARROW-5811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.13.0
> Environment: Ubuntu Xenial
>Reporter: Bogdan Klichuk
>Priority: Minor
>  Labels: csv, csvparser, pyarrow
> Fix For: 1.0.0
>
>
> I'm trying to read CSV as is. All columns as strings. I don't know the schema 
> of these CSVs and they will vary as they are provided by user.
> Right now i'm using pandas.read_csv(dtype=str) which works great, but since 
> final destination of these CSVs are parquet files it seems like much more 
> efficient to use pyarrow.csv.read_csv in future, as soon as this becomes 
> available :)
> I tried things like 
> `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda:
>  'string')))` but it doesn't work.
> Maybe I just didnt' find something that already exists? :)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5811) [C++] CSV reader: Ability to not infer column types.

2019-07-17 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887170#comment-16887170
 ] 

Neal Richardson commented on ARROW-5811:


I think I'm not understanding the problem. What's missing from the 
{{column_types}} we already support? 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/options.h#L69]

> [C++] CSV reader: Ability to not infer column types.
> 
>
> Key: ARROW-5811
> URL: https://issues.apache.org/jira/browse/ARROW-5811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.13.0
> Environment: Ubuntu Xenial
>Reporter: Bogdan Klichuk
>Priority: Minor
>  Labels: csv, csvparser, pyarrow
> Fix For: 1.0.0
>
>
> I'm trying to read CSV as is. All columns as strings. I don't know the schema 
> of these CSVs and they will vary as they are provided by user.
> Right now i'm using pandas.read_csv(dtype=str) which works great, but since 
> final destination of these CSVs are parquet files it seems like much more 
> efficient to use pyarrow.csv.read_csv in future, as soon as this becomes 
> available :)
> I tried things like 
> `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda:
>  'string')))` but it doesn't work.
> Maybe I just didnt' find something that already exists? :)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-750) [Format] Add LargeBinary and LargeString types

2019-07-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-750:
-
Component/s: C++

> [Format] Add LargeBinary and LargeString types
> --
>
> Key: ARROW-750
> URL: https://issues.apache.org/jira/browse/ARROW-750
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> These are string and binary types that use 64-bit offsets. Java will not need 
> to implement these types for the time being, but they are needed when 
> representing very large datasets in C++



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-4810) [Format][C++] Add "LargeList" type with 64-bit offsets

2019-07-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4810:
--
Priority: Minor  (was: Major)

> [Format][C++] Add "LargeList" type with 64-bit offsets
> --
>
> Key: ARROW-4810
> URL: https://issues.apache.org/jira/browse/ARROW-4810
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Wes McKinney
>Assignee: Philipp Moritz
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Mentioned in https://github.com/apache/arrow/issues/3845



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5811) [C++] CSV reader: Ability to not infer column types.

2019-07-17 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887166#comment-16887166
 ] 

Wes McKinney commented on ARROW-5811:
-

I think we need to create an abstract C++ type (or similar) that is a 
{{ConversionRule}}. We have other types of conversion rules where we have not 
defined an API yet, for example "timestamp with striptime-like format of 
$FORMAT". Whatever API we have, it needs to be extensible to accommodate new 
kinds of logic

> [C++] CSV reader: Ability to not infer column types.
> 
>
> Key: ARROW-5811
> URL: https://issues.apache.org/jira/browse/ARROW-5811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.13.0
> Environment: Ubuntu Xenial
>Reporter: Bogdan Klichuk
>Priority: Minor
>  Labels: csv, csvparser, pyarrow
> Fix For: 1.0.0
>
>
> I'm trying to read CSV as is. All columns as strings. I don't know the schema 
> of these CSVs and they will vary as they are provided by user.
> Right now i'm using pandas.read_csv(dtype=str) which works great, but since 
> final destination of these CSVs are parquet files it seems like much more 
> efficient to use pyarrow.csv.read_csv in future, as soon as this becomes 
> available :)
> I tried things like 
> `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda:
>  'string')))` but it doesn't work.
> Maybe I just didnt' find something that already exists? :)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5811) [C++] CSV reader: Ability to not infer column types.

2019-07-17 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887162#comment-16887162
 ] 

Antoine Pitrou commented on ARROW-5811:
---

[~wesmckinn] [~npr] do you have an idea about a desirable API here?

> [C++] CSV reader: Ability to not infer column types.
> 
>
> Key: ARROW-5811
> URL: https://issues.apache.org/jira/browse/ARROW-5811
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.13.0
> Environment: Ubuntu Xenial
>Reporter: Bogdan Klichuk
>Priority: Minor
>  Labels: csv, csvparser, pyarrow
> Fix For: 1.0.0
>
>
> I'm trying to read CSV as is. All columns as strings. I don't know the schema 
> of these CSVs and they will vary as they are provided by user.
> Right now i'm using pandas.read_csv(dtype=str) which works great, but since 
> final destination of these CSVs are parquet files it seems like much more 
> efficient to use pyarrow.csv.read_csv in future, as soon as this becomes 
> available :)
> I tried things like 
> `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda:
>  'string')))` but it doesn't work.
> Maybe I just didnt' find something that already exists? :)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (ARROW-5839) [Python] Test manylinux2010 in CI

2019-07-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-5839.
-
Resolution: Won't Fix

> [Python] Test manylinux2010 in CI
> -
>
> Key: ARROW-5839
> URL: https://issues.apache.org/jira/browse/ARROW-5839
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently we test manylinux1 builds on Travis-CI. At some point we should 
> test manylinux2010 builds too.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5839) [Python] Test manylinux2010 in CI

2019-07-17 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887160#comment-16887160
 ] 

Antoine Pitrou commented on ARROW-5839:
---

manylinux2010 is already tested on crossbow, so closing this since we aren't 
gonna test it on Travis.

> [Python] Test manylinux2010 in CI
> -
>
> Key: ARROW-5839
> URL: https://issues.apache.org/jira/browse/ARROW-5839
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently we test manylinux1 builds on Travis-CI. At some point we should 
> test manylinux2010 builds too.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array

2019-07-17 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887156#comment-16887156
 ] 

Antoine Pitrou commented on ARROW-5966:
---

Which version are you using?

> [Python] Capacity error when converting large string numpy array to arrow 
> array
> ---
>
> Key: ARROW-5966
> URL: https://issues.apache.org/jira/browse/ARROW-5966
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Igor Yastrebov
>Priority: Major
>
> Trying to create a large string array fails with 
> ArrowCapacityError: Encoded string length exceeds maximum size (2GB)
> instead of creating a chunked array.
>  
> A reproducible example:
> {code:java}
> import uuid
> import numpy as np
> import pyarrow as pa
> li = []
> for i in range(1):
> li.append(uuid.uuid4().hex)
> arr = np.array(li)
> parr = pa.array(arr)
> {code}
> Is it a regression or was it never properly fixed: 
> [https://github.com/apache/arrow/issues/1855]?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array

2019-07-17 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887155#comment-16887155
 ] 

Antoine Pitrou commented on ARROW-5966:
---

I am not seeing this issue:

{code:python}
>>> import pyarrow as pa
>>> 
>>> 
>>> import numpy as np  
>>> 
>>> 
>>> 
>>> 
>>> 
>>> l = []  
>>> 
>>> 
>>> x = b"x"*1024   
>>> 
>>> 
>>> for i in range(4 * (1024**2)): l.append(x)  
>>> 
>>> 
>>> arr = pa.array(l)
>>> arr.type
>>> 
>>> 
DataType(binary)
>>> type(arr)   
>>> 
>>> 
pyarrow.lib.ChunkedArray
>>> len(arr)
>>> 
>>> 
4194304
>>> len(arr.chunks) 
>>> 
>>> 
3
>>> del arr 
>>> 
>>> 
>>> narr = np.array(l)
>>> narr.nbytes 
>>> 
>>> 
4294967296
>>> arr = pa.array(narr)
>>> 
>>> 
>>> type(arr)   
>>> 
>>> 
pyarrow.lib.ChunkedArray
>>> len(arr.chunks) 
>>> 
>>> 
256
>>> len(arr)
>>> 
>>> 
4194304
{code}


> [Python] Capacity error when converting large string numpy array to arrow 
> array
> ---
>
> Key: ARROW-5966
> URL: https://issues.apache.org/jira/browse/ARROW-5966
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Igor Yastrebov
>Priority: Major
>
> Trying to create a large string array fails with 
> ArrowCapacityError: Encoded string length exceeds maximum size (2GB)
> instead of creating a chunked array.
>  
> A reproducible example:
> {code:java}
> import uuid
> import numpy as np
> import pyarrow as pa
> li = []
> for i in range(1):
> li.append(uuid.uuid4().hex)
> arr = np.array(li)
> parr = pa.array(arr)
> {code}
> Is it a regression or was it never properly fixed: 
> [https://github.com/apache/arrow/issues/1855]?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-5963) [R] R Appveyor job does not test changes in the C++ library

2019-07-17 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-5963:
--

Assignee: Neal Richardson

> [R] R Appveyor job does not test changes in the C++ library
> ---
>
> Key: ARROW-5963
> URL: https://issues.apache.org/jira/browse/ARROW-5963
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Wes McKinney
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> It seems like master is being used 
> https://github.com/apache/arrow/blob/master/ci/PKGBUILD#L42
> I observed this in 
> https://ci.appveyor.com/project/wesm/arrow/builds/26030853/job/7vn8q3l8e24t83jh?fullLog=true
> from this PR
> https://github.com/apache/arrow/pull/4841 for ARROW-5893



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-17 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887142#comment-16887142
 ] 

Wes McKinney commented on ARROW-5965:
-

A gdb backtrace would help us a lot. Do you know how to get one?

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-17 Thread H. Vetinari (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887129#comment-16887129
 ] 

H. Vetinari commented on ARROW-5965:


Hey Neal,

I tried a couple of times before filing the report, and all (~5) invocations on 
0.14 crashed, and all invocations on 0.13 worked. The machine itself has lots 
of memory, so I don't think it's that. Not sure I'll be able to pare this down 
to a minimal reproducing parquet file. I'll try.

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-17 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887119#comment-16887119
 ] 

Neal Richardson commented on ARROW-5965:


Thanks for the report. A few questions:
 # Is this reproducible if you try again with the same file? (I wonder if 
"Killed" means OOM and not segfault)
 # Could you provide a (preferably as small as possible) Parquet file that 
triggers this behavior? I think we'll need that in order to identify and fix 
any issues.

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-17 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5965:
---
Labels: parquet  (was: )

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-17 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5965:
---
Component/s: Python

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-17 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-5965:
---
Summary: [Python] Regression: segfault when reading hive table with v0.14  
(was: Regression: segfault when reading hive table with v0.14)

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-5747) [C++] Better column name and header support in CSV reader

2019-07-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-5747:
-

Assignee: Antoine Pitrou

> [C++] Better column name and header support in CSV reader
> -
>
> Key: ARROW-5747
> URL: https://issues.apache.org/jira/browse/ARROW-5747
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: csv
> Fix For: 1.0.0
>
>
> While working on ARROW-5500, I found a number of issues around the CSV parse 
> options {{header_rows}}:
>  * If header_rows is 0, [the reader 
> errors|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L150]
>  * It's not possible to supply your own column names, as [this 
> TODO|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L149]
>  notes. ARROW-4912 allows renaming columns after reading in, which _maybe_ is 
> enough as long as header_rows == 0 doesn't error, but then you can't 
> naturally specify column types in the convert options because that takes a 
> map of column name to type.
>  * If header_rows is > 1, every cell gets turned into a column name, so if 
> header_rows == 2, you get twice the number of column names as columns. This 
> doesn't error, but it leads to unexpected results.
> IMO a better interface would be to have a {{skip_rows}} argument to let you 
> ignore a large header, and a {{column_names}} argument that, if provided, 
> gives the column names. If not provided, the first row after {{skip_rows}} is 
> taken to be the column names. If it were also possible for {{column_names}} 
> to take a {{false}} or {{null}} argument, then we could support the case of 
> autogenerating names when none are provided and there's no header row. 
> Alternatively, we could use a boolean {{header}} argument to govern whether 
> the first (non-skipped) row should be interpreted as column names. (For 
> reference, R's 
> [readr|https://github.com/tidyverse/readr/blob/master/R/read_delim.R#L14-L27] 
> takes TRUE/FALSE/array of strings in one arg; the base 
> [read.csv|https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html]
>  uses separate args for header and col.names. Both have a {{skip}} argument.)
> I don't think there's value in trying to be clever about multirow headers and 
> converting those to column names; if there's meaningful information in a tall 
> header, let the user parse it themselves.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5864) [Python] simplify cython wrapping of Result

2019-07-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-5864.
---
   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4848
[https://github.com/apache/arrow/pull/4848]

> [Python] simplify cython wrapping of Result
> ---
>
> Key: ARROW-5864
> URL: https://issues.apache.org/jira/browse/ARROW-5864
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See answer in https://github.com/cython/cython/issues/3018



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5970) [Java] Provide pointer to Arrow buffer

2019-07-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5970:
--
Labels: pull-request-available  (was: )

> [Java] Provide pointer to Arrow buffer
> --
>
> Key: ARROW-5970
> URL: https://issues.apache.org/jira/browse/ARROW-5970
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>
> Introduce pointer to a memory region within an ArrowBuf.
> This pointer will be used as the basis for calculating the hash code within a 
> vector, and equality determination.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5970) [Java] Provide pointer to Arrow buffer

2019-07-17 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5970:
---

 Summary: [Java] Provide pointer to Arrow buffer
 Key: ARROW-5970
 URL: https://issues.apache.org/jira/browse/ARROW-5970
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Introduce pointer to a memory region within an ArrowBuf.

This pointer will be used as the basis for calculating the hash code within a 
vector, and equality determination.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5969) [CI] [R] Lint failures

2019-07-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-5969.
---
   Resolution: Fixed
Fix Version/s: 1.0.0

Issue resolved by pull request 4895
[https://github.com/apache/arrow/pull/4895]

> [CI] [R] Lint failures
> --
>
> Key: ARROW-5969
> URL: https://issues.apache.org/jira/browse/ARROW-5969
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration, R
>Reporter: Antoine Pitrou
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5969) [CI] [R] Lint failures

2019-07-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5969:
--
Labels: pull-request-available  (was: )

> [CI] [R] Lint failures
> --
>
> Key: ARROW-5969
> URL: https://issues.apache.org/jira/browse/ARROW-5969
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration, R
>Reporter: Antoine Pitrou
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-5969) [CI] [R] Lint failures

2019-07-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-5969:
-

Assignee: Antoine Pitrou

> [CI] [R] Lint failures
> --
>
> Key: ARROW-5969
> URL: https://issues.apache.org/jira/browse/ARROW-5969
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration, R
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5969) [CI] [R] Lint failures

2019-07-17 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5969:
-

 Summary: [CI] [R] Lint failures
 Key: ARROW-5969
 URL: https://issues.apache.org/jira/browse/ARROW-5969
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration, R
Reporter: Antoine Pitrou






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5968) [Java] Remove duplicate Preconditions check in JDBC adapter

2019-07-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5968:
--
Labels: pull-request-available  (was: )

> [Java] Remove duplicate Preconditions check in JDBC adapter
> ---
>
> Key: ARROW-5968
> URL: https://issues.apache.org/jira/browse/ARROW-5968
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
>
> Some Preconditions check are duplicate in {{JdbcToArrow#sqlToArrow}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5968) [Java] Remove duplicate Preconditions check in JDBC adapter

2019-07-17 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5968:
-

 Summary: [Java] Remove duplicate Preconditions check in JDBC 
adapter
 Key: ARROW-5968
 URL: https://issues.apache.org/jira/browse/ARROW-5968
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Some Preconditions check are duplicate in {{JdbcToArrow#sqlToArrow}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5967) [Java] DateUtility#timeZoneList is not correct

2019-07-17 Thread Ji Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Liu updated ARROW-5967:
--
Component/s: Java

> [Java] DateUtility#timeZoneList is not correct
> --
>
> Key: ARROW-5967
> URL: https://issues.apache.org/jira/browse/ARROW-5967
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>
> Now {{timeZoneList}} in {{DateUtility}} belongs to Joda time.
> Since we have replace Joda time with Java time in ARROW-2015, this should 
> also be changed.
> {{TimeStampXXTZVectors}} have a timezone member which seems not used now and 
> its {{getObject}} returns Long(different with that in {{TimeStampXXVectors}} 
> which returns {{LocalDateTime}}), should it return {{LocalDateTime}} with its 
> timezone?
> Is it reasonable if we do as follows:
>  # replace Joda {{timezoneList}} with Java {{timezoneList}} in {{DateUtility}}
>  # add method like {{getLocalDateTimeFromEpochMilli(long epochMillis, String 
> timezone)}} in DateUtility
>  # Not sure make {{TimeStampXXTZVectors}} return {{LocalDateTime}}?
> cc [~emkornfi...@gmail.com]  [~bryanc]  [~siddteotia]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5967) [Java] DateUtility#timeZoneList is not correct

2019-07-17 Thread Ji Liu (JIRA)
Ji Liu created ARROW-5967:
-

 Summary: [Java] DateUtility#timeZoneList is not correct
 Key: ARROW-5967
 URL: https://issues.apache.org/jira/browse/ARROW-5967
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Now {{timeZoneList}} in {{DateUtility}} belongs to Joda time.

Since we have replace Joda time with Java time in ARROW-2015, this should also 
be changed.

{{TimeStampXXTZVectors}} have a timezone member which seems not used now and 
its {{getObject}} returns Long(different with that in {{TimeStampXXVectors}} 
which returns {{LocalDateTime}}), should it return {{LocalDateTime}} with its 
timezone?

Is it reasonable if we do as follows:
 # replace Joda {{timezoneList}} with Java {{timezoneList}} in {{DateUtility}}
 # add method like {{getLocalDateTimeFromEpochMilli(long epochMillis, String 
timezone)}} in DateUtility
 # Not sure make {{TimeStampXXTZVectors}} return {{LocalDateTime}}?

cc [~emkornfi...@gmail.com]  [~bryanc]  [~siddteotia]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5957) [C++][Gandiva] Implement div function in Gandiva

2019-07-17 Thread Prudhvi Porandla (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prudhvi Porandla updated ARROW-5957:

Description: 
Implement 'div' function for int32, int64, float32, and float64 (gandiva) types.
 div is integer division - divide and return quotient after discarding the 
fractional part.
 The function signature is {{type div(type, type)}}

 

 

  was:
Implement 'div' function for int32, int64, float32, float64, and decimal128 
(gandiva) types.
 div is integer division - divide and return quotient after discarding the 
fractional part.
 The function signature is {{type div(type, type)}}

 

 


> [C++][Gandiva] Implement div function in Gandiva
> 
>
> Key: ARROW-5957
> URL: https://issues.apache.org/jira/browse/ARROW-5957
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Implement 'div' function for int32, int64, float32, and float64 (gandiva) 
> types.
>  div is integer division - divide and return quotient after discarding the 
> fractional part.
>  The function signature is {{type div(type, type)}}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array

2019-07-17 Thread Igor Yastrebov (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Yastrebov updated ARROW-5966:
--
Description: 
Trying to create a large string array fails with 

ArrowCapacityError: Encoded string length exceeds maximum size (2GB)

instead of creating a chunked array.

 

A reproducible example:
{code:java}
import uuid
import numpy as np
import pyarrow as pa

li = []
for i in range(1):
li.append(uuid.uuid4().hex)
arr = np.array(li)
parr = pa.array(arr)
{code}
Is it a regression or was it never properly fixed: 
[https://github.com/apache/arrow/issues/1855]?

 

 

  was:
Trying to create a large string array fails with 

ArrowCapacityError: Encoded string length exceeds maximum size (2GB)

instead of creating a chunked array.

 

A reproducible example:
{code:java}
import uuid
import numpy as np
import pyarrow as pa

li = []
for i in range(1):
li.append(uuid.uuid4().hex)
arr = np.array(li)
parr = pa.array(arr)
{code}
Is it a regression or was it never properly fixed: [link 
title|[https://github.com/apache/arrow/issues/1855]]?

 

 


> [Python] Capacity error when converting large string numpy array to arrow 
> array
> ---
>
> Key: ARROW-5966
> URL: https://issues.apache.org/jira/browse/ARROW-5966
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Igor Yastrebov
>Priority: Major
>
> Trying to create a large string array fails with 
> ArrowCapacityError: Encoded string length exceeds maximum size (2GB)
> instead of creating a chunked array.
>  
> A reproducible example:
> {code:java}
> import uuid
> import numpy as np
> import pyarrow as pa
> li = []
> for i in range(1):
> li.append(uuid.uuid4().hex)
> arr = np.array(li)
> parr = pa.array(arr)
> {code}
> Is it a regression or was it never properly fixed: 
> [https://github.com/apache/arrow/issues/1855]?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array

2019-07-17 Thread Igor Yastrebov (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Yastrebov updated ARROW-5966:
--
External issue URL:   (was: https://github.com/apache/arrow/issues/1855)
   Description: 
Trying to create a large string array fails with 

ArrowCapacityError: Encoded string length exceeds maximum size (2GB)

instead of creating a chunked array.

 

A reproducible example:
{code:java}
import uuid
import numpy as np
import pyarrow as pa

li = []
for i in range(1):
li.append(uuid.uuid4().hex)
arr = np.array(li)
parr = pa.array(arr)
{code}
Is it a regression or was it never properly fixed: [link 
title|[https://github.com/apache/arrow/issues/1855]]?

 

 

  was:
Trying to create a large string array fails with 

ArrowCapacityError: Encoded string length exceeds maximum size (2GB)

instead of creating a chunked array.

 

A reproducible example:
{code:java}
import uuid
import numpy as np
import pyarrow as pa

li = []
for i in range(1):
li.append(uuid.uuid4().hex)
arr = np.array(li)
parr = pa.array(arr)
{code}
Is it a regression or was it never properly fixed?

 


> [Python] Capacity error when converting large string numpy array to arrow 
> array
> ---
>
> Key: ARROW-5966
> URL: https://issues.apache.org/jira/browse/ARROW-5966
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Igor Yastrebov
>Priority: Major
>
> Trying to create a large string array fails with 
> ArrowCapacityError: Encoded string length exceeds maximum size (2GB)
> instead of creating a chunked array.
>  
> A reproducible example:
> {code:java}
> import uuid
> import numpy as np
> import pyarrow as pa
> li = []
> for i in range(1):
> li.append(uuid.uuid4().hex)
> arr = np.array(li)
> parr = pa.array(arr)
> {code}
> Is it a regression or was it never properly fixed: [link 
> title|[https://github.com/apache/arrow/issues/1855]]?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5965) Regression: segfault when reading hive table with v0.14

2019-07-17 Thread H. Vetinari (JIRA)
H. Vetinari created ARROW-5965:
--

 Summary: Regression: segfault when reading hive table with v0.14
 Key: ARROW-5965
 URL: https://issues.apache.org/jira/browse/ARROW-5965
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.14.0
Reporter: H. Vetinari


I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
installed in a conda env.

The data I'm reading is a hive(-registered) table written as parquet, and with 
v0.13, reading this table (that is partitioned) does not cause any issues.

The code that worked before and now crashes with v0.14 is simply:

```
import pyarrow.parquet as pq
pq.ParquetDataset('hdfs:///data/raw/source/table').read()

```

Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
cannot report much more, but this is a pretty severe usability restriction. So 
far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5964) [C++][Gandiva] Cast double to decimal with rounding returns 0

2019-07-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5964:
--
Labels: pull-request-available  (was: )

> [C++][Gandiva] Cast double to decimal with rounding returns 0
> -
>
> Key: ARROW-5964
> URL: https://issues.apache.org/jira/browse/ARROW-5964
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Pindikura Ravindra
>Assignee: Pindikura Ravindra
>Priority: Major
>  Labels: pull-request-available
>
> casting 1.15470053838 to decimal(18,0) gives 0. should return 1.
> there is a bug in the overflow check after rounding.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5964) [C++][Gandiva] Cast double to decimal with rounding returns 0

2019-07-17 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-5964:
-

 Summary: [C++][Gandiva] Cast double to decimal with rounding 
returns 0
 Key: ARROW-5964
 URL: https://issues.apache.org/jira/browse/ARROW-5964
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Pindikura Ravindra
Assignee: Pindikura Ravindra


casting 1.15470053838 to decimal(18,0) gives 0. should return 1.

there is a bug in the overflow check after rounding.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5957) [C++][Gandiva] Implement div function in Gandiva

2019-07-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5957:
--
Labels: pull-request-available  (was: )

> [C++][Gandiva] Implement div function in Gandiva
> 
>
> Key: ARROW-5957
> URL: https://issues.apache.org/jira/browse/ARROW-5957
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
>  Labels: pull-request-available
>
> Implement 'div' function for int32, int64, float32, float64, and decimal128 
> (gandiva) types.
>  div is integer division - divide and return quotient after discarding the 
> fractional part.
>  The function signature is {{type div(type, type)}}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5351) [Rust] Add support for take kernel functions

2019-07-17 Thread Neville Dipale (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-5351.
---
   Resolution: Fixed
Fix Version/s: 0.14.1

Issue resolved by pull request 4330
[https://github.com/apache/arrow/pull/4330]

> [Rust] Add support for take kernel functions
> 
>
> Key: ARROW-5351
> URL: https://issues.apache.org/jira/browse/ARROW-5351
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> Similar to https://issues.apache.org/jira/browse/ARROW-772, a take function 
> would allow us random-access on arrays, which is useful for sorting and 
> (potentially) filtering.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)