date:20200909

[jira] [Assigned] (ARROW-9810) [C++][Parquet] Generalize existing null bitmap generation

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9810:


Assignee: Apache Arrow JIRA Bot  (was: Micah Kornfield)

> [C++][Parquet] Generalize existing null bitmap generation 
> --
>
> Key: ARROW-9810
> URL: https://issues.apache.org/jira/browse/ARROW-9810
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Right now null bitmap generation assumes only list nesting.  Generalize and 
> refactor exisitn code without changing existing functionality to accept 
> additional parameters to support arrow nested types:
>  
> 1.  Repeated ancestor def level
> 2.  Null slot usage (for fixed size lists)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9810) [C++][Parquet] Generalize existing null bitmap generation

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9810:


Assignee: Micah Kornfield  (was: Apache Arrow JIRA Bot)

> [C++][Parquet] Generalize existing null bitmap generation 
> --
>
> Key: ARROW-9810
> URL: https://issues.apache.org/jira/browse/ARROW-9810
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Right now null bitmap generation assumes only list nesting.  Generalize and 
> refactor exisitn code without changing existing functionality to accept 
> additional parameters to support arrow nested types:
>  
> 1.  Repeated ancestor def level
> 2.  Null slot usage (for fixed size lists)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9810) [C++][Parquet] Generalize existing null bitmap generation

2020-09-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9810:
--
Labels: pull-request-available  (was: )

> [C++][Parquet] Generalize existing null bitmap generation 
> --
>
> Key: ARROW-9810
> URL: https://issues.apache.org/jira/browse/ARROW-9810
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Right now null bitmap generation assumes only list nesting.  Generalize and 
> refactor exisitn code without changing existing functionality to accept 
> additional parameters to support arrow nested types:
>  
> 1.  Repeated ancestor def level
> 2.  Null slot usage (for fixed size lists)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9946) [R] ParquetFileWriter segfaults when `sink` is a string

2020-09-09 Thread Karl Dunkle Werner (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193317#comment-17193317
 ] 

Karl Dunkle Werner commented on ARROW-9946:
---

I'd be happy to, though it might be a few weeks before I have time.

> [R] ParquetFileWriter segfaults when `sink` is a string
> ---
>
> Key: ARROW-9946
> URL: https://issues.apache.org/jira/browse/ARROW-9946
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: Ubuntu 20.04
>Reporter: Karl Dunkle Werner
>Priority: Minor
> Fix For: 2.0.0
>
>
> Hello again! I have another minor R arrow issue.
>  
> The {{ParquetFileWriter}} docs say that the {{sink}} argument can be a 
> "string which is interpreted as a file path". However, when I try to use a 
> string, I get a segfault because the memory isn't mapped.
>  
> Maybe this is a separate request, but it would also be helpful to have 
> documentation for the methods of the writer created by 
> {{ParquetFileWriter$create()}}.
> Docs link: [https://arrow.apache.org/docs/r/reference/ParquetFileWriter.html]
>  
> {code:r}
> library(arrow)
> sch = schema(a = float32())
> writer = ParquetFileWriter$create(schema = sch, sink = "test.parquet")
> #> *** caught segfault ***
> #> address 0x1417d, cause 'memory not mapped'
> #> 
> #> Traceback:
> #> 1: parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, 
> arrow_properties)
> #> 2: shared_ptr_is_null(xp)
> #> 3: shared_ptr(ParquetFileWriter, 
> parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, 
> arrow_properties))
> #> 4: ParquetFileWriter$create(schema = sch, sink = "test.parquet")
> # This works as expected:
> sink = FileOutputStream$create("test.parquet")
> writer = ParquetFileWriter$create(schema = sch, sink = sink)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-9837) [Rust] Add provider for variable

2020-09-09 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9837.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8135
[https://github.com/apache/arrow/pull/8135]

> [Rust] Add provider for variable
> 
>
> Key: ARROW-9837
> URL: https://issues.apache.org/jira/browse/ARROW-9837
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: qingcheng wu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Select @@version;
> @@version is a variable, and if we want to get its value, we should get it 
> from outside the system,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-9944) [Rust] Implement TO_TIMESTAMP function

2020-09-09 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9944.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8142
[https://github.com/apache/arrow/pull/8142]

> [Rust] Implement TO_TIMESTAMP function
> --
>
> Key: ARROW-9944
> URL: https://issues.apache.org/jira/browse/ARROW-9944
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Implement the TO_TIMESTAMP function, as described in 
> https://docs.google.com/document/d/18O9YPRyJ3u7-58J02NtNVYb6TDWBzi3mIQC58VhwxUk/edit



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9657) [R][Dataset] Expose more FileSystemDatasetFactory options

2020-09-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-9657:
---
Fix Version/s: (was: 2.0.0)

> [R][Dataset] Expose more FileSystemDatasetFactory options
> -
>
> Key: ARROW-9657
> URL: https://issues.apache.org/jira/browse/ARROW-9657
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: dataset
>
> Among the features:
> * ignore_prefixes option
> * Pass an explicit list of files + base directory
> * Exclude invalid files (boolean) option



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9187) [R] Add bindings for arithmetic kernels

2020-09-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9187:
--

Fix Version/s: (was: 2.0.0)
 Assignee: (was: Neal Richardson)

> [R] Add bindings for arithmetic kernels
> ---
>
> Key: ARROW-9187
> URL: https://issues.apache.org/jira/browse/ARROW-9187
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9946) [R] ParquetFileWriter segfaults when `sink` is a string

2020-09-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-9946:
---
Issue Type: Bug  (was: Improvement)

> [R] ParquetFileWriter segfaults when `sink` is a string
> ---
>
> Key: ARROW-9946
> URL: https://issues.apache.org/jira/browse/ARROW-9946
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: Ubuntu 20.04
>Reporter: Karl Dunkle Werner
>Priority: Minor
>
> Hello again! I have another minor R arrow issue.
>  
> The {{ParquetFileWriter}} docs say that the {{sink}} argument can be a 
> "string which is interpreted as a file path". However, when I try to use a 
> string, I get a segfault because the memory isn't mapped.
>  
> Maybe this is a separate request, but it would also be helpful to have 
> documentation for the methods of the writer created by 
> {{ParquetFileWriter$create()}}.
> Docs link: [https://arrow.apache.org/docs/r/reference/ParquetFileWriter.html]
>  
> {code:r}
> library(arrow)
> sch = schema(a = float32())
> writer = ParquetFileWriter$create(schema = sch, sink = "test.parquet")
> #> *** caught segfault ***
> #> address 0x1417d, cause 'memory not mapped'
> #> 
> #> Traceback:
> #> 1: parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, 
> arrow_properties)
> #> 2: shared_ptr_is_null(xp)
> #> 3: shared_ptr(ParquetFileWriter, 
> parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, 
> arrow_properties))
> #> 4: ParquetFileWriter$create(schema = sch, sink = "test.parquet")
> # This works as expected:
> sink = FileOutputStream$create("test.parquet")
> writer = ParquetFileWriter$create(schema = sch, sink = sink)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9946) [R] ParquetFileWriter segfaults when `sink` is a string

2020-09-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-9946:
---
Fix Version/s: 2.0.0

> [R] ParquetFileWriter segfaults when `sink` is a string
> ---
>
> Key: ARROW-9946
> URL: https://issues.apache.org/jira/browse/ARROW-9946
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: Ubuntu 20.04
>Reporter: Karl Dunkle Werner
>Priority: Minor
> Fix For: 2.0.0
>
>
> Hello again! I have another minor R arrow issue.
>  
> The {{ParquetFileWriter}} docs say that the {{sink}} argument can be a 
> "string which is interpreted as a file path". However, when I try to use a 
> string, I get a segfault because the memory isn't mapped.
>  
> Maybe this is a separate request, but it would also be helpful to have 
> documentation for the methods of the writer created by 
> {{ParquetFileWriter$create()}}.
> Docs link: [https://arrow.apache.org/docs/r/reference/ParquetFileWriter.html]
>  
> {code:r}
> library(arrow)
> sch = schema(a = float32())
> writer = ParquetFileWriter$create(schema = sch, sink = "test.parquet")
> #> *** caught segfault ***
> #> address 0x1417d, cause 'memory not mapped'
> #> 
> #> Traceback:
> #> 1: parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, 
> arrow_properties)
> #> 2: shared_ptr_is_null(xp)
> #> 3: shared_ptr(ParquetFileWriter, 
> parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, 
> arrow_properties))
> #> 4: ParquetFileWriter$create(schema = sch, sink = "test.parquet")
> # This works as expected:
> sink = FileOutputStream$create("test.parquet")
> writer = ParquetFileWriter$create(schema = sch, sink = sink)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-9890) [R] Add zstandard compression codec in macOS build

2020-09-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-9890.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8154
[https://github.com/apache/arrow/pull/8154]

> [R] Add zstandard compression codec in macOS build
> --
>
> Key: ARROW-9890
> URL: https://issues.apache.org/jira/browse/ARROW-9890
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 1.0.0, 1.0.1
> Environment: macOS
>Reporter: Liang-Bo Wang
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I am using the default macOS build of R arrow 1.0.1 (R 4.0.2) and it doesn't 
> support zstandard/zstd for compression:
> {code:r}
> > arrow::write_parquet(cars, '~/Downloads/cars.parquet', compression = 'zstd')
> Error in parquet___arrow___FileWriter__WriteTable(self, table, chunk_size) : 
>   NotImplemented: ZSTD codec support not built
> > arrow::codec_is_available('zstd')
> [1] FALSE
> {code}
> Like ARROW-6960 which adds the lz4/zstd support in Windows, It'd be a great 
> to have the zstd support by default in macOS as well.
> I don't know if I have the right knowledge to add such support, but let me 
> know how I can help. Thank you for making this great package!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-9806) [R] More compute kernel bindings

2020-09-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-9806.

Resolution: Fixed

Issue resolved by pull request 8012
[https://github.com/apache/arrow/pull/8012]

> [R] More compute kernel bindings
> 
>
> Key: ARROW-9806
> URL: https://issues.apache.org/jira/browse/ARROW-9806
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9955) [Rust][DataFusion] Benchmark and potentially optimize the to_timestamp function

2020-09-09 Thread Andrew Lamb (Jira)

Andrew Lamb created ARROW-9955:
--

 Summary: [Rust][DataFusion] Benchmark and potentially optimize the 
to_timestamp function
 Key: ARROW-9955
 URL: https://issues.apache.org/jira/browse/ARROW-9955
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andrew Lamb


As suggested by [~jhorstmann] on 
https://github.com/apache/arrow/pull/8142#discussion_r485468664

> I'd be interested in a benchmark of the to_string function or the kernel, 
> maybe one for the happy case with `T` and `Z` and one for the last fallback. 
> From my experience with Java, [parsing timestamps can be rather 
> slow|https://github.com/jhorstmann/packedtime#benchmarks] and it might be 
> worth writing a specialized implementation.







--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9954) [Rust] [DataFusion] Simplify code of aggregate planning

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9954:


Assignee: Jorge  (was: Apache Arrow JIRA Bot)

> [Rust] [DataFusion] Simplify code of aggregate planning
> ---
>
> Key: ARROW-9954
> URL: https://issues.apache.org/jira/browse/ARROW-9954
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9954) [Rust] [DataFusion] Simplify code of aggregate planning

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9954:


Assignee: Apache Arrow JIRA Bot  (was: Jorge)

> [Rust] [DataFusion] Simplify code of aggregate planning
> ---
>
> Key: ARROW-9954
> URL: https://issues.apache.org/jira/browse/ARROW-9954
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jorge
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9954) [Rust] [DataFusion] Simplify code of aggregate planning

2020-09-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9954:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Simplify code of aggregate planning
> ---
>
> Key: ARROW-9954
> URL: https://issues.apache.org/jira/browse/ARROW-9954
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9954) [Rust] [DataFusion] Simplify code of aggregate planning

2020-09-09 Thread Jorge (Jira)

Jorge created ARROW-9954:


 Summary: [Rust] [DataFusion] Simplify code of aggregate planning
 Key: ARROW-9954
 URL: https://issues.apache.org/jira/browse/ARROW-9954
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jorge
Assignee: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9782) [C++][Dataset] Ability to write ".feather" files with IpcFileFormat

2020-09-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9782:
--

Assignee: Ben Kietzman

> [C++][Dataset] Ability to write ".feather" files with IpcFileFormat
> ---
>
> Key: ARROW-9782
> URL: https://issues.apache.org/jira/browse/ARROW-9782
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python, R
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset
> Fix For: 2.0.0
>
>
> With the new dataset writing bindings, one can do {{ds.write_dataset(data, 
> format="feather")}} (Python) or {{write_dataset(data, format = "feather")}} 
> (R) to write a dataset to feather files. 
> However, because "feather" is just an alias for the IpcFileFormat, it will 
> currently write all files with the {{.ipc}} extension.   
> I think this can be a bit confusing, since many people will be more familiar 
> with "feather" and expect such an extension. 
> (more generally, ".ipc" is maybe not the best default, since it's not very 
> descriptive extension. Something like ".arrow" might be better?)
> cc [~npr] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9782) [C++][Dataset] Ability to write ".feather" files with IpcFileFormat

2020-09-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-9782:
---
Labels: dataset  (was: )

> [C++][Dataset] Ability to write ".feather" files with IpcFileFormat
> ---
>
> Key: ARROW-9782
> URL: https://issues.apache.org/jira/browse/ARROW-9782
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python, R
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 2.0.0
>
>
> With the new dataset writing bindings, one can do {{ds.write_dataset(data, 
> format="feather")}} (Python) or {{write_dataset(data, format = "feather")}} 
> (R) to write a dataset to feather files. 
> However, because "feather" is just an alias for the IpcFileFormat, it will 
> currently write all files with the {{.ipc}} extension.   
> I think this can be a bit confusing, since many people will be more familiar 
> with "feather" and expect such an extension. 
> (more generally, ".ipc" is maybe not the best default, since it's not very 
> descriptive extension. Something like ".arrow" might be better?)
> cc [~npr] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9676) [R] Error converting Table with nested structs

2020-09-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-9676:
---
Issue Type: Bug  (was: New Feature)

> [R] Error converting Table with nested structs
> --
>
> Key: ARROW-9676
> URL: https://issues.apache.org/jira/browse/ARROW-9676
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.0
> Environment: Amazon Linux, 32gb of ram
>Reporter: Nick DiQuattro
>Priority: Major
>
> When trying to collect data from a dataset based on parquet files with nested 
> structs (column is a struct with 2 structs nested) of moderate size (1Mish 
> rows), R crashes. If I add a filter to reduce the number of rows, the data is 
> parsed. If I select out the struct column, it works great (up to 21M rows). 
> My hunch is the structs resulting in data.frame columns may be the issue. I 
> am curious if there's a way to have arrow import structs as lists instead of 
> data.frames. Thanks for the direction to here [~neilr8133]!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9903) [R] open_dataset freezes opening feather files on Windows

2020-09-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-9903:
---
Summary: [R] open_dataset freezes opening feather files on Windows  (was: 
[R] open_dataset freezes opening feather files)

> [R] open_dataset freezes opening feather files on Windows
> -
>
> Key: ARROW-9903
> URL: https://issues.apache.org/jira/browse/ARROW-9903
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: Rstudio
>Reporter: Sean Clement
>Priority: Major
>
> Session info:
> {code:java}
> // R version 4.0.2 (2020-06-22)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 19041)Matrix products: defaultlocale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
> States.1252   
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C   
>
> [5] LC_TIME=English_United States.1252attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base 
> other attached packages:
>  [1] forcats_0.5.0   stringr_1.4.0   dplyr_1.0.1 purrr_0.3.4 
> readr_1.3.1 tidyr_1.1.1
>  [7] tibble_3.0.3ggplot2_3.3.2   tidyverse_1.3.0 arrow_1.0.1loaded 
> via a namespace (and not attached):
>  [1] Rcpp_1.0.5   cellranger_1.1.0 pillar_1.4.6 compiler_4.0.2   
> dbplyr_1.4.4 tools_4.0.2 
>  [7] bit_1.1-15.2 lubridate_1.7.9  jsonlite_1.7.0   lifecycle_0.2.0  
> gtable_0.3.0 pkgconfig_2.0.3 
> [13] rlang_0.4.7  reprex_0.3.0 cli_2.0.2DBI_1.1.0
> rstudioapi_0.11  haven_2.3.1 
> [19] withr_2.2.0  xml2_1.3.2   httr_1.4.2   fs_1.4.1 
> generics_0.0.2   vctrs_0.3.2 
> [25] hms_0.5.3bit64_0.9-7  grid_4.0.2   tidyselect_1.1.0 
> glue_1.4.1   R6_2.4.1
> [31] fansi_0.4.1  readxl_1.3.1 modelr_0.1.8 blob_1.2.1   
> magrittr_1.5 backports_1.1.7 
> [37] scales_1.1.1 ellipsis_0.3.1   rvest_0.3.5  assertthat_0.2.1 
> colorspace_1.4-1 stringi_1.4.6   
> [43] munsell_0.5.0broom_0.7.0  crayon_1.3.4
> {code}
> While cycling through and processing files using open_dataset(..., format = 
> "feather") in R, the function hangs randomly and will not proceed to the next 
> file. The freeze does not appear at the same file each time, additionally, 
> the same function freezes when used one on occasion. 
> When open_dataset hangs the only way to get R to stop is using Task Manager 
> as Rstudio becomes totally unresponsive. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9870) [R] Friendly interface for filesystems (S3)

2020-09-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-9870:
---
Fix Version/s: 2.0.0

> [R] Friendly interface for filesystems (S3)
> ---
>
> Key: ARROW-9870
> URL: https://issues.apache.org/jira/browse/ARROW-9870
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 2.0.0
>
>
> The Filesystem methods don't provide a human-friendly interface for basic 
> operations like ls, mkdir, etc. Since we provide access to S3 and potentially 
> other cloud storage, it would be nice to have simple methods for exploring it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9890) [R] Add zstandard compression codec in macOS build

2020-09-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9890:
--

Assignee: Neal Richardson

> [R] Add zstandard compression codec in macOS build
> --
>
> Key: ARROW-9890
> URL: https://issues.apache.org/jira/browse/ARROW-9890
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 1.0.0, 1.0.1
> Environment: macOS
>Reporter: Liang-Bo Wang
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I am using the default macOS build of R arrow 1.0.1 (R 4.0.2) and it doesn't 
> support zstandard/zstd for compression:
> {code:r}
> > arrow::write_parquet(cars, '~/Downloads/cars.parquet', compression = 'zstd')
> Error in parquet___arrow___FileWriter__WriteTable(self, table, chunk_size) : 
>   NotImplemented: ZSTD codec support not built
> > arrow::codec_is_available('zstd')
> [1] FALSE
> {code}
> Like ARROW-6960 which adds the lz4/zstd support in Windows, It'd be a great 
> to have the zstd support by default in macOS as well.
> I don't know if I have the right knowledge to add such support, but let me 
> know how I can help. Thank you for making this great package!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9890) [R] Add zstandard compression codec in macOS build

2020-09-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9890:
--
Labels: pull-request-available  (was: )

> [R] Add zstandard compression codec in macOS build
> --
>
> Key: ARROW-9890
> URL: https://issues.apache.org/jira/browse/ARROW-9890
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 1.0.0, 1.0.1
> Environment: macOS
>Reporter: Liang-Bo Wang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I am using the default macOS build of R arrow 1.0.1 (R 4.0.2) and it doesn't 
> support zstandard/zstd for compression:
> {code:r}
> > arrow::write_parquet(cars, '~/Downloads/cars.parquet', compression = 'zstd')
> Error in parquet___arrow___FileWriter__WriteTable(self, table, chunk_size) : 
>   NotImplemented: ZSTD codec support not built
> > arrow::codec_is_available('zstd')
> [1] FALSE
> {code}
> Like ARROW-6960 which adds the lz4/zstd support in Windows, It'd be a great 
> to have the zstd support by default in macOS as well.
> I don't know if I have the right knowledge to add such support, but let me 
> know how I can help. Thank you for making this great package!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9946) ParquetFileWriter segfaults when `sink` is a string

2020-09-09 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193163#comment-17193163
 ] 

Neal Richardson commented on ARROW-9946:


The docs were very recently updated to say that it requires an OutputStream: 
https://ursalabs.org/arrow-r-nightly/reference/ParquetFileWriter.html

But {{ParquetFileWriter$create()}} should validate that. Would you be 
interested in submitting a PR to fix that? Could also add in brief docstrings 
for ParquetFileWriter's two methods while you're there, it's all within the 
same 10 lines of parquet.R.

> ParquetFileWriter segfaults when `sink` is a string
> ---
>
> Key: ARROW-9946
> URL: https://issues.apache.org/jira/browse/ARROW-9946
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 1.0.1
> Environment: Ubuntu 20.04
>Reporter: Karl Dunkle Werner
>Priority: Minor
>
> Hello again! I have another minor R arrow issue.
>  
> The {{ParquetFileWriter}} docs say that the {{sink}} argument can be a 
> "string which is interpreted as a file path". However, when I try to use a 
> string, I get a segfault because the memory isn't mapped.
>  
> Maybe this is a separate request, but it would also be helpful to have 
> documentation for the methods of the writer created by 
> {{ParquetFileWriter$create()}}.
> Docs link: [https://arrow.apache.org/docs/r/reference/ParquetFileWriter.html]
>  
> {code:r}
> library(arrow)
> sch = schema(a = float32())
> writer = ParquetFileWriter$create(schema = sch, sink = "test.parquet")
> #> *** caught segfault ***
> #> address 0x1417d, cause 'memory not mapped'
> #> 
> #> Traceback:
> #> 1: parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, 
> arrow_properties)
> #> 2: shared_ptr_is_null(xp)
> #> 3: shared_ptr(ParquetFileWriter, 
> parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, 
> arrow_properties))
> #> 4: ParquetFileWriter$create(schema = sch, sink = "test.parquet")
> # This works as expected:
> sink = FileOutputStream$create("test.parquet")
> writer = ParquetFileWriter$create(schema = sch, sink = sink)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9946) [R] ParquetFileWriter segfaults when `sink` is a string

2020-09-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-9946:
---
Summary: [R] ParquetFileWriter segfaults when `sink` is a string  (was: 
ParquetFileWriter segfaults when `sink` is a string)

> [R] ParquetFileWriter segfaults when `sink` is a string
> ---
>
> Key: ARROW-9946
> URL: https://issues.apache.org/jira/browse/ARROW-9946
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 1.0.1
> Environment: Ubuntu 20.04
>Reporter: Karl Dunkle Werner
>Priority: Minor
>
> Hello again! I have another minor R arrow issue.
>  
> The {{ParquetFileWriter}} docs say that the {{sink}} argument can be a 
> "string which is interpreted as a file path". However, when I try to use a 
> string, I get a segfault because the memory isn't mapped.
>  
> Maybe this is a separate request, but it would also be helpful to have 
> documentation for the methods of the writer created by 
> {{ParquetFileWriter$create()}}.
> Docs link: [https://arrow.apache.org/docs/r/reference/ParquetFileWriter.html]
>  
> {code:r}
> library(arrow)
> sch = schema(a = float32())
> writer = ParquetFileWriter$create(schema = sch, sink = "test.parquet")
> #> *** caught segfault ***
> #> address 0x1417d, cause 'memory not mapped'
> #> 
> #> Traceback:
> #> 1: parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, 
> arrow_properties)
> #> 2: shared_ptr_is_null(xp)
> #> 3: shared_ptr(ParquetFileWriter, 
> parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, 
> arrow_properties))
> #> 4: ParquetFileWriter$create(schema = sch, sink = "test.parquet")
> # This works as expected:
> sink = FileOutputStream$create("test.parquet")
> writer = ParquetFileWriter$create(schema = sch, sink = sink)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9953) [R] Declare minimum version for bit64

2020-09-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9953:
--

Assignee: Ofek Shilon

> [R] Declare minimum version for bit64
> -
>
> Key: ARROW-9953
> URL: https://issues.apache.org/jira/browse/ARROW-9953
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R3.4, linux, bit64 0.9-5 pre-installed
>Reporter: Ofek Shilon
>Assignee: Ofek Shilon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The DESCRIPTION file lists -
> {{ Imports:}}
>  {{  assertthat,}}
>  {{  bit64,}}
>  {{  ...}}
> However for us (R3.4 over linux, with various old package versions installed) 
> the installation fails. The error message is
> {{Error : object 'str.integer64' is not exported by 'namespace:bit64'}}
>  {{ERROR: lazy loading failed for package 'arrow'}}
>  We believe that 'bit64' in the Imports section should have been 'bit64 (>= 
> 0.9-7)', as previous versions of bit64 do not export 'str.integer64'.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9953) [R] Declare minimum version for bit64

2020-09-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9953:
--
Labels: pull-request-available  (was: )

> [R] Declare minimum version for bit64
> -
>
> Key: ARROW-9953
> URL: https://issues.apache.org/jira/browse/ARROW-9953
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R3.4, linux, bit64 0.9-5 pre-installed
>Reporter: Ofek Shilon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The DESCRIPTION file lists -
> {{ Imports:}}
>  {{  assertthat,}}
>  {{  bit64,}}
>  {{  ...}}
> However for us (R3.4 over linux, with various old package versions installed) 
> the installation fails. The error message is
> {{Error : object 'str.integer64' is not exported by 'namespace:bit64'}}
>  {{ERROR: lazy loading failed for package 'arrow'}}
>  We believe that 'bit64' in the Imports section should have been 'bit64 (>= 
> 0.9-7)', as previous versions of bit64 do not export 'str.integer64'.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9616) [C++] Support LTO for R

2020-09-09 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193126#comment-17193126
 ] 

Neal Richardson commented on ARROW-9616:


I started a branch to turn this on, and then to enable the cmake IPO as well: 
https://github.com/apache/arrow/pull/8153

Even with that on in cmake, it fails: 
[https://github.com/nealrichardson/arrow/runs/1088290997?check_suite_focus=true]

{code}
C:/rtools40/mingw32/bin/g++ -shared -s -static-libgcc -o arrow.dll tmp.def 
array.o array_from_vector.o array_to_vector.o arraydata.o arrowExports.o 
buffer.o chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o 
expression.o feather.o field.o filesystem.o imports.o io.o json.o memorypool.o 
message.o parquet.o py-to-r.o recordbatch.o recordbatchreader.o 
recordbatchwriter.o scalar.o schema.o symbols.o table.o threadpool.o 
-L../windows/arrow-1.0.1.9000/lib-8.3.0/i386 
-L../windows/arrow-1.0.1.9000/lib/i386 -lparquet -larrow_dataset -larrow 
-larrow_bundled_dependencies -lthrift -lsnappy -lz -lzstd -llz4 -lbcrypt 
-lpsapi -lcrypto -lcrypt32 -laws-cpp-sdk-config -laws-cpp-sdk-transfer 
-laws-cpp-sdk-identity-management -laws-cpp-sdk-cognito-identity 
-laws-cpp-sdk-sts -laws-cpp-sdk-s3 -laws-cpp-sdk-core -laws-c-event-stream 
-laws-checksums -laws-c-common -lUserenv -lversion -lws2_32 -lBcrypt -lWininet 
-lwinhttp -LC:/R/bin/i386 -lR
lto1.exe: internal compiler error: in add_symbol_to_partition_1, at 
lto/lto-partition.c:155
libbacktrace could not find executable to open
Please submit a full bug report,
with preprocessed source if appropriate.
See  for instructions.
lto-wrapper.exe: fatal error: C:\rtools40\mingw32\bin\g++.exe returned 1 exit 
status
compilation terminated.
C:/rtools40/mingw32/bin/../lib/gcc/i686-w64-mingw32/8.3.0/../../../../i686-w64-mingw32/bin/ld.exe:
 error: lto-wrapper failed
collect2.exe: error: ld returned 1 exit status
C:\rtools40\mingw64\bin\nm.exe: 
D:/a/arrow/arrow/check/arrow.Rcheck/00_pkg_src/arrow/src-i386/array.o: plugin 
needed to handle lto object
C:\rtools40\mingw64\bin\nm.exe: 
D:/a/arrow/arrow/check/arrow.Rcheck/00_pkg_src/arrow/src-i386/array_from_vector.o:
 plugin needed to handle lto object
...
C:\rtools40\mingw64\bin\nm.exe: 
D:/a/arrow/arrow/check/arrow.Rcheck/00_pkg_src/arrow/src-i386/threadpool.o: 
plugin needed to handle lto object
no DLL was created
{code}

Googling the "plugin needed to handle lto object" error message, it looks like 
there may be some necessary library missing from Rtools? 
https://stackoverflow.com/questions/32221221/mingw-x64-windows-plugin-needed-to-handle-lto-object/32461766#32461766


> [C++] Support LTO for R
> ---
>
> Key: ARROW-9616
> URL: https://issues.apache.org/jira/browse/ARROW-9616
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 1.0.0
>Reporter: Jeroen
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The next version of R might enable LTO on Windows, i.e. R packages will be 
> compiled with {{-flto}} by default. This works out of the box for most 
> packages, but for arrow, the linker crashes as below. 
> {code}
>  C:/rtools40/mingw64/bin/g++ -shared -O2 -Wall -mfpmath=sse -msse2 
> -mstackrealign -flto -s -static-libgcc -o arrow.dll tmp.def array.o 
> array_from_vector.o array_to_vector.o arraydata.o arrowExports.o buffer.o 
> chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o 
> expression.o feather.o field.o filesystem.o imports.o io.o json.o 
> memorypool.o message.o parquet.o py-to-r.o recordbatch.o recordbatchreader.o 
> recordbatchwriter.o scalar.o schema.o symbols.o table.o threadpool.o 
> -L../windows//lib-8.3.0/x64 -L../windows//lib/x64 -lparquet -larrow_dataset 
> -larrow -lthrift -lsnappy -lz -lzstd -llz4 -lbcrypt -lpsapi -lcrypto 
> -lcrypt32 -lws2_32 -LC:/PROGRA~1/R/R-devel/bin/x64 -lR
>  lto1.exe: internal compiler error: in add_symbol_to_partition_1, at 
> lto/lto-partition.c:153
>  libbacktrace could not find executable to open
>  Please submit a full bug report,
>  with preprocessed source if appropriate.
>  See <[https://github.com/r-windows]> for instructions.
>  lto-wrapper.exe: fatal error: C:\rtools40\mingw64\bin\g++.exe returned 1 
> exit status
>  compilation terminated.
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/9.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  error: lto-wrapper failed
> {code}
> You can reproduce this in R on Windows for example like so:
> {code:r}
> dir.create("~/.R")
> writeLines("CPPFLAGS=-flto", file = "~/.R/Makevars")
> install.packages("arrow", type = 'source')
> {code}
> I am not sure if this is a bug in the toolchain, or in arrow. I tried with 
> both gcc-8.3.0 and

[jira] [Assigned] (ARROW-9616) [C++] Support LTO for R

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9616:


Assignee: Apache Arrow JIRA Bot  (was: Neal Richardson)

> [C++] Support LTO for R
> ---
>
> Key: ARROW-9616
> URL: https://issues.apache.org/jira/browse/ARROW-9616
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 1.0.0
>Reporter: Jeroen
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The next version of R might enable LTO on Windows, i.e. R packages will be 
> compiled with {{-flto}} by default. This works out of the box for most 
> packages, but for arrow, the linker crashes as below. 
> {code}
>  C:/rtools40/mingw64/bin/g++ -shared -O2 -Wall -mfpmath=sse -msse2 
> -mstackrealign -flto -s -static-libgcc -o arrow.dll tmp.def array.o 
> array_from_vector.o array_to_vector.o arraydata.o arrowExports.o buffer.o 
> chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o 
> expression.o feather.o field.o filesystem.o imports.o io.o json.o 
> memorypool.o message.o parquet.o py-to-r.o recordbatch.o recordbatchreader.o 
> recordbatchwriter.o scalar.o schema.o symbols.o table.o threadpool.o 
> -L../windows//lib-8.3.0/x64 -L../windows//lib/x64 -lparquet -larrow_dataset 
> -larrow -lthrift -lsnappy -lz -lzstd -llz4 -lbcrypt -lpsapi -lcrypto 
> -lcrypt32 -lws2_32 -LC:/PROGRA~1/R/R-devel/bin/x64 -lR
>  lto1.exe: internal compiler error: in add_symbol_to_partition_1, at 
> lto/lto-partition.c:153
>  libbacktrace could not find executable to open
>  Please submit a full bug report,
>  with preprocessed source if appropriate.
>  See <[https://github.com/r-windows]> for instructions.
>  lto-wrapper.exe: fatal error: C:\rtools40\mingw64\bin\g++.exe returned 1 
> exit status
>  compilation terminated.
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/9.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  error: lto-wrapper failed
> {code}
> You can reproduce this in R on Windows for example like so:
> {code:r}
> dir.create("~/.R")
> writeLines("CPPFLAGS=-flto", file = "~/.R/Makevars")
> install.packages("arrow", type = 'source')
> {code}
> I am not sure if this is a bug in the toolchain, or in arrow. I tried with 
> both gcc-8.3.0 and gcc-9.3.0, and the result is the same. I did find [this 
> issue|https://github.com/cycfi/elements/pull/56] in another project which 
> suggests to enable `INTERPROCEDURAL_OPTIMIZATION` in cmake, when mixing lto 
> code with non-lto code (which is the case when we only build the r bindings 
> with lto, but not the c++ library).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9616) [C++] Support LTO for R

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9616:


Assignee: Neal Richardson  (was: Apache Arrow JIRA Bot)

> [C++] Support LTO for R
> ---
>
> Key: ARROW-9616
> URL: https://issues.apache.org/jira/browse/ARROW-9616
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 1.0.0
>Reporter: Jeroen
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The next version of R might enable LTO on Windows, i.e. R packages will be 
> compiled with {{-flto}} by default. This works out of the box for most 
> packages, but for arrow, the linker crashes as below. 
> {code}
>  C:/rtools40/mingw64/bin/g++ -shared -O2 -Wall -mfpmath=sse -msse2 
> -mstackrealign -flto -s -static-libgcc -o arrow.dll tmp.def array.o 
> array_from_vector.o array_to_vector.o arraydata.o arrowExports.o buffer.o 
> chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o 
> expression.o feather.o field.o filesystem.o imports.o io.o json.o 
> memorypool.o message.o parquet.o py-to-r.o recordbatch.o recordbatchreader.o 
> recordbatchwriter.o scalar.o schema.o symbols.o table.o threadpool.o 
> -L../windows//lib-8.3.0/x64 -L../windows//lib/x64 -lparquet -larrow_dataset 
> -larrow -lthrift -lsnappy -lz -lzstd -llz4 -lbcrypt -lpsapi -lcrypto 
> -lcrypt32 -lws2_32 -LC:/PROGRA~1/R/R-devel/bin/x64 -lR
>  lto1.exe: internal compiler error: in add_symbol_to_partition_1, at 
> lto/lto-partition.c:153
>  libbacktrace could not find executable to open
>  Please submit a full bug report,
>  with preprocessed source if appropriate.
>  See <[https://github.com/r-windows]> for instructions.
>  lto-wrapper.exe: fatal error: C:\rtools40\mingw64\bin\g++.exe returned 1 
> exit status
>  compilation terminated.
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/9.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  error: lto-wrapper failed
> {code}
> You can reproduce this in R on Windows for example like so:
> {code:r}
> dir.create("~/.R")
> writeLines("CPPFLAGS=-flto", file = "~/.R/Makevars")
> install.packages("arrow", type = 'source')
> {code}
> I am not sure if this is a bug in the toolchain, or in arrow. I tried with 
> both gcc-8.3.0 and gcc-9.3.0, and the result is the same. I did find [this 
> issue|https://github.com/cycfi/elements/pull/56] in another project which 
> suggests to enable `INTERPROCEDURAL_OPTIMIZATION` in cmake, when mixing lto 
> code with non-lto code (which is the case when we only build the r bindings 
> with lto, but not the c++ library).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9616) [C++] Support LTO for R

2020-09-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9616:
--
Labels: pull-request-available  (was: )

> [C++] Support LTO for R
> ---
>
> Key: ARROW-9616
> URL: https://issues.apache.org/jira/browse/ARROW-9616
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 1.0.0
>Reporter: Jeroen
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The next version of R might enable LTO on Windows, i.e. R packages will be 
> compiled with {{-flto}} by default. This works out of the box for most 
> packages, but for arrow, the linker crashes as below. 
> {code}
>  C:/rtools40/mingw64/bin/g++ -shared -O2 -Wall -mfpmath=sse -msse2 
> -mstackrealign -flto -s -static-libgcc -o arrow.dll tmp.def array.o 
> array_from_vector.o array_to_vector.o arraydata.o arrowExports.o buffer.o 
> chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o 
> expression.o feather.o field.o filesystem.o imports.o io.o json.o 
> memorypool.o message.o parquet.o py-to-r.o recordbatch.o recordbatchreader.o 
> recordbatchwriter.o scalar.o schema.o symbols.o table.o threadpool.o 
> -L../windows//lib-8.3.0/x64 -L../windows//lib/x64 -lparquet -larrow_dataset 
> -larrow -lthrift -lsnappy -lz -lzstd -llz4 -lbcrypt -lpsapi -lcrypto 
> -lcrypt32 -lws2_32 -LC:/PROGRA~1/R/R-devel/bin/x64 -lR
>  lto1.exe: internal compiler error: in add_symbol_to_partition_1, at 
> lto/lto-partition.c:153
>  libbacktrace could not find executable to open
>  Please submit a full bug report,
>  with preprocessed source if appropriate.
>  See <[https://github.com/r-windows]> for instructions.
>  lto-wrapper.exe: fatal error: C:\rtools40\mingw64\bin\g++.exe returned 1 
> exit status
>  compilation terminated.
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/9.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  error: lto-wrapper failed
> {code}
> You can reproduce this in R on Windows for example like so:
> {code:r}
> dir.create("~/.R")
> writeLines("CPPFLAGS=-flto", file = "~/.R/Makevars")
> install.packages("arrow", type = 'source')
> {code}
> I am not sure if this is a bug in the toolchain, or in arrow. I tried with 
> both gcc-8.3.0 and gcc-9.3.0, and the result is the same. I did find [this 
> issue|https://github.com/cycfi/elements/pull/56] in another project which 
> suggests to enable `INTERPROCEDURAL_OPTIMIZATION` in cmake, when mixing lto 
> code with non-lto code (which is the case when we only build the r bindings 
> with lto, but not the c++ library).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9923) [R] arrow R package build error: illegal instruction

2020-09-09 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193076#comment-17193076
 ] 

Neal Richardson commented on ARROW-9923:


Why should this be R-specific? IIUC the correct solution is to expose the 
option to turn it off in cmake, and/or detect the capability rather than assume 
it. 

> [R] arrow R package build error: illegal instruction
> 
>
> Key: ARROW-9923
> URL: https://issues.apache.org/jira/browse/ARROW-9923
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: Platform: Linux node 5.8.5-arch1-1 #1 SMP PREEMPT Thu, 
> 27 Aug 2020 18:53:02 + x86_64 GNU/Linux
> CPU: AMD Athlon(tm) II X4 651 Quad-Core Processor (does not support SSE4, 
> AVX/AVX2)
>Reporter: Maxim Terpilowski
>Priority: Major
>  Labels: build
>
> arrow R package (v1.0.1) installing from CRAN results in an error.
> Build log: [https://pastebin.com/Zq1iMTzB]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9953) [R] Declare minimum version for bit64

2020-09-09 Thread Ofek Shilon (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193039#comment-17193039
 ] 

Ofek Shilon commented on ARROW-9953:


I really hope adding an explicit Import would be enough:  
[https://stackoverflow.com/questions/32259422/r-package-versioned-dependencies]

 

> [R] Declare minimum version for bit64
> -
>
> Key: ARROW-9953
> URL: https://issues.apache.org/jira/browse/ARROW-9953
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R3.4, linux, bit64 0.9-5 pre-installed
>Reporter: Ofek Shilon
>Priority: Major
>
> The DESCRIPTION file lists -
> {{ Imports:}}
>  {{  assertthat,}}
>  {{  bit64,}}
>  {{  ...}}
> However for us (R3.4 over linux, with various old package versions installed) 
> the installation fails. The error message is
> {{Error : object 'str.integer64' is not exported by 'namespace:bit64'}}
>  {{ERROR: lazy loading failed for package 'arrow'}}
>  We believe that 'bit64' in the Imports section should have been 'bit64 (>= 
> 0.9-7)', as previous versions of bit64 do not export 'str.integer64'.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8199) [C++] Guidance for creating multi-column sort on Table example?

2020-09-09 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193036#comment-17193036
 ] 

Wes McKinney commented on ARROW-8199:
-

Sure, it would need to be contributed at least as pull request -- depending on 
discussions on the mailing list about the origins of the software, since it was 
externally-developed we might need to obtain a software grant from your 
company. Then there is the question of "productionizing" it -- conforming it to 
the code style of the project and writing unit tests. 

For what it's worth, people have a lot of different expectations when they hear 
"data frame", and realistically we may end up with different kinds of data 
frame interfaces. From what I can see in the code, this is different than what 
I've proposed in 
https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h
 but have not been able to do any development on personally. I'm not personally 
able to invest time in this project in the near term unfortunately. 

> [C++] Guidance for creating multi-column sort on Table example?
> ---
>
> Key: ARROW-8199
> URL: https://issues.apache.org/jira/browse/ARROW-8199
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Scott Wilson
>Priority: Minor
>  Labels: c++, newbie
> Attachments: ArrowCsv.cpp, DataFrame.h
>
>
> I'm just coming up to speed with Arrow and am noticing a dearth of examples 
> ... maybe I can help here.
> I'd like to implement multi-column sorting for Tables and just want to ensure 
> that I'm not duplicating existing work or proposing a bad design.
> My thought was to create a Table-specific version of SortToIndices() where 
> you can specify the columns and sort order.
> Then I'd create Array "views" that use the Indices to remap from the original 
> Array values to the values in sorted order. (Original data is not sorted, but 
> could be as a second step.) I noticed some of the array list variants keep 
> offsets, but didn't see anything that supports remapping per a list of 
> indices, but this may just be my oversight?
> Thanks in advance, Scott



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9953) [R] Declare minimum version for bit64

2020-09-09 Thread Ofek Shilon (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193032#comment-17193032
 ] 

Ofek Shilon commented on ARROW-9953:


[~npr] gladly:  [https://github.com/apache/arrow/pull/8152]

 

> [R] Declare minimum version for bit64
> -
>
> Key: ARROW-9953
> URL: https://issues.apache.org/jira/browse/ARROW-9953
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R3.4, linux, bit64 0.9-5 pre-installed
>Reporter: Ofek Shilon
>Priority: Major
>
> The DESCRIPTION file lists -
> {{ Imports:}}
>  {{  assertthat,}}
>  {{  bit64,}}
>  {{  ...}}
> However for us (R3.4 over linux, with various old package versions installed) 
> the installation fails. The error message is
> {{Error : object 'str.integer64' is not exported by 'namespace:bit64'}}
>  {{ERROR: lazy loading failed for package 'arrow'}}
>  We believe that 'bit64' in the Imports section should have been 'bit64 (>= 
> 0.9-7)', as previous versions of bit64 do not export 'str.integer64'.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8199) [C++] Guidance for creating multi-column sort on Table example?

2020-09-09 Thread Scott Wilson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193028#comment-17193028
 ] 

Scott Wilson commented on ARROW-8199:
-

Is there a way it could become part of the Apache Arrow project?




-- 
Scott B. Wilson
Chairman and Chief Scientist
Persyst Development Corporation
420 Stevens Avenue, Suite 210
Solana Beach, CA 92075


> [C++] Guidance for creating multi-column sort on Table example?
> ---
>
> Key: ARROW-8199
> URL: https://issues.apache.org/jira/browse/ARROW-8199
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Scott Wilson
>Priority: Minor
>  Labels: c++, newbie
> Attachments: ArrowCsv.cpp, DataFrame.h
>
>
> I'm just coming up to speed with Arrow and am noticing a dearth of examples 
> ... maybe I can help here.
> I'd like to implement multi-column sorting for Tables and just want to ensure 
> that I'm not duplicating existing work or proposing a bad design.
> My thought was to create a Table-specific version of SortToIndices() where 
> you can specify the columns and sort order.
> Then I'd create Array "views" that use the Indices to remap from the original 
> Array values to the values in sorted order. (Original data is not sorted, but 
> could be as a second step.) I noticed some of the array list variants keep 
> offsets, but didn't see anything that supports remapping per a list of 
> indices, but this may just be my oversight?
> Thanks in advance, Scott



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9953) [R] Declare minimum version for bit64

2020-09-09 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-9953:
---
Summary: [R] Declare minimum version for bit64  (was: R package missing 
dependencies versions)

> [R] Declare minimum version for bit64
> -
>
> Key: ARROW-9953
> URL: https://issues.apache.org/jira/browse/ARROW-9953
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R3.4, linux, bit64 0.9-5 pre-installed
>Reporter: Ofek Shilon
>Priority: Major
>
> The DESCRIPTION file lists -
> {{ Imports:}}
>  {{  assertthat,}}
>  {{  bit64,}}
>  {{  ...}}
> However for us (R3.4 over linux, with various old package versions installed) 
> the installation fails. The error message is
> {{Error : object 'str.integer64' is not exported by 'namespace:bit64'}}
>  {{ERROR: lazy loading failed for package 'arrow'}}
>  We believe that 'bit64' in the Imports section should have been 'bit64 (>= 
> 0.9-7)', as previous versions of bit64 do not export 'str.integer64'.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9953) R package missing dependencies versions

2020-09-09 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193017#comment-17193017
 ] 

Neal Richardson commented on ARROW-9953:


Thanks. Would you like to submit a pull request to add this? [bit64 version 
0.9-5 is from 2015|https://cran.r-project.org/src/contrib/Archive/bit64/], so 
that's probably why we haven't encountered this before.

> R package missing dependencies versions
> ---
>
> Key: ARROW-9953
> URL: https://issues.apache.org/jira/browse/ARROW-9953
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R3.4, linux, bit64 0.9-5 pre-installed
>Reporter: Ofek Shilon
>Priority: Major
>
> The DESCRIPTION file lists -
> {{ Imports:}}
>  {{  assertthat,}}
>  {{  bit64,}}
>  {{  ...}}
> However for us (R3.4 over linux, with various old package versions installed) 
> the installation fails. The error message is
> {{Error : object 'str.integer64' is not exported by 'namespace:bit64'}}
>  {{ERROR: lazy loading failed for package 'arrow'}}
>  We believe that 'bit64' in the Imports section should have been 'bit64 (>= 
> 0.9-7)', as previous versions of bit64 do not export 'str.integer64'.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9953) R package missing dependencies versions

2020-09-09 Thread Ofek Shilon (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ofek Shilon updated ARROW-9953:
---
Description: 
The DESCRIPTION file lists -

{{ Imports:}}
 {{  assertthat,}}
 {{  bit64,}}
 {{  ...}}

However for us (R3.4 over linux, with various old package versions installed) 
the installation fails. The error message is

{{Error : object 'str.integer64' is not exported by 'namespace:bit64'}}
 {{ERROR: lazy loading failed for package 'arrow'}}

 We believe that 'bit64' in the Imports section should have been 'bit64 (>= 
0.9-7)', as previous versions of bit64 do not export 'str.integer64'.

 

  was:
The DESCRIPTION file lists -

{{ Imports:}}
{{ {{  assertthat,
{{ {{  bit64,
{{ {{  ...

However for us (R3.4 over linux, with various old package versions installed) 
the installation fails. The error message is

{{Error : object 'str.integer64' is not exported by 'namespace:bit64'}}
{{ERROR: lazy loading failed for package 'arrow'}}

 We believe that 'bit64' in the Imports section should have been 'bit64 (>= 
0.9-7)', as previous versions of bit64 do not export 'str.integer64'.

 


> R package missing dependencies versions
> ---
>
> Key: ARROW-9953
> URL: https://issues.apache.org/jira/browse/ARROW-9953
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R3.4, linux, bit64 0.9-5 pre-installed
>Reporter: Ofek Shilon
>Priority: Major
>
> The DESCRIPTION file lists -
> {{ Imports:}}
>  {{  assertthat,}}
>  {{  bit64,}}
>  {{  ...}}
> However for us (R3.4 over linux, with various old package versions installed) 
> the installation fails. The error message is
> {{Error : object 'str.integer64' is not exported by 'namespace:bit64'}}
>  {{ERROR: lazy loading failed for package 'arrow'}}
>  We believe that 'bit64' in the Imports section should have been 'bit64 (>= 
> 0.9-7)', as previous versions of bit64 do not export 'str.integer64'.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9953) R package missing dependencies versions

2020-09-09 Thread Ofek Shilon (Jira)

Ofek Shilon created ARROW-9953:
--

 Summary: R package missing dependencies versions
 Key: ARROW-9953
 URL: https://issues.apache.org/jira/browse/ARROW-9953
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 1.0.1
 Environment: R3.4, linux, bit64 0.9-5 pre-installed
Reporter: Ofek Shilon


The DESCRIPTION file lists -

{{ Imports:}}
{{ {{  assertthat,
{{ {{  bit64,
{{ {{  ...

However for us (R3.4 over linux, with various old package versions installed) 
the installation fails. The error message is

{{Error : object 'str.integer64' is not exported by 'namespace:bit64'}}
{{ERROR: lazy loading failed for package 'arrow'}}

 We believe that 'bit64' in the Imports section should have been 'bit64 (>= 
0.9-7)', as previous versions of bit64 do not export 'str.integer64'.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7302) [C++] CSV: allow converting a column to a specific dictionary type

2020-09-09 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-7302:
-

Assignee: Antoine Pitrou

> [C++] CSV: allow converting a column to a specific dictionary type
> --
>
> Key: ARROW-7302
> URL: https://issues.apache.org/jira/browse/ARROW-7302
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 2.0.0
>
>
> We can probably limit ourselves to {{dictionary(int32, utf8)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5034) [C#] ArrowStreamWriter should expose synchronous Write methods

2020-09-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5034:
--
Labels: pull-request-available  (was: )

> [C#] ArrowStreamWriter should expose synchronous Write methods
> --
>
> Key: ARROW-5034
> URL: https://issues.apache.org/jira/browse/ARROW-5034
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Reporter: Eric Erhardt
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 72h
>  Time Spent: 10m
>  Remaining Estimate: 71h 50m
>
> There are times when callers are in a synchronous method and need to write an 
> Arrow stream. However, ArrowStreamWriter (and ArrowFileWriter) only expose 
> WriteAsync methods, which means the caller needs to call the Async method, 
> and then block on the resulting Task.
> Instead, we should also expose Write methods that complete in a synchronous 
> fashion, so the callers are free to choose the sync or async methods as they 
> need.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-9951) [C#] ArrowStreamWriter implement sync WriteRecordBatch

2020-09-09 Thread Steve Suh (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Suh closed ARROW-9951.

Resolution: Duplicate

> [C#] ArrowStreamWriter implement sync WriteRecordBatch
> --
>
> Key: ARROW-9951
> URL: https://issues.apache.org/jira/browse/ARROW-9951
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C#
>Reporter: Steve Suh
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently ArrowStreamWriter only supports async writing record batches.  We 
> are currently using this in .NET for Apache Spark when we write arrow records 
> [here|https://github.com/dotnet/spark/blob/aed9214c10470dba8831726251fb2ed171189ecc/src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs#L396].
>   However, we would prefer to use a sync version instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9951) [C#] ArrowStreamWriter implement sync WriteRecordBatch

2020-09-09 Thread Steve Suh (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192917#comment-17192917
 ] 

Steve Suh commented on ARROW-9951:
--

Duplicate of [ARROW-5034|https://issues.apache.org/jira/browse/ARROW-5034]

> [C#] ArrowStreamWriter implement sync WriteRecordBatch
> --
>
> Key: ARROW-9951
> URL: https://issues.apache.org/jira/browse/ARROW-9951
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C#
>Reporter: Steve Suh
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently ArrowStreamWriter only supports async writing record batches.  We 
> are currently using this in .NET for Apache Spark when we write arrow records 
> [here|https://github.com/dotnet/spark/blob/aed9214c10470dba8831726251fb2ed171189ecc/src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs#L396].
>   However, we would prefer to use a sync version instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9859) [C++] S3 FileSystemFromUri with special char in secret key fails

2020-09-09 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192886#comment-17192886
 ] 

Antoine Pitrou commented on ARROW-9859:
---

[~npr] Is there a test bucket with those characteristics?

> [C++] S3 FileSystemFromUri with special char in secret key fails
> 
>
> Key: ARROW-9859
> URL: https://issues.apache.org/jira/browse/ARROW-9859
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Python
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 2.0.0
>
>
> S3 Secret access keys can contain special characters like {{/}}. When they do
> 1) FileSystemFromUri will fail to parse the URI unless you URL-encode them 
> (e.g. replace / with %2F)
> 2) When you do escape the special characters, requests that require 
> authorization fail with the message "The request signature we calculated does 
> not match the signature you provided. Check your key and signing method." 
> This may suggest that there's some extra URL encoding/decoding that needs to 
> happen inside.
> I was only able to work around this by generating a new access key that 
> happened not to have special characters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8199) [C++] Guidance for creating multi-column sort on Table example?

2020-09-09 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192873#comment-17192873
 ] 

Wes McKinney commented on ARROW-8199:
-

That's great news. Thanks for attaching the code -- if you apply an open source 
license to it (like Apache 2.0) then others may be able to reuse parts of it. 

> [C++] Guidance for creating multi-column sort on Table example?
> ---
>
> Key: ARROW-8199
> URL: https://issues.apache.org/jira/browse/ARROW-8199
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Scott Wilson
>Priority: Minor
>  Labels: c++, newbie
> Attachments: ArrowCsv.cpp, DataFrame.h
>
>
> I'm just coming up to speed with Arrow and am noticing a dearth of examples 
> ... maybe I can help here.
> I'd like to implement multi-column sorting for Tables and just want to ensure 
> that I'm not duplicating existing work or proposing a bad design.
> My thought was to create a Table-specific version of SortToIndices() where 
> you can specify the columns and sort order.
> Then I'd create Array "views" that use the Indices to remap from the original 
> Array values to the values in sorted order. (Original data is not sorted, but 
> could be as a second step.) I noticed some of the array list variants keep 
> offsets, but didn't see anything that supports remapping per a list of 
> indices, but this may just be my oversight?
> Thanks in advance, Scott



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9932) [R] Arrow 1.0.1 R package fails to install on R3.4 over linux

2020-09-09 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9932:

Summary: [R] Arrow 1.0.1 R package fails to install on R3.4 over linux  
(was: Arrow 1.0.1 R package fails to install on R3.4 over linux)

> [R] Arrow 1.0.1 R package fails to install on R3.4 over linux
> -
>
> Key: ARROW-9932
> URL: https://issues.apache.org/jira/browse/ARROW-9932
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R version 3.4.0 (2015-04-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 14.04.5 LTS
>Reporter: Ofek Shilon
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 2.0.0
>
>
> 1. From R (3.4) prompt, we run
> {{> install.packages("arrow")}}
> and it seems to succeed.
> 2. Next we run:
> {{> arrow::install_arrow()}}
> This is the full output:
> {{Installing package into '/opt/R-3.4.0.mkl/library'}}
>  {{(as 'lib' is unspecified)}}
>  {{trying URL 'https://cloud.r-project.org/src/contrib/arrow_1.0.1.tar.gz'}}
>  {{Content type 'application/x-gzip' length 274865 bytes (268 KB)}}
>  {{==}}
>  {{downloaded 268 KB}}
> {{installing *source* package 'arrow' ...}}
>  {{** package 'arrow' successfully unpacked and MD5 sums checked}}
>  {{*** No C++ binaries found for ubuntu-14.04}}
>  {{*** Successfully retrieved C++ source}}
>  {{*** Building C++ libraries}}
>  {{ cmake}}
>  {color:#ff}*{{Error in dQuote(env_var_list, FALSE) : unused argument 
> (FALSE)}}*{color}
>  {color:#ff} *{{Calls: build_libarrow -> paste}}*{color}
>  {color:#ff} *{{Execution halted}}*{color}
>  {{- NOTE ---}}
>  {{After installation, please run arrow::install_arrow()}}
>  {{for help installing required runtime libraries}}
>  {{-}}
>  {{** libs}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array.cpp -o array.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array_from_vector.cpp -o array_from_vector.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array_to_vector.cpp -o array_to_vector.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c arraydata.cpp -o arraydata.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c arrowExports.cpp -o arrowExports.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c buffer.cpp -o buffer.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c chunkedarray.cpp -o chunkedarray.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c compression.cpp -o compression.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c compute.cpp -o compute.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c csv.cpp -o csv.o+}}{{g+ -std=gnu++0x 
> -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c dataset.cpp -o dataset.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c datatype.cpp -o datatype.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c expression.cpp -o expression.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c feather.cpp -o feather.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include"

[jira] [Created] (ARROW-9952) [Python] Use pyarrow.dataset writing for pq.write_to_dataset

2020-09-09 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-9952:


 Summary: [Python] Use pyarrow.dataset writing for 
pq.write_to_dataset
 Key: ARROW-9952
 URL: https://issues.apache.org/jira/browse/ARROW-9952
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 2.0.0


Now ARROW-9658 and ARROW-9893 are in, we can explore using the 
{{pyarrow.dataset}} writing capabilities in {{parquet.write_to_dataset}}.

Similarly as was done in {{pq.read_table}}, we could initially have a keyword 
to switch between both implementations, eventually defaulting to the new 
datasets one, and to deprecated the old (inefficient) python implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7345) [Python] Writing partitions with NaNs silently drops data

2020-09-09 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7345:
-
Labels: dataset dataset-parquet-write parquet  (was: dataset parquet)

> [Python] Writing partitions with NaNs silently drops data
> -
>
> Key: ARROW-7345
> URL: https://issues.apache.org/jira/browse/ARROW-7345
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
>Reporter: Karl Dunkle Werner
>Priority: Minor
>  Labels: dataset, dataset-parquet-write, parquet
>
> When writing a partitioned table, if the partitioning column has NA values, 
> they're silently dropped. I think it would be helpful if there was a warning. 
> Even better, from my perspective, would be writing out those partitions with 
> a directory name like {{partition_col=NaN}}. 
> Here's a small example where only the {{b = 2}} group is written out and the 
> {{b = NaN}} group is dropped.
> {code:python}
> import os
> import tempfile
> import pyarrow.json
> import pyarrow.parquet
> from pathlib import Path
> # Create a dataset with NaN:
> json_str = """
> {"a": 1, "b": 2}
> {"a": 2, "b": null}
> """
> with tempfile.NamedTemporaryFile() as tf:
> tf = Path(tf.name)
> tf.write_text(json_str)
> table = pyarrow.json.read_json(tf)
> # Write out a partitioned dataset, using the NaN-containing column
> with tempfile.TemporaryDirectory() as out_dir:
> pyarrow.parquet.write_to_dataset(table, out_dir, partition_cols=["b"])
> print(os.listdir(out_dir))
> read_table = pyarrow.parquet.read_table(out_dir)
> print(f"Wrote out {table.shape[0]} rows, read back {read_table.shape[0]} row")
> # Output:
> #> ['b=2.0']
> #> Wrote out 2 rows, read back 1 row
> {code}
>  
> It looks like this caused by pandas dropping NaNs when doing [the {{groupby}} 
> here|https://github.com/apache/arrow/blob/b16a3b53092ccfbc67e5a4e5c90be5913a67c8a5/python/pyarrow/parquet.py#L1434].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7706) [Python] saving a dataframe to the same partitioned location silently doubles the data

2020-09-09 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7706:
-
Labels: dataset dataset-parquet-write parquet  (was: dataset parquet)

> [Python] saving a dataframe to the same partitioned location silently doubles 
> the data
> --
>
> Key: ARROW-7706
> URL: https://issues.apache.org/jira/browse/ARROW-7706
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
>Reporter: Tsvika Shapira
>Priority: Major
>  Labels: dataset, dataset-parquet-write, parquet
>
> When a user saves a dataframe:
> {code:python}
> df1.to_parquet('/tmp/table', partition_cols=['col_a'], engine='pyarrow')
> {code}
> it will create sub-directories named "{{a=val1}}", "{{a=val2}}" in 
> {{/tmp/table}}. Each of them will contain one (or more?) parquet files with 
> random filenames.
> If a user runs the same command again, the code will use the existing 
> sub-directories, but with different (random) filenames. As a result, any data 
> loaded from this folder will be wrong - each row will be present twice.
> For example, when using
> {code:python}
> df1.to_parquet('/tmp/table', partition_cols=['col_a'], engine='pyarrow')  # 
> second time
> df2 = pd.read_parquet('/tmp/table', engine='pyarrow')
> assert len(df1) == len(df2)  # raise an error{code}
> This is a subtle change in the data that can pass unnoticed.
>  
> I would expect that the code will prevent the user from using an non-empty 
> destination as partitioned target. an overwrite flag can also be useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8296) [C++][Dataset] IpcFileFormat should support writing files with compressed buffers

2020-09-09 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8296:
-
Fix Version/s: 2.0.0

> [C++][Dataset] IpcFileFormat should support writing files with compressed 
> buffers
> -
>
> Key: ARROW-8296
> URL: https://issues.apache.org/jira/browse/ARROW-8296
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7617) [Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories)

2020-09-09 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7617:
-
Labels: dataset dataset-parquet-write parquet  (was: dataset parquet)

> [Python] parquet.write_to_dataset creates empty partitions for non-observed 
> dictionary items (categories)
> -
>
> Key: ARROW-7617
> URL: https://issues.apache.org/jira/browse/ARROW-7617
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
>Reporter: Vladimir
>Assignee: Andrew Wieteska
>Priority: Major
>  Labels: dataset, dataset-parquet-write, parquet
>
> Hello,
> it looks like, views with selection along categorical column are not properly 
> respected.
> For the following dummy dataframe:
>  
> {code:java}
> d = pd.date_range('1990-01-01', freq='D', periods=1)
> vals = pd.np.random.randn(len(d), 4)
> x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D'])
> x['Year'] = x.index.year
> {code}
> The slice by Year is saved to partitioned parquet properly:
> {code:java}
> table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
> pq.write_to_dataset(table, root_path='test_a.parquet', 
> partition_cols=['Year']){code}
> However, if we convert Year to pandas.Categorical - it will save the whole 
> original dataframe, not only slice of Year=1990:
> {code:java}
> x['Year'] = x['Year'].astype('category')
> table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
> pq.write_to_dataset(table, root_path='test_b.parquet', 
> partition_cols=['Year'])
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-2628) [Python] parquet.write_to_dataset is memory-hungry on large DataFrames

2020-09-09 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-2628:
-
Labels: dataset dataset-parquet-write parquet  (was: dataset parquet)

> [Python] parquet.write_to_dataset is memory-hungry on large DataFrames
> --
>
> Key: ARROW-2628
> URL: https://issues.apache.org/jira/browse/ARROW-2628
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, dataset-parquet-write, parquet
>
> See discussion in https://github.com/apache/arrow/issues/1749. We should 
> consider strategies for writing very large tables to a partitioned directory 
> scheme. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6579) [Python] Parallel pyarrow.parquet.write_to_dataset

2020-09-09 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6579:
-
Labels: dataset dataset-parquet-write parquet  (was: dataset parquet)

> [Python] Parallel pyarrow.parquet.write_to_dataset
> --
>
> Key: ARROW-6579
> URL: https://issues.apache.org/jira/browse/ARROW-6579
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Adam Lippai
>Priority: Major
>  Labels: dataset, dataset-parquet-write, parquet
>
> pyarrow.parquet.write_to_dataset() is single-threaded now and converts the 
> table from/to Pandas. We should lower the dataset writing to C++ (dropping 
> Pandas usage) so it's easier to write the partitioned dataset using multiple 
> threads.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-9936) [Python] Fix / test relative file paths in pyarrow.parquet

2020-09-09 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9936.
---
Resolution: Fixed

Issue resolved by pull request 8131
[https://github.com/apache/arrow/pull/8131]

> [Python] Fix / test relative file paths in pyarrow.parquet
> --
>
> Key: ARROW-9936
> URL: https://issues.apache.org/jira/browse/ARROW-9936
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> It seems that I broke writing parquet to relative file paths in ARROW-9718 
> (again, something similar happened in the pyarrow.dataset reading), so should 
> fix that and properly test this.
> {code}
> In [3]: pq.write_table(table, "test_relative.parquet")
> ...
> ~/scipy/repos/arrow/python/pyarrow/_fs.pyx in 
> pyarrow._fs.FileSystem.from_uri()
> ArrowInvalid: URI has empty scheme: 'test_relative.parquet'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9279) [C++] Implement PrettyPrint for Scalars

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9279:


Assignee: Apache Arrow JIRA Bot  (was: Ji Liu)

> [C++] Implement PrettyPrint for Scalars
> ---
>
> Key: ARROW-9279
> URL: https://issues.apache.org/jira/browse/ARROW-9279
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It would be useful, especially for nested scalar objects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9279) [C++] Implement PrettyPrint for Scalars

2020-09-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9279:
--
Labels: pull-request-available  (was: )

> [C++] Implement PrettyPrint for Scalars
> ---
>
> Key: ARROW-9279
> URL: https://issues.apache.org/jira/browse/ARROW-9279
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It would be useful, especially for nested scalar objects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9271) [R] Preserve data frame metadata in round trip

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9271:


Assignee: Apache Arrow JIRA Bot  (was: Romain Francois)

> [R] Preserve data frame metadata in round trip
> --
>
> Key: ARROW-9271
> URL: https://issues.apache.org/jira/browse/ARROW-9271
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-8899 collects R attributes from the columns of a data.frame (including 
> recursively in data frame columns), adds them to arrow schema metadata, and 
> restores them when pulling a data frame into R. However, any attributes on 
> the outer data frame itself are not preserved. These are lost in some 
> autosplice magic code before {{AddMetadataFromDots}} can collect it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-9906) [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri (closing NativeFile from S3FileSystem)

2020-09-09 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9906.
---
Resolution: Fixed

Issue resolved by pull request 8141
[https://github.com/apache/arrow/pull/8141]

> [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri 
> (closing NativeFile from S3FileSystem)
> ---
>
> Key: ARROW-9906
> URL: https://issues.apache.org/jira/browse/ARROW-9906
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: filesystem, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See https://github.com/apache/arrow/pull/7991#discussion_r481247263 and the 
> commented out test added in that PR.
> It doesn't give any clarifying traceback or crash message, but it segfaults 
> when closing the {{NativeFile}} returned from 
> {{S3FileSystem.open_output_stream}} in {{ParquetWriter.close()}}.
> With {{gdb}} I get a bit more context:
> {code}
> Thread 1 "python" received signal SIGSEGV, Segmentation fault.
> 0x7fa1a39df8f2 in arrow::fs::(anonymous 
> namespace)::ObjectOutputStream::UploadPart (this=0x5619a95ce820, 
> data=0x7fa197641ec0, nbytes=15671, owned_buffer=...) at 
> ../src/arrow/filesystem/s3fs.cc:806
> 806   client_->UploadPartAsync(req, handler);
> {code}
> Another S3 crash in the parquet tests: ARROW-9814 (although it doesn't seem 
> fully related)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9645) [Python] Deprecate the legacy pyarrow.filesystem interface

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9645:


Assignee: Joris Van den Bossche  (was: Apache Arrow JIRA Bot)

> [Python] Deprecate the legacy pyarrow.filesystem interface
> --
>
> Key: ARROW-9645
> URL: https://issues.apache.org/jira/browse/ARROW-9645
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{pyarrow.filesystem}} interfaces are dubbed "legacy" (in favor of 
> {{pyarrow.fs}}), but at some point we should actually deprecate (and 
> eventually remove) them. 
> There is probably still some work to do before that: ensure the new 
> filesystems can be used instead in all places (eg in pyarrow.parquet), 
> improve the docs about the new filesystems, ..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9645) [Python] Deprecate the legacy pyarrow.filesystem interface

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9645:


Assignee: Apache Arrow JIRA Bot  (was: Joris Van den Bossche)

> [Python] Deprecate the legacy pyarrow.filesystem interface
> --
>
> Key: ARROW-9645
> URL: https://issues.apache.org/jira/browse/ARROW-9645
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{pyarrow.filesystem}} interfaces are dubbed "legacy" (in favor of 
> {{pyarrow.fs}}), but at some point we should actually deprecate (and 
> eventually remove) them. 
> There is probably still some work to do before that: ensure the new 
> filesystems can be used instead in all places (eg in pyarrow.parquet), 
> improve the docs about the new filesystems, ..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9645) [Python] Deprecate the legacy pyarrow.filesystem interface

2020-09-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9645:
--
Labels: pull-request-available  (was: )

> [Python] Deprecate the legacy pyarrow.filesystem interface
> --
>
> Key: ARROW-9645
> URL: https://issues.apache.org/jira/browse/ARROW-9645
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{pyarrow.filesystem}} interfaces are dubbed "legacy" (in favor of 
> {{pyarrow.fs}}), but at some point we should actually deprecate (and 
> eventually remove) them. 
> There is probably still some work to do before that: ensure the new 
> filesystems can be used instead in all places (eg in pyarrow.parquet), 
> improve the docs about the new filesystems, ..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9645) [Python] Deprecate the legacy pyarrow.filesystem interface

2020-09-09 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-9645:


Assignee: Joris Van den Bossche

> [Python] Deprecate the legacy pyarrow.filesystem interface
> --
>
> Key: ARROW-9645
> URL: https://issues.apache.org/jira/browse/ARROW-9645
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 2.0.0
>
>
> The {{pyarrow.filesystem}} interfaces are dubbed "legacy" (in favor of 
> {{pyarrow.fs}}), but at some point we should actually deprecate (and 
> eventually remove) them. 
> There is probably still some work to do before that: ensure the new 
> filesystems can be used instead in all places (eg in pyarrow.parquet), 
> improve the docs about the new filesystems, ..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-9104) [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory

2020-09-09 Thread Revital Sur (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192722#comment-17192722
 ] 

Revital Sur edited comment on ARROW-9104 at 9/9/20, 8:31 AM:
-

[~apitrou] I appreciate if you could review the PR I created to address this 
issue.


was (Author: eres):
[~apitrou]I appropriate if you could review the PR I created to address this 
issue.

> [C++] Parquet encryption tests should write files to a temporary directory 
> instead of the testing submodule's directory
> ---
>
> Key: ARROW-9104
> URL: https://issues.apache.org/jira/browse/ARROW-9104
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Revital Sur
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If the source directory is not writable the test raises permission denied 
> error:
> [ RUN  ] TestEncryptionConfiguration.UniformEncryption
> 1632
> unknown file: Failure
> 1633
> C++ exception with description "IOError: Failed to open local file 
> '/arrow/cpp/submodules/parquet-testing/data/tmp_uniform_encryption.parquet.encrypted'.
>  Detail: [errno 13] Permission denied" thrown in the test body.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9104) [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory

2020-09-09 Thread Revital Sur (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192722#comment-17192722
 ] 

Revital Sur commented on ARROW-9104:


[~apitrou]I appropriate if you could review the PR I created to address this 
issue.

> [C++] Parquet encryption tests should write files to a temporary directory 
> instead of the testing submodule's directory
> ---
>
> Key: ARROW-9104
> URL: https://issues.apache.org/jira/browse/ARROW-9104
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Revital Sur
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If the source directory is not writable the test raises permission denied 
> error:
> [ RUN  ] TestEncryptionConfiguration.UniformEncryption
> 1632
> unknown file: Failure
> 1633
> C++ exception with description "IOError: Failed to open local file 
> '/arrow/cpp/submodules/parquet-testing/data/tmp_uniform_encryption.parquet.encrypted'.
>  Detail: [errno 13] Permission denied" thrown in the test body.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9104) [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9104:


Assignee: Apache Arrow JIRA Bot  (was: Revital Sur)

> [C++] Parquet encryption tests should write files to a temporary directory 
> instead of the testing submodule's directory
> ---
>
> Key: ARROW-9104
> URL: https://issues.apache.org/jira/browse/ARROW-9104
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If the source directory is not writable the test raises permission denied 
> error:
> [ RUN  ] TestEncryptionConfiguration.UniformEncryption
> 1632
> unknown file: Failure
> 1633
> C++ exception with description "IOError: Failed to open local file 
> '/arrow/cpp/submodules/parquet-testing/data/tmp_uniform_encryption.parquet.encrypted'.
>  Detail: [errno 13] Permission denied" thrown in the test body.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9104) [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9104:


Assignee: Revital Sur  (was: Apache Arrow JIRA Bot)

> [C++] Parquet encryption tests should write files to a temporary directory 
> instead of the testing submodule's directory
> ---
>
> Key: ARROW-9104
> URL: https://issues.apache.org/jira/browse/ARROW-9104
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Revital Sur
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If the source directory is not writable the test raises permission denied 
> error:
> [ RUN  ] TestEncryptionConfiguration.UniformEncryption
> 1632
> unknown file: Failure
> 1633
> C++ exception with description "IOError: Failed to open local file 
> '/arrow/cpp/submodules/parquet-testing/data/tmp_uniform_encryption.parquet.encrypted'.
>  Detail: [errno 13] Permission denied" thrown in the test body.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9104) [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory

2020-09-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9104:
--
Labels: pull-request-available  (was: )

> [C++] Parquet encryption tests should write files to a temporary directory 
> instead of the testing submodule's directory
> ---
>
> Key: ARROW-9104
> URL: https://issues.apache.org/jira/browse/ARROW-9104
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Revital Sur
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If the source directory is not writable the test raises permission denied 
> error:
> [ RUN  ] TestEncryptionConfiguration.UniformEncryption
> 1632
> unknown file: Failure
> 1633
> C++ exception with description "IOError: Failed to open local file 
> '/arrow/cpp/submodules/parquet-testing/data/tmp_uniform_encryption.parquet.encrypted'.
>  Detail: [errno 13] Permission denied" thrown in the test body.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9915) [Java] getObject API for temporal types is inconsistent and in some cases incorrect

2020-09-09 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192720#comment-17192720
 ] 

Micah Kornfield commented on ARROW-9915:


[~mjadczak_gsa] discussing it more sounds fine.  For topics like API breaking 
changes it is best to have a discussion on the dev mailing list.

 

 

> [Java] getObject API for temporal types is inconsistent and in some cases 
> incorrect
> ---
>
> Key: ARROW-9915
> URL: https://issues.apache.org/jira/browse/ARROW-9915
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 0.13.0, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.16.0, 0.17.0, 
> 0.17.1, 1.0.0
>Reporter: Matt Jadczak
>Priority: Major
>
> It seems that the work which has been tracked in ARROW-2015 and merged in 
> [https://github.com/apache/arrow/pull/2966] to change the return types of the 
> various Time and Date vector types when using the getObject API missed some 
> of the vector types which are temporal and so should return a temporal type, 
> and provided an incorrect implementation for others (some of this was pointed 
> out in the initial PR review, but it seems that it slipped through the cracks 
> and was not addressed before merging).
> Here is a table of the various temporal vector types, what they currently 
> return from getObject, and what they should return, in my opinion (I have 
> included ones in which the implementation is correct for completeness, and 
> coloured them green).
>  
>  
> ||Vector class||Current return type||Proposed return type||Comments||
> |DateDayVector|Integer|LocalDate|Currently returns the raw value of days 
> since epoch, should return the actual date|
> |DateMilliVector|LocalDateTime|LocalDate|This type is supposed to encode a 
> date, not a datetime, so even though epoch millis are used, the result should 
> be a LocalDate|
> |{color:#00875a}DurationVector{color}|{color:#00875a}Duration{color}|{color:#00875a}Duration{color}|{color:#00875a}Correct.{color}|
> |IntervalDayVector|Duration|Period|As per 
> [https://github.com/apache/arrow/blob/master/format/Schema.fbs#L251] , 
> Interval should be a calendar-based datatype, not a time-based one. This is 
> represented in Java by a Period type. However, I note that the Java Period 
> class does not support milliseconds, unlike the Arrow type, which might be 
> why Duration is being returned. Some discussion may be needed on the best way 
> to deal with this.|
> |{color:#00875a}IntervalYearVector{color}|{color:#00875a}Period{color}|{color:#00875a}Period{color}|{color:#00875a}Correct.{color}|
> |TimeMicroVector|Long|LocalTime|Currently returns the raw number of micros, 
> should return the actual time|
> |TimeMilliVector|LocalDateTime|LocalTime|Currently returns a datetime on 
> 1970-01-01 with the correct time component, should just return the time|
> |TimeNanoVector|Long|LocalTime|Currently returns the raw number of nanos, 
> should return the actual time|
> |TimeSecVector|Integer|LocalTime|Currently returns the raw number of seconds, 
> should return the actual time|
> |{color:#00875a}TimeStampMicroVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}|
> |{color:#00875a}TimeStampMilliVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}|
> |{color:#00875a}TimeStampNanoVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}|
> |{color:#00875a}TimeStampSecVector{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}LocalDateTime{color}|{color:#00875a}Correct.{color}|
> |TimeStampMicroTZVector|Long|ZonedDateTime|Currently returns the underlying 
> micros, and TZ has to be obtained separately. Should return the actual 
> datetime with timezone|
> |TimeStampMilliTZVector|Long|ZonedDateTime|Currently returns the underlying 
> millis, and TZ has to be obtained separately. Should return the actual 
> datetime with timezone|
> |TimeStampNanoTZVector|Long|ZonedDateTime|Currently returns the underlying 
> nanos, and TZ has to be obtained separately. Should return the actual 
> datetime with timezone|
> |TimeStampSecTZVector|Long|ZonedDateTime|Currently returns the underlying 
> seconds, and TZ has to be obtained separately. Should return the actual 
> datetime with timezone|
> I am happy to submit a PR to fix this if there is no other reason for this to 
> persist other than these types being rarely used, but note that these would 
> all be breaking API changes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-9814) [Python] Crash in test_parquet.py::test_read_partitioned_directory_s3fs

2020-09-09 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-9814.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8139
[https://github.com/apache/arrow/pull/8139]

> [Python] Crash in test_parquet.py::test_read_partitioned_directory_s3fs
> ---
>
> Key: ARROW-9814
> URL: https://issues.apache.org/jira/browse/ARROW-9814
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This seems to happen with some Minio versions, but is definitely a problem in 
> Arrow.
> The crash message says:
> {code}
> pyarrow/tests/test_parquet.py::test_read_partitioned_directory_s3fs[False] 
> ../src/arrow/dataset/discovery.cc:188:  Check failed: relative.has_value() 
> GetFileInfo() yielded path outside selector.base_dir
> {code}
> The underlying problem is that we pass a full URI for the selector base_dir 
> (such as "s3://bucket/path.") and the S3 filesystem implementation then 
> returns regular paths (such as "bucket/path/foo/bar").
> I think we should do two things:
> 1) error out rather than crash (and include the path strings in the error 
> message), which would be more user-friendly
> 2) fix the issue that full URIs are passed in base_dir



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9680) [Java] Support non-nullable vectors

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9680:


Assignee: Liya Fan  (was: Apache Arrow JIRA Bot)

> [Java] Support non-nullable vectors
> ---
>
> Key: ARROW-9680
> URL: https://issues.apache.org/jira/browse/ARROW-9680
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This issue was first discussed in the ML 
> ([https://lists.apache.org/thread.html/r480387ec9ec822f3ed30e9131109e43874a1c4d18af74ede1a7e41c5%40%3Cdev.arrow.apache.org%3E]),
>  from which we have received some feedback.
> We briefly resate it here as below:
>  
> 1. Non-nullable vectors are widely used in practice. For example, in a 
> database engine, a column can be declared as not null, so it cannot contain 
> null values.
> 2.Non-nullable vectors has significant performance advantages compared with 
> their nullable conterparts, such as:
>   1) the memory space of the validity buffer can be saved.
>   2) manipulation of the validity buffer can be bypassed
>   3) some if-else branches can be replaced by sequential instructions (by the 
> JIT compiler), leading to high throughput for the CPU pipeline. 
>  
> We open this Jira to facilitate further discussions, and we may provide a 
> sample PR, which will help us make a clearer decision. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9078) [C++] Parquet writing of extension type with nested storage type fails

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9078:


Assignee: Apache Arrow JIRA Bot  (was: Antoine Pitrou)

> [C++] Parquet writing of extension type with nested storage type fails
> --
>
> Key: ARROW-9078
> URL: https://issues.apache.org/jira/browse/ARROW-9078
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A reproducer in Python:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> class MyStructType(pa.PyExtensionType): 
>  
> def __init__(self): 
> pa.PyExtensionType.__init__(self, pa.struct([('left', pa.int64()), 
> ('right', pa.int64())])) 
>  
> def __reduce__(self): 
> return MyStructType, () 
> struct_array = pa.StructArray.from_arrays(
> [
> pa.array([0, 1], type="int64", from_pandas=True),
> pa.array([1, 2], type="int64", from_pandas=True),
> ],
> names=["left", "right"],
> )
> # works
> table = pa.table({'a': struct_array})
> pq.write_table(table, "test_struct.parquet")
> # doesn't work
> mystruct_array = pa.ExtensionArray.from_storage(MyStructType(), struct_array)
> table = pa.table({'a': mystruct_array})
> pq.write_table(table, "test_struct.parquet")
> {code}
> Writing the simple StructArray nowadays works (and reading it back in as 
> well). 
> But when the struct array is the storage array of an ExtensionType, it fails 
> with the following error:
> {code}
> ArrowException: Unknown error: data type leaf_count != builder_leaf_count1 2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9078) [C++] Parquet writing of extension type with nested storage type fails

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9078:


Assignee: Antoine Pitrou  (was: Apache Arrow JIRA Bot)

> [C++] Parquet writing of extension type with nested storage type fails
> --
>
> Key: ARROW-9078
> URL: https://issues.apache.org/jira/browse/ARROW-9078
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A reproducer in Python:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> class MyStructType(pa.PyExtensionType): 
>  
> def __init__(self): 
> pa.PyExtensionType.__init__(self, pa.struct([('left', pa.int64()), 
> ('right', pa.int64())])) 
>  
> def __reduce__(self): 
> return MyStructType, () 
> struct_array = pa.StructArray.from_arrays(
> [
> pa.array([0, 1], type="int64", from_pandas=True),
> pa.array([1, 2], type="int64", from_pandas=True),
> ],
> names=["left", "right"],
> )
> # works
> table = pa.table({'a': struct_array})
> pq.write_table(table, "test_struct.parquet")
> # doesn't work
> mystruct_array = pa.ExtensionArray.from_storage(MyStructType(), struct_array)
> table = pa.table({'a': mystruct_array})
> pq.write_table(table, "test_struct.parquet")
> {code}
> Writing the simple StructArray nowadays works (and reading it back in as 
> well). 
> But when the struct array is the storage array of an ExtensionType, it fails 
> with the following error:
> {code}
> ArrowException: Unknown error: data type leaf_count != builder_leaf_count1 2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7871) [Python] Expose more compute kernels

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-7871:


Assignee: Apache Arrow JIRA Bot  (was: Andrew Wieteska)

> [Python] Expose more compute kernels
> 
>
> Key: ARROW-7871
> URL: https://issues.apache.org/jira/browse/ARROW-7871
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently only the sum kernel is exposed.
> Or consider to deprecate/remove the pyarrow.compute module, and bind the 
> compute kernels as methods instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7871) [Python] Expose more compute kernels

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-7871:


Assignee: Andrew Wieteska  (was: Apache Arrow JIRA Bot)

> [Python] Expose more compute kernels
> 
>
> Key: ARROW-7871
> URL: https://issues.apache.org/jira/browse/ARROW-7871
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Andrew Wieteska
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently only the sum kernel is exposed.
> Or consider to deprecate/remove the pyarrow.compute module, and bind the 
> compute kernels as methods instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9950) [Rust] [DataFusion] Allow UDF usage without registry

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9950:


Assignee: Jorge  (was: Apache Arrow JIRA Bot)

> [Rust] [DataFusion] Allow UDF usage without registry
> 
>
> Key: ARROW-9950
> URL: https://issues.apache.org/jira/browse/ARROW-9950
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is a functionality relevant only for the DataFrame API.
> Sometimes a UDF declaration happens during planning, and it makes it very 
> expressive when the user does not have to access the registry at all to plan 
> the UDF.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9949) [C++] Generalize Decimal128::FromString for reuse in Decimal256

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9949:


Assignee: Apache Arrow JIRA Bot  (was: Mingyu Zhong)

> [C++] Generalize Decimal128::FromString for reuse in Decimal256
> ---
>
> Key: ARROW-9949
> URL: https://issues.apache.org/jira/browse/ARROW-9949
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mingyu Zhong
>Assignee: Apache Arrow JIRA Bot
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9906) [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri (closing NativeFile from S3FileSystem)

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9906:


Assignee: Antoine Pitrou  (was: Apache Arrow JIRA Bot)

> [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri 
> (closing NativeFile from S3FileSystem)
> ---
>
> Key: ARROW-9906
> URL: https://issues.apache.org/jira/browse/ARROW-9906
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: filesystem, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See https://github.com/apache/arrow/pull/7991#discussion_r481247263 and the 
> commented out test added in that PR.
> It doesn't give any clarifying traceback or crash message, but it segfaults 
> when closing the {{NativeFile}} returned from 
> {{S3FileSystem.open_output_stream}} in {{ParquetWriter.close()}}.
> With {{gdb}} I get a bit more context:
> {code}
> Thread 1 "python" received signal SIGSEGV, Segmentation fault.
> 0x7fa1a39df8f2 in arrow::fs::(anonymous 
> namespace)::ObjectOutputStream::UploadPart (this=0x5619a95ce820, 
> data=0x7fa197641ec0, nbytes=15671, owned_buffer=...) at 
> ../src/arrow/filesystem/s3fs.cc:806
> 806   client_->UploadPartAsync(req, handler);
> {code}
> Another S3 crash in the parquet tests: ARROW-9814 (although it doesn't seem 
> fully related)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9949) [C++] Generalize Decimal128::FromString for reuse in Decimal256

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9949:


Assignee: Mingyu Zhong  (was: Apache Arrow JIRA Bot)

> [C++] Generalize Decimal128::FromString for reuse in Decimal256
> ---
>
> Key: ARROW-9949
> URL: https://issues.apache.org/jira/browse/ARROW-9949
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mingyu Zhong
>Assignee: Mingyu Zhong
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9814) [Python] Crash in test_parquet.py::test_read_partitioned_directory_s3fs

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9814:


Assignee: Apache Arrow JIRA Bot  (was: Antoine Pitrou)

> [Python] Crash in test_parquet.py::test_read_partitioned_directory_s3fs
> ---
>
> Key: ARROW-9814
> URL: https://issues.apache.org/jira/browse/ARROW-9814
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Assignee: Apache Arrow JIRA Bot
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This seems to happen with some Minio versions, but is definitely a problem in 
> Arrow.
> The crash message says:
> {code}
> pyarrow/tests/test_parquet.py::test_read_partitioned_directory_s3fs[False] 
> ../src/arrow/dataset/discovery.cc:188:  Check failed: relative.has_value() 
> GetFileInfo() yielded path outside selector.base_dir
> {code}
> The underlying problem is that we pass a full URI for the selector base_dir 
> (such as "s3://bucket/path.") and the S3 filesystem implementation then 
> returns regular paths (such as "bucket/path/foo/bar").
> I think we should do two things:
> 1) error out rather than crash (and include the path strings in the error 
> message), which would be more user-friendly
> 2) fix the issue that full URIs are passed in base_dir



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9906) [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri (closing NativeFile from S3FileSystem)

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9906:


Assignee: Apache Arrow JIRA Bot  (was: Antoine Pitrou)

> [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri 
> (closing NativeFile from S3FileSystem)
> ---
>
> Key: ARROW-9906
> URL: https://issues.apache.org/jira/browse/ARROW-9906
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: filesystem, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See https://github.com/apache/arrow/pull/7991#discussion_r481247263 and the 
> commented out test added in that PR.
> It doesn't give any clarifying traceback or crash message, but it segfaults 
> when closing the {{NativeFile}} returned from 
> {{S3FileSystem.open_output_stream}} in {{ParquetWriter.close()}}.
> With {{gdb}} I get a bit more context:
> {code}
> Thread 1 "python" received signal SIGSEGV, Segmentation fault.
> 0x7fa1a39df8f2 in arrow::fs::(anonymous 
> namespace)::ObjectOutputStream::UploadPart (this=0x5619a95ce820, 
> data=0x7fa197641ec0, nbytes=15671, owned_buffer=...) at 
> ../src/arrow/filesystem/s3fs.cc:806
> 806   client_->UploadPartAsync(req, handler);
> {code}
> Another S3 crash in the parquet tests: ARROW-9814 (although it doesn't seem 
> fully related)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9950) [Rust] [DataFusion] Allow UDF usage without registry

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9950:


Assignee: Apache Arrow JIRA Bot  (was: Jorge)

> [Rust] [DataFusion] Allow UDF usage without registry
> 
>
> Key: ARROW-9950
> URL: https://issues.apache.org/jira/browse/ARROW-9950
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is a functionality relevant only for the DataFrame API.
> Sometimes a UDF declaration happens during planning, and it makes it very 
> expressive when the user does not have to access the registry at all to plan 
> the UDF.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9814) [Python] Crash in test_parquet.py::test_read_partitioned_directory_s3fs

2020-09-09 Thread Apache Arrow JIRA Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9814:


Assignee: Antoine Pitrou  (was: Apache Arrow JIRA Bot)

> [Python] Crash in test_parquet.py::test_read_partitioned_directory_s3fs
> ---
>
> Key: ARROW-9814
> URL: https://issues.apache.org/jira/browse/ARROW-9814
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This seems to happen with some Minio versions, but is definitely a problem in 
> Arrow.
> The crash message says:
> {code}
> pyarrow/tests/test_parquet.py::test_read_partitioned_directory_s3fs[False] 
> ../src/arrow/dataset/discovery.cc:188:  Check failed: relative.has_value() 
> GetFileInfo() yielded path outside selector.base_dir
> {code}
> The underlying problem is that we pass a full URI for the selector base_dir 
> (such as "s3://bucket/path.") and the S3 filesystem implementation then 
> returns regular paths (such as "bucket/path/foo/bar").
> I think we should do two things:
> 1) error out rather than crash (and include the path strings in the error 
> message), which would be more user-friendly
> 2) fix the issue that full URIs are passed in base_dir



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

84 matches

Mail list logo