[jira] [Created] (ARROW-9951) [C#] ArrowStreamWriter implement sync WriteRecordBatch

2020-09-08 Thread Steve Suh (Jira)
Steve Suh created ARROW-9951:


 Summary: [C#] ArrowStreamWriter implement sync WriteRecordBatch
 Key: ARROW-9951
 URL: https://issues.apache.org/jira/browse/ARROW-9951
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C#
Reporter: Steve Suh


Currently ArrowStreamWriter only supports async writing record batches.  We are 
currently using this in .NET for Apache Spark when we write arrow records 
[here|https://github.com/dotnet/spark/blob/aed9214c10470dba8831726251fb2ed171189ecc/src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs#L396].
  However, we would prefer to use a sync version instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7871) [Python] Expose more compute kernels

2020-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7871:
--
Labels: pull-request-available  (was: )

> [Python] Expose more compute kernels
> 
>
> Key: ARROW-7871
> URL: https://issues.apache.org/jira/browse/ARROW-7871
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Andrew Wieteska
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently only the sum kernel is exposed.
> Or consider to deprecate/remove the pyarrow.compute module, and bind the 
> compute kernels as methods instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9950) [Rust] [DataFusion] Allow UDF usage without registry

2020-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9950:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Allow UDF usage without registry
> 
>
> Key: ARROW-9950
> URL: https://issues.apache.org/jira/browse/ARROW-9950
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a functionality relevant only for the DataFrame API.
> Sometimes a UDF declaration happens during planning, and it makes it very 
> expressive when the user does not have to access the registry at all to plan 
> the UDF.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9950) [Rust] [DataFusion] Allow UDF usage without registry

2020-09-08 Thread Jorge (Jira)
Jorge created ARROW-9950:


 Summary: [Rust] [DataFusion] Allow UDF usage without registry
 Key: ARROW-9950
 URL: https://issues.apache.org/jira/browse/ARROW-9950
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


This is a functionality relevant only for the DataFrame API.

Sometimes a UDF declaration happens during planning, and it makes it very 
expressive when the user does not have to access the registry at all to plan 
the UDF.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9895) [RUST] Improve sort kernels

2020-09-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9895.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8092
[https://github.com/apache/arrow/pull/8092]

> [RUST] Improve sort kernels
> ---
>
> Key: ARROW-9895
> URL: https://issues.apache.org/jira/browse/ARROW-9895
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Followup from my mailing list post:
> {quote}1. When sorting by multiple columns (lexsort_to_indices) the Float32
> and Float64 data types are not supported because the implementation
> relies on the OrdArray trait. This trait is not implemented because
> f64/f32 only implements PartialOrd. The sort function for a single
> column (sort_to_indices) has some special logic which looks like it
> wants to treats NaN the same as null, but I'm also not convinced this
> is the correct way. For example postgres does the following
> (https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT)
> "In order to allow floating-point values to be sorted and used in
> tree-based indexes, PostgreSQL treats NaN values as equal, and greater
> than all non-NaN values."
> I propose to do the same in an OrdArray impl for
> Float64Array/Float32Array and then simplifying the sort_to_indices
> function accordingly.
> 2. Sorting for dictionary encoded strings. The problem here is that
> DictionaryArray does not have a generic parameter for the value type
> so it is not currently possible to only implement OrdArray for string
> dictionaries. Again for the single column case, the value data type
> could be checked and a sort could be implemented by looking up each
> key in the dictionary. An optimization could be to check the is_sorted
> flag of DictionaryArray (which does not seem to be used really) and
> then directly sort by the keys. For the general case I see roughly to
> options
> - Somehow implement an OrdArray view of the dictionary array. This
> could be easier if OrdArray did not extend Array but was a completely
> separate trait.
> - Change the lexicographic sort impl to not use dynamic calls but
> instead sort multiple times. So for a query `ORDER BY a, b`, first
> sort by b and afterwards sort again by a. With a stable sort
> implementation this should result in the same ordering. I'm curious
> about the performance, it could avoid dynamic method calls for each
> comparison, but it would process the indices vector multiple times.
> {quote}
> My plan is to open a draft PR with the following changes:
>  - {{sort_to_indices}} further splits up float64/float32 inputs into 
> nulls/non-nan/nan, sorts the non-nan values and then concats those 3 slices 
> according to the sort options. Nans are distinct from null and sort greater 
> than any other valid value
> - implement a sort method for dictionary arrays with string values. this 
> kernel checks the {{is_ordered}} flag and sorts just by the keys if it is 
> set, it will look up the string values otherwise
> - for the lexical sort use case the above kernel are not used, instead the 
> {{OrdArray}} trait is used. To make that more flexible and allow wrapping 
> arrays with differend ordering behavior I will make it no longer extend 
> {{Array}} and instead only contain the {{cmp_value}} method
> - string dictionary sorting can then be implemented with a wrapper struct 
> {{StringDictionaryArrayAsOrdArray}} which implements {{OrdArray}}
> - NaN aware sorting of floats can also be implemented with a wrapper struct 
> and trait implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9751) [Rust] [DataFusion] Extend UDFs to accept more than one type per argument

2020-09-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9751.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 7967
[https://github.com/apache/arrow/pull/7967]

> [Rust] [DataFusion] Extend UDFs to accept more than one type per argument
> -
>
> Key: ARROW-9751
> URL: https://issues.apache.org/jira/browse/ARROW-9751
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> Most math functions accept float32 and float64, `length` will accept Utf8 and 
> lists soon, etc.
> The goal of this story is to allow UDFs to accept more than one datatype.
> Design: the accepted datatypes should be a vector ordered by "faster/smaller" 
> to "slower/larger" (cpu/memory). When the plan reaches a UDF, we try to cast 
> the input expression like before, from "faster/smaller" to "slower/larger".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9949) [C++] Generalize Decimal128::FromString for reuse in Decimal256

2020-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9949:
--
Labels: pull-request-available  (was: )

> [C++] Generalize Decimal128::FromString for reuse in Decimal256
> ---
>
> Key: ARROW-9949
> URL: https://issues.apache.org/jira/browse/ARROW-9949
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Mingyu Zhong
>Assignee: Mingyu Zhong
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9948) [C++] Decimal128 does not check scale range when rescaling; can cause buffer overflow

2020-09-08 Thread Mingyu Zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingyu Zhong updated ARROW-9948:

Summary: [C++] Decimal128 does not check scale range when rescaling; can 
cause buffer overflow  (was: Decimal128 does not check scale range when 
rescaling; can cause buffer overflow)

> [C++] Decimal128 does not check scale range when rescaling; can cause buffer 
> overflow
> -
>
> Key: ARROW-9948
> URL: https://issues.apache.org/jira/browse/ARROW-9948
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Mingyu Zhong
>Priority: Major
>
> BasicDecimal128::GetScaleMultiplier has a DCHECK on the scale, but the scale 
> can come from users. For example, Decimal128::FromString("1e100") will cause 
> an out-of-bound read.
> BasicDecimal128::Rescale and BasicDecimal128::GetWholeAndFraction have the 
> same problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9948) Decimal128 does not check scale range when rescaling; can cause buffer overflow

2020-09-08 Thread Mingyu Zhong (Jira)
Mingyu Zhong created ARROW-9948:
---

 Summary: Decimal128 does not check scale range when rescaling; can 
cause buffer overflow
 Key: ARROW-9948
 URL: https://issues.apache.org/jira/browse/ARROW-9948
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Mingyu Zhong


BasicDecimal128::GetScaleMultiplier has a DCHECK on the scale, but the scale 
can come from users. For example, Decimal128::FromString("1e100") will cause an 
out-of-bound read.

BasicDecimal128::Rescale and BasicDecimal128::GetWholeAndFraction have the same 
problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9944) [Rust] Implement TO_TIMESTAMP function

2020-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9944:
--
Labels: pull-request-available  (was: )

> [Rust] Implement TO_TIMESTAMP function
> --
>
> Key: ARROW-9944
> URL: https://issues.apache.org/jira/browse/ARROW-9944
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Implement the TO_TIMESTAMP function, as described in 
> https://docs.google.com/document/d/18O9YPRyJ3u7-58J02NtNVYb6TDWBzi3mIQC58VhwxUk/edit



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-9935) [Python] datasets unable to read empty S3 folders with fsspec' s3fs

2020-09-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-9935.
-
Resolution: Not A Problem

> [Python] datasets unable to read empty S3 folders with fsspec' s3fs
> ---
>
> Key: ARROW-9935
> URL: https://issues.apache.org/jira/browse/ARROW-9935
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0
>Reporter: Weston Pace
>Priority: Minor
> Attachments: arrow_9935.py
>
>
> When an empty "folder" is created in S3 using the online bucket explorer tool 
> on the management console then it creates a special empty file with the same 
> name as the folder.
> (Some more details here: 
> [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)]
> If parquet files are later loaded into one of these directories (with or 
> without partitioning subdirectories) then this dataset cannot be read by the 
> new dataset API.  The underlying s3fs `find` method returns a "file" object 
> with size 0 that pyarrow then attempts to read.  Since this file doesn't 
> truly exist a FileNotFoundError is thrown.
> Would it be safe to simply ignore all files with size 0?
> As a workaround I can wrap s3fs' find method and strip out these objects with 
> size 0 myself.
> I've attached a script showing the issue and a workaround.  It uses a public 
> bucket that I'll leave up for a few months.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9935) [Python] datasets unable to read empty S3 folders with fsspec' s3fs

2020-09-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9935:
--
Summary: [Python] datasets unable to read empty S3 folders with fsspec' 
s3fs  (was: [Python] New filesystem API unable to read empty S3 folders)

> [Python] datasets unable to read empty S3 folders with fsspec' s3fs
> ---
>
> Key: ARROW-9935
> URL: https://issues.apache.org/jira/browse/ARROW-9935
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0
>Reporter: Weston Pace
>Priority: Minor
> Attachments: arrow_9935.py
>
>
> When an empty "folder" is created in S3 using the online bucket explorer tool 
> on the management console then it creates a special empty file with the same 
> name as the folder.
> (Some more details here: 
> [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)]
> If parquet files are later loaded into one of these directories (with or 
> without partitioning subdirectories) then this dataset cannot be read by the 
> new dataset API.  The underlying s3fs `find` method returns a "file" object 
> with size 0 that pyarrow then attempts to read.  Since this file doesn't 
> truly exist a FileNotFoundError is thrown.
> Would it be safe to simply ignore all files with size 0?
> As a workaround I can wrap s3fs' find method and strip out these objects with 
> size 0 myself.
> I've attached a script showing the issue and a workaround.  It uses a public 
> bucket that I'll leave up for a few months.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-9935) [Python] New filesystem API unable to read empty S3 folders

2020-09-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reopened ARROW-9935:
---

> [Python] New filesystem API unable to read empty S3 folders
> ---
>
> Key: ARROW-9935
> URL: https://issues.apache.org/jira/browse/ARROW-9935
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0
>Reporter: Weston Pace
>Priority: Minor
> Attachments: arrow_9935.py
>
>
> When an empty "folder" is created in S3 using the online bucket explorer tool 
> on the management console then it creates a special empty file with the same 
> name as the folder.
> (Some more details here: 
> [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)]
> If parquet files are later loaded into one of these directories (with or 
> without partitioning subdirectories) then this dataset cannot be read by the 
> new dataset API.  The underlying s3fs `find` method returns a "file" object 
> with size 0 that pyarrow then attempts to read.  Since this file doesn't 
> truly exist a FileNotFoundError is thrown.
> Would it be safe to simply ignore all files with size 0?
> As a workaround I can wrap s3fs' find method and strip out these objects with 
> size 0 myself.
> I've attached a script showing the issue and a workaround.  It uses a public 
> bucket that I'll leave up for a few months.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-9935) [Python] New filesystem API unable to read empty S3 folders

2020-09-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-9935.
-
Resolution: Not A Problem

> [Python] New filesystem API unable to read empty S3 folders
> ---
>
> Key: ARROW-9935
> URL: https://issues.apache.org/jira/browse/ARROW-9935
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0
>Reporter: Weston Pace
>Priority: Minor
> Attachments: arrow_9935.py
>
>
> When an empty "folder" is created in S3 using the online bucket explorer tool 
> on the management console then it creates a special empty file with the same 
> name as the folder.
> (Some more details here: 
> [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)]
> If parquet files are later loaded into one of these directories (with or 
> without partitioning subdirectories) then this dataset cannot be read by the 
> new dataset API.  The underlying s3fs `find` method returns a "file" object 
> with size 0 that pyarrow then attempts to read.  Since this file doesn't 
> truly exist a FileNotFoundError is thrown.
> Would it be safe to simply ignore all files with size 0?
> As a workaround I can wrap s3fs' find method and strip out these objects with 
> size 0 myself.
> I've attached a script showing the issue and a workaround.  It uses a public 
> bucket that I'll leave up for a few months.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9935) [Python] New filesystem API unable to read empty S3 folders

2020-09-08 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192483#comment-17192483
 ] 

Antoine Pitrou commented on ARROW-9935:
---

Thanks for the feedback. I'll close this issue then.

> [Python] New filesystem API unable to read empty S3 folders
> ---
>
> Key: ARROW-9935
> URL: https://issues.apache.org/jira/browse/ARROW-9935
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0
>Reporter: Weston Pace
>Priority: Minor
> Attachments: arrow_9935.py
>
>
> When an empty "folder" is created in S3 using the online bucket explorer tool 
> on the management console then it creates a special empty file with the same 
> name as the folder.
> (Some more details here: 
> [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)]
> If parquet files are later loaded into one of these directories (with or 
> without partitioning subdirectories) then this dataset cannot be read by the 
> new dataset API.  The underlying s3fs `find` method returns a "file" object 
> with size 0 that pyarrow then attempts to read.  Since this file doesn't 
> truly exist a FileNotFoundError is thrown.
> Would it be safe to simply ignore all files with size 0?
> As a workaround I can wrap s3fs' find method and strip out these objects with 
> size 0 myself.
> I've attached a script showing the issue and a workaround.  It uses a public 
> bucket that I'll leave up for a few months.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9935) [Python] New filesystem API unable to read empty S3 folders

2020-09-08 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192481#comment-17192481
 ] 

Weston Pace commented on ARROW-9935:


I tried that out and Arrow's own S3 implementation does not run into this 
issue.  This would only affect the s3fs implementation.  Either way, this is 
not a big problem for me since I have the workaround and I might switch to the 
builtin implementation anyways if the performance is significantly different.  

> [Python] New filesystem API unable to read empty S3 folders
> ---
>
> Key: ARROW-9935
> URL: https://issues.apache.org/jira/browse/ARROW-9935
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0
>Reporter: Weston Pace
>Priority: Minor
> Attachments: arrow_9935.py
>
>
> When an empty "folder" is created in S3 using the online bucket explorer tool 
> on the management console then it creates a special empty file with the same 
> name as the folder.
> (Some more details here: 
> [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)]
> If parquet files are later loaded into one of these directories (with or 
> without partitioning subdirectories) then this dataset cannot be read by the 
> new dataset API.  The underlying s3fs `find` method returns a "file" object 
> with size 0 that pyarrow then attempts to read.  Since this file doesn't 
> truly exist a FileNotFoundError is thrown.
> Would it be safe to simply ignore all files with size 0?
> As a workaround I can wrap s3fs' find method and strip out these objects with 
> size 0 myself.
> I've attached a script showing the issue and a workaround.  It uses a public 
> bucket that I'll leave up for a few months.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8199) [C++] Guidance for creating multi-column sort on Table example?

2020-09-08 Thread Scott Wilson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Wilson updated ARROW-8199:

Attachment: DataFrame.h

Hey Wes,

I hope you and yours are doing well in this strange time.

I'm just writing to thank you for all the work you did on Arrow and the
various discussions you've posted about the design decisions that drove
this development, post pandas. I've largely completed my C++ DataFrame and
replaced python/pandas code that we use for our ML pipeline. Using the
Arrow framework, I've been able to create a DataFrame object that wraps one
or more arrow tables. The implementation supports no-copy subsets, joins
and concatenations, and stl-like iterators. Also supported are transforms
using in-place lambda functions. The net is that a ~1 TB data processing
step that used to take 13 h now requires 15 m.

The only kluge I put into place has to do with support for null values. I
allow in-place editing of values, but no changes to array sizes or types.
This is possible because the typed arrays offer access to the underlying
raw values. To offer the same for null values I had to create derived
classes for Array and ChunkedArray offer access to the cached null_counts.

I've attached the DataFrame header in case it's of interest.

Thanks again, Scott




-- 
Scott B. Wilson
Chairman and Chief Scientist
Persyst Development Corporation
420 Stevens Avenue, Suite 210
Solana Beach, CA 92075


> [C++] Guidance for creating multi-column sort on Table example?
> ---
>
> Key: ARROW-8199
> URL: https://issues.apache.org/jira/browse/ARROW-8199
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Scott Wilson
>Priority: Minor
>  Labels: c++, newbie
> Attachments: ArrowCsv.cpp, DataFrame.h
>
>
> I'm just coming up to speed with Arrow and am noticing a dearth of examples 
> ... maybe I can help here.
> I'd like to implement multi-column sorting for Tables and just want to ensure 
> that I'm not duplicating existing work or proposing a bad design.
> My thought was to create a Table-specific version of SortToIndices() where 
> you can specify the columns and sort order.
> Then I'd create Array "views" that use the Indices to remap from the original 
> Array values to the values in sorted order. (Original data is not sorted, but 
> could be as a second step.) I noticed some of the array list variants keep 
> offsets, but didn't see anything that supports remapping per a list of 
> indices, but this may just be my oversight?
> Thanks in advance, Scott



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9909) [C++] Provide a (FileSystem, path) pair to locate files across filesystems

2020-09-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-9909:
---
Fix Version/s: 2.0.0

> [C++] Provide a (FileSystem, path) pair to locate files across filesystems
> --
>
> Key: ARROW-9909
> URL: https://issues.apache.org/jira/browse/ARROW-9909
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: Ben Kietzman
>Priority: Major
> Fix For: 2.0.0
>
>
> https://github.com/apache/arrow/pull/8101#discussion_r482921953
> Paths are sufficient to locate files within a known filesystem, but APIs (for 
> example datasets) do not always have a single known filesystem and in such 
> contexts a (fs, path) pair would be useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9947) [Python][Parquet] Python API for Parquet encryption

2020-09-08 Thread Itamar Turner-Trauring (Jira)
Itamar Turner-Trauring created ARROW-9947:
-

 Summary: [Python][Parquet] Python API for Parquet encryption
 Key: ARROW-9947
 URL: https://issues.apache.org/jira/browse/ARROW-9947
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Itamar Turner-Trauring


Python API wrapper for ARROW-9318.

Design document will eventually live at 
https://docs.google.com/document/d/1i1M5f5azLEmASj9XQZ_aQLl5Fr5F0CvnyPPVu1xaD9U/edit?usp=sharing

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9931) [C++] Fix undefined behaviour on invalid IPC (OSS-Fuzz)

2020-09-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9931.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8126
[https://github.com/apache/arrow/pull/8126]

> [C++] Fix undefined behaviour on invalid IPC (OSS-Fuzz)
> ---
>
> Key: ARROW-9931
> URL: https://issues.apache.org/jira/browse/ARROW-9931
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9906) [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri (closing NativeFile from S3FileSystem)

2020-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9906:
--
Labels: filesystem pull-request-available  (was: filesystem)

> [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri 
> (closing NativeFile from S3FileSystem)
> ---
>
> Key: ARROW-9906
> URL: https://issues.apache.org/jira/browse/ARROW-9906
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: filesystem, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See https://github.com/apache/arrow/pull/7991#discussion_r481247263 and the 
> commented out test added in that PR.
> It doesn't give any clarifying traceback or crash message, but it segfaults 
> when closing the {{NativeFile}} returned from 
> {{S3FileSystem.open_output_stream}} in {{ParquetWriter.close()}}.
> With {{gdb}} I get a bit more context:
> {code}
> Thread 1 "python" received signal SIGSEGV, Segmentation fault.
> 0x7fa1a39df8f2 in arrow::fs::(anonymous 
> namespace)::ObjectOutputStream::UploadPart (this=0x5619a95ce820, 
> data=0x7fa197641ec0, nbytes=15671, owned_buffer=...) at 
> ../src/arrow/filesystem/s3fs.cc:806
> 806   client_->UploadPartAsync(req, handler);
> {code}
> Another S3 crash in the parquet tests: ARROW-9814 (although it doesn't seem 
> fully related)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9906) [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri (closing NativeFile from S3FileSystem)

2020-09-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-9906:
-

Assignee: Antoine Pitrou

> [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri 
> (closing NativeFile from S3FileSystem)
> ---
>
> Key: ARROW-9906
> URL: https://issues.apache.org/jira/browse/ARROW-9906
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: filesystem
> Fix For: 2.0.0
>
>
> See https://github.com/apache/arrow/pull/7991#discussion_r481247263 and the 
> commented out test added in that PR.
> It doesn't give any clarifying traceback or crash message, but it segfaults 
> when closing the {{NativeFile}} returned from 
> {{S3FileSystem.open_output_stream}} in {{ParquetWriter.close()}}.
> With {{gdb}} I get a bit more context:
> {code}
> Thread 1 "python" received signal SIGSEGV, Segmentation fault.
> 0x7fa1a39df8f2 in arrow::fs::(anonymous 
> namespace)::ObjectOutputStream::UploadPart (this=0x5619a95ce820, 
> data=0x7fa197641ec0, nbytes=15671, owned_buffer=...) at 
> ../src/arrow/filesystem/s3fs.cc:806
> 806   client_->UploadPartAsync(req, handler);
> {code}
> Another S3 crash in the parquet tests: ARROW-9814 (although it doesn't seem 
> fully related)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9946) ParquetFileWriter segfaults when `sink` is a string

2020-09-08 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-9946:
-

 Summary: ParquetFileWriter segfaults when `sink` is a string
 Key: ARROW-9946
 URL: https://issues.apache.org/jira/browse/ARROW-9946
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 1.0.1
 Environment: Ubuntu 20.04
Reporter: Karl Dunkle Werner


Hello again! I have another minor R arrow issue.

 

The {{ParquetFileWriter}} docs say that the {{sink}} argument can be a "string 
which is interpreted as a file path". However, when I try to use a string, I 
get a segfault because the memory isn't mapped.

 

Maybe this is a separate request, but it would also be helpful to have 
documentation for the methods of the writer created by 
{{ParquetFileWriter$create()}}.

Docs link: [https://arrow.apache.org/docs/r/reference/ParquetFileWriter.html]

 
{code:r}
library(arrow)

sch = schema(a = float32())
writer = ParquetFileWriter$create(schema = sch, sink = "test.parquet")

#> *** caught segfault ***
#> address 0x1417d, cause 'memory not mapped'
#> 
#> Traceback:
#> 1: parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, 
arrow_properties)
#> 2: shared_ptr_is_null(xp)
#> 3: shared_ptr(ParquetFileWriter, 
parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, 
arrow_properties))
#> 4: ParquetFileWriter$create(schema = sch, sink = "test.parquet")


# This works as expected:
sink = FileOutputStream$create("test.parquet")
writer = ParquetFileWriter$create(schema = sch, sink = sink)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9893) [Python] Bindings for writing datasets to Parquet

2020-09-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9893.
---
Resolution: Fixed

Issue resolved by pull request 8138
[https://github.com/apache/arrow/pull/8138]

> [Python] Bindings for writing datasets to Parquet
> -
>
> Key: ARROW-9893
> URL: https://issues.apache.org/jira/browse/ARROW-9893
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Added to C++ in ARROW-9646, follow-up on Python bindings of ARROW-9658



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8718) [R] Add str() methods to objects

2020-09-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8718:
---
Fix Version/s: (was: 2.0.0)

> [R] Add str() methods to objects
> 
>
> Key: ARROW-8718
> URL: https://issues.apache.org/jira/browse/ARROW-8718
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>
> Apparently this will make the RStudio IDE show useful things in the 
> environment panel. Probably this is most useful for Table, RecordBatch, and 
> Dataset (and maybe Schema, which would look similar).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9387) [R] Use new C++ table select method

2020-09-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9387:
--

Assignee: Romain Francois  (was: Neal Richardson)

> [R] Use new C++ table select method
> ---
>
> Key: ARROW-9387
> URL: https://issues.apache.org/jira/browse/ARROW-9387
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain Francois
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ARROW-8314 adds it so we can use it instead of the one we wrote in the R 
> package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3812) [R] union support

2020-09-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-3812:
---
Fix Version/s: (was: 2.0.0)

> [R] union support
> -
>
> Key: ARROW-3812
> URL: https://issues.apache.org/jira/browse/ARROW-3812
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Romain Francois
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9827) [Python] pandas.read_parquet fails for wide parquet files and pyarrow 1.0.X

2020-09-08 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-9827.
-
Resolution: Fixed

Issue resolved by pull request 8037
[https://github.com/apache/arrow/pull/8037]

> [Python] pandas.read_parquet fails for wide parquet files and pyarrow 1.0.X
> ---
>
> Key: ARROW-9827
> URL: https://issues.apache.org/jira/browse/ARROW-9827
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0
>Reporter: Kyle Beauchamp
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> I recently tried to update my pyarrow from 0.17.1 to 1.0.0 and I'm 
> encountering a serious bug where wide DataFrames fail during 
> pandas.read_parquet.  Small parquet files (m=1) read correctly, medium 
> files (m=4) fail with a "Bus Error: 10", and large files (m=10) 
> completely hang.  I've tried python 3.8.5, pandas 1.0.5, pyarrow 1.0.0, and 
> OSX 10.14.   
> The driver code and output is below:
> {code:python}
> import pandas as pd
> import numpy as np
> import sys
> filename = "test.parquet"
> n = 10
> m = int(sys.argv[1])
> print(m)
> x = np.zeros((n, m))
> x = pd.DataFrame(x, columns=[f"A_{i}" for i in range(m)])
> x.to_parquet(filename)
> y = pd.read_parquet(filename, engine='pyarrow')
> {code}
> {code:java}
> time python test_pyarrow.py  1
> real 0m4.018s user 0m5.286s sys 0m0.514s
> time python test_pyarrow.py  4
> 4
> Bus error: 10
> {code}
>  
> On a pyarrow 0.17.1 environment, the 40,000 case completes in 8 seconds.  
> This was cross-posted on the pandas tracker as well: 
> [https://github.com/pandas-dev/pandas/issues/35846]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

2020-09-08 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192289#comment-17192289
 ] 

Wes McKinney commented on ARROW-9924:
-

I'd prefer to fix the Datasets implementation rather than kicking the can down 
the road. It doesn't seem reasonable to pay an extra price when using the 
interface to read a single file. 

> [Python] Performance regression reading individual Parquet files using 
> Dataset interface
> 
>
> Key: ARROW-9924
> URL: https://issues.apache.org/jira/browse/ARROW-9924
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Critical
> Fix For: 2.0.0
>
>
> I haven't investigated very deeply but this seems symptomatic of a problem:
> {code}
> In [27]: df = pd.DataFrame({'A': np.random.randn(1000)})  
>   
>   
> In [28]: pq.write_table(pa.table(df), 'test.parquet') 
>   
>   
> In [29]: timeit pq.read_table('test.parquet') 
>   
>   
> 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)
>   
>   
> 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9945) [C++][Dataset] Refactor Expression::Assume to return a Result

2020-09-08 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-9945:
---

 Summary: [C++][Dataset] Refactor Expression::Assume to return a 
Result
 Key: ARROW-9945
 URL: https://issues.apache.org/jira/browse/ARROW-9945
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 1.0.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 2.0.0


Expression::Assume can abort if the two expressions are not valid against a 
single schema. This is not ideal since a schema is not always easily available. 
The method should be able to fail gracefully in the case of a best-effort 
simplification where validation against a schema is not desired.

https://github.com/apache/arrow/pull/8037#discussion_r475594117



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9864) [Python] pathlib.Path not supported in write_to_dataset with partition columns

2020-09-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-9864:
-

Assignee: Joris Van den Bossche

> [Python] pathlib.Path not supported in write_to_dataset with partition columns
> --
>
> Key: ARROW-9864
> URL: https://issues.apache.org/jira/browse/ARROW-9864
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Copying over from https://github.com/pandas-dev/pandas/issues/35902
> {code:python}
> import pathlib
> df = pd.DataFrame({'A':[1,2,3,4], 'B':'C'})
> df.to_parquet('tmp_path1.parquet')  # OK
> df.to_parquet(pathlib.Path('tmp_path2.parquet'))  # OK
> df.to_parquet('tmp_path3.parquet', partition_cols=['B'])  # OK
> df.to_parquet(pathlib.Path('tmp_path4.parquet'), partition_cols=['B'])  # 
> TypeError
> {code}
> {{to_parquet}} method raises TypeError when using {{pathlib.Path()}} as an 
> argument in case when `partition_cols` argument is not None. If no partition 
> cols are provided, then {{pathlib.Path()}} is properly accepted
> {code}
> ---
> TypeError Traceback (most recent call last)
>  in 
>   3 
>   4 df.to_parquet('tmp_path3.parquet', partition_cols=['B']) # OK
> > 5 df.to_parquet(pathlib.Path('tmp_path4.parquet'), 
> partition_cols=['B'])  # TypeError
> ...
> ~/miniconda3/lib/python3.7/site-packages/pyarrow/parquet.py in 
> write_to_dataset(table, root_path, partition_cols, partition_filename_cb, 
> filesystem, **kwargs)
>1790 subtable = pa.Table.from_pandas(subgroup, 
> schema=subschema,
>1791 safe=False)
> -> 1792 _mkdir_if_not_exists(fs, '/'.join([root_path, subdir]))
>1793 if partition_filename_cb:
>1794 outfile = partition_filename_cb(keys)
> TypeError: sequence item 0: expected str instance, PosixPath found
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9864) [Python] pathlib.Path not supported in write_to_dataset with partition columns

2020-09-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9864.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8064
[https://github.com/apache/arrow/pull/8064]

> [Python] pathlib.Path not supported in write_to_dataset with partition columns
> --
>
> Key: ARROW-9864
> URL: https://issues.apache.org/jira/browse/ARROW-9864
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Copying over from https://github.com/pandas-dev/pandas/issues/35902
> {code:python}
> import pathlib
> df = pd.DataFrame({'A':[1,2,3,4], 'B':'C'})
> df.to_parquet('tmp_path1.parquet')  # OK
> df.to_parquet(pathlib.Path('tmp_path2.parquet'))  # OK
> df.to_parquet('tmp_path3.parquet', partition_cols=['B'])  # OK
> df.to_parquet(pathlib.Path('tmp_path4.parquet'), partition_cols=['B'])  # 
> TypeError
> {code}
> {{to_parquet}} method raises TypeError when using {{pathlib.Path()}} as an 
> argument in case when `partition_cols` argument is not None. If no partition 
> cols are provided, then {{pathlib.Path()}} is properly accepted
> {code}
> ---
> TypeError Traceback (most recent call last)
>  in 
>   3 
>   4 df.to_parquet('tmp_path3.parquet', partition_cols=['B']) # OK
> > 5 df.to_parquet(pathlib.Path('tmp_path4.parquet'), 
> partition_cols=['B'])  # TypeError
> ...
> ~/miniconda3/lib/python3.7/site-packages/pyarrow/parquet.py in 
> write_to_dataset(table, root_path, partition_cols, partition_filename_cb, 
> filesystem, **kwargs)
>1790 subtable = pa.Table.from_pandas(subgroup, 
> schema=subschema,
>1791 safe=False)
> -> 1792 _mkdir_if_not_exists(fs, '/'.join([root_path, subdir]))
>1793 if partition_filename_cb:
>1794 outfile = partition_filename_cb(keys)
> TypeError: sequence item 0: expected str instance, PosixPath found
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9935) [Python] New filesystem API unable to read empty S3 folders

2020-09-08 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9935:

Summary: [Python] New filesystem API unable to read empty S3 folders  (was: 
New filesystem API unable to read empty S3 folders)

> [Python] New filesystem API unable to read empty S3 folders
> ---
>
> Key: ARROW-9935
> URL: https://issues.apache.org/jira/browse/ARROW-9935
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Weston Pace
>Priority: Minor
> Attachments: arrow_9935.py
>
>
> When an empty "folder" is created in S3 using the online bucket explorer tool 
> on the management console then it creates a special empty file with the same 
> name as the folder.
> (Some more details here: 
> [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)]
> If parquet files are later loaded into one of these directories (with or 
> without partitioning subdirectories) then this dataset cannot be read by the 
> new dataset API.  The underlying s3fs `find` method returns a "file" object 
> with size 0 that pyarrow then attempts to read.  Since this file doesn't 
> truly exist a FileNotFoundError is thrown.
> Would it be safe to simply ignore all files with size 0?
> As a workaround I can wrap s3fs' find method and strip out these objects with 
> size 0 myself.
> I've attached a script showing the issue and a workaround.  It uses a public 
> bucket that I'll leave up for a few months.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9935) [Python] New filesystem API unable to read empty S3 folders

2020-09-08 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9935:

Component/s: Python

> [Python] New filesystem API unable to read empty S3 folders
> ---
>
> Key: ARROW-9935
> URL: https://issues.apache.org/jira/browse/ARROW-9935
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0
>Reporter: Weston Pace
>Priority: Minor
> Attachments: arrow_9935.py
>
>
> When an empty "folder" is created in S3 using the online bucket explorer tool 
> on the management console then it creates a special empty file with the same 
> name as the folder.
> (Some more details here: 
> [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)]
> If parquet files are later loaded into one of these directories (with or 
> without partitioning subdirectories) then this dataset cannot be read by the 
> new dataset API.  The underlying s3fs `find` method returns a "file" object 
> with size 0 that pyarrow then attempts to read.  Since this file doesn't 
> truly exist a FileNotFoundError is thrown.
> Would it be safe to simply ignore all files with size 0?
> As a workaround I can wrap s3fs' find method and strip out these objects with 
> size 0 myself.
> I've attached a script showing the issue and a workaround.  It uses a public 
> bucket that I'll leave up for a few months.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9814) [Python] Crash in test_parquet.py::test_read_partitioned_directory_s3fs

2020-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9814:
--
Labels: pull-request-available  (was: )

> [Python] Crash in test_parquet.py::test_read_partitioned_directory_s3fs
> ---
>
> Key: ARROW-9814
> URL: https://issues.apache.org/jira/browse/ARROW-9814
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This seems to happen with some Minio versions, but is definitely a problem in 
> Arrow.
> The crash message says:
> {code}
> pyarrow/tests/test_parquet.py::test_read_partitioned_directory_s3fs[False] 
> ../src/arrow/dataset/discovery.cc:188:  Check failed: relative.has_value() 
> GetFileInfo() yielded path outside selector.base_dir
> {code}
> The underlying problem is that we pass a full URI for the selector base_dir 
> (such as "s3://bucket/path.") and the S3 filesystem implementation then 
> returns regular paths (such as "bucket/path/foo/bar").
> I think we should do two things:
> 1) error out rather than crash (and include the path strings in the error 
> message), which would be more user-friendly
> 2) fix the issue that full URIs are passed in base_dir



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9944) [Rust] Implement TO_TIMESTAMP function

2020-09-08 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9944:
-
Summary: [Rust] Implement TO_TIMESTAMP function  (was: Implement 
TO_TIMESTAMP function)

> [Rust] Implement TO_TIMESTAMP function
> --
>
> Key: ARROW-9944
> URL: https://issues.apache.org/jira/browse/ARROW-9944
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>
> Implement the TO_TIMESTAMP function, as described in 
> https://docs.google.com/document/d/18O9YPRyJ3u7-58J02NtNVYb6TDWBzi3mIQC58VhwxUk/edit



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9944) [Rust] Implement TO_TIMESTAMP function

2020-09-08 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9944:
-
Component/s: Rust - DataFusion

> [Rust] Implement TO_TIMESTAMP function
> --
>
> Key: ARROW-9944
> URL: https://issues.apache.org/jira/browse/ARROW-9944
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>
> Implement the TO_TIMESTAMP function, as described in 
> https://docs.google.com/document/d/18O9YPRyJ3u7-58J02NtNVYb6TDWBzi3mIQC58VhwxUk/edit



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9923) [R] arrow R package build error: illegal instruction

2020-09-08 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192279#comment-17192279
 ] 

Wes McKinney commented on ARROW-9923:
-

We could probably use an R-specific environment variable (like 
{{ARROW_R_NO_SSE4=1}}) to toggle it off when building, or similar

> [R] arrow R package build error: illegal instruction
> 
>
> Key: ARROW-9923
> URL: https://issues.apache.org/jira/browse/ARROW-9923
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: Platform: Linux node 5.8.5-arch1-1 #1 SMP PREEMPT Thu, 
> 27 Aug 2020 18:53:02 + x86_64 GNU/Linux
> CPU: AMD Athlon(tm) II X4 651 Quad-Core Processor (does not support SSE4, 
> AVX/AVX2)
>Reporter: Maxim Terpilowski
>Priority: Major
>  Labels: build
>
> arrow R package (v1.0.1) installing from CRAN results in an error.
> Build log: [https://pastebin.com/Zq1iMTzB]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9814) [Python] Crash in test_parquet.py::test_read_partitioned_directory_s3fs

2020-09-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-9814:
-

Assignee: Antoine Pitrou  (was: Ben Kietzman)

> [Python] Crash in test_parquet.py::test_read_partitioned_directory_s3fs
> ---
>
> Key: ARROW-9814
> URL: https://issues.apache.org/jira/browse/ARROW-9814
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
>
> This seems to happen with some Minio versions, but is definitely a problem in 
> Arrow.
> The crash message says:
> {code}
> pyarrow/tests/test_parquet.py::test_read_partitioned_directory_s3fs[False] 
> ../src/arrow/dataset/discovery.cc:188:  Check failed: relative.has_value() 
> GetFileInfo() yielded path outside selector.base_dir
> {code}
> The underlying problem is that we pass a full URI for the selector base_dir 
> (such as "s3://bucket/path.") and the S3 filesystem implementation then 
> returns regular paths (such as "bucket/path/foo/bar").
> I think we should do two things:
> 1) error out rather than crash (and include the path strings in the error 
> message), which would be more user-friendly
> 2) fix the issue that full URIs are passed in base_dir



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9923) [R] arrow R package build error: illegal instruction

2020-09-08 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192274#comment-17192274
 ] 

Neal Richardson commented on ARROW-9923:


https://github.com/apache/arrow/blob/master/cpp/cmake_modules/SetupCxxFlags.cmake
 seems to be where these flags are set. It looks like SSE4.2 is assumed unless 
you're on one of a set of non-x86 processors. 

> [R] arrow R package build error: illegal instruction
> 
>
> Key: ARROW-9923
> URL: https://issues.apache.org/jira/browse/ARROW-9923
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: Platform: Linux node 5.8.5-arch1-1 #1 SMP PREEMPT Thu, 
> 27 Aug 2020 18:53:02 + x86_64 GNU/Linux
> CPU: AMD Athlon(tm) II X4 651 Quad-Core Processor (does not support SSE4, 
> AVX/AVX2)
>Reporter: Maxim Terpilowski
>Priority: Major
>  Labels: build
>
> arrow R package (v1.0.1) installing from CRAN results in an error.
> Build log: [https://pastebin.com/Zq1iMTzB]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9932) Arrow 1.0.1 R package fails to install on R3.4 over linux

2020-09-08 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-9932.

Fix Version/s: 2.0.0
 Assignee: Neal Richardson
   Resolution: Duplicate

Thanks for the report. This was recently fixed (though after 1.0.1 was 
released). 

> Arrow 1.0.1 R package fails to install on R3.4 over linux
> -
>
> Key: ARROW-9932
> URL: https://issues.apache.org/jira/browse/ARROW-9932
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R version 3.4.0 (2015-04-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 14.04.5 LTS
>Reporter: Ofek Shilon
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 2.0.0
>
>
> 1. From R (3.4) prompt, we run
> {{> install.packages("arrow")}}
> and it seems to succeed.
> 2. Next we run:
> {{> arrow::install_arrow()}}
> This is the full output:
> {{Installing package into '/opt/R-3.4.0.mkl/library'}}
>  {{(as 'lib' is unspecified)}}
>  {{trying URL 'https://cloud.r-project.org/src/contrib/arrow_1.0.1.tar.gz'}}
>  {{Content type 'application/x-gzip' length 274865 bytes (268 KB)}}
>  {{==}}
>  {{downloaded 268 KB}}
> {{installing *source* package 'arrow' ...}}
>  {{** package 'arrow' successfully unpacked and MD5 sums checked}}
>  {{*** No C++ binaries found for ubuntu-14.04}}
>  {{*** Successfully retrieved C++ source}}
>  {{*** Building C++ libraries}}
>  {{ cmake}}
>  {color:#ff}*{{Error in dQuote(env_var_list, FALSE) : unused argument 
> (FALSE)}}*{color}
>  {color:#ff} *{{Calls: build_libarrow -> paste}}*{color}
>  {color:#ff} *{{Execution halted}}*{color}
>  {{- NOTE ---}}
>  {{After installation, please run arrow::install_arrow()}}
>  {{for help installing required runtime libraries}}
>  {{-}}
>  {{** libs}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array.cpp -o array.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array_from_vector.cpp -o array_from_vector.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array_to_vector.cpp -o array_to_vector.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c arraydata.cpp -o arraydata.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c arrowExports.cpp -o arrowExports.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c buffer.cpp -o buffer.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c chunkedarray.cpp -o chunkedarray.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c compression.cpp -o compression.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c compute.cpp -o compute.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c csv.cpp -o csv.o+}}{{g+ -std=gnu++0x 
> -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c dataset.cpp -o dataset.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c datatype.cpp -o datatype.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c expression.cpp -o expression.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c feather.cpp -o feather.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> 

[jira] [Commented] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?

2020-09-08 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192267#comment-17192267
 ] 

Neal Richardson commented on ARROW-9938:


FTR I'm doing this in R in ARROW-9854, in case you want to see what this looks 
like in practice (https://github.com/apache/arrow/pull/8058)

> [Python] Add filesystem capabilities to other IO formats (feather, csv, json, 
> ..)?
> --
>
> Key: ARROW-9938
> URL: https://issues.apache.org/jira/browse/ARROW-9938
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: filesystem
>
> In the parquet IO functions, we support reading/writing files from non-local 
> filesystems directly (in addition to passing a buffer) by:
> - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}})
> - specifying the filesystem keyword (eg 
> {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) 
> On the other hand, for other file formats such as feather, we only support 
> local files or buffers. So for those, you need to do the more manual (I 
> _suppose_ this works?):
> {code:python}
> from pyarrow import fs, feather
> s3 = fs.S3FileSystem()
> with s3.open_input_file("bucket/data.arrow") as file:
> table = feather.read_table(file)
> {code}
> So I think the question comes up: do we want to extend this filesystem 
> support to other file formats (feather, csv, json) and make this more uniform 
> across pyarrow, or do we prefer to keep the plain readers more low-level (and 
> people can use the datasets API for more convenience)?
> cc [~apitrou] [~kszucs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9944) Implement TO_TIMESTAMP function

2020-09-08 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-9944:
--

 Summary: Implement TO_TIMESTAMP function
 Key: ARROW-9944
 URL: https://issues.apache.org/jira/browse/ARROW-9944
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Andrew Lamb


Implement the TO_TIMESTAMP function, as described in 
https://docs.google.com/document/d/18O9YPRyJ3u7-58J02NtNVYb6TDWBzi3mIQC58VhwxUk/edit



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9588) [C++] clang/win: Copy constructor of ParquetInvalidOrCorruptedFileException not correctly triggered

2020-09-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9588.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8114
[https://github.com/apache/arrow/pull/8114]

> [C++] clang/win: Copy constructor of ParquetInvalidOrCorruptedFileException 
> not correctly triggered
> ---
>
> Key: ARROW-9588
> URL: https://issues.apache.org/jira/browse/ARROW-9588
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The copy constructor of ParquetInvalidOrCorruptedFileException doesn't seem 
> to be taken correctly when building with clang 9.0.1 on Windows in a MSVC 
> toolchain.
> Adding {{ParquetInvalidOrCorruptedFileException(const 
> ParquetInvalidOrCorruptedFileException&) = default;}} as an explicit copy 
> constructor didn't help.
> Happy to any ideas here, probably a long shot as there are other clang-msvc 
> problems.
> {code}
> [49/62] Building CXX object 
> src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_1_cxx.cxx.obj
> FAILED: src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_1_cxx.cxx.obj
> C:\Users\Administrator\miniconda3\conda-bld\arrow-cpp-ext_1595962790058\_build_env\Library\bin\clang++.exe
>   -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 
> -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_S
> SE4_2 -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 
> -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC 
> -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DAWS_COMMON_USE_IMPORT_EXPORT -DAWS_EVE
> NT_STREAM_USE_IMPORT_EXPORT -DAWS_SDK_VERSION_MAJOR=1 
> -DAWS_SDK_VERSION_MINOR=7 -DAWS_SDK_VERSION_PATCH=164 -DHAVE_INTTYPES_H 
> -DHAVE_NETDB_H -DNOMINMAX -DPARQUET_EXPORTING -DUSE_IMPORT_EXPORT -DUSE_IMPORT
> _EXPORT=1 -DUSE_WINDOWS_DLL_SEMANTICS -D_CRT_SECURE_NO_WARNINGS 
> -Dparquet_shared_EXPORTS -Isrc -I../src -I../src/generated -isystem 
> ../thirdparty/flatbuffers/include -isystem C:/Users/Administrator/minico
> nda3/conda-bld/arrow-cpp-ext_1595962790058/_h_env/Library/include -isystem 
> ../thirdparty/hadoop/include -fvisibility-inlines-hidden -std=c++14 
> -fmessage-length=0 -march=k8 -mtune=haswell -ftree-vectorize
> -fstack-protector-strong -O2 -ffunction-sections -pipe 
> -D_CRT_SECURE_NO_WARNINGS -D_MT -D_DLL -nostdlib -Xclang 
> --dependent-lib=msvcrt -fuse-ld=lld -fno-aligned-allocation 
> -Qunused-arguments -fcolor-diagn
> ostics -O3 -DNDEBUG  -Wa,-mbig-obj -Wall -Wno-unknown-warning-option 
> -Wno-pass-failed -msse4.2  -O3 -DNDEBUG -D_DLL -D_MT -Xclang 
> --dependent-lib=msvcrt   -std=c++14 -MD -MT src/parquet/CMakeFiles/parquet
> _shared.dir/Unity/unity_1_cxx.cxx.obj -MF 
> src\parquet\CMakeFiles\parquet_shared.dir\Unity\unity_1_cxx.cxx.obj.d -o 
> src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_1_cxx.cxx.obj -c 
> src/parquet/CMakeF
> iles/parquet_shared.dir/Unity/unity_1_cxx.cxx
> In file included from 
> src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_1_cxx.cxx:3:
> In file included from 
> C:/Users/Administrator/miniconda3/conda-bld/arrow-cpp-ext_1595962790058/work/cpp/src/parquet/column_scanner.cc:18:
> In file included from ../src\parquet/column_scanner.h:29:
> In file included from ../src\parquet/column_reader.h:25:
> In file included from ../src\parquet/exception.h:26:
> In file included from ../src\parquet/platform.h:23:
> In file included from ../src\arrow/buffer.h:28:
> In file included from ../src\arrow/status.h:25:
> ../src\arrow/util/string_builder.h:49:10: error: invalid operands to binary 
> expression ('std::ostream' (aka 'basic_ostream >') 
> and 'parquet::ParquetInvalidOrCorruptedFileException'
> )
>   stream << head;
>   ~~ ^  
> ../src\arrow/util/string_builder.h:61:3: note: in instantiation of function 
> template specialization 
> 'arrow::util::StringBuilderRecursive  &>' requested here
>   StringBuilderRecursive(ss.stream(), std::forward(args)...);
>   ^
> ../src\arrow/status.h:160:31: note: in instantiation of function template 
> specialization 
> 'arrow::util::StringBuilder &>' requested here
> return Status(code, util::StringBuilder(std::forward(args)...));
>   ^
> ../src\arrow/status.h:204:20: note: in instantiation of function template 
> specialization 
> 'arrow::Status::FromArgs' 
> requested here
> return Status::FromArgs(StatusCode::Invalid, std::forward(args)...);
>^
> ../src\parquet/exception.h:129:49: note: in instantiation of function 
> template specialization 
> 

[jira] [Updated] (ARROW-9078) [C++] Parquet writing of extension type with nested storage type fails

2020-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9078:
--
Labels: parquet pull-request-available  (was: parquet)

> [C++] Parquet writing of extension type with nested storage type fails
> --
>
> Key: ARROW-9078
> URL: https://issues.apache.org/jira/browse/ARROW-9078
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A reproducer in Python:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> class MyStructType(pa.PyExtensionType): 
>  
> def __init__(self): 
> pa.PyExtensionType.__init__(self, pa.struct([('left', pa.int64()), 
> ('right', pa.int64())])) 
>  
> def __reduce__(self): 
> return MyStructType, () 
> struct_array = pa.StructArray.from_arrays(
> [
> pa.array([0, 1], type="int64", from_pandas=True),
> pa.array([1, 2], type="int64", from_pandas=True),
> ],
> names=["left", "right"],
> )
> # works
> table = pa.table({'a': struct_array})
> pq.write_table(table, "test_struct.parquet")
> # doesn't work
> mystruct_array = pa.ExtensionArray.from_storage(MyStructType(), struct_array)
> table = pa.table({'a': mystruct_array})
> pq.write_table(table, "test_struct.parquet")
> {code}
> Writing the simple StructArray nowadays works (and reading it back in as 
> well). 
> But when the struct array is the storage array of an ExtensionType, it fails 
> with the following error:
> {code}
> ArrowException: Unknown error: data type leaf_count != builder_leaf_count1 2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9943) [C++] Arrow metadata not applied recursively when reading Parquet file

2020-09-08 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9943:
-

 Summary: [C++] Arrow metadata not applied recursively when reading 
Parquet file
 Key: ARROW-9943
 URL: https://issues.apache.org/jira/browse/ARROW-9943
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 1.0.0
Reporter: Antoine Pitrou


Currently, {{ApplyOriginalMetadata}} in {{src/parquet/arrow/schema.cc}} is only 
applied for the top-level node of each schema field. Nested metadata (such as 
dicts-inside-lists, etc.) will not be applied.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9932) Arrow 1.0.1 R package fails to install on R3.4 over linux

2020-09-08 Thread Ofek Shilon (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192227#comment-17192227
 ] 

Ofek Shilon edited comment on ARROW-9932 at 9/8/20, 2:03 PM:
-

The previous suspicion is *not* the root cause of the installation failure.

 

The signature of dQuote changed in R3.6. It [accepted a single argument 
before|[https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote]]
  but [accepts a second argument since 
3.6.|https://www.rdocumentation.org/packages/base/versions/3.6.0/topics/sQuote] 
  The arrow 1.0.1 R package is marked as dependent on R>=3.1, but the 
installation error message above seems to indicate that the installation script 
uses R3.6 syntax. 

This usage is (at least) at /r/tools/linuxlibs.R :

env_vars <- paste(
 names(env_var_list), *dQuote(env_var_list, FALSE)*,
 sep = "=", collapse = " "
 )


was (Author: ofek):
The previous suspicion is *not* the root cause of the installation failure.

 

The signature of dQuote changed in R3.6. It [accepted a single argument 
before|[https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote]]
  but [accepts a second argument since 
3.6.|https://www.rdocumentation.org/packages/base/versions/3.6.0/topics/sQuote] 
  The arrow 1.0.1 R package is marked as dependent on R>=3.1, but the 
installation error message above seems to indicate that the installation script 
uses somewhere R3.6 syntax. 

> Arrow 1.0.1 R package fails to install on R3.4 over linux
> -
>
> Key: ARROW-9932
> URL: https://issues.apache.org/jira/browse/ARROW-9932
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R version 3.4.0 (2015-04-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 14.04.5 LTS
>Reporter: Ofek Shilon
>Priority: Major
>
> 1. From R (3.4) prompt, we run
> {{> install.packages("arrow")}}
> and it seems to succeed.
> 2. Next we run:
> {{> arrow::install_arrow()}}
> This is the full output:
> {{Installing package into '/opt/R-3.4.0.mkl/library'}}
>  {{(as 'lib' is unspecified)}}
>  {{trying URL 'https://cloud.r-project.org/src/contrib/arrow_1.0.1.tar.gz'}}
>  {{Content type 'application/x-gzip' length 274865 bytes (268 KB)}}
>  {{==}}
>  {{downloaded 268 KB}}
> {{installing *source* package 'arrow' ...}}
>  {{** package 'arrow' successfully unpacked and MD5 sums checked}}
>  {{*** No C++ binaries found for ubuntu-14.04}}
>  {{*** Successfully retrieved C++ source}}
>  {{*** Building C++ libraries}}
>  {{ cmake}}
>  {color:#ff}*{{Error in dQuote(env_var_list, FALSE) : unused argument 
> (FALSE)}}*{color}
>  {color:#ff} *{{Calls: build_libarrow -> paste}}*{color}
>  {color:#ff} *{{Execution halted}}*{color}
>  {{- NOTE ---}}
>  {{After installation, please run arrow::install_arrow()}}
>  {{for help installing required runtime libraries}}
>  {{-}}
>  {{** libs}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array.cpp -o array.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array_from_vector.cpp -o array_from_vector.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array_to_vector.cpp -o array_to_vector.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c arraydata.cpp -o arraydata.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c arrowExports.cpp -o arrowExports.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c buffer.cpp -o buffer.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c chunkedarray.cpp -o chunkedarray.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c compression.cpp -o compression.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include 

[jira] [Updated] (ARROW-9932) Arrow 1.0.1 R package fails to install on R3.4 over linux

2020-09-08 Thread Ofek Shilon (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ofek Shilon updated ARROW-9932:
---
Summary: Arrow 1.0.1 R package fails to install on R3.4 over linux  (was: 
Arrow 1.0.1 R package fails to install on R3.4)

> Arrow 1.0.1 R package fails to install on R3.4 over linux
> -
>
> Key: ARROW-9932
> URL: https://issues.apache.org/jira/browse/ARROW-9932
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R version 3.4.0 (2015-04-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 14.04.5 LTS
>Reporter: Ofek Shilon
>Priority: Major
>
> 1. From R (3.4) prompt, we run
> {{> install.packages("arrow")}}
> and it seems to succeed.
> 2. Next we run:
> {{> arrow::install_arrow()}}
> This is the full output:
> {{Installing package into '/opt/R-3.4.0.mkl/library'}}
>  {{(as 'lib' is unspecified)}}
>  {{trying URL 'https://cloud.r-project.org/src/contrib/arrow_1.0.1.tar.gz'}}
>  {{Content type 'application/x-gzip' length 274865 bytes (268 KB)}}
>  {{==}}
>  {{downloaded 268 KB}}
> {{installing *source* package 'arrow' ...}}
>  {{** package 'arrow' successfully unpacked and MD5 sums checked}}
>  {{*** No C++ binaries found for ubuntu-14.04}}
>  {{*** Successfully retrieved C++ source}}
>  {{*** Building C++ libraries}}
>  {{ cmake}}
>  {color:#ff}*{{Error in dQuote(env_var_list, FALSE) : unused argument 
> (FALSE)}}*{color}
>  {color:#ff} *{{Calls: build_libarrow -> paste}}*{color}
>  {color:#ff} *{{Execution halted}}*{color}
>  {{- NOTE ---}}
>  {{After installation, please run arrow::install_arrow()}}
>  {{for help installing required runtime libraries}}
>  {{-}}
>  {{** libs}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array.cpp -o array.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array_from_vector.cpp -o array_from_vector.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array_to_vector.cpp -o array_to_vector.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c arraydata.cpp -o arraydata.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c arrowExports.cpp -o arrowExports.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c buffer.cpp -o buffer.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c chunkedarray.cpp -o chunkedarray.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c compression.cpp -o compression.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c compute.cpp -o compute.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c csv.cpp -o csv.o+}}{{g+ -std=gnu++0x 
> -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c dataset.cpp -o dataset.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c datatype.cpp -o datatype.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c expression.cpp -o expression.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c feather.cpp -o feather.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c field.cpp -o field.o}}
>  {{g++ -std=gnu++0x 

[jira] [Updated] (ARROW-9932) Arrow 1.0.1 R package fails to install on R3.4

2020-09-08 Thread Ofek Shilon (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ofek Shilon updated ARROW-9932:
---
Summary: Arrow 1.0.1 R package fails to install on R3.4  (was: R package 
fails to install on Ubuntu 14)

> Arrow 1.0.1 R package fails to install on R3.4
> --
>
> Key: ARROW-9932
> URL: https://issues.apache.org/jira/browse/ARROW-9932
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R version 3.4.0 (2015-04-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 14.04.5 LTS
>Reporter: Ofek Shilon
>Priority: Major
>
> 1. From R (3.4) prompt, we run
> {{> install.packages("arrow")}}
> and it seems to succeed.
> 2. Next we run:
> {{> arrow::install_arrow()}}
> This is the full output:
> {{Installing package into '/opt/R-3.4.0.mkl/library'}}
>  {{(as 'lib' is unspecified)}}
>  {{trying URL 'https://cloud.r-project.org/src/contrib/arrow_1.0.1.tar.gz'}}
>  {{Content type 'application/x-gzip' length 274865 bytes (268 KB)}}
>  {{==}}
>  {{downloaded 268 KB}}
> {{installing *source* package 'arrow' ...}}
>  {{** package 'arrow' successfully unpacked and MD5 sums checked}}
>  {{*** No C++ binaries found for ubuntu-14.04}}
>  {{*** Successfully retrieved C++ source}}
>  {{*** Building C++ libraries}}
>  {{ cmake}}
>  {color:#ff}*{{Error in dQuote(env_var_list, FALSE) : unused argument 
> (FALSE)}}*{color}
>  {color:#ff} *{{Calls: build_libarrow -> paste}}*{color}
>  {color:#ff} *{{Execution halted}}*{color}
>  {{- NOTE ---}}
>  {{After installation, please run arrow::install_arrow()}}
>  {{for help installing required runtime libraries}}
>  {{-}}
>  {{** libs}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array.cpp -o array.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array_from_vector.cpp -o array_from_vector.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array_to_vector.cpp -o array_to_vector.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c arraydata.cpp -o arraydata.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c arrowExports.cpp -o arrowExports.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c buffer.cpp -o buffer.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c chunkedarray.cpp -o chunkedarray.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c compression.cpp -o compression.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c compute.cpp -o compute.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c csv.cpp -o csv.o+}}{{g+ -std=gnu++0x 
> -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c dataset.cpp -o dataset.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c datatype.cpp -o datatype.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c expression.cpp -o expression.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c feather.cpp -o feather.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c field.cpp -o field.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include 

[jira] [Comment Edited] (ARROW-9932) R package fails to install on Ubuntu 14

2020-09-08 Thread Ofek Shilon (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192227#comment-17192227
 ] 

Ofek Shilon edited comment on ARROW-9932 at 9/8/20, 1:55 PM:
-

The previous suspicion is *not* the root cause of the installation failure.

 

The signature of dQuote changed in R3.6. It [accepted a single argument 
before|[https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote]]
  but [accepts a second argument since 
3.6.|https://www.rdocumentation.org/packages/base/versions/3.6.0/topics/sQuote] 
  The arrow 1.0.1 R package is marked as dependent on R>=3.1, but the 
installation error message above seems to indicate that the installation script 
uses somewhere R3.6 syntax. 


was (Author: ofek):
The previous suspicion is *not* the root cause of the installation failure.

 

The signature of dQuote changed in R3.6. It [accepted a single argument 
before|[https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote|https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote],]][,|https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote],]
 but [accepts a second argument since 
3.6.|https://www.rdocumentation.org/packages/base/versions/3.6.0/topics/sQuote] 
  The arrow 1.0.1 R package is marked as dependent on R>=3.1, but the 
installation error message above seems to indicate that the installation script 
uses somewhere R3.6 syntax. 

> R package fails to install on Ubuntu 14
> ---
>
> Key: ARROW-9932
> URL: https://issues.apache.org/jira/browse/ARROW-9932
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R version 3.4.0 (2015-04-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 14.04.5 LTS
>Reporter: Ofek Shilon
>Priority: Major
>
> 1. From R (3.4) prompt, we run
> {{> install.packages("arrow")}}
> and it seems to succeed.
> 2. Next we run:
> {{> arrow::install_arrow()}}
> This is the full output:
> {{Installing package into '/opt/R-3.4.0.mkl/library'}}
>  {{(as 'lib' is unspecified)}}
>  {{trying URL 'https://cloud.r-project.org/src/contrib/arrow_1.0.1.tar.gz'}}
>  {{Content type 'application/x-gzip' length 274865 bytes (268 KB)}}
>  {{==}}
>  {{downloaded 268 KB}}
> {{installing *source* package 'arrow' ...}}
>  {{** package 'arrow' successfully unpacked and MD5 sums checked}}
>  {{*** No C++ binaries found for ubuntu-14.04}}
>  {{*** Successfully retrieved C++ source}}
>  {{*** Building C++ libraries}}
>  {{ cmake}}
>  {color:#ff}*{{Error in dQuote(env_var_list, FALSE) : unused argument 
> (FALSE)}}*{color}
>  {color:#ff} *{{Calls: build_libarrow -> paste}}*{color}
>  {color:#ff} *{{Execution halted}}*{color}
>  {{- NOTE ---}}
>  {{After installation, please run arrow::install_arrow()}}
>  {{for help installing required runtime libraries}}
>  {{-}}
>  {{** libs}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array.cpp -o array.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array_from_vector.cpp -o array_from_vector.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array_to_vector.cpp -o array_to_vector.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c arraydata.cpp -o arraydata.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c arrowExports.cpp -o arrowExports.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c buffer.cpp -o buffer.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c chunkedarray.cpp -o chunkedarray.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c compression.cpp -o compression.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 

[jira] [Reopened] (ARROW-9932) R package fails to install on Ubuntu 14

2020-09-08 Thread Ofek Shilon (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ofek Shilon reopened ARROW-9932:


The previous suspicion is *not* the root cause of the installation failure.

 

The signature of dQuote changed in R3.6. It [accepted a single argument 
before|[https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote|https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote],]][,|https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote],]
 but [accepts a second argument since 
3.6.|https://www.rdocumentation.org/packages/base/versions/3.6.0/topics/sQuote] 
  The arrow 1.0.1 R package is marked as dependent on R>=3.1, but the 
installation error message above seems to indicate that the installation script 
uses somewhere R3.6 syntax. 

> R package fails to install on Ubuntu 14
> ---
>
> Key: ARROW-9932
> URL: https://issues.apache.org/jira/browse/ARROW-9932
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R version 3.4.0 (2015-04-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 14.04.5 LTS
>Reporter: Ofek Shilon
>Priority: Major
>
> 1. From R (3.4) prompt, we run
> {{> install.packages("arrow")}}
> and it seems to succeed.
> 2. Next we run:
> {{> arrow::install_arrow()}}
> This is the full output:
> {{Installing package into '/opt/R-3.4.0.mkl/library'}}
>  {{(as 'lib' is unspecified)}}
>  {{trying URL 'https://cloud.r-project.org/src/contrib/arrow_1.0.1.tar.gz'}}
>  {{Content type 'application/x-gzip' length 274865 bytes (268 KB)}}
>  {{==}}
>  {{downloaded 268 KB}}
> {{installing *source* package 'arrow' ...}}
>  {{** package 'arrow' successfully unpacked and MD5 sums checked}}
>  {{*** No C++ binaries found for ubuntu-14.04}}
>  {{*** Successfully retrieved C++ source}}
>  {{*** Building C++ libraries}}
>  {{ cmake}}
>  {color:#ff}*{{Error in dQuote(env_var_list, FALSE) : unused argument 
> (FALSE)}}*{color}
>  {color:#ff} *{{Calls: build_libarrow -> paste}}*{color}
>  {color:#ff} *{{Execution halted}}*{color}
>  {{- NOTE ---}}
>  {{After installation, please run arrow::install_arrow()}}
>  {{for help installing required runtime libraries}}
>  {{-}}
>  {{** libs}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array.cpp -o array.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array_from_vector.cpp -o array_from_vector.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c array_to_vector.cpp -o array_to_vector.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c arraydata.cpp -o arraydata.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c arrowExports.cpp -o arrowExports.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c buffer.cpp -o buffer.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c chunkedarray.cpp -o chunkedarray.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c compression.cpp -o compression.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c compute.cpp -o compute.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c csv.cpp -o csv.o+}}{{g+ -std=gnu++0x 
> -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c dataset.cpp -o dataset.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG 
> -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic 
> -march=x86-64 -O3 -c datatype.cpp -o datatype.o}}
>  {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include 

[jira] [Resolved] (ARROW-9821) [Rust][DataFusion] User Defined PlanNode / Operator API

2020-09-08 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9821.
---
Resolution: Fixed

Issue resolved by pull request 8097
[https://github.com/apache/arrow/pull/8097]

> [Rust][DataFusion] User Defined PlanNode / Operator API
> ---
>
> Key: ARROW-9821
> URL: https://issues.apache.org/jira/browse/ARROW-9821
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> The basic goal is to  allow users to implement their own PlanNodes. I will 
> provide a google doc opened for comments shortly.
> Proposal: 
> https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit#
> See also mailing list discussion here: 
> https://lists.apache.org/thread.html/rf8ae7d1147e93e3f6172bc2e4fa50a38abcb35f046cc5830e09da6cc%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9942) [Python] Schema Evolution - Add new Field

2020-09-08 Thread Daniel Figus (Jira)
Daniel Figus created ARROW-9942:
---

 Summary: [Python] Schema Evolution - Add new Field
 Key: ARROW-9942
 URL: https://issues.apache.org/jira/browse/ARROW-9942
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 1.0.0
 Environment: pandas==1.1.1
pyarrow==1.0.0
Reporter: Daniel Figus


We are trying to leverage the new Dataset implementation and specifically rely 
on the schema evolution feature there. However when adding a new field in a 
later parquet file, the schemas don't seem to be merged and the new field is 
not available. 

Simple example:
{code:python}
import pandas as pd
from pyarrow import parquet as pq
from pyarrow import dataset as ds
import pyarrow as pa

path = "data/sample/"

df1 = pd.DataFrame({"field1": ["a", "b", "c"]})
df2 = pd.DataFrame({"field1": ["d", "e", "f"],
"field2": ["x", "y", "z"]})

df1.to_parquet(path + "df1.parquet", coerce_timestamps=None, version="2.0", 
index=False)
df2.to_parquet(path + "df2.parquet", coerce_timestamps=None, version="2.0", 
index=False)

# read via pandas
df = pd.read_parquet(path)
print(df.head())
print(df.info())
{code}
Output:
{noformat}
  field1
0  a
1  b
2  c
3  d
4  e

RangeIndex: 6 entries, 0 to 5
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  --  --  - 
 0   field1  6 non-null  object
dtypes: object(1)
memory usage: 176.0+ bytes
None
{noformat}
My expectation was to get the field2 as well based on what I have understood 
with the new Dataset implementation from ARROW-8039.

When using the Dataset API with a schema created from the second dataframe I'm 
able to read the field2:
{code:python}
# write metadata
schema = pa.Schema.from_pandas(df2, preserve_index=False)
pq.write_metadata(schema, path + "_common_metadata", version="2.0", 
coerce_timestamps=None)

# read with new dataset and schema
schema = pq.read_schema(path + "_common_metadata")
df = ds.dataset(path, schema=schema, format="parquet").to_table().to_pandas()
print(df.head())
print(df.info())
{code}
Output:
{noformat}
  field1 field2
0  a   None
1  b   None
2  c   None
3  d  x
4  e  y

RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  --  --  - 
 0   field1  6 non-null  object
 1   field2  3 non-null  object
dtypes: object(2)
memory usage: 224.0+ bytes
None
{noformat}
This works, however I want to avoid to write a {{_common_metadata}} file if 
possible. Is there a way to get the schema merge without passing an explicit 
schema? Or is this this yet to be implemented?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9941) [Python] Better string representation for extension types

2020-09-08 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-9941:
-

 Summary: [Python] Better string representation for extension types
 Key: ARROW-9941
 URL: https://issues.apache.org/jira/browse/ARROW-9941
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++, Python
Reporter: Antoine Pitrou


When one defines an extension type in Python (by subclassing 
{{PyExtensionType}}) and uses that type e.g. in a Arrow table, the printed 
schema looks like this:
{code}
pyarrow.Table
a: extension
b: extension
{code}
... which isn't very informative. PyExtensionType could perhaps override 
ToString() and call {{str}} on the type instance.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9941) [Python] Better string representation for extension types

2020-09-08 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192202#comment-17192202
 ] 

Antoine Pitrou commented on ARROW-9941:
---

cc [~jorisvandenbossche]

> [Python] Better string representation for extension types
> -
>
> Key: ARROW-9941
> URL: https://issues.apache.org/jira/browse/ARROW-9941
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 2.0.0
>
>
> When one defines an extension type in Python (by subclassing 
> {{PyExtensionType}}) and uses that type e.g. in a Arrow table, the printed 
> schema looks like this:
> {code}
> pyarrow.Table
> a: extension
> b: extension
> {code}
> ... which isn't very informative. PyExtensionType could perhaps override 
> ToString() and call {{str}} on the type instance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9941) [Python] Better string representation for extension types

2020-09-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9941:
--
Fix Version/s: 2.0.0

> [Python] Better string representation for extension types
> -
>
> Key: ARROW-9941
> URL: https://issues.apache.org/jira/browse/ARROW-9941
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 2.0.0
>
>
> When one defines an extension type in Python (by subclassing 
> {{PyExtensionType}}) and uses that type e.g. in a Arrow table, the printed 
> schema looks like this:
> {code}
> pyarrow.Table
> a: extension
> b: extension
> {code}
> ... which isn't very informative. PyExtensionType could perhaps override 
> ToString() and call {{str}} on the type instance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

2020-09-08 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192188#comment-17192188
 ] 

Joris Van den Bossche commented on ARROW-9924:
--

There was one other issue about a performance regression (ARROW-9827), for 
which I have an open PR (fix to not parse statistics when there is no filter 
specified). Now, I tried a release build of that branch compared to master, and 
that doesn't seem to make a difference for this case.

bq. IMHO we should not continue to use the Dataset interface for reading single 
files by default until the perf regression has been eliminated. 

That came up before, and we can certainly still use the old ParquetFile reader 
if there is eg no {{filter}} specified (we shouldn't use ParquetDataset for 
this case, though, as was done before 1.0)

---

I did a quick profile (with py-spy), and it _seems_ that the dataset version 
has a bit more overhead in all kinds of iteration (it uses the 
RecordBatchReader, and not the {{FileReader::ReadTable}} which is specifically 
to read the whole parquet file at once)

> [Python] Performance regression reading individual Parquet files using 
> Dataset interface
> 
>
> Key: ARROW-9924
> URL: https://issues.apache.org/jira/browse/ARROW-9924
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Critical
> Fix For: 2.0.0
>
>
> I haven't investigated very deeply but this seems symptomatic of a problem:
> {code}
> In [27]: df = pd.DataFrame({'A': np.random.randn(1000)})  
>   
>   
> In [28]: pq.write_table(pa.table(df), 'test.parquet') 
>   
>   
> In [29]: timeit pq.read_table('test.parquet') 
>   
>   
> 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)
>   
>   
> 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9940) [Rust][DataFusion] Generic "extension package" mechanism

2020-09-08 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-9940:
--

 Summary: [Rust][DataFusion] Generic "extension package" mechanism
 Key: ARROW-9940
 URL: https://issues.apache.org/jira/browse/ARROW-9940
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Andrew Lamb


This came from [~jorgecarleitao]'s suggestion on this PR: 
 https://github.com/apache/arrow/pull/8097/files#r482968858

The high level idea is to design and implement an upgrade/ improvement to the 
DataFusion APIs which allows registering composeable sets of 
UserDefinedLogicalNode, Logical planning rules and Physical Planning rules for 
some functionality.

h2. The use case:

You publish the TopK extension as a (library) crate called datafusion-topk, and 
I publish a crate datafusion-s3 with another extension.

A user wants to use both extensions. It installs them by:

# adding each crate to Cargo.toml
# initialize the default planner with both of them
# plan them
# execute them
I.e. freaking easy!

Broadly speaking, this allows the existence of an ecosystem of 
extensions/user-defined plans: people can share hand-crafted plans and plans 
can be added as dependencies to the crate and registered to the planner to be 
used by other people. 勞

This also reduces the pressure of placing everything in DataFusion's codebase: 
if we offer an API to extend DataFusion in this way, people can just distribute 
libraries with the extension/user-defined plan without having to go through the 
decision process of whether X is part of DataFusion's core or not (e.g. a scan 
of format Y, or a scan over protocol Z).

For me, this use case does require an easy way to achieve 2. initialize the 
default planner with both of them. But again, this PR is definitely a major 
step in this direction!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9939) [Rust][DataFusion] Rename inputs --> child consistently

2020-09-08 Thread Andrew Lamb (Jira)
Andrew Lamb created ARROW-9939:
--

 Summary: [Rust][DataFusion] Rename inputs --> child consistently
 Key: ARROW-9939
 URL: https://issues.apache.org/jira/browse/ARROW-9939
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andrew Lamb
Assignee: Andrew Lamb


As suggested by [~andygrove]  on 
https://github.com/apache/arrow/pull/8097/files#r484556394

> I've been thinking lately that we should start standardizing on children 
> rather than inputs. 

I think `children` is a more standard term and having consistent terminology 
across the datafusion code base will be valuable




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?

2020-09-08 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192130#comment-17192130
 ] 

Krisztian Szucs commented on ARROW-9938:


Supporting remote URIs sounds like a nice feature.

> [Python] Add filesystem capabilities to other IO formats (feather, csv, json, 
> ..)?
> --
>
> Key: ARROW-9938
> URL: https://issues.apache.org/jira/browse/ARROW-9938
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: filesystem
>
> In the parquet IO functions, we support reading/writing files from non-local 
> filesystems directly (in addition to passing a buffer) by:
> - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}})
> - specifying the filesystem keyword (eg 
> {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) 
> On the other hand, for other file formats such as feather, we only support 
> local files or buffers. So for those, you need to do the more manual (I 
> _suppose_ this works?):
> {code:python}
> from pyarrow import fs, feather
> s3 = fs.S3FileSystem()
> with s3.open_input_file("bucket/data.arrow") as file:
> table = feather.read_table(file)
> {code}
> So I think the question comes up: do we want to extend this filesystem 
> support to other file formats (feather, csv, json) and make this more uniform 
> across pyarrow, or do we prefer to keep the plain readers more low-level (and 
> people can use the datasets API for more convenience)?
> cc [~apitrou] [~kszucs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9775) [C++] Automatic S3 region selection

2020-09-08 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192096#comment-17192096
 ] 

Antoine Pitrou commented on ARROW-9775:
---

It seems it can be determined through a HEAD request on a bucket:
https://github.com/aws/aws-cli/issues/2431

This is how boto does it:
https://github.com/boto/botocore/pull/936/files

A S3Client is bound to a region, so some care will be needed in the 
implementation.

> [C++] Automatic S3 region selection
> ---
>
> Key: ARROW-9775
> URL: https://issues.apache.org/jira/browse/ARROW-9775
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Python
> Environment: macOS, Linux.
>Reporter: Sahil Gupta
>Priority: Major
>  Labels: filesystem
> Fix For: 2.0.0
>
>
> Currently, PyArrow and ArrowCpp need to be provided the region of the S3 
> file/bucket, else it defaults to using 'us-east-1'. Ideally, PyArrow and 
> ArrowCpp can automatically detect the region and get the files, etc. For 
> instance, s3fs and boto3 can read and write files without having to specify 
> the region explicitly. Similar functionality to auto-detect the region would 
> be great to have in PyArrow and ArrowCpp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9935) New filesystem API unable to read empty S3 folders

2020-09-08 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192088#comment-17192088
 ] 

Antoine Pitrou commented on ARROW-9935:
---

Have you tried using Arrow's own S3 filesystem implementation?

{code:python}
>>> from pyarrow.fs import S3FileSystem
>>> fs = S3FileSystem()
>>> fs.get_file_info("pyarrow-s3-empty-folder-file/mydataset")
{code}

(there may be more S3 configuration to do because this doesn't seem to work 
here: bad region perhaps?)

> New filesystem API unable to read empty S3 folders
> --
>
> Key: ARROW-9935
> URL: https://issues.apache.org/jira/browse/ARROW-9935
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Weston Pace
>Priority: Minor
> Attachments: arrow_9935.py
>
>
> When an empty "folder" is created in S3 using the online bucket explorer tool 
> on the management console then it creates a special empty file with the same 
> name as the folder.
> (Some more details here: 
> [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)]
> If parquet files are later loaded into one of these directories (with or 
> without partitioning subdirectories) then this dataset cannot be read by the 
> new dataset API.  The underlying s3fs `find` method returns a "file" object 
> with size 0 that pyarrow then attempts to read.  Since this file doesn't 
> truly exist a FileNotFoundError is thrown.
> Would it be safe to simply ignore all files with size 0?
> As a workaround I can wrap s3fs' find method and strip out these objects with 
> size 0 myself.
> I've attached a script showing the issue and a workaround.  It uses a public 
> bucket that I'll leave up for a few months.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9920) [Python] pyarrow.concat_arrays segfaults when passing it a chunked array

2020-09-08 Thread Uwe Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn resolved ARROW-9920.
-
Resolution: Fixed

Issue resolved by pull request 8132
[https://github.com/apache/arrow/pull/8132]

> [Python] pyarrow.concat_arrays segfaults when passing it a chunked array
> 
>
> Key: ARROW-9920
> URL: https://issues.apache.org/jira/browse/ARROW-9920
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> One can concat the chunks of a ChunkedArray with {{concat_arrays}} passing it 
> the list of chunks:
> {code}
> In [1]: arr = pa.chunked_array([[0, 1], [3, 4]])
> In [2]: pa.concat_arrays(arr.chunks)
> Out[2]: 
> 
> [
>   0,
>   1,
>   3,
>   4
> ]
> {code}
> but if passing the chunked array itself, you get a segfault:
> {code}
> In [4]: pa.concat_arrays(arr)
> Segmentation fault (core dumped)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9920) [Python] pyarrow.concat_arrays segfaults when passing it a chunked array

2020-09-08 Thread Uwe Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn reassigned ARROW-9920:
---

Assignee: Joris Van den Bossche

> [Python] pyarrow.concat_arrays segfaults when passing it a chunked array
> 
>
> Key: ARROW-9920
> URL: https://issues.apache.org/jira/browse/ARROW-9920
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> One can concat the chunks of a ChunkedArray with {{concat_arrays}} passing it 
> the list of chunks:
> {code}
> In [1]: arr = pa.chunked_array([[0, 1], [3, 4]])
> In [2]: pa.concat_arrays(arr.chunks)
> Out[2]: 
> 
> [
>   0,
>   1,
>   3,
>   4
> ]
> {code}
> but if passing the chunked array itself, you get a segfault:
> {code}
> In [4]: pa.concat_arrays(arr)
> Segmentation fault (core dumped)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?

2020-09-08 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192070#comment-17192070
 ] 

Antoine Pitrou commented on ARROW-9938:
---

On the C++ side they will definitely stay more low-level. On the Python side, I 
have no preference. I suppose it could be useful to write 
{{open_csv("s3://...")}}.

> [Python] Add filesystem capabilities to other IO formats (feather, csv, json, 
> ..)?
> --
>
> Key: ARROW-9938
> URL: https://issues.apache.org/jira/browse/ARROW-9938
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: filesystem
>
> In the parquet IO functions, we support reading/writing files from non-local 
> filesystems directly (in addition to passing a buffer) by:
> - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}})
> - specifying the filesystem keyword (eg 
> {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) 
> On the other hand, for other file formats such as feather, we only support 
> local files or buffers. So for those, you need to do the more manual (I 
> _suppose_ this works?):
> {code:python}
> from pyarrow import fs, feather
> s3 = fs.S3FileSystem()
> with s3.open_input_file("bucket/data.arrow") as file:
> table = feather.read_table(file)
> {code}
> So I think the question comes up: do we want to extend this filesystem 
> support to other file formats (feather, csv, json) and make this more uniform 
> across pyarrow, or do we prefer to keep the plain readers more low-level (and 
> people can use the datasets API for more convenience)?
> cc [~apitrou] [~kszucs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9104) [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory

2020-09-08 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192069#comment-17192069
 ] 

Antoine Pitrou commented on ARROW-9104:
---

I've added [~revit13] to the contributors and assigned the Jira to her. Thank 
you!

> [C++] Parquet encryption tests should write files to a temporary directory 
> instead of the testing submodule's directory
> ---
>
> Key: ARROW-9104
> URL: https://issues.apache.org/jira/browse/ARROW-9104
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Revital Sur
>Priority: Major
> Fix For: 2.0.0
>
>
> If the source directory is not writable the test raises permission denied 
> error:
> [ RUN  ] TestEncryptionConfiguration.UniformEncryption
> 1632
> unknown file: Failure
> 1633
> C++ exception with description "IOError: Failed to open local file 
> '/arrow/cpp/submodules/parquet-testing/data/tmp_uniform_encryption.parquet.encrypted'.
>  Detail: [errno 13] Permission denied" thrown in the test body.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9104) [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory

2020-09-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-9104:
-

Assignee: Revital Sur  (was: Gidon Gershinsky)

> [C++] Parquet encryption tests should write files to a temporary directory 
> instead of the testing submodule's directory
> ---
>
> Key: ARROW-9104
> URL: https://issues.apache.org/jira/browse/ARROW-9104
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Assignee: Revital Sur
>Priority: Major
> Fix For: 2.0.0
>
>
> If the source directory is not writable the test raises permission denied 
> error:
> [ RUN  ] TestEncryptionConfiguration.UniformEncryption
> 1632
> unknown file: Failure
> 1633
> C++ exception with description "IOError: Failed to open local file 
> '/arrow/cpp/submodules/parquet-testing/data/tmp_uniform_encryption.parquet.encrypted'.
>  Detail: [errno 13] Permission denied" thrown in the test body.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9846) [Rust] Master branch broken build

2020-09-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-9846.
---
Resolution: Not A Problem

> [Rust] Master branch broken build
> -
>
> Key: ARROW-9846
> URL: https://issues.apache.org/jira/browse/ARROW-9846
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> Master branch is failing to build in CI. It fails to compile 
> "tower-balance-0.3.0". I cannot reproduce locally.
> {code:java}
> error[E0502]: cannot borrow `self` as immutable because it is also borrowed 
> as mutable
>--> 
> /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-balance-0.3.0/src/pool/mod.rs:381:21
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?

2020-09-08 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9938:
-
Description: 
In the parquet IO functions, we support reading/writing files from non-local 
filesystems directly (in addition to passing a buffer) by:

- passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}})
- specifying the filesystem keyword (eg 
{{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) 

On the other hand, for other file formats such as feather, we only support 
local files or buffers. So for those, you need to do the more manual (I 
_suppose_ this works?):

{code:python}
from pyarrow import fs, feather

s3 = fs.S3FileSystem()

with s3.open_input_file("bucket/data.arrow") as file:
table = feather.read_table(file)
{code}

So I think the question comes up: do we want to extend this filesystem support 
to other file formats (feather, csv, json) and make this more uniform across 
pyarrow, or do we prefer to keep the plain readers more low-level (and people 
can use the datasets API for more convenience)?

cc [~apitrou] [~kszucs]


  was:
In the parquet IO functions, we support reading/writing files from non-local 
filesystems directly (in addition to passing a buffer) by:

- passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}})
- specifying the filesystem keyword (eg 
{{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) 

On the other hand, for other file formats such as feather, we only support 
local files or buffers. So for those, you need to do the more manual (I 
_suppose_ this works?):

{code:python}
from pyarrow import fs, feather

s3 = fs.S3FileSystem()

with s3.open_input_file("bucket/data.arrow") as file:
  table = feather.read_table(file)
{code}

So I think the question comes up: do we want to extend this filesystem support 
to other file formats (feather, csv, json) and make this more uniform across 
pyarrow, or do we prefer to keep the plain readers more low-level (and people 
can use the datasets API for more convenience)?

cc [~apitrou] [~kszucs]



> [Python] Add filesystem capabilities to other IO formats (feather, csv, json, 
> ..)?
> --
>
> Key: ARROW-9938
> URL: https://issues.apache.org/jira/browse/ARROW-9938
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: filesystem
>
> In the parquet IO functions, we support reading/writing files from non-local 
> filesystems directly (in addition to passing a buffer) by:
> - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}})
> - specifying the filesystem keyword (eg 
> {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) 
> On the other hand, for other file formats such as feather, we only support 
> local files or buffers. So for those, you need to do the more manual (I 
> _suppose_ this works?):
> {code:python}
> from pyarrow import fs, feather
> s3 = fs.S3FileSystem()
> with s3.open_input_file("bucket/data.arrow") as file:
> table = feather.read_table(file)
> {code}
> So I think the question comes up: do we want to extend this filesystem 
> support to other file formats (feather, csv, json) and make this more uniform 
> across pyarrow, or do we prefer to keep the plain readers more low-level (and 
> people can use the datasets API for more convenience)?
> cc [~apitrou] [~kszucs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?

2020-09-08 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9938:
-
Description: 
In the parquet IO functions, we support reading/writing files from non-local 
filesystems directly (in addition to passing a buffer) by:

- passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}})
- specifying the filesystem keyword (eg 
{{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) 

On the other hand, for other file formats such as feather, we only support 
local files or buffers. So for those, you need to do the more manual (I 
_suppose_ this works?):

{code:python}
from pyarrow import fs, feather

s3 = fs.S3FileSystem()

with s3.open_input_file("bucket/data.arrow") as file:
  table = feather.read_table(file)
{code}

So I think the question comes up: do we want to extend this filesystem support 
to other file formats (feather, csv, json) and make this more uniform across 
pyarrow, or do we prefer to keep the plain readers more low-level (and people 
can use the datasets API for more convenience)?

cc [~apitrou] [~kszucs]


  was:
In the parquet IO functions, we support reading/writing files from non-local 
filesystems directly (in addition to passing a buffer) by:

- passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}})
- specifying the filesystem keyword (eg 
{{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) 

On the other hand, for other file formats such as feather, we only support 
local files. So for those, you need to do the more manual (I _suppose_ this 
works?):

{code:python}
from pyarrow import fs, feather

s3 = fs.S3FileSystem()

with s3.open_input_file("bucket/data.arrow") as file:
  table = feather.read_table(file)
{code}

So I think the question comes up: do we want to extend this filesystem support 
to other file formats (feather, csv, json) and make this more uniform across 
pyarrow, or do we prefer to keep the plain readers more low-level (and people 
can use the datasets API for more convenience)?

cc [~apitrou] [~kszucs]



> [Python] Add filesystem capabilities to other IO formats (feather, csv, json, 
> ..)?
> --
>
> Key: ARROW-9938
> URL: https://issues.apache.org/jira/browse/ARROW-9938
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: filesystem
>
> In the parquet IO functions, we support reading/writing files from non-local 
> filesystems directly (in addition to passing a buffer) by:
> - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}})
> - specifying the filesystem keyword (eg 
> {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) 
> On the other hand, for other file formats such as feather, we only support 
> local files or buffers. So for those, you need to do the more manual (I 
> _suppose_ this works?):
> {code:python}
> from pyarrow import fs, feather
> s3 = fs.S3FileSystem()
> with s3.open_input_file("bucket/data.arrow") as file:
>   table = feather.read_table(file)
> {code}
> So I think the question comes up: do we want to extend this filesystem 
> support to other file formats (feather, csv, json) and make this more uniform 
> across pyarrow, or do we prefer to keep the plain readers more low-level (and 
> people can use the datasets API for more convenience)?
> cc [~apitrou] [~kszucs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?

2020-09-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9938:


 Summary: [Python] Add filesystem capabilities to other IO formats 
(feather, csv, json, ..)?
 Key: ARROW-9938
 URL: https://issues.apache.org/jira/browse/ARROW-9938
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


In the parquet IO functions, we support reading/writing files from non-local 
filesystems directly (in addition to passing a buffer) by:

- passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}})
- specifying the filesystem keyword (eg 
{{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) 

On the other hand, for other file formats such as feather, we only support 
local files. So for those, you need to do the more manual (I _suppose_ this 
works?):

{code:python}
from pyarrow import fs, feather

s3 = fs.S3FileSystem()

with s3.open_input_file("bucket/data.arrow") as file:
  table = feather.read_table(file)
{code}

So I think the question comes up: do we want to extend this filesystem support 
to other file formats (feather, csv, json) and make this more uniform across 
pyarrow, or do we prefer to keep the plain readers more low-level (and people 
can use the datasets API for more convenience)?

cc [~apitrou] [~kszucs]




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9919) [Rust] [DataFusion] Math functions

2020-09-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-9919:
--
Component/s: Rust - DataFusion

> [Rust] [DataFusion] Math functions
> --
>
> Key: ARROW-9919
> URL: https://issues.apache.org/jira/browse/ARROW-9919
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> See main issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9919) [Rust] [DataFusion] Math functions

2020-09-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-9919:
--
Affects Version/s: 1.0.0

> [Rust] [DataFusion] Math functions
> --
>
> Key: ARROW-9919
> URL: https://issues.apache.org/jira/browse/ARROW-9919
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Affects Versions: 1.0.0
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> See main issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9919) [Rust] [DataFusion] Math functions

2020-09-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-9919.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8116
[https://github.com/apache/arrow/pull/8116]

> [Rust] [DataFusion] Math functions
> --
>
> Key: ARROW-9919
> URL: https://issues.apache.org/jira/browse/ARROW-9919
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> See main issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9917) [Python][Compute] Add bindings for mode kernel

2020-09-08 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-9917.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8115
[https://github.com/apache/arrow/pull/8115]

> [Python][Compute] Add bindings for mode kernel
> --
>
> Key: ARROW-9917
> URL: https://issues.apache.org/jira/browse/ARROW-9917
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Andrew Wieteska
>Assignee: Andrew Wieteska
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9920) [Python] pyarrow.concat_arrays segfaults when passing it a chunked array

2020-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9920:
--
Labels: pull-request-available  (was: )

> [Python] pyarrow.concat_arrays segfaults when passing it a chunked array
> 
>
> Key: ARROW-9920
> URL: https://issues.apache.org/jira/browse/ARROW-9920
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> One can concat the chunks of a ChunkedArray with {{concat_arrays}} passing it 
> the list of chunks:
> {code}
> In [1]: arr = pa.chunked_array([[0, 1], [3, 4]])
> In [2]: pa.concat_arrays(arr.chunks)
> Out[2]: 
> 
> [
>   0,
>   1,
>   3,
>   4
> ]
> {code}
> but if passing the chunked array itself, you get a segfault:
> {code}
> In [4]: pa.concat_arrays(arr)
> Segmentation fault (core dumped)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9936) [Python] Fix / test relative file paths in pyarrow.parquet

2020-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9936:
--
Labels: pull-request-available  (was: )

> [Python] Fix / test relative file paths in pyarrow.parquet
> --
>
> Key: ARROW-9936
> URL: https://issues.apache.org/jira/browse/ARROW-9936
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It seems that I broke writing parquet to relative file paths in ARROW-9718 
> (again, something similar happened in the pyarrow.dataset reading), so should 
> fix that and properly test this.
> {code}
> In [3]: pq.write_table(table, "test_relative.parquet")
> ...
> ~/scipy/repos/arrow/python/pyarrow/_fs.pyx in 
> pyarrow._fs.FileSystem.from_uri()
> ArrowInvalid: URI has empty scheme: 'test_relative.parquet'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9826) [Rust] add set function to PrimitiveArray

2020-09-08 Thread Francesco Gadaleta (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191998#comment-17191998
 ] 

Francesco Gadaleta commented on ARROW-9826:
---

But that can be extremely inefficient. If one needs to change a dozen values in 
a column of millions of elements, that can become prohibitive.
In-place value changes are quite a common operation in data science.

> [Rust] add set function to PrimitiveArray
> -
>
> Key: ARROW-9826
> URL: https://issues.apache.org/jira/browse/ARROW-9826
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Francesco Gadaleta
>Priority: Major
>
> For in-place value replacement in Array, a `set()` function (maybe unsafe?) 
> would be required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (ARROW-9826) [Rust] add set function to PrimitiveArray

2020-09-08 Thread Francesco Gadaleta (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francesco Gadaleta updated ARROW-9826:
--
Comment: was deleted

(was: But that can be extremely inefficient. If one needs to change a dozen 
values in a column of millions of elements, that can become prohibitive.
In-place value changes are quite a common operation in data science.)

> [Rust] add set function to PrimitiveArray
> -
>
> Key: ARROW-9826
> URL: https://issues.apache.org/jira/browse/ARROW-9826
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Francesco Gadaleta
>Priority: Major
>
> For in-place value replacement in Array, a `set()` function (maybe unsafe?) 
> would be required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9826) [Rust] add set function to PrimitiveArray

2020-09-08 Thread Francesco Gadaleta (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191997#comment-17191997
 ] 

Francesco Gadaleta commented on ARROW-9826:
---

But that can be extremely inefficient. If one needs to change a dozen values in 
a column of millions of elements, that can become prohibitive.
In-place value changes are quite a common operation in data science.



> [Rust] add set function to PrimitiveArray
> -
>
> Key: ARROW-9826
> URL: https://issues.apache.org/jira/browse/ARROW-9826
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Francesco Gadaleta
>Priority: Major
>
> For in-place value replacement in Array, a `set()` function (maybe unsafe?) 
> would be required.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9937) [Rust] [DataFusion] Average is not correct

2020-09-08 Thread Jorge (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191993#comment-17191993
 ] 

Jorge commented on ARROW-9937:
--

[~andygrove], I remember that you wanted to touch this. If not, let me know and 
I take a shoot at it.

Looking at [Ballista's source code for 
this|https://github.com/ballista-compute/ballista/blob/main/rust/ballista/src/execution/operators/hash_aggregate.rs]
 , I think that we have the same issue there. :/


> [Rust] [DataFusion] Average is not correct
> --
>
> Key: ARROW-9937
> URL: https://issues.apache.org/jira/browse/ARROW-9937
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Priority: Major
>
> The current design of aggregates makes the calculation of the average 
> incorrect.
> It also makes it impossible to compute the [geometric 
> mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other 
> operations. 
> The central issue is that Accumulator returns a `ScalarValue` during partial 
> aggregations via {{get_value}}, but very often a `ScalarValue` is not 
> sufficient information to perform the full aggregation.
> A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are 
> distributed in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current 
> calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then 
> reduces them using another average, i.e.
> {{((x1+x2)/2 + (x3+x4)/2 + x5)/3}}
> which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}.
> I believe that our Accumulators need to pass more information from the 
> partial aggregations to the final aggregation.
> We could consider taking an API equivalent to 
> [spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), 
> i.e. have an `update`, a `merge` and an `evaluate`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9937) [Rust] [DataFusion] Average is not correct

2020-09-08 Thread Jorge (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge updated ARROW-9937:
-
Description: 
The current design of aggregates makes the calculation of the average incorrect.
It also makes it impossible to compute the [geometric 
mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other 
operations. 

The central issue is that Accumulator returns a `ScalarValue` during partial 
aggregations via {{get_value}}, but very often a `ScalarValue` is not 
sufficient information to perform the full aggregation.

A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are 
distributed in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current 
calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then 
reduces them using another average, i.e.

{{((x1+x2)/2 + (x3+x4)/2 + x5)/3}}

which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}.

I believe that our Accumulators need to pass more information from the partial 
aggregations to the final aggregation.

We could consider taking an API equivalent to 
[spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), 
i.e. have an `update`, a `merge` and an `evaluate`.

  was:
The current design of aggregates makes the calculation of the average incorrect.
It also makes it impossible to compute the [geometric 
mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other 
operations. 

The central issue is that Accumulator returns a `ScalarValue` during partial 
aggregations via {{get_value}}, but very often a `ScalarValue` is not 
sufficient information to perform the full aggregation.

A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are 
distributed in in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current 
calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then 
reduces them using another average, i.e.

{{((x1+x2)/2 + (x3+x4)/2 + x5)/3}}

which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}.

I believe that our Accumulators need to pass more information from the partial 
aggregations to the final aggregation.

We could consider taking an API equivalent to 
[spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), 
i.e. have an `update`, a `merge` and an `evaluate`.


> [Rust] [DataFusion] Average is not correct
> --
>
> Key: ARROW-9937
> URL: https://issues.apache.org/jira/browse/ARROW-9937
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Priority: Major
>
> The current design of aggregates makes the calculation of the average 
> incorrect.
> It also makes it impossible to compute the [geometric 
> mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other 
> operations. 
> The central issue is that Accumulator returns a `ScalarValue` during partial 
> aggregations via {{get_value}}, but very often a `ScalarValue` is not 
> sufficient information to perform the full aggregation.
> A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are 
> distributed in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current 
> calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then 
> reduces them using another average, i.e.
> {{((x1+x2)/2 + (x3+x4)/2 + x5)/3}}
> which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}.
> I believe that our Accumulators need to pass more information from the 
> partial aggregations to the final aggregation.
> We could consider taking an API equivalent to 
> [spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), 
> i.e. have an `update`, a `merge` and an `evaluate`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9937) [Rust] [DataFusion] Average is not correct

2020-09-08 Thread Jorge (Jira)
Jorge created ARROW-9937:


 Summary: [Rust] [DataFusion] Average is not correct
 Key: ARROW-9937
 URL: https://issues.apache.org/jira/browse/ARROW-9937
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Jorge


The current design of aggregates makes the calculation of the average incorrect.
It also makes it impossible to compute the [geometric 
mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other 
operations. 

The central issue is that Accumulator returns a `ScalarValue` during partial 
aggregations via {{get_value}}, but very often a `ScalarValue` is not 
sufficient information to perform the full aggregation.

A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are 
distributed in in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current 
calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then 
reduces them using another average, i.e.

{{((x1+x2)/2 + (x3+x4)/2 + x5)/3}}

which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}.

I believe that our Accumulators need to pass more information from the partial 
aggregations to the final aggregation.

We could consider taking an API equivalent to 
[spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), 
i.e. have an `update`, a `merge` and an `evaluate`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9936) [Python] Fix / test relative file paths in pyarrow.parquet

2020-09-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9936:


 Summary: [Python] Fix / test relative file paths in pyarrow.parquet
 Key: ARROW-9936
 URL: https://issues.apache.org/jira/browse/ARROW-9936
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 2.0.0


It seems that I broke writing parquet to relative file paths in ARROW-9718 
(again, something similar happened in the pyarrow.dataset reading), so should 
fix that and properly test this.

{code}
In [3]: pq.write_table(table, "test_relative.parquet")
...
~/scipy/repos/arrow/python/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.from_uri()

ArrowInvalid: URI has empty scheme: 'test_relative.parquet'
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9920) [Python] pyarrow.concat_arrays segfaults when passing it a chunked array

2020-09-08 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9920:
-
Fix Version/s: 2.0.0

> [Python] pyarrow.concat_arrays segfaults when passing it a chunked array
> 
>
> Key: ARROW-9920
> URL: https://issues.apache.org/jira/browse/ARROW-9920
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 2.0.0
>
>
> One can concat the chunks of a ChunkedArray with {{concat_arrays}} passing it 
> the list of chunks:
> {code}
> In [1]: arr = pa.chunked_array([[0, 1], [3, 4]])
> In [2]: pa.concat_arrays(arr.chunks)
> Out[2]: 
> 
> [
>   0,
>   1,
>   3,
>   4
> ]
> {code}
> but if passing the chunked array itself, you get a segfault:
> {code}
> In [4]: pa.concat_arrays(arr)
> Segmentation fault (core dumped)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9927) [R] Add dplyr group_by, summarise and mutate support in function open_dataset R arrow package

2020-09-08 Thread Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pal updated ARROW-9927:
---
Issue Type: Improvement  (was: Bug)
  Priority: Critical  (was: Major)

> [R] Add dplyr group_by, summarise and mutate support in function open_dataset 
> R arrow package  
> ---
>
> Key: ARROW-9927
> URL: https://issues.apache.org/jira/browse/ARROW-9927
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 1.0.1
>Reporter: Pal
>Priority: Critical
>
> Hi, 
>  
> The open_dataset() function in the R arrow package already includes the 
> support for dplyr filter, select and rename functions. However, it would be a 
> huge improvement if it also could include other functions such as group_by, 
> summarise and mutate before calling collect(). Is there any idea or projet 
> going on to do so ? Would be it possible to include those features 
> (compatible also with dplyr version < 1) ?
> Many thanks for this excellent job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)