[jira] [Created] (ARROW-9951) [C#] ArrowStreamWriter implement sync WriteRecordBatch
Steve Suh created ARROW-9951: Summary: [C#] ArrowStreamWriter implement sync WriteRecordBatch Key: ARROW-9951 URL: https://issues.apache.org/jira/browse/ARROW-9951 Project: Apache Arrow Issue Type: New Feature Components: C# Reporter: Steve Suh Currently ArrowStreamWriter only supports async writing record batches. We are currently using this in .NET for Apache Spark when we write arrow records [here|https://github.com/dotnet/spark/blob/aed9214c10470dba8831726251fb2ed171189ecc/src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs#L396]. However, we would prefer to use a sync version instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7871) [Python] Expose more compute kernels
[ https://issues.apache.org/jira/browse/ARROW-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7871: -- Labels: pull-request-available (was: ) > [Python] Expose more compute kernels > > > Key: ARROW-7871 > URL: https://issues.apache.org/jira/browse/ARROW-7871 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Krisztian Szucs >Assignee: Andrew Wieteska >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently only the sum kernel is exposed. > Or consider to deprecate/remove the pyarrow.compute module, and bind the > compute kernels as methods instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9950) [Rust] [DataFusion] Allow UDF usage without registry
[ https://issues.apache.org/jira/browse/ARROW-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9950: -- Labels: pull-request-available (was: ) > [Rust] [DataFusion] Allow UDF usage without registry > > > Key: ARROW-9950 > URL: https://issues.apache.org/jira/browse/ARROW-9950 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This is a functionality relevant only for the DataFrame API. > Sometimes a UDF declaration happens during planning, and it makes it very > expressive when the user does not have to access the registry at all to plan > the UDF. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9950) [Rust] [DataFusion] Allow UDF usage without registry
Jorge created ARROW-9950: Summary: [Rust] [DataFusion] Allow UDF usage without registry Key: ARROW-9950 URL: https://issues.apache.org/jira/browse/ARROW-9950 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge This is a functionality relevant only for the DataFrame API. Sometimes a UDF declaration happens during planning, and it makes it very expressive when the user does not have to access the registry at all to plan the UDF. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9895) [RUST] Improve sort kernels
[ https://issues.apache.org/jira/browse/ARROW-9895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-9895. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8092 [https://github.com/apache/arrow/pull/8092] > [RUST] Improve sort kernels > --- > > Key: ARROW-9895 > URL: https://issues.apache.org/jira/browse/ARROW-9895 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 1.0.0 >Reporter: Jörn Horstmann >Assignee: Jörn Horstmann >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Followup from my mailing list post: > {quote}1. When sorting by multiple columns (lexsort_to_indices) the Float32 > and Float64 data types are not supported because the implementation > relies on the OrdArray trait. This trait is not implemented because > f64/f32 only implements PartialOrd. The sort function for a single > column (sort_to_indices) has some special logic which looks like it > wants to treats NaN the same as null, but I'm also not convinced this > is the correct way. For example postgres does the following > (https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT) > "In order to allow floating-point values to be sorted and used in > tree-based indexes, PostgreSQL treats NaN values as equal, and greater > than all non-NaN values." > I propose to do the same in an OrdArray impl for > Float64Array/Float32Array and then simplifying the sort_to_indices > function accordingly. > 2. Sorting for dictionary encoded strings. The problem here is that > DictionaryArray does not have a generic parameter for the value type > so it is not currently possible to only implement OrdArray for string > dictionaries. Again for the single column case, the value data type > could be checked and a sort could be implemented by looking up each > key in the dictionary. An optimization could be to check the is_sorted > flag of DictionaryArray (which does not seem to be used really) and > then directly sort by the keys. For the general case I see roughly to > options > - Somehow implement an OrdArray view of the dictionary array. This > could be easier if OrdArray did not extend Array but was a completely > separate trait. > - Change the lexicographic sort impl to not use dynamic calls but > instead sort multiple times. So for a query `ORDER BY a, b`, first > sort by b and afterwards sort again by a. With a stable sort > implementation this should result in the same ordering. I'm curious > about the performance, it could avoid dynamic method calls for each > comparison, but it would process the indices vector multiple times. > {quote} > My plan is to open a draft PR with the following changes: > - {{sort_to_indices}} further splits up float64/float32 inputs into > nulls/non-nan/nan, sorts the non-nan values and then concats those 3 slices > according to the sort options. Nans are distinct from null and sort greater > than any other valid value > - implement a sort method for dictionary arrays with string values. this > kernel checks the {{is_ordered}} flag and sorts just by the keys if it is > set, it will look up the string values otherwise > - for the lexical sort use case the above kernel are not used, instead the > {{OrdArray}} trait is used. To make that more flexible and allow wrapping > arrays with differend ordering behavior I will make it no longer extend > {{Array}} and instead only contain the {{cmp_value}} method > - string dictionary sorting can then be implemented with a wrapper struct > {{StringDictionaryArrayAsOrdArray}} which implements {{OrdArray}} > - NaN aware sorting of floats can also be implemented with a wrapper struct > and trait implementation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9751) [Rust] [DataFusion] Extend UDFs to accept more than one type per argument
[ https://issues.apache.org/jira/browse/ARROW-9751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-9751. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 7967 [https://github.com/apache/arrow/pull/7967] > [Rust] [DataFusion] Extend UDFs to accept more than one type per argument > - > > Key: ARROW-9751 > URL: https://issues.apache.org/jira/browse/ARROW-9751 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10.5h > Remaining Estimate: 0h > > Most math functions accept float32 and float64, `length` will accept Utf8 and > lists soon, etc. > The goal of this story is to allow UDFs to accept more than one datatype. > Design: the accepted datatypes should be a vector ordered by "faster/smaller" > to "slower/larger" (cpu/memory). When the plan reaches a UDF, we try to cast > the input expression like before, from "faster/smaller" to "slower/larger". -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9949) [C++] Generalize Decimal128::FromString for reuse in Decimal256
[ https://issues.apache.org/jira/browse/ARROW-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9949: -- Labels: pull-request-available (was: ) > [C++] Generalize Decimal128::FromString for reuse in Decimal256 > --- > > Key: ARROW-9949 > URL: https://issues.apache.org/jira/browse/ARROW-9949 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Mingyu Zhong >Assignee: Mingyu Zhong >Priority: Blocker > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9948) [C++] Decimal128 does not check scale range when rescaling; can cause buffer overflow
[ https://issues.apache.org/jira/browse/ARROW-9948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingyu Zhong updated ARROW-9948: Summary: [C++] Decimal128 does not check scale range when rescaling; can cause buffer overflow (was: Decimal128 does not check scale range when rescaling; can cause buffer overflow) > [C++] Decimal128 does not check scale range when rescaling; can cause buffer > overflow > - > > Key: ARROW-9948 > URL: https://issues.apache.org/jira/browse/ARROW-9948 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Mingyu Zhong >Priority: Major > > BasicDecimal128::GetScaleMultiplier has a DCHECK on the scale, but the scale > can come from users. For example, Decimal128::FromString("1e100") will cause > an out-of-bound read. > BasicDecimal128::Rescale and BasicDecimal128::GetWholeAndFraction have the > same problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9948) Decimal128 does not check scale range when rescaling; can cause buffer overflow
Mingyu Zhong created ARROW-9948: --- Summary: Decimal128 does not check scale range when rescaling; can cause buffer overflow Key: ARROW-9948 URL: https://issues.apache.org/jira/browse/ARROW-9948 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Mingyu Zhong BasicDecimal128::GetScaleMultiplier has a DCHECK on the scale, but the scale can come from users. For example, Decimal128::FromString("1e100") will cause an out-of-bound read. BasicDecimal128::Rescale and BasicDecimal128::GetWholeAndFraction have the same problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9944) [Rust] Implement TO_TIMESTAMP function
[ https://issues.apache.org/jira/browse/ARROW-9944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9944: -- Labels: pull-request-available (was: ) > [Rust] Implement TO_TIMESTAMP function > -- > > Key: ARROW-9944 > URL: https://issues.apache.org/jira/browse/ARROW-9944 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust - DataFusion >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Implement the TO_TIMESTAMP function, as described in > https://docs.google.com/document/d/18O9YPRyJ3u7-58J02NtNVYb6TDWBzi3mIQC58VhwxUk/edit -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-9935) [Python] datasets unable to read empty S3 folders with fsspec' s3fs
[ https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-9935. - Resolution: Not A Problem > [Python] datasets unable to read empty S3 folders with fsspec' s3fs > --- > > Key: ARROW-9935 > URL: https://issues.apache.org/jira/browse/ARROW-9935 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0 >Reporter: Weston Pace >Priority: Minor > Attachments: arrow_9935.py > > > When an empty "folder" is created in S3 using the online bucket explorer tool > on the management console then it creates a special empty file with the same > name as the folder. > (Some more details here: > [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)] > If parquet files are later loaded into one of these directories (with or > without partitioning subdirectories) then this dataset cannot be read by the > new dataset API. The underlying s3fs `find` method returns a "file" object > with size 0 that pyarrow then attempts to read. Since this file doesn't > truly exist a FileNotFoundError is thrown. > Would it be safe to simply ignore all files with size 0? > As a workaround I can wrap s3fs' find method and strip out these objects with > size 0 myself. > I've attached a script showing the issue and a workaround. It uses a public > bucket that I'll leave up for a few months. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9935) [Python] datasets unable to read empty S3 folders with fsspec' s3fs
[ https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-9935: -- Summary: [Python] datasets unable to read empty S3 folders with fsspec' s3fs (was: [Python] New filesystem API unable to read empty S3 folders) > [Python] datasets unable to read empty S3 folders with fsspec' s3fs > --- > > Key: ARROW-9935 > URL: https://issues.apache.org/jira/browse/ARROW-9935 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0 >Reporter: Weston Pace >Priority: Minor > Attachments: arrow_9935.py > > > When an empty "folder" is created in S3 using the online bucket explorer tool > on the management console then it creates a special empty file with the same > name as the folder. > (Some more details here: > [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)] > If parquet files are later loaded into one of these directories (with or > without partitioning subdirectories) then this dataset cannot be read by the > new dataset API. The underlying s3fs `find` method returns a "file" object > with size 0 that pyarrow then attempts to read. Since this file doesn't > truly exist a FileNotFoundError is thrown. > Would it be safe to simply ignore all files with size 0? > As a workaround I can wrap s3fs' find method and strip out these objects with > size 0 myself. > I've attached a script showing the issue and a workaround. It uses a public > bucket that I'll leave up for a few months. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (ARROW-9935) [Python] New filesystem API unable to read empty S3 folders
[ https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reopened ARROW-9935: --- > [Python] New filesystem API unable to read empty S3 folders > --- > > Key: ARROW-9935 > URL: https://issues.apache.org/jira/browse/ARROW-9935 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0 >Reporter: Weston Pace >Priority: Minor > Attachments: arrow_9935.py > > > When an empty "folder" is created in S3 using the online bucket explorer tool > on the management console then it creates a special empty file with the same > name as the folder. > (Some more details here: > [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)] > If parquet files are later loaded into one of these directories (with or > without partitioning subdirectories) then this dataset cannot be read by the > new dataset API. The underlying s3fs `find` method returns a "file" object > with size 0 that pyarrow then attempts to read. Since this file doesn't > truly exist a FileNotFoundError is thrown. > Would it be safe to simply ignore all files with size 0? > As a workaround I can wrap s3fs' find method and strip out these objects with > size 0 myself. > I've attached a script showing the issue and a workaround. It uses a public > bucket that I'll leave up for a few months. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-9935) [Python] New filesystem API unable to read empty S3 folders
[ https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-9935. - Resolution: Not A Problem > [Python] New filesystem API unable to read empty S3 folders > --- > > Key: ARROW-9935 > URL: https://issues.apache.org/jira/browse/ARROW-9935 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0 >Reporter: Weston Pace >Priority: Minor > Attachments: arrow_9935.py > > > When an empty "folder" is created in S3 using the online bucket explorer tool > on the management console then it creates a special empty file with the same > name as the folder. > (Some more details here: > [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)] > If parquet files are later loaded into one of these directories (with or > without partitioning subdirectories) then this dataset cannot be read by the > new dataset API. The underlying s3fs `find` method returns a "file" object > with size 0 that pyarrow then attempts to read. Since this file doesn't > truly exist a FileNotFoundError is thrown. > Would it be safe to simply ignore all files with size 0? > As a workaround I can wrap s3fs' find method and strip out these objects with > size 0 myself. > I've attached a script showing the issue and a workaround. It uses a public > bucket that I'll leave up for a few months. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9935) [Python] New filesystem API unable to read empty S3 folders
[ https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192483#comment-17192483 ] Antoine Pitrou commented on ARROW-9935: --- Thanks for the feedback. I'll close this issue then. > [Python] New filesystem API unable to read empty S3 folders > --- > > Key: ARROW-9935 > URL: https://issues.apache.org/jira/browse/ARROW-9935 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0 >Reporter: Weston Pace >Priority: Minor > Attachments: arrow_9935.py > > > When an empty "folder" is created in S3 using the online bucket explorer tool > on the management console then it creates a special empty file with the same > name as the folder. > (Some more details here: > [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)] > If parquet files are later loaded into one of these directories (with or > without partitioning subdirectories) then this dataset cannot be read by the > new dataset API. The underlying s3fs `find` method returns a "file" object > with size 0 that pyarrow then attempts to read. Since this file doesn't > truly exist a FileNotFoundError is thrown. > Would it be safe to simply ignore all files with size 0? > As a workaround I can wrap s3fs' find method and strip out these objects with > size 0 myself. > I've attached a script showing the issue and a workaround. It uses a public > bucket that I'll leave up for a few months. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9935) [Python] New filesystem API unable to read empty S3 folders
[ https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192481#comment-17192481 ] Weston Pace commented on ARROW-9935: I tried that out and Arrow's own S3 implementation does not run into this issue. This would only affect the s3fs implementation. Either way, this is not a big problem for me since I have the workaround and I might switch to the builtin implementation anyways if the performance is significantly different. > [Python] New filesystem API unable to read empty S3 folders > --- > > Key: ARROW-9935 > URL: https://issues.apache.org/jira/browse/ARROW-9935 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0 >Reporter: Weston Pace >Priority: Minor > Attachments: arrow_9935.py > > > When an empty "folder" is created in S3 using the online bucket explorer tool > on the management console then it creates a special empty file with the same > name as the folder. > (Some more details here: > [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)] > If parquet files are later loaded into one of these directories (with or > without partitioning subdirectories) then this dataset cannot be read by the > new dataset API. The underlying s3fs `find` method returns a "file" object > with size 0 that pyarrow then attempts to read. Since this file doesn't > truly exist a FileNotFoundError is thrown. > Would it be safe to simply ignore all files with size 0? > As a workaround I can wrap s3fs' find method and strip out these objects with > size 0 myself. > I've attached a script showing the issue and a workaround. It uses a public > bucket that I'll leave up for a few months. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8199) [C++] Guidance for creating multi-column sort on Table example?
[ https://issues.apache.org/jira/browse/ARROW-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Wilson updated ARROW-8199: Attachment: DataFrame.h Hey Wes, I hope you and yours are doing well in this strange time. I'm just writing to thank you for all the work you did on Arrow and the various discussions you've posted about the design decisions that drove this development, post pandas. I've largely completed my C++ DataFrame and replaced python/pandas code that we use for our ML pipeline. Using the Arrow framework, I've been able to create a DataFrame object that wraps one or more arrow tables. The implementation supports no-copy subsets, joins and concatenations, and stl-like iterators. Also supported are transforms using in-place lambda functions. The net is that a ~1 TB data processing step that used to take 13 h now requires 15 m. The only kluge I put into place has to do with support for null values. I allow in-place editing of values, but no changes to array sizes or types. This is possible because the typed arrays offer access to the underlying raw values. To offer the same for null values I had to create derived classes for Array and ChunkedArray offer access to the cached null_counts. I've attached the DataFrame header in case it's of interest. Thanks again, Scott -- Scott B. Wilson Chairman and Chief Scientist Persyst Development Corporation 420 Stevens Avenue, Suite 210 Solana Beach, CA 92075 > [C++] Guidance for creating multi-column sort on Table example? > --- > > Key: ARROW-8199 > URL: https://issues.apache.org/jira/browse/ARROW-8199 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Affects Versions: 0.16.0 >Reporter: Scott Wilson >Priority: Minor > Labels: c++, newbie > Attachments: ArrowCsv.cpp, DataFrame.h > > > I'm just coming up to speed with Arrow and am noticing a dearth of examples > ... maybe I can help here. > I'd like to implement multi-column sorting for Tables and just want to ensure > that I'm not duplicating existing work or proposing a bad design. > My thought was to create a Table-specific version of SortToIndices() where > you can specify the columns and sort order. > Then I'd create Array "views" that use the Indices to remap from the original > Array values to the values in sorted order. (Original data is not sorted, but > could be as a second step.) I noticed some of the array list variants keep > offsets, but didn't see anything that supports remapping per a list of > indices, but this may just be my oversight? > Thanks in advance, Scott -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9909) [C++] Provide a (FileSystem, path) pair to locate files across filesystems
[ https://issues.apache.org/jira/browse/ARROW-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-9909: --- Fix Version/s: 2.0.0 > [C++] Provide a (FileSystem, path) pair to locate files across filesystems > -- > > Key: ARROW-9909 > URL: https://issues.apache.org/jira/browse/ARROW-9909 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 1.0.0 >Reporter: Ben Kietzman >Priority: Major > Fix For: 2.0.0 > > > https://github.com/apache/arrow/pull/8101#discussion_r482921953 > Paths are sufficient to locate files within a known filesystem, but APIs (for > example datasets) do not always have a single known filesystem and in such > contexts a (fs, path) pair would be useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9947) [Python][Parquet] Python API for Parquet encryption
Itamar Turner-Trauring created ARROW-9947: - Summary: [Python][Parquet] Python API for Parquet encryption Key: ARROW-9947 URL: https://issues.apache.org/jira/browse/ARROW-9947 Project: Apache Arrow Issue Type: Improvement Reporter: Itamar Turner-Trauring Python API wrapper for ARROW-9318. Design document will eventually live at https://docs.google.com/document/d/1i1M5f5azLEmASj9XQZ_aQLl5Fr5F0CvnyPPVu1xaD9U/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9931) [C++] Fix undefined behaviour on invalid IPC (OSS-Fuzz)
[ https://issues.apache.org/jira/browse/ARROW-9931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-9931. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8126 [https://github.com/apache/arrow/pull/8126] > [C++] Fix undefined behaviour on invalid IPC (OSS-Fuzz) > --- > > Key: ARROW-9931 > URL: https://issues.apache.org/jira/browse/ARROW-9931 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9906) [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri (closing NativeFile from S3FileSystem)
[ https://issues.apache.org/jira/browse/ARROW-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9906: -- Labels: filesystem pull-request-available (was: filesystem) > [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri > (closing NativeFile from S3FileSystem) > --- > > Key: ARROW-9906 > URL: https://issues.apache.org/jira/browse/ARROW-9906 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Joris Van den Bossche >Assignee: Antoine Pitrou >Priority: Major > Labels: filesystem, pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > See https://github.com/apache/arrow/pull/7991#discussion_r481247263 and the > commented out test added in that PR. > It doesn't give any clarifying traceback or crash message, but it segfaults > when closing the {{NativeFile}} returned from > {{S3FileSystem.open_output_stream}} in {{ParquetWriter.close()}}. > With {{gdb}} I get a bit more context: > {code} > Thread 1 "python" received signal SIGSEGV, Segmentation fault. > 0x7fa1a39df8f2 in arrow::fs::(anonymous > namespace)::ObjectOutputStream::UploadPart (this=0x5619a95ce820, > data=0x7fa197641ec0, nbytes=15671, owned_buffer=...) at > ../src/arrow/filesystem/s3fs.cc:806 > 806 client_->UploadPartAsync(req, handler); > {code} > Another S3 crash in the parquet tests: ARROW-9814 (although it doesn't seem > fully related) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9906) [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri (closing NativeFile from S3FileSystem)
[ https://issues.apache.org/jira/browse/ARROW-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-9906: - Assignee: Antoine Pitrou > [Python] Crash in test_parquet.py::test_parquet_writer_filesystem_s3_uri > (closing NativeFile from S3FileSystem) > --- > > Key: ARROW-9906 > URL: https://issues.apache.org/jira/browse/ARROW-9906 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Joris Van den Bossche >Assignee: Antoine Pitrou >Priority: Major > Labels: filesystem > Fix For: 2.0.0 > > > See https://github.com/apache/arrow/pull/7991#discussion_r481247263 and the > commented out test added in that PR. > It doesn't give any clarifying traceback or crash message, but it segfaults > when closing the {{NativeFile}} returned from > {{S3FileSystem.open_output_stream}} in {{ParquetWriter.close()}}. > With {{gdb}} I get a bit more context: > {code} > Thread 1 "python" received signal SIGSEGV, Segmentation fault. > 0x7fa1a39df8f2 in arrow::fs::(anonymous > namespace)::ObjectOutputStream::UploadPart (this=0x5619a95ce820, > data=0x7fa197641ec0, nbytes=15671, owned_buffer=...) at > ../src/arrow/filesystem/s3fs.cc:806 > 806 client_->UploadPartAsync(req, handler); > {code} > Another S3 crash in the parquet tests: ARROW-9814 (although it doesn't seem > fully related) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9946) ParquetFileWriter segfaults when `sink` is a string
Karl Dunkle Werner created ARROW-9946: - Summary: ParquetFileWriter segfaults when `sink` is a string Key: ARROW-9946 URL: https://issues.apache.org/jira/browse/ARROW-9946 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 1.0.1 Environment: Ubuntu 20.04 Reporter: Karl Dunkle Werner Hello again! I have another minor R arrow issue. The {{ParquetFileWriter}} docs say that the {{sink}} argument can be a "string which is interpreted as a file path". However, when I try to use a string, I get a segfault because the memory isn't mapped. Maybe this is a separate request, but it would also be helpful to have documentation for the methods of the writer created by {{ParquetFileWriter$create()}}. Docs link: [https://arrow.apache.org/docs/r/reference/ParquetFileWriter.html] {code:r} library(arrow) sch = schema(a = float32()) writer = ParquetFileWriter$create(schema = sch, sink = "test.parquet") #> *** caught segfault *** #> address 0x1417d, cause 'memory not mapped' #> #> Traceback: #> 1: parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, arrow_properties) #> 2: shared_ptr_is_null(xp) #> 3: shared_ptr(ParquetFileWriter, parquet___arrow___ParquetFileWriter__Open(schema, sink, properties, arrow_properties)) #> 4: ParquetFileWriter$create(schema = sch, sink = "test.parquet") # This works as expected: sink = FileOutputStream$create("test.parquet") writer = ParquetFileWriter$create(schema = sch, sink = sink) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9893) [Python] Bindings for writing datasets to Parquet
[ https://issues.apache.org/jira/browse/ARROW-9893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-9893. --- Resolution: Fixed Issue resolved by pull request 8138 [https://github.com/apache/arrow/pull/8138] > [Python] Bindings for writing datasets to Parquet > - > > Key: ARROW-9893 > URL: https://issues.apache.org/jira/browse/ARROW-9893 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Added to C++ in ARROW-9646, follow-up on Python bindings of ARROW-9658 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8718) [R] Add str() methods to objects
[ https://issues.apache.org/jira/browse/ARROW-8718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-8718: --- Fix Version/s: (was: 2.0.0) > [R] Add str() methods to objects > > > Key: ARROW-8718 > URL: https://issues.apache.org/jira/browse/ARROW-8718 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Priority: Major > > Apparently this will make the RStudio IDE show useful things in the > environment panel. Probably this is most useful for Table, RecordBatch, and > Dataset (and maybe Schema, which would look similar). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9387) [R] Use new C++ table select method
[ https://issues.apache.org/jira/browse/ARROW-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-9387: -- Assignee: Romain Francois (was: Neal Richardson) > [R] Use new C++ table select method > --- > > Key: ARROW-9387 > URL: https://issues.apache.org/jira/browse/ARROW-9387 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Romain Francois >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > ARROW-8314 adds it so we can use it instead of the one we wrote in the R > package. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-3812) [R] union support
[ https://issues.apache.org/jira/browse/ARROW-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-3812: --- Fix Version/s: (was: 2.0.0) > [R] union support > - > > Key: ARROW-3812 > URL: https://issues.apache.org/jira/browse/ARROW-3812 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Romain Francois >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9827) [Python] pandas.read_parquet fails for wide parquet files and pyarrow 1.0.X
[ https://issues.apache.org/jira/browse/ARROW-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman resolved ARROW-9827. - Resolution: Fixed Issue resolved by pull request 8037 [https://github.com/apache/arrow/pull/8037] > [Python] pandas.read_parquet fails for wide parquet files and pyarrow 1.0.X > --- > > Key: ARROW-9827 > URL: https://issues.apache.org/jira/browse/ARROW-9827 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0 >Reporter: Kyle Beauchamp >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > I recently tried to update my pyarrow from 0.17.1 to 1.0.0 and I'm > encountering a serious bug where wide DataFrames fail during > pandas.read_parquet. Small parquet files (m=1) read correctly, medium > files (m=4) fail with a "Bus Error: 10", and large files (m=10) > completely hang. I've tried python 3.8.5, pandas 1.0.5, pyarrow 1.0.0, and > OSX 10.14. > The driver code and output is below: > {code:python} > import pandas as pd > import numpy as np > import sys > filename = "test.parquet" > n = 10 > m = int(sys.argv[1]) > print(m) > x = np.zeros((n, m)) > x = pd.DataFrame(x, columns=[f"A_{i}" for i in range(m)]) > x.to_parquet(filename) > y = pd.read_parquet(filename, engine='pyarrow') > {code} > {code:java} > time python test_pyarrow.py 1 > real 0m4.018s user 0m5.286s sys 0m0.514s > time python test_pyarrow.py 4 > 4 > Bus error: 10 > {code} > > On a pyarrow 0.17.1 environment, the 40,000 case completes in 8 seconds. > This was cross-posted on the pandas tracker as well: > [https://github.com/pandas-dev/pandas/issues/35846] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface
[ https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192289#comment-17192289 ] Wes McKinney commented on ARROW-9924: - I'd prefer to fix the Datasets implementation rather than kicking the can down the road. It doesn't seem reasonable to pay an extra price when using the interface to read a single file. > [Python] Performance regression reading individual Parquet files using > Dataset interface > > > Key: ARROW-9924 > URL: https://issues.apache.org/jira/browse/ARROW-9924 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Critical > Fix For: 2.0.0 > > > I haven't investigated very deeply but this seems symptomatic of a problem: > {code} > In [27]: df = pd.DataFrame({'A': np.random.randn(1000)}) > > > In [28]: pq.write_table(pa.table(df), 'test.parquet') > > > In [29]: timeit pq.read_table('test.parquet') > > > 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True) > > > 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9945) [C++][Dataset] Refactor Expression::Assume to return a Result
Ben Kietzman created ARROW-9945: --- Summary: [C++][Dataset] Refactor Expression::Assume to return a Result Key: ARROW-9945 URL: https://issues.apache.org/jira/browse/ARROW-9945 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 1.0.0 Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 2.0.0 Expression::Assume can abort if the two expressions are not valid against a single schema. This is not ideal since a schema is not always easily available. The method should be able to fail gracefully in the case of a best-effort simplification where validation against a schema is not desired. https://github.com/apache/arrow/pull/8037#discussion_r475594117 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9864) [Python] pathlib.Path not supported in write_to_dataset with partition columns
[ https://issues.apache.org/jira/browse/ARROW-9864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-9864: - Assignee: Joris Van den Bossche > [Python] pathlib.Path not supported in write_to_dataset with partition columns > -- > > Key: ARROW-9864 > URL: https://issues.apache.org/jira/browse/ARROW-9864 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: parquet, pull-request-available > Fix For: 2.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Copying over from https://github.com/pandas-dev/pandas/issues/35902 > {code:python} > import pathlib > df = pd.DataFrame({'A':[1,2,3,4], 'B':'C'}) > df.to_parquet('tmp_path1.parquet') # OK > df.to_parquet(pathlib.Path('tmp_path2.parquet')) # OK > df.to_parquet('tmp_path3.parquet', partition_cols=['B']) # OK > df.to_parquet(pathlib.Path('tmp_path4.parquet'), partition_cols=['B']) # > TypeError > {code} > {{to_parquet}} method raises TypeError when using {{pathlib.Path()}} as an > argument in case when `partition_cols` argument is not None. If no partition > cols are provided, then {{pathlib.Path()}} is properly accepted > {code} > --- > TypeError Traceback (most recent call last) > in > 3 > 4 df.to_parquet('tmp_path3.parquet', partition_cols=['B']) # OK > > 5 df.to_parquet(pathlib.Path('tmp_path4.parquet'), > partition_cols=['B']) # TypeError > ... > ~/miniconda3/lib/python3.7/site-packages/pyarrow/parquet.py in > write_to_dataset(table, root_path, partition_cols, partition_filename_cb, > filesystem, **kwargs) >1790 subtable = pa.Table.from_pandas(subgroup, > schema=subschema, >1791 safe=False) > -> 1792 _mkdir_if_not_exists(fs, '/'.join([root_path, subdir])) >1793 if partition_filename_cb: >1794 outfile = partition_filename_cb(keys) > TypeError: sequence item 0: expected str instance, PosixPath found > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9864) [Python] pathlib.Path not supported in write_to_dataset with partition columns
[ https://issues.apache.org/jira/browse/ARROW-9864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-9864. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8064 [https://github.com/apache/arrow/pull/8064] > [Python] pathlib.Path not supported in write_to_dataset with partition columns > -- > > Key: ARROW-9864 > URL: https://issues.apache.org/jira/browse/ARROW-9864 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: parquet, pull-request-available > Fix For: 2.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Copying over from https://github.com/pandas-dev/pandas/issues/35902 > {code:python} > import pathlib > df = pd.DataFrame({'A':[1,2,3,4], 'B':'C'}) > df.to_parquet('tmp_path1.parquet') # OK > df.to_parquet(pathlib.Path('tmp_path2.parquet')) # OK > df.to_parquet('tmp_path3.parquet', partition_cols=['B']) # OK > df.to_parquet(pathlib.Path('tmp_path4.parquet'), partition_cols=['B']) # > TypeError > {code} > {{to_parquet}} method raises TypeError when using {{pathlib.Path()}} as an > argument in case when `partition_cols` argument is not None. If no partition > cols are provided, then {{pathlib.Path()}} is properly accepted > {code} > --- > TypeError Traceback (most recent call last) > in > 3 > 4 df.to_parquet('tmp_path3.parquet', partition_cols=['B']) # OK > > 5 df.to_parquet(pathlib.Path('tmp_path4.parquet'), > partition_cols=['B']) # TypeError > ... > ~/miniconda3/lib/python3.7/site-packages/pyarrow/parquet.py in > write_to_dataset(table, root_path, partition_cols, partition_filename_cb, > filesystem, **kwargs) >1790 subtable = pa.Table.from_pandas(subgroup, > schema=subschema, >1791 safe=False) > -> 1792 _mkdir_if_not_exists(fs, '/'.join([root_path, subdir])) >1793 if partition_filename_cb: >1794 outfile = partition_filename_cb(keys) > TypeError: sequence item 0: expected str instance, PosixPath found > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9935) [Python] New filesystem API unable to read empty S3 folders
[ https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-9935: Summary: [Python] New filesystem API unable to read empty S3 folders (was: New filesystem API unable to read empty S3 folders) > [Python] New filesystem API unable to read empty S3 folders > --- > > Key: ARROW-9935 > URL: https://issues.apache.org/jira/browse/ARROW-9935 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Weston Pace >Priority: Minor > Attachments: arrow_9935.py > > > When an empty "folder" is created in S3 using the online bucket explorer tool > on the management console then it creates a special empty file with the same > name as the folder. > (Some more details here: > [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)] > If parquet files are later loaded into one of these directories (with or > without partitioning subdirectories) then this dataset cannot be read by the > new dataset API. The underlying s3fs `find` method returns a "file" object > with size 0 that pyarrow then attempts to read. Since this file doesn't > truly exist a FileNotFoundError is thrown. > Would it be safe to simply ignore all files with size 0? > As a workaround I can wrap s3fs' find method and strip out these objects with > size 0 myself. > I've attached a script showing the issue and a workaround. It uses a public > bucket that I'll leave up for a few months. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9935) [Python] New filesystem API unable to read empty S3 folders
[ https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-9935: Component/s: Python > [Python] New filesystem API unable to read empty S3 folders > --- > > Key: ARROW-9935 > URL: https://issues.apache.org/jira/browse/ARROW-9935 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0 >Reporter: Weston Pace >Priority: Minor > Attachments: arrow_9935.py > > > When an empty "folder" is created in S3 using the online bucket explorer tool > on the management console then it creates a special empty file with the same > name as the folder. > (Some more details here: > [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)] > If parquet files are later loaded into one of these directories (with or > without partitioning subdirectories) then this dataset cannot be read by the > new dataset API. The underlying s3fs `find` method returns a "file" object > with size 0 that pyarrow then attempts to read. Since this file doesn't > truly exist a FileNotFoundError is thrown. > Would it be safe to simply ignore all files with size 0? > As a workaround I can wrap s3fs' find method and strip out these objects with > size 0 myself. > I've attached a script showing the issue and a workaround. It uses a public > bucket that I'll leave up for a few months. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9814) [Python] Crash in test_parquet.py::test_read_partitioned_directory_s3fs
[ https://issues.apache.org/jira/browse/ARROW-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9814: -- Labels: pull-request-available (was: ) > [Python] Crash in test_parquet.py::test_read_partitioned_directory_s3fs > --- > > Key: ARROW-9814 > URL: https://issues.apache.org/jira/browse/ARROW-9814 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Critical > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This seems to happen with some Minio versions, but is definitely a problem in > Arrow. > The crash message says: > {code} > pyarrow/tests/test_parquet.py::test_read_partitioned_directory_s3fs[False] > ../src/arrow/dataset/discovery.cc:188: Check failed: relative.has_value() > GetFileInfo() yielded path outside selector.base_dir > {code} > The underlying problem is that we pass a full URI for the selector base_dir > (such as "s3://bucket/path.") and the S3 filesystem implementation then > returns regular paths (such as "bucket/path/foo/bar"). > I think we should do two things: > 1) error out rather than crash (and include the path strings in the error > message), which would be more user-friendly > 2) fix the issue that full URIs are passed in base_dir -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9944) [Rust] Implement TO_TIMESTAMP function
[ https://issues.apache.org/jira/browse/ARROW-9944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-9944: - Summary: [Rust] Implement TO_TIMESTAMP function (was: Implement TO_TIMESTAMP function) > [Rust] Implement TO_TIMESTAMP function > -- > > Key: ARROW-9944 > URL: https://issues.apache.org/jira/browse/ARROW-9944 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > > Implement the TO_TIMESTAMP function, as described in > https://docs.google.com/document/d/18O9YPRyJ3u7-58J02NtNVYb6TDWBzi3mIQC58VhwxUk/edit -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9944) [Rust] Implement TO_TIMESTAMP function
[ https://issues.apache.org/jira/browse/ARROW-9944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-9944: - Component/s: Rust - DataFusion > [Rust] Implement TO_TIMESTAMP function > -- > > Key: ARROW-9944 > URL: https://issues.apache.org/jira/browse/ARROW-9944 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust - DataFusion >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > > Implement the TO_TIMESTAMP function, as described in > https://docs.google.com/document/d/18O9YPRyJ3u7-58J02NtNVYb6TDWBzi3mIQC58VhwxUk/edit -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9923) [R] arrow R package build error: illegal instruction
[ https://issues.apache.org/jira/browse/ARROW-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192279#comment-17192279 ] Wes McKinney commented on ARROW-9923: - We could probably use an R-specific environment variable (like {{ARROW_R_NO_SSE4=1}}) to toggle it off when building, or similar > [R] arrow R package build error: illegal instruction > > > Key: ARROW-9923 > URL: https://issues.apache.org/jira/browse/ARROW-9923 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.1 > Environment: Platform: Linux node 5.8.5-arch1-1 #1 SMP PREEMPT Thu, > 27 Aug 2020 18:53:02 + x86_64 GNU/Linux > CPU: AMD Athlon(tm) II X4 651 Quad-Core Processor (does not support SSE4, > AVX/AVX2) >Reporter: Maxim Terpilowski >Priority: Major > Labels: build > > arrow R package (v1.0.1) installing from CRAN results in an error. > Build log: [https://pastebin.com/Zq1iMTzB] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9814) [Python] Crash in test_parquet.py::test_read_partitioned_directory_s3fs
[ https://issues.apache.org/jira/browse/ARROW-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-9814: - Assignee: Antoine Pitrou (was: Ben Kietzman) > [Python] Crash in test_parquet.py::test_read_partitioned_directory_s3fs > --- > > Key: ARROW-9814 > URL: https://issues.apache.org/jira/browse/ARROW-9814 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Critical > > This seems to happen with some Minio versions, but is definitely a problem in > Arrow. > The crash message says: > {code} > pyarrow/tests/test_parquet.py::test_read_partitioned_directory_s3fs[False] > ../src/arrow/dataset/discovery.cc:188: Check failed: relative.has_value() > GetFileInfo() yielded path outside selector.base_dir > {code} > The underlying problem is that we pass a full URI for the selector base_dir > (such as "s3://bucket/path.") and the S3 filesystem implementation then > returns regular paths (such as "bucket/path/foo/bar"). > I think we should do two things: > 1) error out rather than crash (and include the path strings in the error > message), which would be more user-friendly > 2) fix the issue that full URIs are passed in base_dir -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9923) [R] arrow R package build error: illegal instruction
[ https://issues.apache.org/jira/browse/ARROW-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192274#comment-17192274 ] Neal Richardson commented on ARROW-9923: https://github.com/apache/arrow/blob/master/cpp/cmake_modules/SetupCxxFlags.cmake seems to be where these flags are set. It looks like SSE4.2 is assumed unless you're on one of a set of non-x86 processors. > [R] arrow R package build error: illegal instruction > > > Key: ARROW-9923 > URL: https://issues.apache.org/jira/browse/ARROW-9923 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.1 > Environment: Platform: Linux node 5.8.5-arch1-1 #1 SMP PREEMPT Thu, > 27 Aug 2020 18:53:02 + x86_64 GNU/Linux > CPU: AMD Athlon(tm) II X4 651 Quad-Core Processor (does not support SSE4, > AVX/AVX2) >Reporter: Maxim Terpilowski >Priority: Major > Labels: build > > arrow R package (v1.0.1) installing from CRAN results in an error. > Build log: [https://pastebin.com/Zq1iMTzB] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9932) Arrow 1.0.1 R package fails to install on R3.4 over linux
[ https://issues.apache.org/jira/browse/ARROW-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-9932. Fix Version/s: 2.0.0 Assignee: Neal Richardson Resolution: Duplicate Thanks for the report. This was recently fixed (though after 1.0.1 was released). > Arrow 1.0.1 R package fails to install on R3.4 over linux > - > > Key: ARROW-9932 > URL: https://issues.apache.org/jira/browse/ARROW-9932 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.1 > Environment: R version 3.4.0 (2015-04-16) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 14.04.5 LTS >Reporter: Ofek Shilon >Assignee: Neal Richardson >Priority: Major > Fix For: 2.0.0 > > > 1. From R (3.4) prompt, we run > {{> install.packages("arrow")}} > and it seems to succeed. > 2. Next we run: > {{> arrow::install_arrow()}} > This is the full output: > {{Installing package into '/opt/R-3.4.0.mkl/library'}} > {{(as 'lib' is unspecified)}} > {{trying URL 'https://cloud.r-project.org/src/contrib/arrow_1.0.1.tar.gz'}} > {{Content type 'application/x-gzip' length 274865 bytes (268 KB)}} > {{==}} > {{downloaded 268 KB}} > {{installing *source* package 'arrow' ...}} > {{** package 'arrow' successfully unpacked and MD5 sums checked}} > {{*** No C++ binaries found for ubuntu-14.04}} > {{*** Successfully retrieved C++ source}} > {{*** Building C++ libraries}} > {{ cmake}} > {color:#ff}*{{Error in dQuote(env_var_list, FALSE) : unused argument > (FALSE)}}*{color} > {color:#ff} *{{Calls: build_libarrow -> paste}}*{color} > {color:#ff} *{{Execution halted}}*{color} > {{- NOTE ---}} > {{After installation, please run arrow::install_arrow()}} > {{for help installing required runtime libraries}} > {{-}} > {{** libs}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array.cpp -o array.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array_from_vector.cpp -o array_from_vector.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array_to_vector.cpp -o array_to_vector.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c arraydata.cpp -o arraydata.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c arrowExports.cpp -o arrowExports.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c buffer.cpp -o buffer.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c chunkedarray.cpp -o chunkedarray.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c compression.cpp -o compression.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c compute.cpp -o compute.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c csv.cpp -o csv.o+}}{{g+ -std=gnu++0x > -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c dataset.cpp -o dataset.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c datatype.cpp -o datatype.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c expression.cpp -o expression.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c feather.cpp -o feather.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG >
[jira] [Commented] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?
[ https://issues.apache.org/jira/browse/ARROW-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192267#comment-17192267 ] Neal Richardson commented on ARROW-9938: FTR I'm doing this in R in ARROW-9854, in case you want to see what this looks like in practice (https://github.com/apache/arrow/pull/8058) > [Python] Add filesystem capabilities to other IO formats (feather, csv, json, > ..)? > -- > > Key: ARROW-9938 > URL: https://issues.apache.org/jira/browse/ARROW-9938 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: filesystem > > In the parquet IO functions, we support reading/writing files from non-local > filesystems directly (in addition to passing a buffer) by: > - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}}) > - specifying the filesystem keyword (eg > {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) > On the other hand, for other file formats such as feather, we only support > local files or buffers. So for those, you need to do the more manual (I > _suppose_ this works?): > {code:python} > from pyarrow import fs, feather > s3 = fs.S3FileSystem() > with s3.open_input_file("bucket/data.arrow") as file: > table = feather.read_table(file) > {code} > So I think the question comes up: do we want to extend this filesystem > support to other file formats (feather, csv, json) and make this more uniform > across pyarrow, or do we prefer to keep the plain readers more low-level (and > people can use the datasets API for more convenience)? > cc [~apitrou] [~kszucs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9944) Implement TO_TIMESTAMP function
Andrew Lamb created ARROW-9944: -- Summary: Implement TO_TIMESTAMP function Key: ARROW-9944 URL: https://issues.apache.org/jira/browse/ARROW-9944 Project: Apache Arrow Issue Type: Sub-task Reporter: Andrew Lamb Implement the TO_TIMESTAMP function, as described in https://docs.google.com/document/d/18O9YPRyJ3u7-58J02NtNVYb6TDWBzi3mIQC58VhwxUk/edit -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9588) [C++] clang/win: Copy constructor of ParquetInvalidOrCorruptedFileException not correctly triggered
[ https://issues.apache.org/jira/browse/ARROW-9588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-9588. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8114 [https://github.com/apache/arrow/pull/8114] > [C++] clang/win: Copy constructor of ParquetInvalidOrCorruptedFileException > not correctly triggered > --- > > Key: ARROW-9588 > URL: https://issues.apache.org/jira/browse/ARROW-9588 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 1.0.0 >Reporter: Uwe Korn >Assignee: Uwe Korn >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > The copy constructor of ParquetInvalidOrCorruptedFileException doesn't seem > to be taken correctly when building with clang 9.0.1 on Windows in a MSVC > toolchain. > Adding {{ParquetInvalidOrCorruptedFileException(const > ParquetInvalidOrCorruptedFileException&) = default;}} as an explicit copy > constructor didn't help. > Happy to any ideas here, probably a long shot as there are other clang-msvc > problems. > {code} > [49/62] Building CXX object > src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_1_cxx.cxx.obj > FAILED: src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_1_cxx.cxx.obj > C:\Users\Administrator\miniconda3\conda-bld\arrow-cpp-ext_1595962790058\_build_env\Library\bin\clang++.exe > -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 > -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_S > SE4_2 -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 > -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC > -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DAWS_COMMON_USE_IMPORT_EXPORT -DAWS_EVE > NT_STREAM_USE_IMPORT_EXPORT -DAWS_SDK_VERSION_MAJOR=1 > -DAWS_SDK_VERSION_MINOR=7 -DAWS_SDK_VERSION_PATCH=164 -DHAVE_INTTYPES_H > -DHAVE_NETDB_H -DNOMINMAX -DPARQUET_EXPORTING -DUSE_IMPORT_EXPORT -DUSE_IMPORT > _EXPORT=1 -DUSE_WINDOWS_DLL_SEMANTICS -D_CRT_SECURE_NO_WARNINGS > -Dparquet_shared_EXPORTS -Isrc -I../src -I../src/generated -isystem > ../thirdparty/flatbuffers/include -isystem C:/Users/Administrator/minico > nda3/conda-bld/arrow-cpp-ext_1595962790058/_h_env/Library/include -isystem > ../thirdparty/hadoop/include -fvisibility-inlines-hidden -std=c++14 > -fmessage-length=0 -march=k8 -mtune=haswell -ftree-vectorize > -fstack-protector-strong -O2 -ffunction-sections -pipe > -D_CRT_SECURE_NO_WARNINGS -D_MT -D_DLL -nostdlib -Xclang > --dependent-lib=msvcrt -fuse-ld=lld -fno-aligned-allocation > -Qunused-arguments -fcolor-diagn > ostics -O3 -DNDEBUG -Wa,-mbig-obj -Wall -Wno-unknown-warning-option > -Wno-pass-failed -msse4.2 -O3 -DNDEBUG -D_DLL -D_MT -Xclang > --dependent-lib=msvcrt -std=c++14 -MD -MT src/parquet/CMakeFiles/parquet > _shared.dir/Unity/unity_1_cxx.cxx.obj -MF > src\parquet\CMakeFiles\parquet_shared.dir\Unity\unity_1_cxx.cxx.obj.d -o > src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_1_cxx.cxx.obj -c > src/parquet/CMakeF > iles/parquet_shared.dir/Unity/unity_1_cxx.cxx > In file included from > src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_1_cxx.cxx:3: > In file included from > C:/Users/Administrator/miniconda3/conda-bld/arrow-cpp-ext_1595962790058/work/cpp/src/parquet/column_scanner.cc:18: > In file included from ../src\parquet/column_scanner.h:29: > In file included from ../src\parquet/column_reader.h:25: > In file included from ../src\parquet/exception.h:26: > In file included from ../src\parquet/platform.h:23: > In file included from ../src\arrow/buffer.h:28: > In file included from ../src\arrow/status.h:25: > ../src\arrow/util/string_builder.h:49:10: error: invalid operands to binary > expression ('std::ostream' (aka 'basic_ostream >') > and 'parquet::ParquetInvalidOrCorruptedFileException' > ) > stream << head; > ~~ ^ > ../src\arrow/util/string_builder.h:61:3: note: in instantiation of function > template specialization > 'arrow::util::StringBuilderRecursive &>' requested here > StringBuilderRecursive(ss.stream(), std::forward(args)...); > ^ > ../src\arrow/status.h:160:31: note: in instantiation of function template > specialization > 'arrow::util::StringBuilder &>' requested here > return Status(code, util::StringBuilder(std::forward(args)...)); > ^ > ../src\arrow/status.h:204:20: note: in instantiation of function template > specialization > 'arrow::Status::FromArgs' > requested here > return Status::FromArgs(StatusCode::Invalid, std::forward(args)...); >^ > ../src\parquet/exception.h:129:49: note: in instantiation of function > template specialization >
[jira] [Updated] (ARROW-9078) [C++] Parquet writing of extension type with nested storage type fails
[ https://issues.apache.org/jira/browse/ARROW-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9078: -- Labels: parquet pull-request-available (was: parquet) > [C++] Parquet writing of extension type with nested storage type fails > -- > > Key: ARROW-9078 > URL: https://issues.apache.org/jira/browse/ARROW-9078 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Antoine Pitrou >Priority: Major > Labels: parquet, pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > A reproducer in Python: > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > class MyStructType(pa.PyExtensionType): > > def __init__(self): > pa.PyExtensionType.__init__(self, pa.struct([('left', pa.int64()), > ('right', pa.int64())])) > > def __reduce__(self): > return MyStructType, () > struct_array = pa.StructArray.from_arrays( > [ > pa.array([0, 1], type="int64", from_pandas=True), > pa.array([1, 2], type="int64", from_pandas=True), > ], > names=["left", "right"], > ) > # works > table = pa.table({'a': struct_array}) > pq.write_table(table, "test_struct.parquet") > # doesn't work > mystruct_array = pa.ExtensionArray.from_storage(MyStructType(), struct_array) > table = pa.table({'a': mystruct_array}) > pq.write_table(table, "test_struct.parquet") > {code} > Writing the simple StructArray nowadays works (and reading it back in as > well). > But when the struct array is the storage array of an ExtensionType, it fails > with the following error: > {code} > ArrowException: Unknown error: data type leaf_count != builder_leaf_count1 2 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9943) [C++] Arrow metadata not applied recursively when reading Parquet file
Antoine Pitrou created ARROW-9943: - Summary: [C++] Arrow metadata not applied recursively when reading Parquet file Key: ARROW-9943 URL: https://issues.apache.org/jira/browse/ARROW-9943 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 1.0.0 Reporter: Antoine Pitrou Currently, {{ApplyOriginalMetadata}} in {{src/parquet/arrow/schema.cc}} is only applied for the top-level node of each schema field. Nested metadata (such as dicts-inside-lists, etc.) will not be applied. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-9932) Arrow 1.0.1 R package fails to install on R3.4 over linux
[ https://issues.apache.org/jira/browse/ARROW-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192227#comment-17192227 ] Ofek Shilon edited comment on ARROW-9932 at 9/8/20, 2:03 PM: - The previous suspicion is *not* the root cause of the installation failure. The signature of dQuote changed in R3.6. It [accepted a single argument before|[https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote]] but [accepts a second argument since 3.6.|https://www.rdocumentation.org/packages/base/versions/3.6.0/topics/sQuote] The arrow 1.0.1 R package is marked as dependent on R>=3.1, but the installation error message above seems to indicate that the installation script uses R3.6 syntax. This usage is (at least) at /r/tools/linuxlibs.R : env_vars <- paste( names(env_var_list), *dQuote(env_var_list, FALSE)*, sep = "=", collapse = " " ) was (Author: ofek): The previous suspicion is *not* the root cause of the installation failure. The signature of dQuote changed in R3.6. It [accepted a single argument before|[https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote]] but [accepts a second argument since 3.6.|https://www.rdocumentation.org/packages/base/versions/3.6.0/topics/sQuote] The arrow 1.0.1 R package is marked as dependent on R>=3.1, but the installation error message above seems to indicate that the installation script uses somewhere R3.6 syntax. > Arrow 1.0.1 R package fails to install on R3.4 over linux > - > > Key: ARROW-9932 > URL: https://issues.apache.org/jira/browse/ARROW-9932 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.1 > Environment: R version 3.4.0 (2015-04-16) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 14.04.5 LTS >Reporter: Ofek Shilon >Priority: Major > > 1. From R (3.4) prompt, we run > {{> install.packages("arrow")}} > and it seems to succeed. > 2. Next we run: > {{> arrow::install_arrow()}} > This is the full output: > {{Installing package into '/opt/R-3.4.0.mkl/library'}} > {{(as 'lib' is unspecified)}} > {{trying URL 'https://cloud.r-project.org/src/contrib/arrow_1.0.1.tar.gz'}} > {{Content type 'application/x-gzip' length 274865 bytes (268 KB)}} > {{==}} > {{downloaded 268 KB}} > {{installing *source* package 'arrow' ...}} > {{** package 'arrow' successfully unpacked and MD5 sums checked}} > {{*** No C++ binaries found for ubuntu-14.04}} > {{*** Successfully retrieved C++ source}} > {{*** Building C++ libraries}} > {{ cmake}} > {color:#ff}*{{Error in dQuote(env_var_list, FALSE) : unused argument > (FALSE)}}*{color} > {color:#ff} *{{Calls: build_libarrow -> paste}}*{color} > {color:#ff} *{{Execution halted}}*{color} > {{- NOTE ---}} > {{After installation, please run arrow::install_arrow()}} > {{for help installing required runtime libraries}} > {{-}} > {{** libs}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array.cpp -o array.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array_from_vector.cpp -o array_from_vector.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array_to_vector.cpp -o array_to_vector.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c arraydata.cpp -o arraydata.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c arrowExports.cpp -o arrowExports.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c buffer.cpp -o buffer.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c chunkedarray.cpp -o chunkedarray.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c compression.cpp -o compression.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include
[jira] [Updated] (ARROW-9932) Arrow 1.0.1 R package fails to install on R3.4 over linux
[ https://issues.apache.org/jira/browse/ARROW-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ofek Shilon updated ARROW-9932: --- Summary: Arrow 1.0.1 R package fails to install on R3.4 over linux (was: Arrow 1.0.1 R package fails to install on R3.4) > Arrow 1.0.1 R package fails to install on R3.4 over linux > - > > Key: ARROW-9932 > URL: https://issues.apache.org/jira/browse/ARROW-9932 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.1 > Environment: R version 3.4.0 (2015-04-16) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 14.04.5 LTS >Reporter: Ofek Shilon >Priority: Major > > 1. From R (3.4) prompt, we run > {{> install.packages("arrow")}} > and it seems to succeed. > 2. Next we run: > {{> arrow::install_arrow()}} > This is the full output: > {{Installing package into '/opt/R-3.4.0.mkl/library'}} > {{(as 'lib' is unspecified)}} > {{trying URL 'https://cloud.r-project.org/src/contrib/arrow_1.0.1.tar.gz'}} > {{Content type 'application/x-gzip' length 274865 bytes (268 KB)}} > {{==}} > {{downloaded 268 KB}} > {{installing *source* package 'arrow' ...}} > {{** package 'arrow' successfully unpacked and MD5 sums checked}} > {{*** No C++ binaries found for ubuntu-14.04}} > {{*** Successfully retrieved C++ source}} > {{*** Building C++ libraries}} > {{ cmake}} > {color:#ff}*{{Error in dQuote(env_var_list, FALSE) : unused argument > (FALSE)}}*{color} > {color:#ff} *{{Calls: build_libarrow -> paste}}*{color} > {color:#ff} *{{Execution halted}}*{color} > {{- NOTE ---}} > {{After installation, please run arrow::install_arrow()}} > {{for help installing required runtime libraries}} > {{-}} > {{** libs}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array.cpp -o array.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array_from_vector.cpp -o array_from_vector.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array_to_vector.cpp -o array_to_vector.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c arraydata.cpp -o arraydata.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c arrowExports.cpp -o arrowExports.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c buffer.cpp -o buffer.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c chunkedarray.cpp -o chunkedarray.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c compression.cpp -o compression.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c compute.cpp -o compute.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c csv.cpp -o csv.o+}}{{g+ -std=gnu++0x > -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c dataset.cpp -o dataset.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c datatype.cpp -o datatype.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c expression.cpp -o expression.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c feather.cpp -o feather.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c field.cpp -o field.o}} > {{g++ -std=gnu++0x
[jira] [Updated] (ARROW-9932) Arrow 1.0.1 R package fails to install on R3.4
[ https://issues.apache.org/jira/browse/ARROW-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ofek Shilon updated ARROW-9932: --- Summary: Arrow 1.0.1 R package fails to install on R3.4 (was: R package fails to install on Ubuntu 14) > Arrow 1.0.1 R package fails to install on R3.4 > -- > > Key: ARROW-9932 > URL: https://issues.apache.org/jira/browse/ARROW-9932 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.1 > Environment: R version 3.4.0 (2015-04-16) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 14.04.5 LTS >Reporter: Ofek Shilon >Priority: Major > > 1. From R (3.4) prompt, we run > {{> install.packages("arrow")}} > and it seems to succeed. > 2. Next we run: > {{> arrow::install_arrow()}} > This is the full output: > {{Installing package into '/opt/R-3.4.0.mkl/library'}} > {{(as 'lib' is unspecified)}} > {{trying URL 'https://cloud.r-project.org/src/contrib/arrow_1.0.1.tar.gz'}} > {{Content type 'application/x-gzip' length 274865 bytes (268 KB)}} > {{==}} > {{downloaded 268 KB}} > {{installing *source* package 'arrow' ...}} > {{** package 'arrow' successfully unpacked and MD5 sums checked}} > {{*** No C++ binaries found for ubuntu-14.04}} > {{*** Successfully retrieved C++ source}} > {{*** Building C++ libraries}} > {{ cmake}} > {color:#ff}*{{Error in dQuote(env_var_list, FALSE) : unused argument > (FALSE)}}*{color} > {color:#ff} *{{Calls: build_libarrow -> paste}}*{color} > {color:#ff} *{{Execution halted}}*{color} > {{- NOTE ---}} > {{After installation, please run arrow::install_arrow()}} > {{for help installing required runtime libraries}} > {{-}} > {{** libs}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array.cpp -o array.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array_from_vector.cpp -o array_from_vector.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array_to_vector.cpp -o array_to_vector.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c arraydata.cpp -o arraydata.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c arrowExports.cpp -o arrowExports.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c buffer.cpp -o buffer.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c chunkedarray.cpp -o chunkedarray.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c compression.cpp -o compression.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c compute.cpp -o compute.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c csv.cpp -o csv.o+}}{{g+ -std=gnu++0x > -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c dataset.cpp -o dataset.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c datatype.cpp -o datatype.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c expression.cpp -o expression.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c feather.cpp -o feather.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c field.cpp -o field.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include
[jira] [Comment Edited] (ARROW-9932) R package fails to install on Ubuntu 14
[ https://issues.apache.org/jira/browse/ARROW-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192227#comment-17192227 ] Ofek Shilon edited comment on ARROW-9932 at 9/8/20, 1:55 PM: - The previous suspicion is *not* the root cause of the installation failure. The signature of dQuote changed in R3.6. It [accepted a single argument before|[https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote]] but [accepts a second argument since 3.6.|https://www.rdocumentation.org/packages/base/versions/3.6.0/topics/sQuote] The arrow 1.0.1 R package is marked as dependent on R>=3.1, but the installation error message above seems to indicate that the installation script uses somewhere R3.6 syntax. was (Author: ofek): The previous suspicion is *not* the root cause of the installation failure. The signature of dQuote changed in R3.6. It [accepted a single argument before|[https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote|https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote],]][,|https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote],] but [accepts a second argument since 3.6.|https://www.rdocumentation.org/packages/base/versions/3.6.0/topics/sQuote] The arrow 1.0.1 R package is marked as dependent on R>=3.1, but the installation error message above seems to indicate that the installation script uses somewhere R3.6 syntax. > R package fails to install on Ubuntu 14 > --- > > Key: ARROW-9932 > URL: https://issues.apache.org/jira/browse/ARROW-9932 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.1 > Environment: R version 3.4.0 (2015-04-16) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 14.04.5 LTS >Reporter: Ofek Shilon >Priority: Major > > 1. From R (3.4) prompt, we run > {{> install.packages("arrow")}} > and it seems to succeed. > 2. Next we run: > {{> arrow::install_arrow()}} > This is the full output: > {{Installing package into '/opt/R-3.4.0.mkl/library'}} > {{(as 'lib' is unspecified)}} > {{trying URL 'https://cloud.r-project.org/src/contrib/arrow_1.0.1.tar.gz'}} > {{Content type 'application/x-gzip' length 274865 bytes (268 KB)}} > {{==}} > {{downloaded 268 KB}} > {{installing *source* package 'arrow' ...}} > {{** package 'arrow' successfully unpacked and MD5 sums checked}} > {{*** No C++ binaries found for ubuntu-14.04}} > {{*** Successfully retrieved C++ source}} > {{*** Building C++ libraries}} > {{ cmake}} > {color:#ff}*{{Error in dQuote(env_var_list, FALSE) : unused argument > (FALSE)}}*{color} > {color:#ff} *{{Calls: build_libarrow -> paste}}*{color} > {color:#ff} *{{Execution halted}}*{color} > {{- NOTE ---}} > {{After installation, please run arrow::install_arrow()}} > {{for help installing required runtime libraries}} > {{-}} > {{** libs}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array.cpp -o array.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array_from_vector.cpp -o array_from_vector.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array_to_vector.cpp -o array_to_vector.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c arraydata.cpp -o arraydata.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c arrowExports.cpp -o arrowExports.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c buffer.cpp -o buffer.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c chunkedarray.cpp -o chunkedarray.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c compression.cpp -o compression.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3
[jira] [Reopened] (ARROW-9932) R package fails to install on Ubuntu 14
[ https://issues.apache.org/jira/browse/ARROW-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ofek Shilon reopened ARROW-9932: The previous suspicion is *not* the root cause of the installation failure. The signature of dQuote changed in R3.6. It [accepted a single argument before|[https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote|https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote],]][,|https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/sQuote],] but [accepts a second argument since 3.6.|https://www.rdocumentation.org/packages/base/versions/3.6.0/topics/sQuote] The arrow 1.0.1 R package is marked as dependent on R>=3.1, but the installation error message above seems to indicate that the installation script uses somewhere R3.6 syntax. > R package fails to install on Ubuntu 14 > --- > > Key: ARROW-9932 > URL: https://issues.apache.org/jira/browse/ARROW-9932 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.1 > Environment: R version 3.4.0 (2015-04-16) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 14.04.5 LTS >Reporter: Ofek Shilon >Priority: Major > > 1. From R (3.4) prompt, we run > {{> install.packages("arrow")}} > and it seems to succeed. > 2. Next we run: > {{> arrow::install_arrow()}} > This is the full output: > {{Installing package into '/opt/R-3.4.0.mkl/library'}} > {{(as 'lib' is unspecified)}} > {{trying URL 'https://cloud.r-project.org/src/contrib/arrow_1.0.1.tar.gz'}} > {{Content type 'application/x-gzip' length 274865 bytes (268 KB)}} > {{==}} > {{downloaded 268 KB}} > {{installing *source* package 'arrow' ...}} > {{** package 'arrow' successfully unpacked and MD5 sums checked}} > {{*** No C++ binaries found for ubuntu-14.04}} > {{*** Successfully retrieved C++ source}} > {{*** Building C++ libraries}} > {{ cmake}} > {color:#ff}*{{Error in dQuote(env_var_list, FALSE) : unused argument > (FALSE)}}*{color} > {color:#ff} *{{Calls: build_libarrow -> paste}}*{color} > {color:#ff} *{{Execution halted}}*{color} > {{- NOTE ---}} > {{After installation, please run arrow::install_arrow()}} > {{for help installing required runtime libraries}} > {{-}} > {{** libs}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array.cpp -o array.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array_from_vector.cpp -o array_from_vector.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c array_to_vector.cpp -o array_to_vector.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c arraydata.cpp -o arraydata.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c arrowExports.cpp -o arrowExports.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c buffer.cpp -o buffer.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c chunkedarray.cpp -o chunkedarray.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c compression.cpp -o compression.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c compute.cpp -o compute.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c csv.cpp -o csv.o+}}{{g+ -std=gnu++0x > -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c dataset.cpp -o dataset.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include -DNDEBUG > -I"/opt/R-3.4.0.mkl/library/Rcpp/include" -I/usr/local/include -fpic > -march=x86-64 -O3 -c datatype.cpp -o datatype.o}} > {{g++ -std=gnu++0x -I/opt/R-3.4.0.mkl/lib64/R/include
[jira] [Resolved] (ARROW-9821) [Rust][DataFusion] User Defined PlanNode / Operator API
[ https://issues.apache.org/jira/browse/ARROW-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-9821. --- Resolution: Fixed Issue resolved by pull request 8097 [https://github.com/apache/arrow/pull/8097] > [Rust][DataFusion] User Defined PlanNode / Operator API > --- > > Key: ARROW-9821 > URL: https://issues.apache.org/jira/browse/ARROW-9821 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 6.5h > Remaining Estimate: 0h > > The basic goal is to allow users to implement their own PlanNodes. I will > provide a google doc opened for comments shortly. > Proposal: > https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit# > See also mailing list discussion here: > https://lists.apache.org/thread.html/rf8ae7d1147e93e3f6172bc2e4fa50a38abcb35f046cc5830e09da6cc%40%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9942) [Python] Schema Evolution - Add new Field
Daniel Figus created ARROW-9942: --- Summary: [Python] Schema Evolution - Add new Field Key: ARROW-9942 URL: https://issues.apache.org/jira/browse/ARROW-9942 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 1.0.0 Environment: pandas==1.1.1 pyarrow==1.0.0 Reporter: Daniel Figus We are trying to leverage the new Dataset implementation and specifically rely on the schema evolution feature there. However when adding a new field in a later parquet file, the schemas don't seem to be merged and the new field is not available. Simple example: {code:python} import pandas as pd from pyarrow import parquet as pq from pyarrow import dataset as ds import pyarrow as pa path = "data/sample/" df1 = pd.DataFrame({"field1": ["a", "b", "c"]}) df2 = pd.DataFrame({"field1": ["d", "e", "f"], "field2": ["x", "y", "z"]}) df1.to_parquet(path + "df1.parquet", coerce_timestamps=None, version="2.0", index=False) df2.to_parquet(path + "df2.parquet", coerce_timestamps=None, version="2.0", index=False) # read via pandas df = pd.read_parquet(path) print(df.head()) print(df.info()) {code} Output: {noformat} field1 0 a 1 b 2 c 3 d 4 e RangeIndex: 6 entries, 0 to 5 Data columns (total 1 columns): # Column Non-Null Count Dtype --- -- -- - 0 field1 6 non-null object dtypes: object(1) memory usage: 176.0+ bytes None {noformat} My expectation was to get the field2 as well based on what I have understood with the new Dataset implementation from ARROW-8039. When using the Dataset API with a schema created from the second dataframe I'm able to read the field2: {code:python} # write metadata schema = pa.Schema.from_pandas(df2, preserve_index=False) pq.write_metadata(schema, path + "_common_metadata", version="2.0", coerce_timestamps=None) # read with new dataset and schema schema = pq.read_schema(path + "_common_metadata") df = ds.dataset(path, schema=schema, format="parquet").to_table().to_pandas() print(df.head()) print(df.info()) {code} Output: {noformat} field1 field2 0 a None 1 b None 2 c None 3 d x 4 e y RangeIndex: 6 entries, 0 to 5 Data columns (total 2 columns): # Column Non-Null Count Dtype --- -- -- - 0 field1 6 non-null object 1 field2 3 non-null object dtypes: object(2) memory usage: 224.0+ bytes None {noformat} This works, however I want to avoid to write a {{_common_metadata}} file if possible. Is there a way to get the schema merge without passing an explicit schema? Or is this this yet to be implemented? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9941) [Python] Better string representation for extension types
Antoine Pitrou created ARROW-9941: - Summary: [Python] Better string representation for extension types Key: ARROW-9941 URL: https://issues.apache.org/jira/browse/ARROW-9941 Project: Apache Arrow Issue Type: Wish Components: C++, Python Reporter: Antoine Pitrou When one defines an extension type in Python (by subclassing {{PyExtensionType}}) and uses that type e.g. in a Arrow table, the printed schema looks like this: {code} pyarrow.Table a: extension b: extension {code} ... which isn't very informative. PyExtensionType could perhaps override ToString() and call {{str}} on the type instance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9941) [Python] Better string representation for extension types
[ https://issues.apache.org/jira/browse/ARROW-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192202#comment-17192202 ] Antoine Pitrou commented on ARROW-9941: --- cc [~jorisvandenbossche] > [Python] Better string representation for extension types > - > > Key: ARROW-9941 > URL: https://issues.apache.org/jira/browse/ARROW-9941 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 2.0.0 > > > When one defines an extension type in Python (by subclassing > {{PyExtensionType}}) and uses that type e.g. in a Arrow table, the printed > schema looks like this: > {code} > pyarrow.Table > a: extension > b: extension > {code} > ... which isn't very informative. PyExtensionType could perhaps override > ToString() and call {{str}} on the type instance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9941) [Python] Better string representation for extension types
[ https://issues.apache.org/jira/browse/ARROW-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-9941: -- Fix Version/s: 2.0.0 > [Python] Better string representation for extension types > - > > Key: ARROW-9941 > URL: https://issues.apache.org/jira/browse/ARROW-9941 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 2.0.0 > > > When one defines an extension type in Python (by subclassing > {{PyExtensionType}}) and uses that type e.g. in a Arrow table, the printed > schema looks like this: > {code} > pyarrow.Table > a: extension > b: extension > {code} > ... which isn't very informative. PyExtensionType could perhaps override > ToString() and call {{str}} on the type instance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface
[ https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192188#comment-17192188 ] Joris Van den Bossche commented on ARROW-9924: -- There was one other issue about a performance regression (ARROW-9827), for which I have an open PR (fix to not parse statistics when there is no filter specified). Now, I tried a release build of that branch compared to master, and that doesn't seem to make a difference for this case. bq. IMHO we should not continue to use the Dataset interface for reading single files by default until the perf regression has been eliminated. That came up before, and we can certainly still use the old ParquetFile reader if there is eg no {{filter}} specified (we shouldn't use ParquetDataset for this case, though, as was done before 1.0) --- I did a quick profile (with py-spy), and it _seems_ that the dataset version has a bit more overhead in all kinds of iteration (it uses the RecordBatchReader, and not the {{FileReader::ReadTable}} which is specifically to read the whole parquet file at once) > [Python] Performance regression reading individual Parquet files using > Dataset interface > > > Key: ARROW-9924 > URL: https://issues.apache.org/jira/browse/ARROW-9924 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Critical > Fix For: 2.0.0 > > > I haven't investigated very deeply but this seems symptomatic of a problem: > {code} > In [27]: df = pd.DataFrame({'A': np.random.randn(1000)}) > > > In [28]: pq.write_table(pa.table(df), 'test.parquet') > > > In [29]: timeit pq.read_table('test.parquet') > > > 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True) > > > 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9940) [Rust][DataFusion] Generic "extension package" mechanism
Andrew Lamb created ARROW-9940: -- Summary: [Rust][DataFusion] Generic "extension package" mechanism Key: ARROW-9940 URL: https://issues.apache.org/jira/browse/ARROW-9940 Project: Apache Arrow Issue Type: New Feature Reporter: Andrew Lamb This came from [~jorgecarleitao]'s suggestion on this PR: https://github.com/apache/arrow/pull/8097/files#r482968858 The high level idea is to design and implement an upgrade/ improvement to the DataFusion APIs which allows registering composeable sets of UserDefinedLogicalNode, Logical planning rules and Physical Planning rules for some functionality. h2. The use case: You publish the TopK extension as a (library) crate called datafusion-topk, and I publish a crate datafusion-s3 with another extension. A user wants to use both extensions. It installs them by: # adding each crate to Cargo.toml # initialize the default planner with both of them # plan them # execute them I.e. freaking easy! Broadly speaking, this allows the existence of an ecosystem of extensions/user-defined plans: people can share hand-crafted plans and plans can be added as dependencies to the crate and registered to the planner to be used by other people. 勞 This also reduces the pressure of placing everything in DataFusion's codebase: if we offer an API to extend DataFusion in this way, people can just distribute libraries with the extension/user-defined plan without having to go through the decision process of whether X is part of DataFusion's core or not (e.g. a scan of format Y, or a scan over protocol Z). For me, this use case does require an easy way to achieve 2. initialize the default planner with both of them. But again, this PR is definitely a major step in this direction! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9939) [Rust][DataFusion] Rename inputs --> child consistently
Andrew Lamb created ARROW-9939: -- Summary: [Rust][DataFusion] Rename inputs --> child consistently Key: ARROW-9939 URL: https://issues.apache.org/jira/browse/ARROW-9939 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Andrew Lamb As suggested by [~andygrove] on https://github.com/apache/arrow/pull/8097/files#r484556394 > I've been thinking lately that we should start standardizing on children > rather than inputs. I think `children` is a more standard term and having consistent terminology across the datafusion code base will be valuable -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?
[ https://issues.apache.org/jira/browse/ARROW-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192130#comment-17192130 ] Krisztian Szucs commented on ARROW-9938: Supporting remote URIs sounds like a nice feature. > [Python] Add filesystem capabilities to other IO formats (feather, csv, json, > ..)? > -- > > Key: ARROW-9938 > URL: https://issues.apache.org/jira/browse/ARROW-9938 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: filesystem > > In the parquet IO functions, we support reading/writing files from non-local > filesystems directly (in addition to passing a buffer) by: > - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}}) > - specifying the filesystem keyword (eg > {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) > On the other hand, for other file formats such as feather, we only support > local files or buffers. So for those, you need to do the more manual (I > _suppose_ this works?): > {code:python} > from pyarrow import fs, feather > s3 = fs.S3FileSystem() > with s3.open_input_file("bucket/data.arrow") as file: > table = feather.read_table(file) > {code} > So I think the question comes up: do we want to extend this filesystem > support to other file formats (feather, csv, json) and make this more uniform > across pyarrow, or do we prefer to keep the plain readers more low-level (and > people can use the datasets API for more convenience)? > cc [~apitrou] [~kszucs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9775) [C++] Automatic S3 region selection
[ https://issues.apache.org/jira/browse/ARROW-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192096#comment-17192096 ] Antoine Pitrou commented on ARROW-9775: --- It seems it can be determined through a HEAD request on a bucket: https://github.com/aws/aws-cli/issues/2431 This is how boto does it: https://github.com/boto/botocore/pull/936/files A S3Client is bound to a region, so some care will be needed in the implementation. > [C++] Automatic S3 region selection > --- > > Key: ARROW-9775 > URL: https://issues.apache.org/jira/browse/ARROW-9775 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Python > Environment: macOS, Linux. >Reporter: Sahil Gupta >Priority: Major > Labels: filesystem > Fix For: 2.0.0 > > > Currently, PyArrow and ArrowCpp need to be provided the region of the S3 > file/bucket, else it defaults to using 'us-east-1'. Ideally, PyArrow and > ArrowCpp can automatically detect the region and get the files, etc. For > instance, s3fs and boto3 can read and write files without having to specify > the region explicitly. Similar functionality to auto-detect the region would > be great to have in PyArrow and ArrowCpp. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9935) New filesystem API unable to read empty S3 folders
[ https://issues.apache.org/jira/browse/ARROW-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192088#comment-17192088 ] Antoine Pitrou commented on ARROW-9935: --- Have you tried using Arrow's own S3 filesystem implementation? {code:python} >>> from pyarrow.fs import S3FileSystem >>> fs = S3FileSystem() >>> fs.get_file_info("pyarrow-s3-empty-folder-file/mydataset") {code} (there may be more S3 configuration to do because this doesn't seem to work here: bad region perhaps?) > New filesystem API unable to read empty S3 folders > -- > > Key: ARROW-9935 > URL: https://issues.apache.org/jira/browse/ARROW-9935 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Weston Pace >Priority: Minor > Attachments: arrow_9935.py > > > When an empty "folder" is created in S3 using the online bucket explorer tool > on the management console then it creates a special empty file with the same > name as the folder. > (Some more details here: > [https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html)] > If parquet files are later loaded into one of these directories (with or > without partitioning subdirectories) then this dataset cannot be read by the > new dataset API. The underlying s3fs `find` method returns a "file" object > with size 0 that pyarrow then attempts to read. Since this file doesn't > truly exist a FileNotFoundError is thrown. > Would it be safe to simply ignore all files with size 0? > As a workaround I can wrap s3fs' find method and strip out these objects with > size 0 myself. > I've attached a script showing the issue and a workaround. It uses a public > bucket that I'll leave up for a few months. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9920) [Python] pyarrow.concat_arrays segfaults when passing it a chunked array
[ https://issues.apache.org/jira/browse/ARROW-9920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn resolved ARROW-9920. - Resolution: Fixed Issue resolved by pull request 8132 [https://github.com/apache/arrow/pull/8132] > [Python] pyarrow.concat_arrays segfaults when passing it a chunked array > > > Key: ARROW-9920 > URL: https://issues.apache.org/jira/browse/ARROW-9920 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > One can concat the chunks of a ChunkedArray with {{concat_arrays}} passing it > the list of chunks: > {code} > In [1]: arr = pa.chunked_array([[0, 1], [3, 4]]) > In [2]: pa.concat_arrays(arr.chunks) > Out[2]: > > [ > 0, > 1, > 3, > 4 > ] > {code} > but if passing the chunked array itself, you get a segfault: > {code} > In [4]: pa.concat_arrays(arr) > Segmentation fault (core dumped) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9920) [Python] pyarrow.concat_arrays segfaults when passing it a chunked array
[ https://issues.apache.org/jira/browse/ARROW-9920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn reassigned ARROW-9920: --- Assignee: Joris Van den Bossche > [Python] pyarrow.concat_arrays segfaults when passing it a chunked array > > > Key: ARROW-9920 > URL: https://issues.apache.org/jira/browse/ARROW-9920 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > One can concat the chunks of a ChunkedArray with {{concat_arrays}} passing it > the list of chunks: > {code} > In [1]: arr = pa.chunked_array([[0, 1], [3, 4]]) > In [2]: pa.concat_arrays(arr.chunks) > Out[2]: > > [ > 0, > 1, > 3, > 4 > ] > {code} > but if passing the chunked array itself, you get a segfault: > {code} > In [4]: pa.concat_arrays(arr) > Segmentation fault (core dumped) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?
[ https://issues.apache.org/jira/browse/ARROW-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192070#comment-17192070 ] Antoine Pitrou commented on ARROW-9938: --- On the C++ side they will definitely stay more low-level. On the Python side, I have no preference. I suppose it could be useful to write {{open_csv("s3://...")}}. > [Python] Add filesystem capabilities to other IO formats (feather, csv, json, > ..)? > -- > > Key: ARROW-9938 > URL: https://issues.apache.org/jira/browse/ARROW-9938 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: filesystem > > In the parquet IO functions, we support reading/writing files from non-local > filesystems directly (in addition to passing a buffer) by: > - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}}) > - specifying the filesystem keyword (eg > {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) > On the other hand, for other file formats such as feather, we only support > local files or buffers. So for those, you need to do the more manual (I > _suppose_ this works?): > {code:python} > from pyarrow import fs, feather > s3 = fs.S3FileSystem() > with s3.open_input_file("bucket/data.arrow") as file: > table = feather.read_table(file) > {code} > So I think the question comes up: do we want to extend this filesystem > support to other file formats (feather, csv, json) and make this more uniform > across pyarrow, or do we prefer to keep the plain readers more low-level (and > people can use the datasets API for more convenience)? > cc [~apitrou] [~kszucs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9104) [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory
[ https://issues.apache.org/jira/browse/ARROW-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192069#comment-17192069 ] Antoine Pitrou commented on ARROW-9104: --- I've added [~revit13] to the contributors and assigned the Jira to her. Thank you! > [C++] Parquet encryption tests should write files to a temporary directory > instead of the testing submodule's directory > --- > > Key: ARROW-9104 > URL: https://issues.apache.org/jira/browse/ARROW-9104 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Krisztian Szucs >Assignee: Revital Sur >Priority: Major > Fix For: 2.0.0 > > > If the source directory is not writable the test raises permission denied > error: > [ RUN ] TestEncryptionConfiguration.UniformEncryption > 1632 > unknown file: Failure > 1633 > C++ exception with description "IOError: Failed to open local file > '/arrow/cpp/submodules/parquet-testing/data/tmp_uniform_encryption.parquet.encrypted'. > Detail: [errno 13] Permission denied" thrown in the test body. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9104) [C++] Parquet encryption tests should write files to a temporary directory instead of the testing submodule's directory
[ https://issues.apache.org/jira/browse/ARROW-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-9104: - Assignee: Revital Sur (was: Gidon Gershinsky) > [C++] Parquet encryption tests should write files to a temporary directory > instead of the testing submodule's directory > --- > > Key: ARROW-9104 > URL: https://issues.apache.org/jira/browse/ARROW-9104 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Krisztian Szucs >Assignee: Revital Sur >Priority: Major > Fix For: 2.0.0 > > > If the source directory is not writable the test raises permission denied > error: > [ RUN ] TestEncryptionConfiguration.UniformEncryption > 1632 > unknown file: Failure > 1633 > C++ exception with description "IOError: Failed to open local file > '/arrow/cpp/submodules/parquet-testing/data/tmp_uniform_encryption.parquet.encrypted'. > Detail: [errno 13] Permission denied" thrown in the test body. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9846) [Rust] Master branch broken build
[ https://issues.apache.org/jira/browse/ARROW-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-9846. --- Resolution: Not A Problem > [Rust] Master branch broken build > - > > Key: ARROW-9846 > URL: https://issues.apache.org/jira/browse/ARROW-9846 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > Master branch is failing to build in CI. It fails to compile > "tower-balance-0.3.0". I cannot reproduce locally. > {code:java} > error[E0502]: cannot borrow `self` as immutable because it is also borrowed > as mutable >--> > /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-balance-0.3.0/src/pool/mod.rs:381:21 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?
[ https://issues.apache.org/jira/browse/ARROW-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-9938: - Description: In the parquet IO functions, we support reading/writing files from non-local filesystems directly (in addition to passing a buffer) by: - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}}) - specifying the filesystem keyword (eg {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) On the other hand, for other file formats such as feather, we only support local files or buffers. So for those, you need to do the more manual (I _suppose_ this works?): {code:python} from pyarrow import fs, feather s3 = fs.S3FileSystem() with s3.open_input_file("bucket/data.arrow") as file: table = feather.read_table(file) {code} So I think the question comes up: do we want to extend this filesystem support to other file formats (feather, csv, json) and make this more uniform across pyarrow, or do we prefer to keep the plain readers more low-level (and people can use the datasets API for more convenience)? cc [~apitrou] [~kszucs] was: In the parquet IO functions, we support reading/writing files from non-local filesystems directly (in addition to passing a buffer) by: - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}}) - specifying the filesystem keyword (eg {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) On the other hand, for other file formats such as feather, we only support local files or buffers. So for those, you need to do the more manual (I _suppose_ this works?): {code:python} from pyarrow import fs, feather s3 = fs.S3FileSystem() with s3.open_input_file("bucket/data.arrow") as file: table = feather.read_table(file) {code} So I think the question comes up: do we want to extend this filesystem support to other file formats (feather, csv, json) and make this more uniform across pyarrow, or do we prefer to keep the plain readers more low-level (and people can use the datasets API for more convenience)? cc [~apitrou] [~kszucs] > [Python] Add filesystem capabilities to other IO formats (feather, csv, json, > ..)? > -- > > Key: ARROW-9938 > URL: https://issues.apache.org/jira/browse/ARROW-9938 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: filesystem > > In the parquet IO functions, we support reading/writing files from non-local > filesystems directly (in addition to passing a buffer) by: > - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}}) > - specifying the filesystem keyword (eg > {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) > On the other hand, for other file formats such as feather, we only support > local files or buffers. So for those, you need to do the more manual (I > _suppose_ this works?): > {code:python} > from pyarrow import fs, feather > s3 = fs.S3FileSystem() > with s3.open_input_file("bucket/data.arrow") as file: > table = feather.read_table(file) > {code} > So I think the question comes up: do we want to extend this filesystem > support to other file formats (feather, csv, json) and make this more uniform > across pyarrow, or do we prefer to keep the plain readers more low-level (and > people can use the datasets API for more convenience)? > cc [~apitrou] [~kszucs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?
[ https://issues.apache.org/jira/browse/ARROW-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-9938: - Description: In the parquet IO functions, we support reading/writing files from non-local filesystems directly (in addition to passing a buffer) by: - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}}) - specifying the filesystem keyword (eg {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) On the other hand, for other file formats such as feather, we only support local files or buffers. So for those, you need to do the more manual (I _suppose_ this works?): {code:python} from pyarrow import fs, feather s3 = fs.S3FileSystem() with s3.open_input_file("bucket/data.arrow") as file: table = feather.read_table(file) {code} So I think the question comes up: do we want to extend this filesystem support to other file formats (feather, csv, json) and make this more uniform across pyarrow, or do we prefer to keep the plain readers more low-level (and people can use the datasets API for more convenience)? cc [~apitrou] [~kszucs] was: In the parquet IO functions, we support reading/writing files from non-local filesystems directly (in addition to passing a buffer) by: - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}}) - specifying the filesystem keyword (eg {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) On the other hand, for other file formats such as feather, we only support local files. So for those, you need to do the more manual (I _suppose_ this works?): {code:python} from pyarrow import fs, feather s3 = fs.S3FileSystem() with s3.open_input_file("bucket/data.arrow") as file: table = feather.read_table(file) {code} So I think the question comes up: do we want to extend this filesystem support to other file formats (feather, csv, json) and make this more uniform across pyarrow, or do we prefer to keep the plain readers more low-level (and people can use the datasets API for more convenience)? cc [~apitrou] [~kszucs] > [Python] Add filesystem capabilities to other IO formats (feather, csv, json, > ..)? > -- > > Key: ARROW-9938 > URL: https://issues.apache.org/jira/browse/ARROW-9938 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: filesystem > > In the parquet IO functions, we support reading/writing files from non-local > filesystems directly (in addition to passing a buffer) by: > - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}}) > - specifying the filesystem keyword (eg > {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) > On the other hand, for other file formats such as feather, we only support > local files or buffers. So for those, you need to do the more manual (I > _suppose_ this works?): > {code:python} > from pyarrow import fs, feather > s3 = fs.S3FileSystem() > with s3.open_input_file("bucket/data.arrow") as file: > table = feather.read_table(file) > {code} > So I think the question comes up: do we want to extend this filesystem > support to other file formats (feather, csv, json) and make this more uniform > across pyarrow, or do we prefer to keep the plain readers more low-level (and > people can use the datasets API for more convenience)? > cc [~apitrou] [~kszucs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?
Joris Van den Bossche created ARROW-9938: Summary: [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)? Key: ARROW-9938 URL: https://issues.apache.org/jira/browse/ARROW-9938 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche In the parquet IO functions, we support reading/writing files from non-local filesystems directly (in addition to passing a buffer) by: - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}}) - specifying the filesystem keyword (eg {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) On the other hand, for other file formats such as feather, we only support local files. So for those, you need to do the more manual (I _suppose_ this works?): {code:python} from pyarrow import fs, feather s3 = fs.S3FileSystem() with s3.open_input_file("bucket/data.arrow") as file: table = feather.read_table(file) {code} So I think the question comes up: do we want to extend this filesystem support to other file formats (feather, csv, json) and make this more uniform across pyarrow, or do we prefer to keep the plain readers more low-level (and people can use the datasets API for more convenience)? cc [~apitrou] [~kszucs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9919) [Rust] [DataFusion] Math functions
[ https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-9919: -- Component/s: Rust - DataFusion > [Rust] [DataFusion] Math functions > -- > > Key: ARROW-9919 > URL: https://issues.apache.org/jira/browse/ARROW-9919 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust - DataFusion >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > See main issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9919) [Rust] [DataFusion] Math functions
[ https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-9919: -- Affects Version/s: 1.0.0 > [Rust] [DataFusion] Math functions > -- > > Key: ARROW-9919 > URL: https://issues.apache.org/jira/browse/ARROW-9919 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust - DataFusion >Affects Versions: 1.0.0 >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > See main issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9919) [Rust] [DataFusion] Math functions
[ https://issues.apache.org/jira/browse/ARROW-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-9919. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8116 [https://github.com/apache/arrow/pull/8116] > [Rust] [DataFusion] Math functions > -- > > Key: ARROW-9919 > URL: https://issues.apache.org/jira/browse/ARROW-9919 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > See main issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9917) [Python][Compute] Add bindings for mode kernel
[ https://issues.apache.org/jira/browse/ARROW-9917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche resolved ARROW-9917. -- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8115 [https://github.com/apache/arrow/pull/8115] > [Python][Compute] Add bindings for mode kernel > -- > > Key: ARROW-9917 > URL: https://issues.apache.org/jira/browse/ARROW-9917 > Project: Apache Arrow > Issue Type: Sub-task > Components: Python >Reporter: Andrew Wieteska >Assignee: Andrew Wieteska >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9920) [Python] pyarrow.concat_arrays segfaults when passing it a chunked array
[ https://issues.apache.org/jira/browse/ARROW-9920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9920: -- Labels: pull-request-available (was: ) > [Python] pyarrow.concat_arrays segfaults when passing it a chunked array > > > Key: ARROW-9920 > URL: https://issues.apache.org/jira/browse/ARROW-9920 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > One can concat the chunks of a ChunkedArray with {{concat_arrays}} passing it > the list of chunks: > {code} > In [1]: arr = pa.chunked_array([[0, 1], [3, 4]]) > In [2]: pa.concat_arrays(arr.chunks) > Out[2]: > > [ > 0, > 1, > 3, > 4 > ] > {code} > but if passing the chunked array itself, you get a segfault: > {code} > In [4]: pa.concat_arrays(arr) > Segmentation fault (core dumped) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9936) [Python] Fix / test relative file paths in pyarrow.parquet
[ https://issues.apache.org/jira/browse/ARROW-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9936: -- Labels: pull-request-available (was: ) > [Python] Fix / test relative file paths in pyarrow.parquet > -- > > Key: ARROW-9936 > URL: https://issues.apache.org/jira/browse/ARROW-9936 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > It seems that I broke writing parquet to relative file paths in ARROW-9718 > (again, something similar happened in the pyarrow.dataset reading), so should > fix that and properly test this. > {code} > In [3]: pq.write_table(table, "test_relative.parquet") > ... > ~/scipy/repos/arrow/python/pyarrow/_fs.pyx in > pyarrow._fs.FileSystem.from_uri() > ArrowInvalid: URI has empty scheme: 'test_relative.parquet' > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9826) [Rust] add set function to PrimitiveArray
[ https://issues.apache.org/jira/browse/ARROW-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191998#comment-17191998 ] Francesco Gadaleta commented on ARROW-9826: --- But that can be extremely inefficient. If one needs to change a dozen values in a column of millions of elements, that can become prohibitive. In-place value changes are quite a common operation in data science. > [Rust] add set function to PrimitiveArray > - > > Key: ARROW-9826 > URL: https://issues.apache.org/jira/browse/ARROW-9826 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Affects Versions: 1.0.0 >Reporter: Francesco Gadaleta >Priority: Major > > For in-place value replacement in Array, a `set()` function (maybe unsafe?) > would be required. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (ARROW-9826) [Rust] add set function to PrimitiveArray
[ https://issues.apache.org/jira/browse/ARROW-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francesco Gadaleta updated ARROW-9826: -- Comment: was deleted (was: But that can be extremely inefficient. If one needs to change a dozen values in a column of millions of elements, that can become prohibitive. In-place value changes are quite a common operation in data science.) > [Rust] add set function to PrimitiveArray > - > > Key: ARROW-9826 > URL: https://issues.apache.org/jira/browse/ARROW-9826 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Affects Versions: 1.0.0 >Reporter: Francesco Gadaleta >Priority: Major > > For in-place value replacement in Array, a `set()` function (maybe unsafe?) > would be required. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9826) [Rust] add set function to PrimitiveArray
[ https://issues.apache.org/jira/browse/ARROW-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191997#comment-17191997 ] Francesco Gadaleta commented on ARROW-9826: --- But that can be extremely inefficient. If one needs to change a dozen values in a column of millions of elements, that can become prohibitive. In-place value changes are quite a common operation in data science. > [Rust] add set function to PrimitiveArray > - > > Key: ARROW-9826 > URL: https://issues.apache.org/jira/browse/ARROW-9826 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Affects Versions: 1.0.0 >Reporter: Francesco Gadaleta >Priority: Major > > For in-place value replacement in Array, a `set()` function (maybe unsafe?) > would be required. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9937) [Rust] [DataFusion] Average is not correct
[ https://issues.apache.org/jira/browse/ARROW-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191993#comment-17191993 ] Jorge commented on ARROW-9937: -- [~andygrove], I remember that you wanted to touch this. If not, let me know and I take a shoot at it. Looking at [Ballista's source code for this|https://github.com/ballista-compute/ballista/blob/main/rust/ballista/src/execution/operators/hash_aggregate.rs] , I think that we have the same issue there. :/ > [Rust] [DataFusion] Average is not correct > -- > > Key: ARROW-9937 > URL: https://issues.apache.org/jira/browse/ARROW-9937 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Jorge >Priority: Major > > The current design of aggregates makes the calculation of the average > incorrect. > It also makes it impossible to compute the [geometric > mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other > operations. > The central issue is that Accumulator returns a `ScalarValue` during partial > aggregations via {{get_value}}, but very often a `ScalarValue` is not > sufficient information to perform the full aggregation. > A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are > distributed in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current > calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then > reduces them using another average, i.e. > {{((x1+x2)/2 + (x3+x4)/2 + x5)/3}} > which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}. > I believe that our Accumulators need to pass more information from the > partial aggregations to the final aggregation. > We could consider taking an API equivalent to > [spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), > i.e. have an `update`, a `merge` and an `evaluate`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9937) [Rust] [DataFusion] Average is not correct
[ https://issues.apache.org/jira/browse/ARROW-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge updated ARROW-9937: - Description: The current design of aggregates makes the calculation of the average incorrect. It also makes it impossible to compute the [geometric mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other operations. The central issue is that Accumulator returns a `ScalarValue` during partial aggregations via {{get_value}}, but very often a `ScalarValue` is not sufficient information to perform the full aggregation. A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are distributed in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then reduces them using another average, i.e. {{((x1+x2)/2 + (x3+x4)/2 + x5)/3}} which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}. I believe that our Accumulators need to pass more information from the partial aggregations to the final aggregation. We could consider taking an API equivalent to [spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), i.e. have an `update`, a `merge` and an `evaluate`. was: The current design of aggregates makes the calculation of the average incorrect. It also makes it impossible to compute the [geometric mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other operations. The central issue is that Accumulator returns a `ScalarValue` during partial aggregations via {{get_value}}, but very often a `ScalarValue` is not sufficient information to perform the full aggregation. A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are distributed in in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then reduces them using another average, i.e. {{((x1+x2)/2 + (x3+x4)/2 + x5)/3}} which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}. I believe that our Accumulators need to pass more information from the partial aggregations to the final aggregation. We could consider taking an API equivalent to [spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), i.e. have an `update`, a `merge` and an `evaluate`. > [Rust] [DataFusion] Average is not correct > -- > > Key: ARROW-9937 > URL: https://issues.apache.org/jira/browse/ARROW-9937 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Jorge >Priority: Major > > The current design of aggregates makes the calculation of the average > incorrect. > It also makes it impossible to compute the [geometric > mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other > operations. > The central issue is that Accumulator returns a `ScalarValue` during partial > aggregations via {{get_value}}, but very often a `ScalarValue` is not > sufficient information to perform the full aggregation. > A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are > distributed in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current > calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then > reduces them using another average, i.e. > {{((x1+x2)/2 + (x3+x4)/2 + x5)/3}} > which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}. > I believe that our Accumulators need to pass more information from the > partial aggregations to the final aggregation. > We could consider taking an API equivalent to > [spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), > i.e. have an `update`, a `merge` and an `evaluate`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9937) [Rust] [DataFusion] Average is not correct
Jorge created ARROW-9937: Summary: [Rust] [DataFusion] Average is not correct Key: ARROW-9937 URL: https://issues.apache.org/jira/browse/ARROW-9937 Project: Apache Arrow Issue Type: Bug Components: Rust, Rust - DataFusion Reporter: Jorge The current design of aggregates makes the calculation of the average incorrect. It also makes it impossible to compute the [geometric mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other operations. The central issue is that Accumulator returns a `ScalarValue` during partial aggregations via {{get_value}}, but very often a `ScalarValue` is not sufficient information to perform the full aggregation. A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are distributed in in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then reduces them using another average, i.e. {{((x1+x2)/2 + (x3+x4)/2 + x5)/3}} which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}. I believe that our Accumulators need to pass more information from the partial aggregations to the final aggregation. We could consider taking an API equivalent to [spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), i.e. have an `update`, a `merge` and an `evaluate`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9936) [Python] Fix / test relative file paths in pyarrow.parquet
Joris Van den Bossche created ARROW-9936: Summary: [Python] Fix / test relative file paths in pyarrow.parquet Key: ARROW-9936 URL: https://issues.apache.org/jira/browse/ARROW-9936 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 2.0.0 It seems that I broke writing parquet to relative file paths in ARROW-9718 (again, something similar happened in the pyarrow.dataset reading), so should fix that and properly test this. {code} In [3]: pq.write_table(table, "test_relative.parquet") ... ~/scipy/repos/arrow/python/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.from_uri() ArrowInvalid: URI has empty scheme: 'test_relative.parquet' {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9920) [Python] pyarrow.concat_arrays segfaults when passing it a chunked array
[ https://issues.apache.org/jira/browse/ARROW-9920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-9920: - Fix Version/s: 2.0.0 > [Python] pyarrow.concat_arrays segfaults when passing it a chunked array > > > Key: ARROW-9920 > URL: https://issues.apache.org/jira/browse/ARROW-9920 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Fix For: 2.0.0 > > > One can concat the chunks of a ChunkedArray with {{concat_arrays}} passing it > the list of chunks: > {code} > In [1]: arr = pa.chunked_array([[0, 1], [3, 4]]) > In [2]: pa.concat_arrays(arr.chunks) > Out[2]: > > [ > 0, > 1, > 3, > 4 > ] > {code} > but if passing the chunked array itself, you get a segfault: > {code} > In [4]: pa.concat_arrays(arr) > Segmentation fault (core dumped) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9927) [R] Add dplyr group_by, summarise and mutate support in function open_dataset R arrow package
[ https://issues.apache.org/jira/browse/ARROW-9927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pal updated ARROW-9927: --- Issue Type: Improvement (was: Bug) Priority: Critical (was: Major) > [R] Add dplyr group_by, summarise and mutate support in function open_dataset > R arrow package > --- > > Key: ARROW-9927 > URL: https://issues.apache.org/jira/browse/ARROW-9927 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 1.0.1 >Reporter: Pal >Priority: Critical > > Hi, > > The open_dataset() function in the R arrow package already includes the > support for dplyr filter, select and rename functions. However, it would be a > huge improvement if it also could include other functions such as group_by, > summarise and mutate before calling collect(). Is there any idea or projet > going on to do so ? Would be it possible to include those features > (compatible also with dplyr version < 1) ? > Many thanks for this excellent job. -- This message was sent by Atlassian Jira (v8.3.4#803005)