[jira] [Updated] (ARROW-10215) [Rust] [DataFusion] Rename "Source" typedef
[ https://issues.apache.org/jira/browse/ARROW-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10215: --- Labels: pull-request-available (was: ) > [Rust] [DataFusion] Rename "Source" typedef > --- > > Key: ARROW-10215 > URL: https://issues.apache.org/jira/browse/ARROW-10215 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Priority: Minor > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The name "Source" for this type doesn't make sense to me. I would like to > discuss alternate names for it. > {code:java} > type Source = Box; {code} > My first thoughts are: > * RecordBatchIterator > * RecordBatchStream > * SendableRecordBatchReader -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10215) [Rust] [DataFusion] Rename "Source" typedef
[ https://issues.apache.org/jira/browse/ARROW-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210625#comment-17210625 ] Jorge Leitão commented on ARROW-10215: -- I agree. I used for iterating during the PR, but it was not intended to remain like that. Any of your suggestions is fine by me. > [Rust] [DataFusion] Rename "Source" typedef > --- > > Key: ARROW-10215 > URL: https://issues.apache.org/jira/browse/ARROW-10215 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Priority: Minor > Fix For: 3.0.0 > > > The name "Source" for this type doesn't make sense to me. I would like to > discuss alternate names for it. > {code:java} > type Source = Box; {code} > My first thoughts are: > * RecordBatchIterator > * RecordBatchStream > * SendableRecordBatchReader -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9553) [Rust] Release script doesn't bump parquet crate's arrow dependency version
[ https://issues.apache.org/jira/browse/ARROW-9553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs reassigned ARROW-9553: -- Assignee: Krisztian Szucs (was: Andy Grove) > [Rust] Release script doesn't bump parquet crate's arrow dependency version > --- > > Key: ARROW-9553 > URL: https://issues.apache.org/jira/browse/ARROW-9553 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 1.0.0 >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Fix For: 2.0.0 > > > After rebasing the master the rust builds have started to fail. > The solution is to bump a version number gere > https://github.com/apache/arrow/pull/7829 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10242) Parquet reader thread terminated due to error: ExecutionError("sending on a disconnected channel")
[ https://issues.apache.org/jira/browse/ARROW-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210605#comment-17210605 ] Josh Taylor commented on ARROW-10242: - Hi [~andygrove], I'm not sure if i'm using a nested type, they should all be pretty primitive types. I'll start by removing all the fields and field types and adding one at a time and see what causes it to explode. Thanks for the swift response! > Parquet reader thread terminated due to error: ExecutionError("sending on a > disconnected channel") > -- > > Key: ARROW-10242 > URL: https://issues.apache.org/jira/browse/ARROW-10242 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Affects Versions: 2.0.0 >Reporter: Josh Taylor >Assignee: Andy Grove >Priority: Major > > *Running the latest code from github for datafusion & parquet.* > When trying to read a directory of around ~210 parquet files (3.2gb total, > each file around 13-18mb), doing the following: > {code:java} > let mut ctx = ExecutionContext::new(); > // register parquet file with the execution context > ctx.register_parquet( > "something", > "/home/josh/dev/pat/fff/" > )?; > // execute the query > let df = ctx.sql( > "select * from something", > )?; > let results = df.collect().await?; > > {code} > I get the following error shown ~204 times: > {code:java} > Parquet reader thread terminated due to error: ExecutionError("sending on a > disconnected channel"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-10242) Parquet reader thread terminated due to error: ExecutionError("sending on a disconnected channel")
[ https://issues.apache.org/jira/browse/ARROW-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210605#comment-17210605 ] Josh Taylor edited comment on ARROW-10242 at 10/9/20, 4:21 AM: --- Hi [~andygrove], I'm not sure if i'm using a nested type, they should all be pretty primitive types. I'll start by removing all the fields and field types and adding one at a time and see what causes it to explode. I'm not seeing any other errors. Thanks for the swift response! was (Author: joshx): Hi [~andygrove], I'm not sure if i'm using a nested type, they should all be pretty primitive types. I'll start by removing all the fields and field types and adding one at a time and see what causes it to explode. Thanks for the swift response! > Parquet reader thread terminated due to error: ExecutionError("sending on a > disconnected channel") > -- > > Key: ARROW-10242 > URL: https://issues.apache.org/jira/browse/ARROW-10242 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Affects Versions: 2.0.0 >Reporter: Josh Taylor >Assignee: Andy Grove >Priority: Major > > *Running the latest code from github for datafusion & parquet.* > When trying to read a directory of around ~210 parquet files (3.2gb total, > each file around 13-18mb), doing the following: > {code:java} > let mut ctx = ExecutionContext::new(); > // register parquet file with the execution context > ctx.register_parquet( > "something", > "/home/josh/dev/pat/fff/" > )?; > // execute the query > let df = ctx.sql( > "select * from something", > )?; > let results = df.collect().await?; > > {code} > I get the following error shown ~204 times: > {code:java} > Parquet reader thread terminated due to error: ExecutionError("sending on a > disconnected channel"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9553) [Rust] Release script doesn't bump parquet crate's arrow dependency version
[ https://issues.apache.org/jira/browse/ARROW-9553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210603#comment-17210603 ] Andy Grove commented on ARROW-9553: --- Actually it has two separate dependencies on arrow, in [dependencies] and [dev-dependencies] and a different format in each. > [Rust] Release script doesn't bump parquet crate's arrow dependency version > --- > > Key: ARROW-9553 > URL: https://issues.apache.org/jira/browse/ARROW-9553 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 1.0.0 >Reporter: Krisztian Szucs >Assignee: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > After rebasing the master the rust builds have started to fail. > The solution is to bump a version number gere > https://github.com/apache/arrow/pull/7829 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9553) [Rust] Release script doesn't bump parquet crate's arrow dependency version
[ https://issues.apache.org/jira/browse/ARROW-9553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210602#comment-17210602 ] Andy Grove commented on ARROW-9553: --- The release-test script is looking for this pattern: {code:java} ["-arrow = { path = \"../arrow\", version = \"#{@snapshot_version}\" }", "+arrow = { path = \"../arrow\", version = \"#{@release_version}\" }"] {code} The parquet cargo.toml does not match: {code:java} arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT", optional = true } {code} > [Rust] Release script doesn't bump parquet crate's arrow dependency version > --- > > Key: ARROW-9553 > URL: https://issues.apache.org/jira/browse/ARROW-9553 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 1.0.0 >Reporter: Krisztian Szucs >Assignee: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > After rebasing the master the rust builds have started to fail. > The solution is to bump a version number gere > https://github.com/apache/arrow/pull/7829 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9847) [Rust] Inconsistent use of import arrow:: vs crate::arrow::
[ https://issues.apache.org/jira/browse/ARROW-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-9847: -- Fix Version/s: (was: 2.0.0) 3.0.0 > [Rust] Inconsistent use of import arrow:: vs crate::arrow:: > --- > > Key: ARROW-9847 > URL: https://issues.apache.org/jira/browse/ARROW-9847 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 3.0.0 > > > Both the DataFusion and Parquet crates have a mix of "import arrow::" and > "import crate::arrow::" and we should standardize on one or the other. > > Which ever standard we use should be enforced in build.rs so CI fails on PRs > that do not follow the standard. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10186) [Rust] Tests fail when following instructions in README
[ https://issues.apache.org/jira/browse/ARROW-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-10186: --- Fix Version/s: (was: 2.0.0) 3.0.0 > [Rust] Tests fail when following instructions in README > --- > > Key: ARROW-10186 > URL: https://issues.apache.org/jira/browse/ARROW-10186 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 3.0.0 > > > If I follow the instructions from the README and set the test paths as > follows, some of the IPC tests fail with "no such file or directory". > {code:java} > export PARQUET_TEST_DATA=../cpp/submodules/parquet-testing/data > export ARROW_TEST_DATA=../testing/data {code} > If I change them to absolute paths as follows then the tests pass: > {code:java} > export PARQUET_TEST_DATA=`pwd`/../cpp/submodules/parquet-testing/data > export ARROW_TEST_DATA=`pwd`/../testing/data {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10242) Parquet reader thread terminated due to error: ExecutionError("sending on a disconnected channel")
[ https://issues.apache.org/jira/browse/ARROW-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210600#comment-17210600 ] Andy Grove commented on ARROW-10242: Hi [~joshx] and thanks for the bug report. I was unable to reproduce the issue on any of the parquet data sets that I usually test with, but they are simple data sets containing primitive types. My first guess here is that there is something in the files that DataFusion doesn't support and the error message is being suppressed, but this is just a guess. Do your files contain nested types? Do you see any other errors before the disconnected channel error? > Parquet reader thread terminated due to error: ExecutionError("sending on a > disconnected channel") > -- > > Key: ARROW-10242 > URL: https://issues.apache.org/jira/browse/ARROW-10242 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Affects Versions: 2.0.0 >Reporter: Josh Taylor >Assignee: Andy Grove >Priority: Major > > *Running the latest code from github for datafusion & parquet.* > When trying to read a directory of around ~210 parquet files (3.2gb total, > each file around 13-18mb), doing the following: > {code:java} > let mut ctx = ExecutionContext::new(); > // register parquet file with the execution context > ctx.register_parquet( > "something", > "/home/josh/dev/pat/fff/" > )?; > // execute the query > let df = ctx.sql( > "select * from something", > )?; > let results = df.collect().await?; > > {code} > I get the following error shown ~204 times: > {code:java} > Parquet reader thread terminated due to error: ExecutionError("sending on a > disconnected channel"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10242) Parquet reader thread terminated due to error: ExecutionError("sending on a disconnected channel")
[ https://issues.apache.org/jira/browse/ARROW-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove reassigned ARROW-10242: -- Assignee: Andy Grove > Parquet reader thread terminated due to error: ExecutionError("sending on a > disconnected channel") > -- > > Key: ARROW-10242 > URL: https://issues.apache.org/jira/browse/ARROW-10242 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Affects Versions: 2.0.0 >Reporter: Josh Taylor >Assignee: Andy Grove >Priority: Major > > *Running the latest code from github for datafusion & parquet.* > When trying to read a directory of around ~210 parquet files (3.2gb total, > each file around 13-18mb), doing the following: > {code:java} > let mut ctx = ExecutionContext::new(); > // register parquet file with the execution context > ctx.register_parquet( > "something", > "/home/josh/dev/pat/fff/" > )?; > // execute the query > let df = ctx.sql( > "select * from something", > )?; > let results = df.collect().await?; > > {code} > I get the following error shown ~204 times: > {code:java} > Parquet reader thread terminated due to error: ExecutionError("sending on a > disconnected channel"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10242) Parquet reader thread terminated due to error: ExecutionError("sending on a disconnected channel")
Josh Taylor created ARROW-10242: --- Summary: Parquet reader thread terminated due to error: ExecutionError("sending on a disconnected channel") Key: ARROW-10242 URL: https://issues.apache.org/jira/browse/ARROW-10242 Project: Apache Arrow Issue Type: Bug Components: Rust, Rust - DataFusion Affects Versions: 2.0.0 Reporter: Josh Taylor *Running the latest code from github for datafusion & parquet.* When trying to read a directory of around ~210 parquet files (3.2gb total, each file around 13-18mb), doing the following: {code:java} let mut ctx = ExecutionContext::new(); // register parquet file with the execution context ctx.register_parquet( "something", "/home/josh/dev/pat/fff/" )?; // execute the query let df = ctx.sql( "select * from something", )?; let results = df.collect().await?; {code} I get the following error shown ~204 times: {code:java} Parquet reader thread terminated due to error: ExecutionError("sending on a disconnected channel"){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10239) [C++] aws-sdk-cpp apparently requires zlib too
[ https://issues.apache.org/jira/browse/ARROW-10239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-10239. -- Resolution: Fixed Issue resolved by pull request 8406 [https://github.com/apache/arrow/pull/8406] > [C++] aws-sdk-cpp apparently requires zlib too > -- > > Key: ARROW-10239 > URL: https://issues.apache.org/jira/browse/ARROW-10239 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Packaging >Reporter: Neal Richardson >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > https://github.com/aws/aws-sdk-cpp/blob/master/cmake/external_dependencies.cmake#L9 > If you happen to be building on a bare system without zlib, the bundled > aws-sdk-cpp build fails: > https://github.com/ursa-labs/arrow-r-nightly/runs/1227805500?check_suite_focus=true#step:4:930 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10237) [C++] Duplicate values in a dictionary result in corrupted parquet
[ https://issues.apache.org/jira/browse/ARROW-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman resolved ARROW-10237. -- Resolution: Fixed Issue resolved by pull request 8403 [https://github.com/apache/arrow/pull/8403] > [C++] Duplicate values in a dictionary result in corrupted parquet > -- > > Key: ARROW-10237 > URL: https://issues.apache.org/jira/browse/ARROW-10237 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 1.0.1 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Initial discussion: > https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10153) [Java] Adding values to VarCharVector beyond 2GB results in IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/ARROW-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210562#comment-17210562 ] Liya Fan commented on ARROW-10153: -- [~samarthjain] Thanks for reporting the problem. As indicated by [~emkornfi...@gmail.com], a LargeVarChatVector employs a 8-byte offset buffer, so the data locality can be worse as less data could be loaded to cache. You can check the vector capacity by calling the \{{BaseVariableWidthVector#getValueCapacity}} API. Finally, it could be expensive to copy the data from a VarCharVector to a LargeVarCharVector, so if the data size may exceed 2GB, maybe you should consider LargeVarCharVector from the very beginning? > [Java] Adding values to VarCharVector beyond 2GB results in > IndexOutOfBoundsException > - > > Key: ARROW-10153 > URL: https://issues.apache.org/jira/browse/ARROW-10153 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 1.0.0 >Reporter: Samarth Jain >Priority: Major > > On executing the below test case, one can see that on adding the 2049th > string of size 1MB, it fails. > {code:java} > int length = 1024 * 1024; > StringBuilder sb = new StringBuilder(length); > for (int i = 0; i < length; i++) { > sb.append("a"); > } > byte[] str = sb.toString().getBytes(); > VarCharVector vector = new VarCharVector("v", new > RootAllocator(Long.MAX_VALUE)); > vector.allocateNew(3000); > for (int i = 0; i < 3000; i++) { > vector.setSafe(i, str); > }{code} > > {code:java} > Exception in thread "main" java.lang.IndexOutOfBoundsException: index: > -2147483648, length: 1048576 (expected: range(0, 2147483648))Exception in > thread "main" java.lang.IndexOutOfBoundsException: index: -2147483648, > length: 1048576 (expected: range(0, 2147483648)) at > org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699) at > org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:762) at > org.apache.arrow.vector.BaseVariableWidthVector.setBytes(BaseVariableWidthVector.java:1212) > at > org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1011) > {code} > Stepping through the code, > > [https://github.com/apache/arrow/blob/master/java/memory/memory-core/src/main/java/org/apache/arrow/memory/ArrowBuf.java#L425] > returns the negative index `-2147483648` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10238) [C#] List is broken
[ https://issues.apache.org/jira/browse/ARROW-10238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Erhardt resolved ARROW-10238. -- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8404 [https://github.com/apache/arrow/pull/8404] > [C#] List is broken > --- > > Key: ARROW-10238 > URL: https://issues.apache.org/jira/browse/ARROW-10238 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 2.0.0 >Reporter: Eric Erhardt >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > This code is currently broken: > [https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/csharp/src/Apache.Arrow/Ipc/MessageSerializer.cs#L147] > > We need to use the `childFields` parameter when creating the ListType, that > way if there are recursive nested types, like List, the correct types > get flown down. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10241) [C++][Compute] Add variance kernel benchmark
[ https://issues.apache.org/jira/browse/ARROW-10241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10241: --- Labels: pull-request-available (was: ) > [C++][Compute] Add variance kernel benchmark > > > Key: ARROW-10241 > URL: https://issues.apache.org/jira/browse/ARROW-10241 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yibo Cai >Assignee: Yibo Cai >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10241) [C++][Compute] Add variance kernel benchmark
Yibo Cai created ARROW-10241: Summary: [C++][Compute] Add variance kernel benchmark Key: ARROW-10241 URL: https://issues.apache.org/jira/browse/ARROW-10241 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Yibo Cai Assignee: Yibo Cai -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10239) [C++] aws-sdk-cpp apparently requires zlib too
[ https://issues.apache.org/jira/browse/ARROW-10239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10239: --- Labels: pull-request-available (was: ) > [C++] aws-sdk-cpp apparently requires zlib too > -- > > Key: ARROW-10239 > URL: https://issues.apache.org/jira/browse/ARROW-10239 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Packaging >Reporter: Neal Richardson >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > https://github.com/aws/aws-sdk-cpp/blob/master/cmake/external_dependencies.cmake#L9 > If you happen to be building on a bare system without zlib, the bundled > aws-sdk-cpp build fails: > https://github.com/ursa-labs/arrow-r-nightly/runs/1227805500?check_suite_focus=true#step:4:930 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10239) [C++] aws-sdk-cpp apparently requires zlib too
[ https://issues.apache.org/jira/browse/ARROW-10239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210538#comment-17210538 ] Kouhei Sutou commented on ARROW-10239: -- OK. I'll take a look this. > [C++] aws-sdk-cpp apparently requires zlib too > -- > > Key: ARROW-10239 > URL: https://issues.apache.org/jira/browse/ARROW-10239 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Packaging >Reporter: Neal Richardson >Assignee: Kouhei Sutou >Priority: Major > Fix For: 2.0.0 > > > https://github.com/aws/aws-sdk-cpp/blob/master/cmake/external_dependencies.cmake#L9 > If you happen to be building on a bare system without zlib, the bundled > aws-sdk-cpp build fails: > https://github.com/ursa-labs/arrow-r-nightly/runs/1227805500?check_suite_focus=true#step:4:930 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1614) [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType
[ https://issues.apache.org/jira/browse/ARROW-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210537#comment-17210537 ] Rok Mihevc commented on ARROW-1614: --- [~bryanc] Text Extensions for Pandas looks like a good start for the Python part. We'd probably want to base it on pyarrow.Tensor instead of np.ndarray? Would you like to do this or shall I start and ask you for review? > [C++] Add a Tensor logical value type with constant dimensions, implemented > using ExtensionType > --- > > Key: ARROW-1614 > URL: https://issues.apache.org/jira/browse/ARROW-1614 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Format >Reporter: Wes McKinney >Priority: Major > > In an Arrow table, we would like to add support for a column that has values > cells each containing a tensor value, with all tensors having the same > dimensions. These would be stored as a binary value, plus some metadata to > store type and shape/strides. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10239) [C++] aws-sdk-cpp apparently requires zlib too
[ https://issues.apache.org/jira/browse/ARROW-10239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-10239: Assignee: Kouhei Sutou > [C++] aws-sdk-cpp apparently requires zlib too > -- > > Key: ARROW-10239 > URL: https://issues.apache.org/jira/browse/ARROW-10239 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Packaging >Reporter: Neal Richardson >Assignee: Kouhei Sutou >Priority: Major > Fix For: 2.0.0 > > > https://github.com/aws/aws-sdk-cpp/blob/master/cmake/external_dependencies.cmake#L9 > If you happen to be building on a bare system without zlib, the bundled > aws-sdk-cpp build fails: > https://github.com/ursa-labs/arrow-r-nightly/runs/1227805500?check_suite_focus=true#step:4:930 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10240) [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query
[ https://issues.apache.org/jira/browse/ARROW-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210531#comment-17210531 ] Andy Grove commented on ARROW-10240: On second thoughts, I might not be able to get to this right away. > [Rust] [Datafusion] Optionally load tpch data into memory before running > benchmark query > > > Key: ARROW-10240 > URL: https://issues.apache.org/jira/browse/ARROW-10240 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jörn Horstmann >Priority: Minor > > The tpch benchmark runtime seems to be dominated by csv parsing code and it > is really difficult to see any performance hotspots related to actual query > execution in a flamegraph. > With the date in memory and more iterations it should be easier to profile > and find bottlenecks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10164) [Rust] Add support for DictionaryArray types to cast kernels
[ https://issues.apache.org/jira/browse/ARROW-10164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-10164. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8346 [https://github.com/apache/arrow/pull/8346] > [Rust] Add support for DictionaryArray types to cast kernels > > > Key: ARROW-10164 > URL: https://issues.apache.org/jira/browse/ARROW-10164 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > This ticket tracks the work to support casting to/from DictionaryArray's, (my > usecase is DictionaryArray's with a Utf8 dictionary). > There is prototype work on > https://github.com/alamb/arrow/tree/alamb/datafusion-string-dictionary -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10164) [Rust] Add support for DictionaryArray types to cast kernels
[ https://issues.apache.org/jira/browse/ARROW-10164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-10164: --- Component/s: Rust > [Rust] Add support for DictionaryArray types to cast kernels > > > Key: ARROW-10164 > URL: https://issues.apache.org/jira/browse/ARROW-10164 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Time Spent: 2h 40m > Remaining Estimate: 0h > > This ticket tracks the work to support casting to/from DictionaryArray's, (my > usecase is DictionaryArray's with a Utf8 dictionary). > There is prototype work on > https://github.com/alamb/arrow/tree/alamb/datafusion-string-dictionary -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10043) [Rust] [DataFusion] Introduce support for DISTINCT by partially implementing COUNT(DISTINCT)
[ https://issues.apache.org/jira/browse/ARROW-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-10043. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8222 [https://github.com/apache/arrow/pull/8222] > [Rust] [DataFusion] Introduce support for DISTINCT by partially implementing > COUNT(DISTINCT) > > > Key: ARROW-10043 > URL: https://issues.apache.org/jira/browse/ARROW-10043 > Project: Apache Arrow > Issue Type: Wish > Components: Rust, Rust - DataFusion >Reporter: Daniel Russo >Assignee: Daniel Russo >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 8h 10m > Remaining Estimate: 0h > > I am unsure where support for {{DISTINCT}} may be on the DataFusion roadmap, > so I've filed this with the "Wish" type and "Minor" priority to reflect that > this is a proposal: > Introduce {{DISTINCT}} into DataFusion by partially implementing > {{COUNT(DISTINCT)}}. The ultimate goal is to fully support the {{DISTINCT}} > keyword, but to get implementation started, limit the scope of this work to: > * the {{COUNT()}} aggregate function > * a single expression in {{COUNT()}}, i.e., {{COUNT(DISTINCT c1)}}, but not > {{COUNT(DISTINCT c1, c2)}} > * only queries with a {{GROUP BY}} clause > * integer types -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10015) [Rust] Implement SIMD for aggregate kernel sum
[ https://issues.apache.org/jira/browse/ARROW-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-10015. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8370 [https://github.com/apache/arrow/pull/8370] > [Rust] Implement SIMD for aggregate kernel sum > -- > > Key: ARROW-10015 > URL: https://issues.apache.org/jira/browse/ARROW-10015 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Jorge Leitão >Assignee: Jörn Horstmann >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Currently, our aggregations are made in a simple loop. However, as described > [here|https://rust-lang.github.io/packed_simd/perf-guide/vert-hor-ops.html], > horizontal operations can also be SIMDed, reports of 2.7x speedups. > The goal of this improvement is to support SIMD for the "sum", for primitive > types. > The code to modify is in > [here|https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/aggregate.rs]. > A good indication that this issue is completed is when the script > {{cargo bench --bench aggregate_kernels && cargo bench --bench > aggregate_kernels --features simd}} > yields a speed-up. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9414) [C++] apt package includes headers for S3 interface, but no support
[ https://issues.apache.org/jira/browse/ARROW-9414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-9414. - Resolution: Fixed Issue resolved by pull request 8394 [https://github.com/apache/arrow/pull/8394] > [C++] apt package includes headers for S3 interface, but no support > --- > > Key: ARROW-9414 > URL: https://issues.apache.org/jira/browse/ARROW-9414 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.17.1 > Environment: Ubuntu 18.04.04 LTS >Reporter: Simon Bertron >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Attachments: test.cpp > > Time Spent: 2h 20m > Remaining Estimate: 0h > > I believe that the apt package is built without S3 support. But s3fs.h is > exported in filesystem/api.h anyway. This creates undefined reference errors > when trying to link to the package. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10240) [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query
[ https://issues.apache.org/jira/browse/ARROW-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210516#comment-17210516 ] Andy Grove commented on ARROW-10240: Great idea [~jhorstmann] . Do you want me to take care of this or are you planning on working on it? I could do this tonight. > [Rust] [Datafusion] Optionally load tpch data into memory before running > benchmark query > > > Key: ARROW-10240 > URL: https://issues.apache.org/jira/browse/ARROW-10240 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jörn Horstmann >Priority: Minor > > The tpch benchmark runtime seems to be dominated by csv parsing code and it > is really difficult to see any performance hotspots related to actual query > execution in a flamegraph. > With the date in memory and more iterations it should be easier to profile > and find bottlenecks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10239) [C++] aws-sdk-cpp apparently requires zlib too
[ https://issues.apache.org/jira/browse/ARROW-10239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210500#comment-17210500 ] Neal Richardson commented on ARROW-10239: - [~kou] I'm not going to get to this today, maybe you can in your day? If not, I can take a look tomorrow. It might be as simple as adding {{if ARROW_S3 then ARROW_WITH_ZLIB}} and then registering zlib as a dependency in build_awssk in case both are being bundled. > [C++] aws-sdk-cpp apparently requires zlib too > -- > > Key: ARROW-10239 > URL: https://issues.apache.org/jira/browse/ARROW-10239 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Packaging >Reporter: Neal Richardson >Priority: Major > Fix For: 2.0.0 > > > https://github.com/aws/aws-sdk-cpp/blob/master/cmake/external_dependencies.cmake#L9 > If you happen to be building on a bare system without zlib, the bundled > aws-sdk-cpp build fails: > https://github.com/ursa-labs/arrow-r-nightly/runs/1227805500?check_suite_focus=true#step:4:930 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1614) [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType
[ https://issues.apache.org/jira/browse/ARROW-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210489#comment-17210489 ] Bryan Cutler commented on ARROW-1614: - I just wanted to let you all know I have been working on a similar Tensor extension type. I currently have a Pandas extension type for a tensor with conversion to/from an Arrow extension type, just for Python/PyArrow right now, and zero-copy conversion with numpy.ndarrays. It's part of the project [Text Extensions for Pandas|https://github.com/CODAIT/text-extensions-for-pandas] where we use it for NLP feature vectors, but it's really general purpose. You can check it out at [https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/array/tensor.py] [https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/array/arrow_conversion.py] Or install the package if you like via {{pip install text-extensions-for-pandas}} (it's currently in alpha) We would love to help out with this effort and contribute what we have to Arrow, if it fits the bill! > [C++] Add a Tensor logical value type with constant dimensions, implemented > using ExtensionType > --- > > Key: ARROW-1614 > URL: https://issues.apache.org/jira/browse/ARROW-1614 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Format >Reporter: Wes McKinney >Priority: Major > > In an Arrow table, we would like to add support for a column that has values > cells each containing a tensor value, with all tensors having the same > dimensions. These would be stored as a binary value, plus some metadata to > store type and shape/strides. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10240) [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query
Jörn Horstmann created ARROW-10240: -- Summary: [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query Key: ARROW-10240 URL: https://issues.apache.org/jira/browse/ARROW-10240 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jörn Horstmann The tpch benchmark runtime seems to be dominated by csv parsing code and it is really difficult to see any performance hotspots related to actual query execution in a flamegraph. With the date in memory and more iterations it should be easier to profile and find bottlenecks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10109) [Rust] Add support to produce a C Data interface
[ https://issues.apache.org/jira/browse/ARROW-10109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-10109: Assignee: Jorge Leitão > [Rust] Add support to produce a C Data interface > > > Key: ARROW-10109 > URL: https://issues.apache.org/jira/browse/ARROW-10109 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Jorge Leitão >Assignee: Jorge Leitão >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > The goal of this issue is to support producing C Data arrays of Rust. > The use-case that motivated this issue was the possibility of running > DataFusion from Python and support moving arrays from DataFusion to > Python/Pyarray and vice-versa. > In particular, so that users can write Python UDFs that expect arrow arrays > and return arrow arrays, in the same spirit as pandas-udfs in Spark work for > Pandas. > The brute-force way of writing these arrays is by converting element by > element from and to native types. The efficient way of doing it to pass the > memory address from and to each implementation, which is zero-copy. > To support the latter, we need an FFI implementation in Rust that produces > and consumes [C's Data > interface|https://arrow.apache.org/docs/format/CDataInterface.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10238) [C#] List is broken
[ https://issues.apache.org/jira/browse/ARROW-10238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10238: --- Labels: pull-request-available (was: ) > [C#] List is broken > --- > > Key: ARROW-10238 > URL: https://issues.apache.org/jira/browse/ARROW-10238 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 2.0.0 >Reporter: Eric Erhardt >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This code is currently broken: > [https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/csharp/src/Apache.Arrow/Ipc/MessageSerializer.cs#L147] > > We need to use the `childFields` parameter when creating the ListType, that > way if there are recursive nested types, like List, the correct types > get flown down. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10237) [C++] Duplicate values in a dictionary result in corrupted parquet
[ https://issues.apache.org/jira/browse/ARROW-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10237: --- Labels: pull-request-available (was: ) > [C++] Duplicate values in a dictionary result in corrupted parquet > -- > > Key: ARROW-10237 > URL: https://issues.apache.org/jira/browse/ARROW-10237 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 1.0.1 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Initial discussion: > https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10237) [C++] Duplicate values in a dictionary result in corrupted parquet
[ https://issues.apache.org/jira/browse/ARROW-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman reassigned ARROW-10237: Assignee: Ben Kietzman (was: Antoine Pitrou) > [C++] Duplicate values in a dictionary result in corrupted parquet > -- > > Key: ARROW-10237 > URL: https://issues.apache.org/jira/browse/ARROW-10237 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 1.0.1 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Fix For: 2.0.0 > > > Initial discussion: > https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10239) [C++] aws-sdk-cpp apparently requires zlib too
Neal Richardson created ARROW-10239: --- Summary: [C++] aws-sdk-cpp apparently requires zlib too Key: ARROW-10239 URL: https://issues.apache.org/jira/browse/ARROW-10239 Project: Apache Arrow Issue Type: Bug Components: C++, Packaging Reporter: Neal Richardson Fix For: 2.0.0 https://github.com/aws/aws-sdk-cpp/blob/master/cmake/external_dependencies.cmake#L9 If you happen to be building on a bare system without zlib, the bundled aws-sdk-cpp build fails: https://github.com/ursa-labs/arrow-r-nightly/runs/1227805500?check_suite_focus=true#step:4:930 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10238) [C#] List is broken
Eric Erhardt created ARROW-10238: Summary: [C#] List is broken Key: ARROW-10238 URL: https://issues.apache.org/jira/browse/ARROW-10238 Project: Apache Arrow Issue Type: Bug Components: C# Affects Versions: 2.0.0 Reporter: Eric Erhardt This code is currently broken: [https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/csharp/src/Apache.Arrow/Ipc/MessageSerializer.cs#L147] We need to use the `childFields` parameter when creating the ListType, that way if there are recursive nested types, like List, the correct types get flown down. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10237) [C++] Duplicate values in a dictionary result in corrupted parquet
Ben Kietzman created ARROW-10237: Summary: [C++] Duplicate values in a dictionary result in corrupted parquet Key: ARROW-10237 URL: https://issues.apache.org/jira/browse/ARROW-10237 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 1.0.1 Reporter: Ben Kietzman Assignee: Antoine Pitrou Fix For: 2.0.0 Initial discussion: https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10220) Cache javascript utf-8 dictionary keys?
[ https://issues.apache.org/jira/browse/ARROW-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10220: --- Labels: pull-request-available (was: ) > Cache javascript utf-8 dictionary keys? > --- > > Key: ARROW-10220 > URL: https://issues.apache.org/jira/browse/ARROW-10220 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Affects Versions: 1.0.1 >Reporter: Ben Schmidt >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > String decoding from arrow tables is a major bottleneck in using arrow in > Javascript–it can take a second to decode a million rows. For utf-8 types, > I'm not sure what could be done; but some memoization would help utf-8 > dictionary types. > Currently, the javascript implementation decodes a utf-8 string every time > you request an item from a dictionary with utf-8 data. If arrow cached the > decoded strings to a native js Map, routine operations like looping over all > the entries in a text column might be on the order of 10x faster. Here's an > observable notebook [benchmarking that and a couple other > strategies|https://observablehq.com/@bmschmidt/faster-arrow-dictionary-unpacking]. > I would file a pull request, but 1) I would have to learn some typescript to > do so, and 2) this idea may be undesirable because it creates new objects > that will increase the memory footprint of a table, rather than just using > the typed arrays. > Some discussion of how the real-world issues here affect the arquero project > is [here|https://github.com/uwdata/arquero/issues/1]. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10175) [CI] Nightly hdfs integration test job fails
[ https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210379#comment-17210379 ] Neal Richardson commented on ARROW-10175: - In the link Antoine shared, {code} FAILED opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_hdfs.py::TestLibHdfs::test_read_multiple_parquet_files_with_uri FAILED opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_hdfs.py::TestLibHdfs::test_read_write_parquet_files_with_uri {code} [~jorisvandenbossche] can you take a look? > [CI] Nightly hdfs integration test job fails > > > Key: ARROW-10175 > URL: https://issues.apache.org/jira/browse/ARROW-10175 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, Python >Reporter: Neal Richardson >Priority: Major > Fix For: 2.0.0 > > > Two tests fail: > https://github.com/ursa-labs/crossbow/runs/1204680589 > [removed bogus investigation] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10175) [CI] Nightly hdfs integration test job fails
[ https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-10175: --- Assignee: Joris Van den Bossche > [CI] Nightly hdfs integration test job fails > > > Key: ARROW-10175 > URL: https://issues.apache.org/jira/browse/ARROW-10175 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, Python >Reporter: Neal Richardson >Assignee: Joris Van den Bossche >Priority: Major > Fix For: 2.0.0 > > > Two tests fail: > https://github.com/ursa-labs/crossbow/runs/1204680589 > [removed bogus investigation] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9414) [C++] apt package includes headers for S3 interface, but no support
[ https://issues.apache.org/jira/browse/ARROW-9414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-9414: -- Assignee: Kouhei Sutou (was: Neal Richardson) > [C++] apt package includes headers for S3 interface, but no support > --- > > Key: ARROW-9414 > URL: https://issues.apache.org/jira/browse/ARROW-9414 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.17.1 > Environment: Ubuntu 18.04.04 LTS >Reporter: Simon Bertron >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Attachments: test.cpp > > Time Spent: 1h 20m > Remaining Estimate: 0h > > I believe that the apt package is built without S3 support. But s3fs.h is > exported in filesystem/api.h anyway. This creates undefined reference errors > when trying to link to the package. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-5845) [Java] Implement converter between Arrow record batches and Avro records
[ https://issues.apache.org/jira/browse/ARROW-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-5845: --- Fix Version/s: (was: 2.0.0) 3.0.0 > [Java] Implement converter between Arrow record batches and Avro records > > > Key: ARROW-5845 > URL: https://issues.apache.org/jira/browse/ARROW-5845 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Major > Fix For: 3.0.0 > > > It would be useful for applications which need convert Avro data to Arrow > data. > This is an adapter which convert data with existing API (like JDBC adapter) > rather than a native reader (like orc). > We implement this function through Avro java project, receiving param like > Decoder/Schema/DatumReader of Avro and return VectorSchemaRoot. For each data > type we have a consumer class as below to get Avro data and write it into > vector to avoid boxing/unboxing (e.g. GenericRecord#get returns Object) > {code:java} > public class AvroIntConsumer implements Consumer { > private final IntWriter writer; > public AvroIntConsumer(IntVector vector) > { this.writer = new IntWriterImpl(vector); } > @Override > public void consume(Decoder decoder) throws IOException > { writer.writeInt(decoder.readInt()); writer.setPosition(writer.getPosition() > + 1); } > {code} > We intended to support primitive and complex types (null value represented > via unions type with null type), size limit and field selection could be > optional for users. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9414) [C++] apt package includes headers for S3 interface, but no support
[ https://issues.apache.org/jira/browse/ARROW-9414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-9414: -- Assignee: Neal Richardson (was: Kouhei Sutou) > [C++] apt package includes headers for S3 interface, but no support > --- > > Key: ARROW-9414 > URL: https://issues.apache.org/jira/browse/ARROW-9414 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.17.1 > Environment: Ubuntu 18.04.04 LTS >Reporter: Simon Bertron >Assignee: Neal Richardson >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Attachments: test.cpp > > Time Spent: 1h 20m > Remaining Estimate: 0h > > I believe that the apt package is built without S3 support. But s3fs.h is > exported in filesystem/api.h anyway. This creates undefined reference errors > when trying to link to the package. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-10226) [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove closed ARROW-10226. -- Resolution: Fixed Although Spark produces the correct result when I run an aggregate query against this parquet file, it too shows bad values when I just query the l_returnflag column so it appears that the files are corrupt and Spark skips the bad rows when building the aggregate? I will keep looking into this but I no longer think this is a bug that we need to spend time on. fyi [~jorgecarleitao] > [Rust] [Parquet] Parquet reader reading wrong columns in some batches within > a parquet file > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10226) [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210358#comment-17210358 ] Andy Grove commented on ARROW-10226: Here is a test case to reproduce the issue. I uploaded the parquet file to dropbox. It is ~100MB. [https://www.dropbox.com/s/6cpz1h9juxl4c7t/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet?dl=0] [~jorgecarleitao] Thanks for the offer of help. I don't know much time we should spend on this but if you have the time to take a look at least to confirm the test also fails for you, that would be an extra data point. {code:java} #[test] fn foo() { use arrow::array::Array; use crate::arrow::arrow_reader::ArrowReader; let file = std::fs::File::open( "/mnt/tpch/debug/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet").unwrap(); let file_reader = Rc::new(SerializedFileReader::new(file).unwrap()); let metadata = file_reader .metadata .file_metadata() .key_value_metadata() .as_ref() .unwrap(); let mut arrow_reader = ParquetFileArrowReader::new(file_reader); let schema = arrow_reader.get_schema().unwrap(); let projection = vec![4, 5, 6, 7, 8, 9, 10]; let mut batch_reader = arrow_reader.get_record_reader_by_columns(projection, 40960).unwrap(); while let Some(batch) = batch_reader.next() { let batch = batch.unwrap(); let mut n = 0; match batch.column(4).as_any().downcast_ref::() { Some(l_returnflag) => { for i in 0..batch.num_rows() { if l_returnflag.is_valid(i) { if l_returnflag.value(i).len() > 1 { n = n + 1; } } } } None => println!("l_returnflag is not a string") } println!("{} bad values in batch", n); assert_eq!(n, 0); } } {code} > [Rust] [Parquet] Parquet reader reading wrong columns in some batches within > a parquet file > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10226) [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-10226: --- Summary: [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file (was: [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset) > [Rust] [Parquet] Parquet reader reading wrong columns in some batches within > a parquet file > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210341#comment-17210341 ] Andy Grove commented on ARROW-10226: {code:java} part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 375000 bad values in batch part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 49880 bad values in batch part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 375000 bad values in batch part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 49979 bad values in batch part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 374998 bad values in batch part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 50031 bad values in batch part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 375002 bad values in batch part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 50110 bad values in batch {code} > [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10206) [Python][C++][FlightRPC] Add client option to disable server validation
[ https://issues.apache.org/jira/browse/ARROW-10206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210329#comment-17210329 ] James Duong commented on ARROW-10206: - In the PR above, we support building against multiple versions of gRPC. 1. Prior to 1.27, the features in gRPC needed to support this don't exist. The option to disable server verification fails at runtime if used. 2. Between 1.27 to 1.31 (inclusive), the features needed are in the grpc_impl::experimental namespace. Compile the code in Flight client using that namespace. 3. From 1.32 and later, the features are in the grpc::experimental namespace. Compile the code in Flight client using that namespace. > [Python][C++][FlightRPC] Add client option to disable server validation > --- > > Key: ARROW-10206 > URL: https://issues.apache.org/jira/browse/ARROW-10206 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: James Duong >Assignee: James Duong >Priority: Major > Labels: pull-request-available > Time Spent: 2h > Remaining Estimate: 0h > > Note that this requires using grpc-cpp version 1.25 or higher. > This requires using GRPC's TlsCredentials class, which is in a different > namespace for 1.25-1.31 vs. 1.32+ as well. > This class and its related options provide an option to disable server > certificate checks and require the caller to supply a callback to be used > instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210328#comment-17210328 ] Andy Grove commented on ARROW-10226: Just tracking progress with debugging this. The issue is that the projection is behaving differently PER BATCH within these Parquet files. We expect l_returnflag to be a single char but sometimes the parquet reader is returning the contents of the l_comment field instead. {code:java} [/mnt/tpch/s1/parquet/lineitem/part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: N [/mnt/tpch/s1/parquet/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: N [/mnt/tpch/s1/parquet/lineitem/part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: A [/mnt/tpch/s1/parquet/lineitem/part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: R [/mnt/tpch/s1/parquet/lineitem/part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: N [/mnt/tpch/s1/parquet/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: R [/mnt/tpch/s1/parquet/lineitem/part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: R [/mnt/tpch/s1/parquet/lineitem/part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: s among the fluffily r [/mnt/tpch/s1/parquet/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: eposits a [/mnt/tpch/s1/parquet/lineitem/part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: N [/mnt/tpch/s1/parquet/lineitem/part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: y ironic foxes above t {code} > [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210317#comment-17210317 ] Neal Richardson commented on ARROW-10226: - Sounds good, thanks. Good luck! > [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-10226: --- Priority: Major (was: Blocker) > [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210314#comment-17210314 ] Andy Grove commented on ARROW-10226: [~npr] Sure, I changed to major, but my plan was to resolve the issue before we release tomorrow. > [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10109) [Rust] Add support to produce a C Data interface
[ https://issues.apache.org/jira/browse/ARROW-10109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10109: --- Labels: pull-request-available (was: ) > [Rust] Add support to produce a C Data interface > > > Key: ARROW-10109 > URL: https://issues.apache.org/jira/browse/ARROW-10109 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Jorge Leitão >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The goal of this issue is to support producing C Data arrays of Rust. > The use-case that motivated this issue was the possibility of running > DataFusion from Python and support moving arrays from DataFusion to > Python/Pyarray and vice-versa. > In particular, so that users can write Python UDFs that expect arrow arrays > and return arrow arrays, in the same spirit as pandas-udfs in Spark work for > Pandas. > The brute-force way of writing these arrays is by converting element by > element from and to native types. The efficient way of doing it to pass the > memory address from and to each implementation, which is zero-copy. > To support the latter, we need an FFI implementation in Rust that produces > and consumes [C's Data > interface|https://arrow.apache.org/docs/format/CDataInterface.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10228) [Julia] Donate Julia Implementation
[ https://issues.apache.org/jira/browse/ARROW-10228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-10228: Component/s: Julia > [Julia] Donate Julia Implementation > --- > > Key: ARROW-10228 > URL: https://issues.apache.org/jira/browse/ARROW-10228 > Project: Apache Arrow > Issue Type: New Feature > Components: Julia >Reporter: Jacob Quinn >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > Contribute pure Julia implementation supporting arrow array types and > reading/writing streams/files with the arrow format. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10228) [Julia] Donate Julia Implementation
[ https://issues.apache.org/jira/browse/ARROW-10228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-10228: - Summary: [Julia] Donate Julia Implementation (was: Donate Julia Implementation) > [Julia] Donate Julia Implementation > --- > > Key: ARROW-10228 > URL: https://issues.apache.org/jira/browse/ARROW-10228 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Jacob Quinn >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Contribute pure Julia implementation supporting arrow array types and > reading/writing streams/files with the arrow format. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-5440) [Rust][Parquet] Rust Parquet requiring libstd-xxx.so dependency on centos
[ https://issues.apache.org/jira/browse/ARROW-5440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale closed ARROW-5440. - Resolution: Cannot Reproduce >From the comments, it sounds like this is no longer an issue > [Rust][Parquet] Rust Parquet requiring libstd-xxx.so dependency on centos > - > > Key: ARROW-5440 > URL: https://issues.apache.org/jira/browse/ARROW-5440 > Project: Apache Arrow > Issue Type: Bug > Components: Rust > Environment: CentOS Linux release 7.6.1810 (Core) >Reporter: Tenzin Rigden >Priority: Major > Attachments: parquet-test-libstd.tar.gz, serde_json_test.tar.gz > > > Hello, > In the rust parquet implementation ([https://github.com/sunchao/parquet-rs]) > on centos, the binary created has a `libstd-hash.so` shared library > dependency that is causing issues since it's a shared library found in the > rustup directory. This `libstd-hash.so` dependency isn't there on any other > rust binaries I've made before. This dependency means that I can't run this > binary anywhere where rustup isn't installed with that exact libstd library. > This is not an issue on Mac. > I've attached the rust files and here is the command line output below. > {code:java|title=cli-output|borderStyle=solid} > [centos@_ parquet-test]$ cat /etc/centos-release > CentOS Linux release 7.6.1810 (Core) > [centos@_ parquet-test]$ rustc --version > rustc 1.36.0-nightly (e70d5386d 2019-05-27) > [centos@_ parquet-test]$ ldd target/release/parquet-test > linux-vdso.so.1 => (0x7ffd02fee000) > libstd-44988553032616b2.so => not found > librt.so.1 => /lib64/librt.so.1 (0x7f6ecd209000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x7f6eccfed000) > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f6eccdd7000) > libc.so.6 => /lib64/libc.so.6 (0x7f6ecca0a000) > libm.so.6 => /lib64/libm.so.6 (0x7f6ecc708000) > /lib64/ld-linux-x86-64.so.2 (0x7f6ecd8b1000) > [centos@_ parquet-test]$ ls -l > ~/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/libstd-44988553032616b2.so > -rw-r--r--. 1 centos centos 5623568 May 27 21:46 > /home/centos/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/libstd-44988553032616b2.so > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-5352) [Rust] BinaryArray filter replaces nulls with empty strings
[ https://issues.apache.org/jira/browse/ARROW-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale closed ARROW-5352. - Resolution: Duplicate > [Rust] BinaryArray filter replaces nulls with empty strings > --- > > Key: ARROW-5352 > URL: https://issues.apache.org/jira/browse/ARROW-5352 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.13.0 >Reporter: Neville Dipale >Priority: Minor > > The filter implementation for BinaryArray discards nullness of data. > BinaryArrays that are null (seem to) always return an empty string slice when > getting a value, so the way filter works might be a bug depending on what > Arrow developers' or users' intentions are. > I think we should either preserve nulls (and their count) or document this as > intended behaviour. > Below is a test case that reproduces the bug. > {code:java} > #[test] > fn test_filter_binary_array_with_nulls() { > let mut a: BinaryBuilder = BinaryBuilder::new(100); > a.append_null().unwrap(); > a.append_string("a string").unwrap(); > a.append_null().unwrap(); > a.append_string("with nulls").unwrap(); > let array = a.finish(); > let b = BooleanArray::from(vec![true, true, true, true]); > let c = filter(, ).unwrap(); > let d: = c.as_any().downcast_ref::().unwrap(); > // I didn't expect this behaviour > assert_eq!("", d.get_string(0)); > // fails here > assert!(d.is_null(0)); > assert_eq!(4, d.len()); > // fails here > assert_eq!(2, d.null_count()); > assert_eq!("a string", d.get_string(1)); > // fails here > assert!(d.is_null(2)); > assert_eq!("with nulls", d.get_string(3)); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10199) [Rust][Parquet] Release Parquet at crates.io to remove debug prints
[ https://issues.apache.org/jira/browse/ARROW-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10199. Fix Version/s: 2.0.0 Resolution: Fixed This has been resolved, and will be fixed in next release in about a week or 2 > [Rust][Parquet] Release Parquet at crates.io to remove debug prints > --- > > Key: ARROW-10199 > URL: https://issues.apache.org/jira/browse/ARROW-10199 > Project: Apache Arrow > Issue Type: Wish > Components: Rust >Affects Versions: 1.0.1 >Reporter: Krzysztof Stanisławek >Priority: Critical > Fix For: 2.0.0 > > > Version of Parquet released to docs.rs & crates.io has debug prints in > [https://github.com/apache/arrow/blob/886d87bdea78ce80e39a4b5b6fd6ca6042474c5f/rust/parquet/src/column/writer.rs#L60]. > They were pretty hard to track down, so I suggest considering logging create > in the future. When is the new version going to be released? Is there some > stable schedule I can expect? > Is it recommended to use the current snapshot straight from github instead of > crates.io? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210280#comment-17210280 ] Neal Richardson commented on ARROW-10226: - [~andygrove] can you explain why this is a release blocker, given that our release target date is tomorrow? It certainly sounds bad, but if this is not due to a recent change, and perhaps something that never worked, I'm curious why this should hold up 2.0. > [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Blocker > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10225) [Rust] [Parquet] Fix null bitmap comparisons in roundtrip tests
[ https://issues.apache.org/jira/browse/ARROW-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10225: --- Summary: [Rust] [Parquet] Fix null bitmap comparisons in roundtrip tests (was: [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests) > [Rust] [Parquet] Fix null bitmap comparisons in roundtrip tests > --- > > Key: ARROW-10225 > URL: https://issues.apache.org/jira/browse/ARROW-10225 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > The Arrow spec allows makes the null bitmap optional if an array has no nulls > [~carols10cents], so the tests that were failing were because we're comparing > `None` with a 100% populated bitmap. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10225) [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests
[ https://issues.apache.org/jira/browse/ARROW-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10225. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8388 [https://github.com/apache/arrow/pull/8388] > [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests > --- > > Key: ARROW-10225 > URL: https://issues.apache.org/jira/browse/ARROW-10225 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > The Arrow spec allows makes the null bitmap optional if an array has no nulls > [~carols10cents], so the tests that were failing were because we're comparing > `None` with a 100% populated bitmap. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel
[ https://issues.apache.org/jira/browse/ARROW-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10236: --- Labels: pull-request-available (was: ) > [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel > - > > Key: ARROW-10236 > URL: https://issues.apache.org/jira/browse/ARROW-10236 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > There are plan time checks for valid type casts in DataFusion that are > designed to catch errors early before plan execution > Sadly the cast types that DataFusion thinks are valid is a significant subset > of what the arrow cast kernel supports. The goal of this ticket is to bring > DataFusion to parity with the type casting supported by arrow and allow > DataFusion to plan all casts that are supported by the arrow cast kernel > (I want this implicitly so when I add support for DictionaryArray casts in > Arrow they also are part of DataFusion) > Previously the notions of coercion and casting were somewhat conflated. I > have tried to clarify them in https://github.com/apache/arrow/pull/8399 as > well > For more detail, see > https://github.com/apache/arrow/pull/8340#discussion_r501257096 from > [~jorgecarleitao] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9164) [C++] Provide APIs for adding "docstrings" to arrow::compute::Function classes that can be accessed by bindings
[ https://issues.apache.org/jira/browse/ARROW-9164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-9164: - Assignee: Antoine Pitrou > [C++] Provide APIs for adding "docstrings" to arrow::compute::Function > classes that can be accessed by bindings > --- > > Key: ARROW-9164 > URL: https://issues.apache.org/jira/browse/ARROW-9164 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10233) [Rust] Make array_value_to_string available in all Arrow builds
[ https://issues.apache.org/jira/browse/ARROW-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão updated ARROW-10233: - Component/s: Rust > [Rust] Make array_value_to_string available in all Arrow builds > --- > > Key: ARROW-10233 > URL: https://issues.apache.org/jira/browse/ARROW-10233 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Make array_value_to_string available in all Arrow builds > Currently the array_value_to_string function it is only available if the > `feature = "prettyprint"` is enabled. > The rationale for making this change is that I want to be able to use > `array_value_to_string` to write tests (such as on > https://github.com/apache/arrow/pull/8346) but currently it is only available > when `feature = "prettyprint"` is enabled. > It appears that [~nevi_me] made prettyprint compilation optional so that > arrow could be compiled for wasm in > https://github.com/apache/arrow/pull/7400. > https://issues.apache.org/jira/browse/ARROW-9088 explains that this is due to > some dependency of pretty-table; `array_value_to_string` has no needed > dependencies. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10233) [Rust] Make array_value_to_string available in all Arrow builds
[ https://issues.apache.org/jira/browse/ARROW-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão resolved ARROW-10233. -- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8397 [https://github.com/apache/arrow/pull/8397] > [Rust] Make array_value_to_string available in all Arrow builds > --- > > Key: ARROW-10233 > URL: https://issues.apache.org/jira/browse/ARROW-10233 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Make array_value_to_string available in all Arrow builds > Currently the array_value_to_string function it is only available if the > `feature = "prettyprint"` is enabled. > The rationale for making this change is that I want to be able to use > `array_value_to_string` to write tests (such as on > https://github.com/apache/arrow/pull/8346) but currently it is only available > when `feature = "prettyprint"` is enabled. > It appears that [~nevi_me] made prettyprint compilation optional so that > arrow could be compiled for wasm in > https://github.com/apache/arrow/pull/7400. > https://issues.apache.org/jira/browse/ARROW-9088 explains that this is due to > some dependency of pretty-table; `array_value_to_string` has no needed > dependencies. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210273#comment-17210273 ] Jorge Leitão commented on ARROW-10226: -- I am really sorry to hear that. Let me know if there is anything I can support on this ahead of the release. I can take time over the weekend to bootstrap an environment on the cloud to run this and debug it. I can also easy write some Terraform to bootstrap an environment, so that we have a procedure to run these tests on an independent and "immutable" environment. > [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Blocker > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6537) [R] Pass column_types to CSV reader
[ https://issues.apache.org/jira/browse/ARROW-6537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-6537. Resolution: Fixed Issue resolved by pull request 7807 [https://github.com/apache/arrow/pull/7807] > [R] Pass column_types to CSV reader > --- > > Key: ARROW-6537 > URL: https://issues.apache.org/jira/browse/ARROW-6537 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Reporter: Neal Richardson >Assignee: Romain Francois >Priority: Major > Labels: csv, dataset, pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > See also ARROW-6536. It may be the case that the csv reader does accept a > Schema now, I think I saw that, but otherwise it takes unordered_map. > {{read_csv_arrow}} should take for {{col_types}} either a Schema, a named > list of Types, or the "compact string representation" that {{readr}} > supports. Per its docs, "c = character, i = integer, n = number, d = double, > l = logical, f = factor, D = date, T = date time, t = time, ? = guess, or _/- > to skip the column." So, c = utf8(), i = int32(), d = float64(), l = bool(), > f = dictionary(int32(), utf8()), D = date32(), T = timestamp(), t = time32(), > etc. I'm not sure if ? and - are supported, and/or what exactly happens if > you don't specify types for all columns, but I guess we'll find out, and we > can make JIRAs if important features are missing. > Following the existing conventions in csv.R, that compact string > representation would be encapsulated in {{read_csv_arrow}}, so CsvTableReader > and the various Csv*Options would only deal with the Arrow C++ interface. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8712) [R] Expose strptime timestamp parsing in read_csv conversion options
[ https://issues.apache.org/jira/browse/ARROW-8712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-8712. Resolution: Fixed > [R] Expose strptime timestamp parsing in read_csv conversion options > > > Key: ARROW-8712 > URL: https://issues.apache.org/jira/browse/ARROW-8712 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Wes McKinney >Assignee: Romain Francois >Priority: Major > Fix For: 2.0.0 > > > Follow up to ARROW-8111 > It appears that CsvConvertOptions has a {{timestamp_converters}} vector: > https://github.com/apache/arrow/pull/6631/files#diff-06f0ffdc5cae9f7e40e1a80b250dce47R95 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel rules
Andrew Lamb created ARROW-10236: --- Summary: [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel rules Key: ARROW-10236 URL: https://issues.apache.org/jira/browse/ARROW-10236 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Andrew Lamb There are plan time checks for valid type casts in DataFusion that are designed to catch errors early before plan execution Sadly the cast types that DataFusion thinks are valid is a significant subset of what the arrow cast kernel supports. The goal of this ticket is to bring DataFusion to parity with the type casting supported by arrow and allow DataFusion to plan all casts that are supported by the arrow cast kernel (I want this implicitly so when I add support for DictionaryArray casts in Arrow they also are part of DataFusion) Previously the notions of coercion and casting were somewhat conflated. I have tried to clarify them in https://github.com/apache/arrow/pull/8399 as well For more detail, see https://github.com/apache/arrow/pull/8340#discussion_r501257096 from [~jorgecarleitao] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel
[ https://issues.apache.org/jira/browse/ARROW-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb updated ARROW-10236: Summary: [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel (was: [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel rules ) > [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel > - > > Key: ARROW-10236 > URL: https://issues.apache.org/jira/browse/ARROW-10236 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > > There are plan time checks for valid type casts in DataFusion that are > designed to catch errors early before plan execution > Sadly the cast types that DataFusion thinks are valid is a significant subset > of what the arrow cast kernel supports. The goal of this ticket is to bring > DataFusion to parity with the type casting supported by arrow and allow > DataFusion to plan all casts that are supported by the arrow cast kernel > (I want this implicitly so when I add support for DictionaryArray casts in > Arrow they also are part of DataFusion) > Previously the notions of coercion and casting were somewhat conflated. I > have tried to clarify them in https://github.com/apache/arrow/pull/8399 as > well > For more detail, see > https://github.com/apache/arrow/pull/8340#discussion_r501257096 from > [~jorgecarleitao] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-9930) [C++] Fix undefined behaviour on invalid IPC (OSS-Fuzz)
[ https://issues.apache.org/jira/browse/ARROW-9930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-9930. - Resolution: Invalid I don't remember why I opened this, probably a duplicate of another issue. > [C++] Fix undefined behaviour on invalid IPC (OSS-Fuzz) > --- > > Key: ARROW-9930 > URL: https://issues.apache.org/jira/browse/ARROW-9930 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10235) [Rust][DataFusion] Improve documentation for type coercion
[ https://issues.apache.org/jira/browse/ARROW-10235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão resolved ARROW-10235. -- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8399 [https://github.com/apache/arrow/pull/8399] > [Rust][DataFusion] Improve documentation for type coercion > -- > > Key: ARROW-10235 > URL: https://issues.apache.org/jira/browse/ARROW-10235 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > The code / comments for type coercion are a little confusing and don't make > the distinction between coercion and casting clear -- we could improve the > documentation to clarify this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10040) [Rust] Create a way to slice unalligned offset buffers
[ https://issues.apache.org/jira/browse/ARROW-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-10040: -- Assignee: Jörn Horstmann (was: Neville Dipale) > [Rust] Create a way to slice unalligned offset buffers > -- > > Key: ARROW-10040 > URL: https://issues.apache.org/jira/browse/ARROW-10040 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Assignee: Jörn Horstmann >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > We have limitations on the boolean kernels, where we can't apply the kernels > on buffers whose offsets aren't a multiple of 8. This has the potential of > preventing users from applying some computations on arrays whose offsets > aren't divisible by 8. > We could create methods on Buffer that allow slicing buffers and copying them > into aligned buffers. > An idea would be Buffer::slice(, offset: usize, len: usize) -> Buffer; -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10040) [Rust] Create a way to slice unalligned offset buffers
[ https://issues.apache.org/jira/browse/ARROW-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale reassigned ARROW-10040: -- Assignee: Neville Dipale > [Rust] Create a way to slice unalligned offset buffers > -- > > Key: ARROW-10040 > URL: https://issues.apache.org/jira/browse/ARROW-10040 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Assignee: Neville Dipale >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > We have limitations on the boolean kernels, where we can't apply the kernels > on buffers whose offsets aren't a multiple of 8. This has the potential of > preventing users from applying some computations on arrays whose offsets > aren't divisible by 8. > We could create methods on Buffer that allow slicing buffers and copying them > into aligned buffers. > An idea would be Buffer::slice(, offset: usize, len: usize) -> Buffer; -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10040) [Rust] Create a way to slice unalligned offset buffers
[ https://issues.apache.org/jira/browse/ARROW-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10040. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8262 [https://github.com/apache/arrow/pull/8262] > [Rust] Create a way to slice unalligned offset buffers > -- > > Key: ARROW-10040 > URL: https://issues.apache.org/jira/browse/ARROW-10040 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 4h > Remaining Estimate: 0h > > We have limitations on the boolean kernels, where we can't apply the kernels > on buffers whose offsets aren't a multiple of 8. This has the potential of > preventing users from applying some computations on arrays whose offsets > aren't divisible by 8. > We could create methods on Buffer that allow slicing buffers and copying them > into aligned buffers. > An idea would be Buffer::slice(, offset: usize, len: usize) -> Buffer; -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-3122) [C++] Incremental Variance, Standard Deviation aggregators
[ https://issues.apache.org/jira/browse/ARROW-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210226#comment-17210226 ] Antoine Pitrou commented on ARROW-3122: --- Isn't this fixed by ARROW-10070? > [C++] Incremental Variance, Standard Deviation aggregators > -- > > Key: ARROW-3122 > URL: https://issues.apache.org/jira/browse/ARROW-3122 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: analytics > > These must provide for degrees of freedom adjustment when yielding result -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9967) [Python] Add compute module docs
[ https://issues.apache.org/jira/browse/ARROW-9967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-9967. --- Resolution: Fixed Issue resolved by pull request 8145 [https://github.com/apache/arrow/pull/8145] > [Python] Add compute module docs > > > Key: ARROW-9967 > URL: https://issues.apache.org/jira/browse/ARROW-9967 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Python >Reporter: Andrew Wieteska >Assignee: Andrew Wieteska >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 4h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10234) [C++][Gandiva] Fix logic of round() for floats/decimals in Gandiva
Sagnik Chakraborty created ARROW-10234: -- Summary: [C++][Gandiva] Fix logic of round() for floats/decimals in Gandiva Key: ARROW-10234 URL: https://issues.apache.org/jira/browse/ARROW-10234 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva Reporter: Sagnik Chakraborty Assignee: Sagnik Chakraborty round() for floats/doubles is returning incorrect results for some edge cases, like round(cast(1.55 as float), 1) gives 1.6, but it should be 1.5, since the result after casting to float comes to 1.549523162842, due to inaccurate representation of floating point numbers in memory. Removing an intermediate explicit cast to float statement for a double value, which is used in subsequent computations, minimises the error introduced due to the incorrect representation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10235) [Rust][DataFusion] Improve documentation for type coercion
[ https://issues.apache.org/jira/browse/ARROW-10235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10235: --- Labels: pull-request-available (was: ) > [Rust][DataFusion] Improve documentation for type coercion > -- > > Key: ARROW-10235 > URL: https://issues.apache.org/jira/browse/ARROW-10235 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The code / comments for type coercion are a little confusing and don't make > the distinction between coercion and casting clear -- we could improve the > documentation to clarify this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10234) [C++][Gandiva] Fix logic of round() for floats/decimals in Gandiva
[ https://issues.apache.org/jira/browse/ARROW-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10234: --- Labels: pull-request-available (was: ) > [C++][Gandiva] Fix logic of round() for floats/decimals in Gandiva > -- > > Key: ARROW-10234 > URL: https://issues.apache.org/jira/browse/ARROW-10234 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Reporter: Sagnik Chakraborty >Assignee: Sagnik Chakraborty >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > round() for floats/doubles is returning incorrect results for some edge > cases, like round(cast(1.55 as float), 1) gives 1.6, but it should be 1.5, > since the result after casting to float comes to 1.549523162842, due to > inaccurate representation of floating point numbers in memory. Removing an > intermediate explicit cast to float statement for a double value, which is > used in subsequent computations, minimises the error introduced due to the > incorrect representation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10023) [Gandiva][C++] Implementing Split part function in gandiva
[ https://issues.apache.org/jira/browse/ARROW-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Praveen Kumar resolved ARROW-10023. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8231 [https://github.com/apache/arrow/pull/8231] > [Gandiva][C++] Implementing Split part function in gandiva > -- > > Key: ARROW-10023 > URL: https://issues.apache.org/jira/browse/ARROW-10023 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Gandiva >Reporter: Naman Udasi >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10235) [Rust][DataFusion] Improve documentation for type coercion
Andrew Lamb created ARROW-10235: --- Summary: [Rust][DataFusion] Improve documentation for type coercion Key: ARROW-10235 URL: https://issues.apache.org/jira/browse/ARROW-10235 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Andrew Lamb Assignee: Andrew Lamb The code / comments for type coercion are a little confusing and don't make the distinction between coercion and casting clear -- we could improve the documentation to clarify this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-10165) [Rust] [DataFusion] Allow DataFusion to cast all type combinations supported by Arrow cast kernel
[ https://issues.apache.org/jira/browse/ARROW-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb closed ARROW-10165. --- Resolution: Duplicate > [Rust] [DataFusion] Allow DataFusion to cast all type combinations supported > by Arrow cast kernel > - > > Key: ARROW-10165 > URL: https://issues.apache.org/jira/browse/ARROW-10165 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Minor > Labels: pull-request-available > Time Spent: 2h > Remaining Estimate: 0h > > When the DataFusion planner inserts casts, today it relies on special logic > to determine the valid coded casts. > The actual arrow cast kernels support a much wider range of data types, and > thus DataFusion is artificially limiting the casts it supports for no > particularly good reason I can see. > This ticket tracks the work to remove the extra casting checking in the > datafusion planner and instead simply rely on runtime check of arrow cast > compute kernel > The potential downside of this approach is that the error may be generated > later in the execution process (rather than the planner), and possibly have a > less specific error message, the upside is there is less code and we get > several conversions immediately (like timestamp predicate casting) > I also plan to add DictionaryArray support to the casting kernels and I would > like to avoid having to replicate some part of that logic in DataFusion -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10165) [Rust] [DataFusion] Allow DataFusion to cast all type combinations supported by Arrow cast kernel
[ https://issues.apache.org/jira/browse/ARROW-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210151#comment-17210151 ] Andrew Lamb commented on ARROW-10165: - Per comments on the PR, we have decided on a different approach here. I expect the code will be done under the aegis of https://issues.apache.org/jira/browse/ARROW-10163. Closing this one for now > [Rust] [DataFusion] Allow DataFusion to cast all type combinations supported > by Arrow cast kernel > - > > Key: ARROW-10165 > URL: https://issues.apache.org/jira/browse/ARROW-10165 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Minor > Labels: pull-request-available > Time Spent: 2h > Remaining Estimate: 0h > > When the DataFusion planner inserts casts, today it relies on special logic > to determine the valid coded casts. > The actual arrow cast kernels support a much wider range of data types, and > thus DataFusion is artificially limiting the casts it supports for no > particularly good reason I can see. > This ticket tracks the work to remove the extra casting checking in the > datafusion planner and instead simply rely on runtime check of arrow cast > compute kernel > The potential downside of this approach is that the error may be generated > later in the execution process (rather than the planner), and possibly have a > less specific error message, the upside is there is less code and we get > several conversions immediately (like timestamp predicate casting) > I also plan to add DictionaryArray support to the casting kernels and I would > like to avoid having to replicate some part of that logic in DataFusion -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10233) [Rust] Make array_value_to_string available in all Arrow builds
[ https://issues.apache.org/jira/browse/ARROW-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb reassigned ARROW-10233: --- Assignee: Andrew Lamb > [Rust] Make array_value_to_string available in all Arrow builds > --- > > Key: ARROW-10233 > URL: https://issues.apache.org/jira/browse/ARROW-10233 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Make array_value_to_string available in all Arrow builds > Currently the array_value_to_string function it is only available if the > `feature = "prettyprint"` is enabled. > The rationale for making this change is that I want to be able to use > `array_value_to_string` to write tests (such as on > https://github.com/apache/arrow/pull/8346) but currently it is only available > when `feature = "prettyprint"` is enabled. > It appears that [~nevi_me] made prettyprint compilation optional so that > arrow could be compiled for wasm in > https://github.com/apache/arrow/pull/7400. > https://issues.apache.org/jira/browse/ARROW-9088 explains that this is due to > some dependency of pretty-table; `array_value_to_string` has no needed > dependencies. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10233) [Rust] Make array_value_to_string available in all Arrow builds
[ https://issues.apache.org/jira/browse/ARROW-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10233: --- Labels: pull-request-available (was: ) > [Rust] Make array_value_to_string available in all Arrow builds > --- > > Key: ARROW-10233 > URL: https://issues.apache.org/jira/browse/ARROW-10233 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Andrew Lamb >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Make array_value_to_string available in all Arrow builds > Currently the array_value_to_string function it is only available if the > `feature = "prettyprint"` is enabled. > The rationale for making this change is that I want to be able to use > `array_value_to_string` to write tests (such as on > https://github.com/apache/arrow/pull/8346) but currently it is only available > when `feature = "prettyprint"` is enabled. > It appears that [~nevi_me] made prettyprint compilation optional so that > arrow could be compiled for wasm in > https://github.com/apache/arrow/pull/7400. > https://issues.apache.org/jira/browse/ARROW-9088 explains that this is due to > some dependency of pretty-table; `array_value_to_string` has no needed > dependencies. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-10232) FixedSizeListArray is incorrectly written/read to/from parquet
[ https://issues.apache.org/jira/browse/ARROW-10232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-10232. -- Resolution: Duplicate > FixedSizeListArray is incorrectly written/read to/from parquet > -- > > Key: ARROW-10232 > URL: https://issues.apache.org/jira/browse/ARROW-10232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: Simon Perkins >Priority: Major > Fix For: 2.0.0 > > > FixedSizeListArray's seem to be either incorrectly written or read to or from > Parquet files. > > When reading the parquet file, nulls/Nones are returned where the original > values should be. > > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > import numpy as np > np_data = np.arange(20*4).reshape(20, 4).astype(np.float64) > pa_data = pa.FixedSizeListArray.from_arrays(np_data.ravel(), 4) > assert np_data.tolist() == pa_data.tolist() > schema = pa.schema([pa.field("rectangle", pa_data.type)]) > table = pa.table({"rectangle": pa_data}, schema=schema) > pq.write_table(table, "test.parquet") > in_table = pq.read_table("test.parquet") > # rectangle is filled with nulls > assert in_table.column("rectangle").to_pylist() == pa_data.tolist() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10232) FixedSizeListArray is incorrectly written/read to/from parquet
[ https://issues.apache.org/jira/browse/ARROW-10232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-10232: --- Fix Version/s: 2.0.0 > FixedSizeListArray is incorrectly written/read to/from parquet > -- > > Key: ARROW-10232 > URL: https://issues.apache.org/jira/browse/ARROW-10232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: Simon Perkins >Priority: Major > Fix For: 2.0.0 > > > FixedSizeListArray's seem to be either incorrectly written or read to or from > Parquet files. > > When reading the parquet file, nulls/Nones are returned where the original > values should be. > > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > import numpy as np > np_data = np.arange(20*4).reshape(20, 4).astype(np.float64) > pa_data = pa.FixedSizeListArray.from_arrays(np_data.ravel(), 4) > assert np_data.tolist() == pa_data.tolist() > schema = pa.schema([pa.field("rectangle", pa_data.type)]) > table = pa.table({"rectangle": pa_data}, schema=schema) > pq.write_table(table, "test.parquet") > in_table = pq.read_table("test.parquet") > # rectangle is filled with nulls > assert in_table.column("rectangle").to_pylist() == pa_data.tolist() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10232) FixedSizeListArray is incorrectly written/read to/from parquet
[ https://issues.apache.org/jira/browse/ARROW-10232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210139#comment-17210139 ] Antoine Pitrou commented on ARROW-10232: Thanks for the report. I can confirm this fails on 1.0.1, but it was fixed in git master (we hope to release 2.0.0 in a week or two). > FixedSizeListArray is incorrectly written/read to/from parquet > -- > > Key: ARROW-10232 > URL: https://issues.apache.org/jira/browse/ARROW-10232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: Simon Perkins >Priority: Major > > FixedSizeListArray's seem to be either incorrectly written or read to or from > Parquet files. > > When reading the parquet file, nulls/Nones are returned where the original > values should be. > > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > import numpy as np > np_data = np.arange(20*4).reshape(20, 4).astype(np.float64) > pa_data = pa.FixedSizeListArray.from_arrays(np_data.ravel(), 4) > assert np_data.tolist() == pa_data.tolist() > schema = pa.schema([pa.field("rectangle", pa_data.type)]) > table = pa.table({"rectangle": pa_data}, schema=schema) > pq.write_table(table, "test.parquet") > in_table = pq.read_table("test.parquet") > # rectangle is filled with nulls > assert in_table.column("rectangle").to_pylist() == pa_data.tolist() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10233) [Rust] Make array_value_to_string available in all Arrow builds
Andrew Lamb created ARROW-10233: --- Summary: [Rust] Make array_value_to_string available in all Arrow builds Key: ARROW-10233 URL: https://issues.apache.org/jira/browse/ARROW-10233 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Make array_value_to_string available in all Arrow builds Currently the array_value_to_string function it is only available if the `feature = "prettyprint"` is enabled. The rationale for making this change is that I want to be able to use `array_value_to_string` to write tests (such as on https://github.com/apache/arrow/pull/8346) but currently it is only available when `feature = "prettyprint"` is enabled. It appears that [~nevi_me] made prettyprint compilation optional so that arrow could be compiled for wasm in https://github.com/apache/arrow/pull/7400. https://issues.apache.org/jira/browse/ARROW-9088 explains that this is due to some dependency of pretty-table; `array_value_to_string` has no needed dependencies. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10232) FixedSizeListArray is incorrectly written/read to/from parquet
Simon Perkins created ARROW-10232: - Summary: FixedSizeListArray is incorrectly written/read to/from parquet Key: ARROW-10232 URL: https://issues.apache.org/jira/browse/ARROW-10232 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 1.0.1 Reporter: Simon Perkins FixedSizeListArray's seem to be either incorrectly written or read to or from Parquet files. When reading the parquet file, nulls/Nones are returned where the original values should be. {code:python} import pyarrow as pa import pyarrow.parquet as pq import numpy as np np_data = np.arange(20*4).reshape(20, 4).astype(np.float64) pa_data = pa.FixedSizeListArray.from_arrays(np_data.ravel(), 4) assert np_data.tolist() == pa_data.tolist() schema = pa.schema([pa.field("rectangle", pa_data.type)]) table = pa.table({"rectangle": pa_data}, schema=schema) pq.write_table(table, "test.parquet") in_table = pq.read_table("test.parquet") # rectangle is filled with nulls assert in_table.column("rectangle").to_pylist() == pa_data.tolist() {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10231) [CI] Unable to download minio in arm32v7 docker image
[ https://issues.apache.org/jira/browse/ARROW-10231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10231: --- Labels: pull-request-available (was: ) > [CI] Unable to download minio in arm32v7 docker image > - > > Key: ARROW-10231 > URL: https://issues.apache.org/jira/browse/ARROW-10231 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > See build log https://github.com/apache/arrow/runs/1224947766#step:5:2021 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10231) [CI] Unable to download minio in arm32v7 docker image
Krisztian Szucs created ARROW-10231: --- Summary: [CI] Unable to download minio in arm32v7 docker image Key: ARROW-10231 URL: https://issues.apache.org/jira/browse/ARROW-10231 Project: Apache Arrow Issue Type: Improvement Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 2.0.0 See build log https://github.com/apache/arrow/runs/1224947766#step:5:2021 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10230) [JS][Doc] JavaScript documentation fails to build
[ https://issues.apache.org/jira/browse/ARROW-10230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10230: --- Labels: pull-request-available (was: ) > [JS][Doc] JavaScript documentation fails to build > - > > Key: ARROW-10230 > URL: https://issues.apache.org/jira/browse/ARROW-10230 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, JavaScript >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Probably because of typedoc updates. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10230) [JS][Doc] JavaScript documentation fails to build
Krisztian Szucs created ARROW-10230: --- Summary: [JS][Doc] JavaScript documentation fails to build Key: ARROW-10230 URL: https://issues.apache.org/jira/browse/ARROW-10230 Project: Apache Arrow Issue Type: Improvement Components: Documentation, JavaScript Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 2.0.0 Probably because of typedoc updates. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10230) [JS][Doc] JavaScript documentation fails to build
[ https://issues.apache.org/jira/browse/ARROW-10230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-10230: Issue Type: Bug (was: Improvement) > [JS][Doc] JavaScript documentation fails to build > - > > Key: ARROW-10230 > URL: https://issues.apache.org/jira/browse/ARROW-10230 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, JavaScript >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Fix For: 2.0.0 > > > Probably because of typedoc updates. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10229) [C++][Parquet] Remove left over ARROW_LOG statement.
[ https://issues.apache.org/jira/browse/ARROW-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-10229: --- Component/s: C++ > [C++][Parquet] Remove left over ARROW_LOG statement. > > > Key: ARROW-10229 > URL: https://issues.apache.org/jira/browse/ARROW-10229 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10229) [C++][Parquet] Remove left over ARROW_LOG statement.
[ https://issues.apache.org/jira/browse/ARROW-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-10229. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8392 [https://github.com/apache/arrow/pull/8392] > [C++][Parquet] Remove left over ARROW_LOG statement. > > > Key: ARROW-10229 > URL: https://issues.apache.org/jira/browse/ARROW-10229 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)