[jira] [Commented] (ARROW-10351) [C++][Flight] See if reading/writing to gRPC get/put streams asynchronously helps performance
[ https://issues.apache.org/jira/browse/ARROW-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324149#comment-17324149 ] Yibo Cai commented on ARROW-10351: -- Will redo the test on an 8 core desktop. Maybe too many threads (grpc client, server, compression) compete limited cores. > [C++][Flight] See if reading/writing to gRPC get/put streams asynchronously > helps performance > - > > Key: ARROW-10351 > URL: https://issues.apache.org/jira/browse/ARROW-10351 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC >Reporter: Wes McKinney >Priority: Major > > We don't use any asynchronous concepts in the way that Flight is implemented > now, i.e. IPC deconstruction/reconstruction (which may include compression!) > is not performed concurrent with moving FlightData objects through the gRPC > machinery, which may yield suboptimal performance. > It might be better to apply an actor-type approach where a dedicated thread > retrieves and prepares the next raw IPC message (within a Future) while the > current IPC message is being processed -- that way reading/writing to/from > the gRPC stream is not blocked on the IPC code doing its thing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12430) [C++] Support LZO compression
[ https://issues.apache.org/jira/browse/ARROW-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haowei Yu updated ARROW-12430: -- Description: I have some code that supports arrow compression with LZO and am willing to contribute. However, I do understand there is a license concern w.r.t using lzo library since it's under GPL2. I am not sure if you can take the change set. (was: I have some code that supports arrow compression with LZO and am willing to contribute. However, I do understand there is a license concern w.r.t using lzo library since it's under GPL2 ) > [C++] Support LZO compression > - > > Key: ARROW-12430 > URL: https://issues.apache.org/jira/browse/ARROW-12430 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Haowei Yu >Priority: Major > > I have some code that supports arrow compression with LZO and am willing to > contribute. However, I do understand there is a license concern w.r.t using > lzo library since it's under GPL2. I am not sure if you can take the change > set. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12430) [C++] Support LZO compression
[ https://issues.apache.org/jira/browse/ARROW-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haowei Yu updated ARROW-12430: -- Description: I have some code that supports arrow compression with LZO and am willing to contribute. However, I do understand there is a license concern w.r.t using lzo library since it's under GPL2 > [C++] Support LZO compression > - > > Key: ARROW-12430 > URL: https://issues.apache.org/jira/browse/ARROW-12430 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Haowei Yu >Priority: Major > > I have some code that supports arrow compression with LZO and am willing to > contribute. However, I do understand there is a license concern w.r.t using > lzo library since it's under GPL2 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12430) [C++] Support LZO compression
Haowei Yu created ARROW-12430: - Summary: [C++] Support LZO compression Key: ARROW-12430 URL: https://issues.apache.org/jira/browse/ARROW-12430 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Haowei Yu -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12429) [C++] MergedGeneratorTestFixture is incorrectly instantiated
[ https://issues.apache.org/jira/browse/ARROW-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12429: --- Labels: pull-request-available (was: ) > [C++] MergedGeneratorTestFixture is incorrectly instantiated > > > Key: ARROW-12429 > URL: https://issues.apache.org/jira/browse/ARROW-12429 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > [https://gist.github.com/kou/868eaed328b348e45865747044044272#file-source-cpp-txt] > Looks like the base class was accidentally instantiated instead of the actual > test -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12429) [C++] MergedGeneratorTestFixture is incorrectly instantiated
[ https://issues.apache.org/jira/browse/ARROW-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324121#comment-17324121 ] David Li commented on ARROW-12429: -- Recent versions of Googletest catch this but Googletest stopped releasing after 1.10, instead declaring that master is always usable. Hence it wasn't caught in CI which generally installed the last available "release". I tried the latest master but there are link issues, presumably something has changed in the intervening couple years. > [C++] MergedGeneratorTestFixture is incorrectly instantiated > > > Key: ARROW-12429 > URL: https://issues.apache.org/jira/browse/ARROW-12429 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > > [https://gist.github.com/kou/868eaed328b348e45865747044044272#file-source-cpp-txt] > Looks like the base class was accidentally instantiated instead of the actual > test -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12429) [C++] MergedGeneratorTestFixture is incorrectly instantiated
[ https://issues.apache.org/jira/browse/ARROW-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-12429: - Fix Version/s: (was: 4.0.0) > [C++] MergedGeneratorTestFixture is incorrectly instantiated > > > Key: ARROW-12429 > URL: https://issues.apache.org/jira/browse/ARROW-12429 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > > [https://gist.github.com/kou/868eaed328b348e45865747044044272#file-source-cpp-txt] > Looks like the base class was accidentally instantiated instead of the actual > test -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12429) [C++] MergedGeneratorTestFixture is incorrectly instantiated
[ https://issues.apache.org/jira/browse/ARROW-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-12429: - Fix Version/s: 4.0.0 > [C++] MergedGeneratorTestFixture is incorrectly instantiated > > > Key: ARROW-12429 > URL: https://issues.apache.org/jira/browse/ARROW-12429 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: David Li >Assignee: David Li >Priority: Major > Fix For: 4.0.0 > > > [https://gist.github.com/kou/868eaed328b348e45865747044044272#file-source-cpp-txt] > Looks like the base class was accidentally instantiated instead of the actual > test -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12429) [C++] MergedGeneratorTestFixture is incorrectly instantiated
David Li created ARROW-12429: Summary: [C++] MergedGeneratorTestFixture is incorrectly instantiated Key: ARROW-12429 URL: https://issues.apache.org/jira/browse/ARROW-12429 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: David Li Assignee: David Li [https://gist.github.com/kou/868eaed328b348e45865747044044272#file-source-cpp-txt] Looks like the base class was accidentally instantiated instead of the actual test -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12144) [R] wire up exponentiation bindings
[ https://issues.apache.org/jira/browse/ARROW-12144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-12144. -- Resolution: Duplicate > [R] wire up exponentiation bindings > --- > > Key: ARROW-12144 > URL: https://issues.apache.org/jira/browse/ARROW-12144 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > > Once ARROW-11070 is merged, we can remove the R-based workarounds for these. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-12372) [R] Developer Docs followups
[ https://issues.apache.org/jira/browse/ARROW-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-12372: -- Assignee: Jonathan Keane > [R] Developer Docs followups > > > Key: ARROW-12372 > URL: https://issues.apache.org/jira/browse/ARROW-12372 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Minor > > To dos to add later: > * Check if {{withr::with_makevars(list(LDFLAGS = ""), > remotes::install_github(...)}} is sufficient instead of > {{withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), > remotes::install_github(...)}} > * add a latest nightly downloader helper function > https://github.com/apache/arrow/pull/9898#discussion_r612891598 > * Add description of docker + crossbow > https://github.com/apache/arrow/pull/9898/files#r613590007 > * Discuss the {{ARROW_PYTHON}} flag: > {{ARROW_PYTHON}} is an alias for: > {code} > set(ARROW_COMPUTE ON) > set(ARROW_CSV ON) > set(ARROW_DATASET ON) > set(ARROW_FILESYSTEM ON) > set(ARROW_HDFS ON) > set(ARROW_JSON ON) > {code} > The only one we don't recommend being on is ARROW_HDFS, should we add that > (at least to the "full" section)? Then builds with the R instructions should > be compatible with python too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12372) [R] Developer Docs followups
[ https://issues.apache.org/jira/browse/ARROW-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12372: --- Description: To dos to add later: * Check if {{withr::with_makevars(list(LDFLAGS = ""), remotes::install_github(...)}} is sufficient instead of {{withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), remotes::install_github(...)}} * add a latest nightly downloader helper function https://github.com/apache/arrow/pull/9898#discussion_r612891598 * Add description of docker + crossbow https://github.com/apache/arrow/pull/9898/files#r613590007 * Discuss the {{ARROW_PYTHON}} flag: {{ARROW_PYTHON}} is an alias for: {code} set(ARROW_COMPUTE ON) set(ARROW_CSV ON) set(ARROW_DATASET ON) set(ARROW_FILESYSTEM ON) set(ARROW_HDFS ON) set(ARROW_JSON ON) {code} The only one we don't recommend being on is ARROW_HDFS, should we add that (at least to the "full" section)? Then builds with the R instructions should be compatible with python too. was: To dos to add later: * Check if {{withr::with_makevars(list(LDFLAGS = ""), remotes::install_github(...)}} is sufficient instead of {{withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), remotes::install_github(...)}} * add a latest nightly downloader helper function https://github.com/apache/arrow/pull/9898#discussion_r612891598 * Add description of docker + crossbow https://github.com/apache/arrow/pull/9898/files#r613590007 > [R] Developer Docs followups > > > Key: ARROW-12372 > URL: https://issues.apache.org/jira/browse/ARROW-12372 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Priority: Minor > > To dos to add later: > * Check if {{withr::with_makevars(list(LDFLAGS = ""), > remotes::install_github(...)}} is sufficient instead of > {{withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), > remotes::install_github(...)}} > * add a latest nightly downloader helper function > https://github.com/apache/arrow/pull/9898#discussion_r612891598 > * Add description of docker + crossbow > https://github.com/apache/arrow/pull/9898/files#r613590007 > * Discuss the {{ARROW_PYTHON}} flag: > {{ARROW_PYTHON}} is an alias for: > {code} > set(ARROW_COMPUTE ON) > set(ARROW_CSV ON) > set(ARROW_DATASET ON) > set(ARROW_FILESYSTEM ON) > set(ARROW_HDFS ON) > set(ARROW_JSON ON) > {code} > The only one we don't recommend being on is ARROW_HDFS, should we add that > (at least to the "full" section)? Then builds with the R instructions should > be compatible with python too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12428) [Python] pyarrow.parquet.read_* should use pre_buffer=True
[ https://issues.apache.org/jira/browse/ARROW-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324063#comment-17324063 ] David Li commented on ARROW-12428: -- Finally, if we perform column selection, fsspec's readahead is actually extremely detrimental: {noformat} Pandas/S3FS (no pre-buffer): 88.26093492098153 seconds Pandas/S3FS (pre-buffer): 107.76374901900999 seconds PyArrow (no pre-buffer): 55.75352717819624 seconds PyArrow (pre-buffer): 9.941459016874433 seconds {noformat} {code:python} columns = ['vendor_id', 'pickup_latitude', 'pickup_longitude', 'extra'] start = time.monotonic() df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=False) duration = time.monotonic() - start print("Pandas/S3FS (no pre-buffer):", duration, "seconds") start = time.monotonic() df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=True) duration = time.monotonic() - start print("Pandas/S3FS (pre-buffer):", duration, "seconds") start = time.monotonic() df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=False) duration = time.monotonic() - start print("PyArrow (no pre-buffer):", duration, "seconds") start = time.monotonic() df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", columns=columns, pre_buffer=True) duration = time.monotonic() - start print("PyArrow (pre-buffer):", duration, "seconds") {code} > [Python] pyarrow.parquet.read_* should use pre_buffer=True > -- > > Key: ARROW-12428 > URL: https://issues.apache.org/jira/browse/ARROW-12428 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > If the user is synchronously reading a single file, we should try to read it > as fast as possible. The one sticking point might be whether it's beneficial > to enable this no matter the filesystem or whether we should try to only > enable it on high-latency filesystems. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12428) [Python] pyarrow.parquet.read_* should use pre_buffer=True
[ https://issues.apache.org/jira/browse/ARROW-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12428: --- Labels: pull-request-available (was: ) > [Python] pyarrow.parquet.read_* should use pre_buffer=True > -- > > Key: ARROW-12428 > URL: https://issues.apache.org/jira/browse/ARROW-12428 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > If the user is synchronously reading a single file, we should try to read it > as fast as possible. The one sticking point might be whether it's beneficial > to enable this no matter the filesystem or whether we should try to only > enable it on high-latency filesystems. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-12428) [Python] pyarrow.parquet.read_* should use pre_buffer=True
[ https://issues.apache.org/jira/browse/ARROW-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324043#comment-17324043 ] David Li edited comment on ARROW-12428 at 4/16/21, 7:41 PM: And for local files, to confirm that pre_buffer isn't a negative: {noformat} Pandas: 14.584974920144305 seconds PyArrow: 6.650648137088865 seconds PyArrow (pre-buffer): 6.587288308190182 seconds {noformat} This is on a system with NVME storage, so results may vary for spinning-rust or SATA SSDs. (Updated results to read once without measuring before taking the measurement, in case disk cache is a factor) was (Author: lidavidm): And for local files, to confirm that pre_buffer isn't a negative: {noformat} Pandas: 14.566267257090658 seconds PyArrow: 6.649410092970356 seconds PyArrow (pre-buffer): 6.627140663098544 seconds {noformat} This is on a system with NVME storage, so results may vary for spinning-rust or SATA SSDs. > [Python] pyarrow.parquet.read_* should use pre_buffer=True > -- > > Key: ARROW-12428 > URL: https://issues.apache.org/jira/browse/ARROW-12428 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: David Li >Assignee: David Li >Priority: Major > Fix For: 5.0.0 > > > If the user is synchronously reading a single file, we should try to read it > as fast as possible. The one sticking point might be whether it's beneficial > to enable this no matter the filesystem or whether we should try to only > enable it on high-latency filesystems. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12428) [Python] pyarrow.parquet.read_* should use pre_buffer=True
[ https://issues.apache.org/jira/browse/ARROW-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324043#comment-17324043 ] David Li commented on ARROW-12428: -- And for local files, to confirm that pre_buffer isn't a negative: {noformat} Pandas: 14.566267257090658 seconds PyArrow: 6.649410092970356 seconds PyArrow (pre-buffer): 6.627140663098544 seconds {noformat} This is on a system with NVME storage, so results may vary for spinning-rust or SATA SSDs. > [Python] pyarrow.parquet.read_* should use pre_buffer=True > -- > > Key: ARROW-12428 > URL: https://issues.apache.org/jira/browse/ARROW-12428 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: David Li >Assignee: David Li >Priority: Major > Fix For: 5.0.0 > > > If the user is synchronously reading a single file, we should try to read it > as fast as possible. The one sticking point might be whether it's beneficial > to enable this no matter the filesystem or whether we should try to only > enable it on high-latency filesystems. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12428) [Python] pyarrow.parquet.read_* should use pre_buffer=True
[ https://issues.apache.org/jira/browse/ARROW-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324041#comment-17324041 ] David Li commented on ARROW-12428: -- Here's a quick comparison between Pandas/S3FS and PyArrow with a pre_buffer option implemented: {noformat} Python: 3.9.2 Pandas: 1.2.3 PyArrow: 5.0.0 master (9c1e5bd19347635ea9f373bcf93f2cea0231d50a) Pandas/S3FS: 107.31099020410329 seconds Pandas/S3FS (no readahead): 676.9701101030223 seconds PyArrow: 213.81073790509254 seconds PyArrow (pre-buffer): 29.330630503827706 seconds Pandas/S3FS (pre-buffer): 54.61801828909665 seconds Pandas/S3FS (pre-buffer, no readahead): 46.7531590978615 seconds {noformat} {code:python} import time import pandas as pd import pyarrow.parquet as pq start = time.monotonic() df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet") duration = time.monotonic() - start print("Pandas/S3FS:", duration, "seconds") start = time.monotonic() df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", storage_options={ 'default_block_size': 1, # 0 is ignored 'default_fill_cache': False, }) duration = time.monotonic() - start print("Pandas/S3FS (no readahead):", duration, "seconds") start = time.monotonic() df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet") duration = time.monotonic() - start print("PyArrow:", duration, "seconds") start = time.monotonic() df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", pre_buffer=True) duration = time.monotonic() - start print("PyArrow (pre-buffer):", duration, "seconds") start = time.monotonic() df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", pre_buffer=True) duration = time.monotonic() - start print("Pandas/S3FS (pre-buffer):", duration, "seconds") start = time.monotonic() df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", storage_options={ 'default_block_size': 1, # 0 is ignored 'default_fill_cache': False, }, pre_buffer=True) duration = time.monotonic() - start print("Pandas/S3FS (pre-buffer, no readahead):", duration, "seconds") {code} > [Python] pyarrow.parquet.read_* should use pre_buffer=True > -- > > Key: ARROW-12428 > URL: https://issues.apache.org/jira/browse/ARROW-12428 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: David Li >Assignee: David Li >Priority: Major > Fix For: 5.0.0 > > > If the user is synchronously reading a single file, we should try to read it > as fast as possible. The one sticking point might be whether it's beneficial > to enable this no matter the filesystem or whether we should try to only > enable it on high-latency filesystems. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12392) [C++] Restore asynchronous streaming CSV reader
[ https://issues.apache.org/jira/browse/ARROW-12392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weston Pace updated ARROW-12392: Summary: [C++] Restore asynchronous streaming CSV reader (was: [C++] Restore asynchronous streaming scanner) > [C++] Restore asynchronous streaming CSV reader > --- > > Key: ARROW-12392 > URL: https://issues.apache.org/jira/browse/ARROW-12392 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > In order to support the AsyncScanner we need the asynchronous streaming CSV > reader back (added in ARROW-11887 but reverted later). However, it will > either need to be implemented as a mirror API (so the sync and async > implementations are side-by-side) or the async-API must be wrapped with > RunInSerialExecutor when called synchronously. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12428) [Python] pyarrow.parquet.read_* should use pre_buffer=True
David Li created ARROW-12428: Summary: [Python] pyarrow.parquet.read_* should use pre_buffer=True Key: ARROW-12428 URL: https://issues.apache.org/jira/browse/ARROW-12428 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: David Li Assignee: David Li Fix For: 5.0.0 If the user is synchronously reading a single file, we should try to read it as fast as possible. The one sticking point might be whether it's beneficial to enable this no matter the filesystem or whether we should try to only enable it on high-latency filesystems. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12426) [Rust] Concatenating dictionaries ignores values
[ https://issues.apache.org/jira/browse/ARROW-12426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12426: --- Labels: pull-request-available (was: ) > [Rust] Concatenating dictionaries ignores values > > > Key: ARROW-12426 > URL: https://issues.apache.org/jira/browse/ARROW-12426 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Raphael Taylor-Davies >Assignee: Raphael Taylor-Davies >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Concatenating dictionaries ignores the values array, at best leading to > incorrect data, but often leading to keys with indexes beyond the bounds of > the values array -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12427) [Rust][DataFusion] Reenable physical_optimizer::repartition::Repartition;
Andrew Lamb created ARROW-12427: --- Summary: [Rust][DataFusion] Reenable physical_optimizer::repartition::Repartition; Key: ARROW-12427 URL: https://issues.apache.org/jira/browse/ARROW-12427 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb To fix https://issues.apache.org/jira/browse/ARROW-12421 We disabled the physical_optimizer::repartition::Repartition rule in https://github.com/apache/arrow/pull/10069 this ticket tracks finding the root cause of the CI test failure and reenabing physical_optimizer::repartition::Repartition; -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12427) [Rust][DataFusion] Reenable physical_optimizer::repartition::Repartition;
[ https://issues.apache.org/jira/browse/ARROW-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb updated ARROW-12427: Component/s: Rust - DataFusion > [Rust][DataFusion] Reenable physical_optimizer::repartition::Repartition; > - > > Key: ARROW-12427 > URL: https://issues.apache.org/jira/browse/ARROW-12427 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Andrew Lamb >Priority: Major > > To fix https://issues.apache.org/jira/browse/ARROW-12421 > We disabled the physical_optimizer::repartition::Repartition rule in > https://github.com/apache/arrow/pull/10069 > this ticket tracks finding the root cause of the CI test failure and > reenabing physical_optimizer::repartition::Repartition; -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master
[ https://issues.apache.org/jira/browse/ARROW-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb reassigned ARROW-12421: --- Assignee: Andy Grove > [Rust] [DataFusion] topk_query test fails in master > --- > > Key: ARROW-12421 > URL: https://issues.apache.org/jira/browse/ARROW-12421 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > {code:java} > Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 > tests > test topk_plan ... ok > test topk_query ... FAILED > test normal_query ... okfailures: topk_query stdout > thread 'topk_query' panicked at 'assertion failed: `(left == right)` > left: `["+-+-+", "| customer_id | revenue |", > "+-+-+", "| paul| 300 |", "| jorge | > 200 |", "| andy| 150 |", "+-+-+"]`, > right: `["++", "||", "++", "++"]`: output mismatch for Topk context. > Expectedn > +-+-+ > | customer_id | revenue | > +-+-+ > | paul| 300 | > | jorge | 200 | > | andy| 150 | > +-+-+Actual: > ++ > || > ++ > ++ > ', datafusion/tests/user_defined_plan.rs:133:5 > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master
[ https://issues.apache.org/jira/browse/ARROW-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Lamb resolved ARROW-12421. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 10069 [https://github.com/apache/arrow/pull/10069] > [Rust] [DataFusion] topk_query test fails in master > --- > > Key: ARROW-12421 > URL: https://issues.apache.org/jira/browse/ARROW-12421 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > {code:java} > Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 > tests > test topk_plan ... ok > test topk_query ... FAILED > test normal_query ... okfailures: topk_query stdout > thread 'topk_query' panicked at 'assertion failed: `(left == right)` > left: `["+-+-+", "| customer_id | revenue |", > "+-+-+", "| paul| 300 |", "| jorge | > 200 |", "| andy| 150 |", "+-+-+"]`, > right: `["++", "||", "++", "++"]`: output mismatch for Topk context. > Expectedn > +-+-+ > | customer_id | revenue | > +-+-+ > | paul| 300 | > | jorge | 200 | > | andy| 150 | > +-+-+Actual: > ++ > || > ++ > ++ > ', datafusion/tests/user_defined_plan.rs:133:5 > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12426) [Rust] Concatenating dictionaries ignores values
Raphael Taylor-Davies created ARROW-12426: - Summary: [Rust] Concatenating dictionaries ignores values Key: ARROW-12426 URL: https://issues.apache.org/jira/browse/ARROW-12426 Project: Apache Arrow Issue Type: Improvement Reporter: Raphael Taylor-Davies Assignee: Raphael Taylor-Davies Concatenating dictionaries ignores the values array, at best leading to incorrect data, but often leading to keys with indexes beyond the bounds of the values array -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12425) [Rust] new_null_array doesn't allocate keys buffer for dictionary arrays
[ https://issues.apache.org/jira/browse/ARROW-12425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12425: --- Labels: pull-request-available (was: ) > [Rust] new_null_array doesn't allocate keys buffer for dictionary arrays > > > Key: ARROW-12425 > URL: https://issues.apache.org/jira/browse/ARROW-12425 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Raphael Taylor-Davies >Assignee: Raphael Taylor-Davies >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12425) [Rust] new_null_array doesn't allocate keys buffer for dictionary arrays
Raphael Taylor-Davies created ARROW-12425: - Summary: [Rust] new_null_array doesn't allocate keys buffer for dictionary arrays Key: ARROW-12425 URL: https://issues.apache.org/jira/browse/ARROW-12425 Project: Apache Arrow Issue Type: Improvement Reporter: Raphael Taylor-Davies Assignee: Raphael Taylor-Davies -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Deleted] (ARROW-12418) 1Z0-1072 PDF - Become Oracle Certified With The Help Of Prepare4test
[ https://issues.apache.org/jira/browse/ARROW-12418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson deleted ARROW-12418: > 1Z0-1072 PDF - Become Oracle Certified With The Help Of Prepare4test > > > Key: ARROW-12418 > URL: https://issues.apache.org/jira/browse/ARROW-12418 > Project: Apache Arrow > Issue Type: Task >Reporter: Andrew Sharon >Priority: Major > > *Take Up the 1Z0-1072 Exam For a Successful Career!* > In order to prove your expertise in the Oracle Cloud Infrastructure 2019 > Architect Associate Exam the best thing you could do is to take up the exam > 1Z0-1072. This would bring instant fame to you and also prove that you are an > Oracle Cloud expert. The passing score is decided by the Oracle and it is > likely to change. You may refer to the Oracle website in order to find the > correct passing score. > There are many recommended 1Z0-1072 courses that you may take up for the > Oracle Cloud Infrastructure 2019 Architect Associate Exam exam and these > include Oracle Cloud services etc. Having this knowledge would help you to > perform well in the 1Z0-1072 exam. All these training programs are offered by > Oracle and you can make use of the online training option in order to get > trained at home itself! The [Oracle Cloud > exam|http://prepare4test.com/exam/1z0-1072-dumps/] syllabus would include > topics like basic Oracle Cloud Infrastructure 2019 Architect Associate Exam > etc. In case you failed to pass the 1Z0-1072 exam with the required > percentage of marks, you could re attend the Oracle Cloud Infrastructure 2019 > Architect Associate Exam exam. In order to prepare well for the Oracle Cloud > exam, you could take up various coaching programs by the Oracle university. > There are different types of programs which includes instructor led class, > web based class. You could choose the appropriate program based on your > convenience. There are also lots of 1Z0-1072 practice exams that are > available online. You could take up these 1Z0-1072 practice tests in order to > understand the Oracle Cloud Infrastructure 2019 Architect Associate Exam exam > pattern in a better way! > [!https://i.imgur.com/maE1HKX.jpg!|http://prepare4test.com/exam/1z0-1072-dumps/] > {quote} {quote} > *Why Oracle Cloud Infrastructure 2019 Architect Associate Exam training and > certification?* > IT professionals those who are Oracle Cloud training and certification > holders boast a distinct advantage over other IT aspirants. Oracle 1Z0-1072 > certification is valuable and globally recognized credential that prove the > skills and expertise of the IT professionals. Oracle Cloud is the most > innovative and top data base product, developed to handle the massive and > continuously growing and expanding requirements of modern organizations at > lower costs, with high quality standards. Oracle Cloud Infrastructure 2019 > Architect Associate Exam certification bring forth the aspirants' level of > knowledge and skills to create and maintain Oracle Cloud environment, etc. > This is hence, can be considered as one of the highly respectable and viable > Oracle certification in the industry. 1Z0-1072 IT professionals already > working in the industry get benefited by being eligible to get a salary > raise, also strengthen and create newer avenues in the job market and career > hierarchy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12424) Add Schema Package
[ https://issues.apache.org/jira/browse/ARROW-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12424: --- Labels: pull-request-available (was: ) > Add Schema Package > -- > > Key: ARROW-12424 > URL: https://issues.apache.org/jira/browse/ARROW-12424 > Project: Apache Arrow > Issue Type: Sub-task > Components: Go, Parquet >Reporter: Matt Topol >Assignee: Matt Topol >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Adding the ported code for the Schema module for Go Parquet library. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12231) [C++][Dataset] Separate datasets backed by readers from InMemoryDataset
[ https://issues.apache.org/jira/browse/ARROW-12231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12231: --- Labels: dataset datasets pull-request-available (was: dataset datasets) > [C++][Dataset] Separate datasets backed by readers from InMemoryDataset > --- > > Key: ARROW-12231 > URL: https://issues.apache.org/jira/browse/ARROW-12231 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 4.0.0 >Reporter: Weston Pace >Assignee: David Li >Priority: Major > Labels: dataset, datasets, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > From ARROW-10882/[https://github.com/apache/arrow/pull/9802] > * Backing an InMemoryDataset with a reader is misleading. Let's split that > out into a separate class. > * Dataset scanning can then use an I/O thread for the new class. (Note that > for Python, we'll need to be careful to release the GIL before any operations > so that the I/O thread can acquire the GIL to call into the underlying Python > reader/file object.) > * Longer-term, we should interface with Python's async. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12424) Add Schema Package
Matt Topol created ARROW-12424: -- Summary: Add Schema Package Key: ARROW-12424 URL: https://issues.apache.org/jira/browse/ARROW-12424 Project: Apache Arrow Issue Type: Sub-task Components: Go, Parquet Reporter: Matt Topol Assignee: Matt Topol Adding the ported code for the Schema module for Go Parquet library. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12423) Codecov badge in main Readme only applies to Rust
Dominik Moritz created ARROW-12423: -- Summary: Codecov badge in main Readme only applies to Rust Key: ARROW-12423 URL: https://issues.apache.org/jira/browse/ARROW-12423 Project: Apache Arrow Issue Type: Task Reporter: Dominik Moritz The badge in https://github.com/apache/arrow/blob/master/README.md links to https://app.codecov.io/gh/apache/arrow, which seems to only show the coverage for the Rust code. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12114) [C++] Dataset to table filter expression API change
[ https://issues.apache.org/jira/browse/ARROW-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman closed ARROW-12114. Resolution: Not A Problem > [C++] Dataset to table filter expression API change > --- > > Key: ARROW-12114 > URL: https://issues.apache.org/jira/browse/ARROW-12114 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Diana Clarke >Assignee: Ben Kietzman >Priority: Major > > Ben: > Can you please confirm that we're aware and okay with the following API > change? Thanks! > {code} > import pyarrow.dataset > path_prefix = "ursa-labs-taxi-data-repartitioned-10k/" > paths = [ > > f"ursa-labs-taxi-data-repartitioned-10k/{year}/{month:02}/{part:04}/data.parquet" > for year in range(2009, 2020) > for month in range(1, 13) > for part in range(101) > if not (year == 2019 and month > 6) # Data ends in 2019/06 > and not (year == 2010 and month == 3) # Data is missing in 2010/03 > ] > partitioning = pyarrow.dataset.DirectoryPartitioning.discover( > field_names=["year", "month", "part"], > infer_dictionary=True, > ) > s3 = pyarrow.fs.S3FileSystem(region="us-east-2") > dataset = pyarrow.dataset.dataset( > paths, > format="parquet", > filesystem=s3, > partitioning=partitioning, > partition_base_dir=path_prefix, > ) > year = pyarrow.dataset.field("year") > month = pyarrow.dataset.field("month") > part = pyarrow.dataset.field("part") > filter_expr = (year == "2011") & (month == 1) & (part == 2) > dataset.to_table(filter=filter_expr) > {code} > In arrow 3.0, the above code executes without error. > On head[1], {{year == "2011"}}, which should be {{year == 2011}} (no quotes), > raises the following exception. > {code} > pyarrow.lib.ArrowNotImplementedError: Function equal has no kernel matching > input types (array[int32], scalar[string]) > {code} > This API change appears to have been introduced in ARROW-8919. Perhaps it was > intentional, just figured we should double check. Thanks again! > [1] {{51c97799b8302466b9dfbb657dc23fd3f0cd8e61}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12114) [C++] Dataset to table filter expression API change
[ https://issues.apache.org/jira/browse/ARROW-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323884#comment-17323884 ] Ben Kietzman commented on ARROW-12114: -- I'll close this for now, then. Thanks all > [C++] Dataset to table filter expression API change > --- > > Key: ARROW-12114 > URL: https://issues.apache.org/jira/browse/ARROW-12114 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Diana Clarke >Assignee: Ben Kietzman >Priority: Major > > Ben: > Can you please confirm that we're aware and okay with the following API > change? Thanks! > {code} > import pyarrow.dataset > path_prefix = "ursa-labs-taxi-data-repartitioned-10k/" > paths = [ > > f"ursa-labs-taxi-data-repartitioned-10k/{year}/{month:02}/{part:04}/data.parquet" > for year in range(2009, 2020) > for month in range(1, 13) > for part in range(101) > if not (year == 2019 and month > 6) # Data ends in 2019/06 > and not (year == 2010 and month == 3) # Data is missing in 2010/03 > ] > partitioning = pyarrow.dataset.DirectoryPartitioning.discover( > field_names=["year", "month", "part"], > infer_dictionary=True, > ) > s3 = pyarrow.fs.S3FileSystem(region="us-east-2") > dataset = pyarrow.dataset.dataset( > paths, > format="parquet", > filesystem=s3, > partitioning=partitioning, > partition_base_dir=path_prefix, > ) > year = pyarrow.dataset.field("year") > month = pyarrow.dataset.field("month") > part = pyarrow.dataset.field("part") > filter_expr = (year == "2011") & (month == 1) & (part == 2) > dataset.to_table(filter=filter_expr) > {code} > In arrow 3.0, the above code executes without error. > On head[1], {{year == "2011"}}, which should be {{year == 2011}} (no quotes), > raises the following exception. > {code} > pyarrow.lib.ArrowNotImplementedError: Function equal has no kernel matching > input types (array[int32], scalar[string]) > {code} > This API change appears to have been introduced in ARROW-8919. Perhaps it was > intentional, just figured we should double check. Thanks again! > [1] {{51c97799b8302466b9dfbb657dc23fd3f0cd8e61}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master
[ https://issues.apache.org/jira/browse/ARROW-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323867#comment-17323867 ] Jorge Leitão edited comment on ARROW-12421 at 4/16/21, 2:40 PM: I can’t reproduce this in my small ubuntu vm, even if I use `.with_concurrency(50)` on the test. So, it seems it needs physical units to reproduce. was (Author: jorgecarleitao): I can’t reproduce this in my small vm, even if I use `.with_concurrency(50)` on the test. So, it seems it needs physical units to reproduce. > [Rust] [DataFusion] topk_query test fails in master > --- > > Key: ARROW-12421 > URL: https://issues.apache.org/jira/browse/ARROW-12421 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - DataFusion >Reporter: Andy Grove >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > {code:java} > Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 > tests > test topk_plan ... ok > test topk_query ... FAILED > test normal_query ... okfailures: topk_query stdout > thread 'topk_query' panicked at 'assertion failed: `(left == right)` > left: `["+-+-+", "| customer_id | revenue |", > "+-+-+", "| paul| 300 |", "| jorge | > 200 |", "| andy| 150 |", "+-+-+"]`, > right: `["++", "||", "++", "++"]`: output mismatch for Topk context. > Expectedn > +-+-+ > | customer_id | revenue | > +-+-+ > | paul| 300 | > | jorge | 200 | > | andy| 150 | > +-+-+Actual: > ++ > || > ++ > ++ > ', datafusion/tests/user_defined_plan.rs:133:5 > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master
[ https://issues.apache.org/jira/browse/ARROW-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323867#comment-17323867 ] Jorge Leitão commented on ARROW-12421: -- I can’t reproduce this in my small vm, even if I use `.with_concurrency(50)` on the test. So, it seems it needs physical units to reproduce. > [Rust] [DataFusion] topk_query test fails in master > --- > > Key: ARROW-12421 > URL: https://issues.apache.org/jira/browse/ARROW-12421 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - DataFusion >Reporter: Andy Grove >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > {code:java} > Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 > tests > test topk_plan ... ok > test topk_query ... FAILED > test normal_query ... okfailures: topk_query stdout > thread 'topk_query' panicked at 'assertion failed: `(left == right)` > left: `["+-+-+", "| customer_id | revenue |", > "+-+-+", "| paul| 300 |", "| jorge | > 200 |", "| andy| 150 |", "+-+-+"]`, > right: `["++", "||", "++", "++"]`: output mismatch for Topk context. > Expectedn > +-+-+ > | customer_id | revenue | > +-+-+ > | paul| 300 | > | jorge | 200 | > | andy| 150 | > +-+-+Actual: > ++ > || > ++ > ++ > ', datafusion/tests/user_defined_plan.rs:133:5 > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master
[ https://issues.apache.org/jira/browse/ARROW-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12421: --- Labels: pull-request-available (was: ) > [Rust] [DataFusion] topk_query test fails in master > --- > > Key: ARROW-12421 > URL: https://issues.apache.org/jira/browse/ARROW-12421 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - DataFusion >Reporter: Andy Grove >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > {code:java} > Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 > tests > test topk_plan ... ok > test topk_query ... FAILED > test normal_query ... okfailures: topk_query stdout > thread 'topk_query' panicked at 'assertion failed: `(left == right)` > left: `["+-+-+", "| customer_id | revenue |", > "+-+-+", "| paul| 300 |", "| jorge | > 200 |", "| andy| 150 |", "+-+-+"]`, > right: `["++", "||", "++", "++"]`: output mismatch for Topk context. > Expectedn > +-+-+ > | customer_id | revenue | > +-+-+ > | paul| 300 | > | jorge | 200 | > | andy| 150 | > +-+-+Actual: > ++ > || > ++ > ++ > ', datafusion/tests/user_defined_plan.rs:133:5 > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10351) [C++][Flight] See if reading/writing to gRPC get/put streams asynchronously helps performance
[ https://issues.apache.org/jira/browse/ARROW-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323855#comment-17323855 ] David Li commented on ARROW-10351: -- Hmm, I was unable to replicate the results here. I checked out your branch and current master branch. I'm running on an Intel Comet Lake laptop with 8 physical cores. Current master: {noformat} > env OMP_NUM_THREADS=4 ./release/arrow-flight-benchmark -test_put > -num_perf_runs=4 -num_streams=4 -num_threads=1 Using spawned TCP server Server running with pid 5988 Server host: localhost Server port: 31337 Testing method: DoPut Server host: localhost Server port: 31337 Number of perf runs: 4 Number of concurrent gets/puts: 1 Batch size: 131040 Batches written: 39072 Bytes written: 512000 Nanos: 2655271083 Speed: 1838.91 MB/s Throughput: 14714.9 batches/s Latency mean: 65 us Latency quantile=0.5: 65 us Latency quantile=0.95: 75 us Latency quantile=0.99: 82 us Latency max: 941 us {noformat} This branch: {noformat} > env OMP_NUM_THREADS=4 ./release/arrow-flight-benchmark -test_put > -num_perf_runs=4 -num_streams=1 -num_threads=1 Using spawned TCP server Server running with pid 5921 Server host: localhost Server port: 31337 Testing method: DoPut Server host: localhost Server port: 31337 Number of perf runs: 4 Number of concurrent gets/puts: 1 Batch size: 131040 Batches written: 9768 Bytes written: 128000 Nanos: 686687591 Speed: 1777.67 MB/s Throughput: 14224.8 batches/s Latency mean: 67 us Latency quantile=0.5: 67 us Latency quantile=0.95: 76 us Latency quantile=0.99: 92 us Latency max: 958 us {noformat} > [C++][Flight] See if reading/writing to gRPC get/put streams asynchronously > helps performance > - > > Key: ARROW-10351 > URL: https://issues.apache.org/jira/browse/ARROW-10351 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC >Reporter: Wes McKinney >Priority: Major > > We don't use any asynchronous concepts in the way that Flight is implemented > now, i.e. IPC deconstruction/reconstruction (which may include compression!) > is not performed concurrent with moving FlightData objects through the gRPC > machinery, which may yield suboptimal performance. > It might be better to apply an actor-type approach where a dedicated thread > retrieves and prepares the next raw IPC message (within a Future) while the > current IPC message is being processed -- that way reading/writing to/from > the gRPC stream is not blocked on the IPC code doing its thing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12422) Add castVARCHAR for milliseconds
[ https://issues.apache.org/jira/browse/ARROW-12422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12422: --- Labels: pull-request-available (was: ) > Add castVARCHAR for milliseconds > > > Key: ARROW-12422 > URL: https://issues.apache.org/jira/browse/ARROW-12422 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Rodrigo Jacomozzi de Bem >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12422) Add castVARCHAR for milliseconds
Rodrigo Jacomozzi de Bem created ARROW-12422: Summary: Add castVARCHAR for milliseconds Key: ARROW-12422 URL: https://issues.apache.org/jira/browse/ARROW-12422 Project: Apache Arrow Issue Type: New Feature Reporter: Rodrigo Jacomozzi de Bem -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12416) [CI] [Python] [R] ERROR: xsimd/xsimd.hpp: No such file or directory
[ https://issues.apache.org/jira/browse/ARROW-12416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li closed ARROW-12416. Resolution: Duplicate Last night's builds passed + the linked log is from a couple days ago, so I think we're ok here. > [CI] [Python] [R] ERROR: xsimd/xsimd.hpp: No such file or directory > --- > > Key: ARROW-12416 > URL: https://issues.apache.org/jira/browse/ARROW-12416 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration, Python, R >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Major > > building bpacking_avx2 fails > see > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3582&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1361 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11780) [C++][Python] StructArray.from_arrays() crashes Python interpreter
[ https://issues.apache.org/jira/browse/ARROW-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323820#comment-17323820 ] Alessandro Molina edited comment on ARROW-11780 at 4/16/21, 1:38 PM: - The issue seems to origin from {{pyarrow_unwrap_array}} not recognising the two values as arrays. An empty shared_ptr is returned when unwrapping something that is not an array, see [https://github.com/apache/arrow/pull/827/files#diff-fd3f36df959d5f57664e7c4ca21a59515d5649679e188e77a220af490ab2b601R126-R132] The two arrays are in fact {{ChunkedArray}}, thus the {{pyarrow_unwrap_array}} doesn't deal with them. {code:python} [[1, 2, 3]] {code} I'm not sure on the impact over the rest of the codebase, but it seems it would be more robust to have {{pyarrow_unwrap_array}} (and similar methods) throwing an exception when they face unsupported types instead of returning NULL values that might trigger an action at a distance that is then hard to debug. was (Author: amol-): The issue seems to origin from {{pyarrow_unwrap_array}} not recognising the two values as arrays. An empty shared_ptr is returned when unwrapping something that is not an array, see [https://github.com/apache/arrow/pull/827/files#diff-fd3f36df959d5f57664e7c4ca21a59515d5649679e188e77a220af490ab2b601R126-R132] The two arrays are in fact {{ChunkedArray}}, thus the{{ pyarrow_unwrap_array}} doesn't deal with them. {code:python} [[1, 2, 3]] I'm not sure on the impact over the rest of the codebase, but it seems it would be more robust to have {code} {{pyarrow_unwrap_array}} {code:python} (and similar methods) throwing an exception when they face unsupported types instead of returning NULL values that might trigger an action at a distance that is then hard to debug.{code} > [C++][Python] StructArray.from_arrays() crashes Python interpreter > -- > > Key: ARROW-11780 > URL: https://issues.apache.org/jira/browse/ARROW-11780 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 3.0.0 >Reporter: ARF >Assignee: Weston Pace >Priority: Major > > {{StructArray.from_arrays()}} crashes the Python interpreter without error > message: > {code:none} > (test_pyarrow) Z:\test_pyarrow>python > Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: > Anaconda, Inc. on win32 > Type "help", "copyright", "credits" or "license" for more information. > >>> import pyarrow as pa > >>> > >>> table = pa.Table.from_pydict({ > ... 'foo': pa.array([1, 2, 3]), > ... 'bar': pa.array([4, 5, 6]) > ... }) > >>> > >>> pa.StructArray.from_arrays([table['foo'], table['bar']], ['foo', 'bar']) > (test_pyarrow) Z:\test_pyarrow> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11780) [C++][Python] StructArray.from_arrays() crashes Python interpreter
[ https://issues.apache.org/jira/browse/ARROW-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323820#comment-17323820 ] Alessandro Molina edited comment on ARROW-11780 at 4/16/21, 1:37 PM: - The issue seems to origin from {{pyarrow_unwrap_array}} not recognising the two values as arrays. An empty shared_ptr is returned when unwrapping something that is not an array, see [https://github.com/apache/arrow/pull/827/files#diff-fd3f36df959d5f57664e7c4ca21a59515d5649679e188e77a220af490ab2b601R126-R132] The two arrays are in fact {{ChunkedArray}}, thus the{{ pyarrow_unwrap_array}} doesn't deal with them. {code:python} [[1, 2, 3]] I'm not sure on the impact over the rest of the codebase, but it seems it would be more robust to have {code} {{pyarrow_unwrap_array}} {code:python} (and similar methods) throwing an exception when they face unsupported types instead of returning NULL values that might trigger an action at a distance that is then hard to debug.{code} was (Author: amol-): The issue seems to origin from {{pyarrow_unwrap_array}} not recognising the two values as arrays. An empty shared_ptr is returned when unwrapping something that is not an array, see [https://github.com/apache/arrow/pull/827/files#diff-fd3f36df959d5f57664e7c4ca21a59515d5649679e188e77a220af490ab2b601R126-R132] The two arrays are in fact {{ChunkedArray}}s, thus the {{pyarrow_unwrap_array}} doesn't deal with them. {code:python} [[1, 2, 3]] {code} I'm not sure on the impact over the rest of the codebase, but it seems it would be more robust to have {{pyarrow_unwrap_array}} (and similar methods) throwing an exception when they face unsupported types instead of returning NULL values that might trigger an action at a distance that is then hard to debug. > [C++][Python] StructArray.from_arrays() crashes Python interpreter > -- > > Key: ARROW-11780 > URL: https://issues.apache.org/jira/browse/ARROW-11780 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 3.0.0 >Reporter: ARF >Assignee: Weston Pace >Priority: Major > > {{StructArray.from_arrays()}} crashes the Python interpreter without error > message: > {code:none} > (test_pyarrow) Z:\test_pyarrow>python > Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: > Anaconda, Inc. on win32 > Type "help", "copyright", "credits" or "license" for more information. > >>> import pyarrow as pa > >>> > >>> table = pa.Table.from_pydict({ > ... 'foo': pa.array([1, 2, 3]), > ... 'bar': pa.array([4, 5, 6]) > ... }) > >>> > >>> pa.StructArray.from_arrays([table['foo'], table['bar']], ['foo', 'bar']) > (test_pyarrow) Z:\test_pyarrow> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11780) [C++][Python] StructArray.from_arrays() crashes Python interpreter
[ https://issues.apache.org/jira/browse/ARROW-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323820#comment-17323820 ] Alessandro Molina commented on ARROW-11780: --- The issue seems to origin from {{pyarrow_unwrap_array}} not recognising the two values as arrays. An empty shared_ptr is returned when unwrapping something that is not an array, see [https://github.com/apache/arrow/pull/827/files#diff-fd3f36df959d5f57664e7c4ca21a59515d5649679e188e77a220af490ab2b601R126-R132] The two arrays are in fact {{ChunkedArray}}s, thus the {{pyarrow_unwrap_array}} doesn't deal with them. {code:python} [[1, 2, 3]] {code} I'm not sure on the impact over the rest of the codebase, but it seems it would be more robust to have {{pyarrow_unwrap_array}} (and similar methods) throwing an exception when they face unsupported types instead of returning NULL values that might trigger an action at a distance that is then hard to debug. > [C++][Python] StructArray.from_arrays() crashes Python interpreter > -- > > Key: ARROW-11780 > URL: https://issues.apache.org/jira/browse/ARROW-11780 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 3.0.0 >Reporter: ARF >Assignee: Weston Pace >Priority: Major > > {{StructArray.from_arrays()}} crashes the Python interpreter without error > message: > {code:none} > (test_pyarrow) Z:\test_pyarrow>python > Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: > Anaconda, Inc. on win32 > Type "help", "copyright", "credits" or "license" for more information. > >>> import pyarrow as pa > >>> > >>> table = pa.Table.from_pydict({ > ... 'foo': pa.array([1, 2, 3]), > ... 'bar': pa.array([4, 5, 6]) > ... }) > >>> > >>> pa.StructArray.from_arrays([table['foo'], table['bar']], ['foo', 'bar']) > (test_pyarrow) Z:\test_pyarrow> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master
[ https://issues.apache.org/jira/browse/ARROW-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323818#comment-17323818 ] Andy Grove commented on ARROW-12421: This failure happens consistently on my 24 core Threadripper desktop running Ubuntu but I cannot reproduce it on my MacBook Pro or on my work PC (6 cores, also Ubuntu). > [Rust] [DataFusion] topk_query test fails in master > --- > > Key: ARROW-12421 > URL: https://issues.apache.org/jira/browse/ARROW-12421 > Project: Apache Arrow > Issue Type: Bug > Components: Rust - DataFusion >Reporter: Andy Grove >Priority: Major > > {code:java} > Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 > tests > test topk_plan ... ok > test topk_query ... FAILED > test normal_query ... okfailures: topk_query stdout > thread 'topk_query' panicked at 'assertion failed: `(left == right)` > left: `["+-+-+", "| customer_id | revenue |", > "+-+-+", "| paul| 300 |", "| jorge | > 200 |", "| andy| 150 |", "+-+-+"]`, > right: `["++", "||", "++", "++"]`: output mismatch for Topk context. > Expectedn > +-+-+ > | customer_id | revenue | > +-+-+ > | paul| 300 | > | jorge | 200 | > | andy| 150 | > +-+-+Actual: > ++ > || > ++ > ++ > ', datafusion/tests/user_defined_plan.rs:133:5 > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12417) [CI] [Python] ERROR: ninja: build stopped: subcommand failed
[ https://issues.apache.org/jira/browse/ARROW-12417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12417: --- Summary: [CI] [Python] ERROR: ninja: build stopped: subcommand failed (was: [Nightly] [Continuous Integration] [Python] [CUDA] ERROR: ninja: build stopped: subcommand failed) > [CI] [Python] ERROR: ninja: build stopped: subcommand failed > > > Key: ARROW-12417 > URL: https://issues.apache.org/jira/browse/ARROW-12417 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration, Python >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Major > > Building unity_0 fails. > This affects Python 3.6 and 3.7, see > * > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3579&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1509 > * > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3576&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1383 > This task has failed 33 times across 209 pipeline runs in the last 14 days > (https://dev.azure.com/ursacomputing/crossbow/_pipeline/analytics/stageawareoutcome?definitionId=1) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12417) [Nightly] [Continuous Integration] [Python] [CUDA] ERROR: ninja: build stopped: subcommand failed
[ https://issues.apache.org/jira/browse/ARROW-12417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12417: --- Component/s: Python Continuous Integration > [Nightly] [Continuous Integration] [Python] [CUDA] ERROR: ninja: build > stopped: subcommand failed > - > > Key: ARROW-12417 > URL: https://issues.apache.org/jira/browse/ARROW-12417 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration, Python >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Major > > Building unity_0 fails. > This affects Python 3.6 and 3.7, see > * > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3579&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1509 > * > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3576&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1383 > This task has failed 33 times across 209 pipeline runs in the last 14 days > (https://dev.azure.com/ursacomputing/crossbow/_pipeline/analytics/stageawareoutcome?definitionId=1) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12416) [CI] [Python] [R] ERROR: xsimd/xsimd.hpp: No such file or directory
[ https://issues.apache.org/jira/browse/ARROW-12416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323794#comment-17323794 ] Jonathan Keane commented on ARROW-12416: Did this fail again today? If not (since you've identified a fix) I would say let's close it as a duplicate of ARROW-12383 > [CI] [Python] [R] ERROR: xsimd/xsimd.hpp: No such file or directory > --- > > Key: ARROW-12416 > URL: https://issues.apache.org/jira/browse/ARROW-12416 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration, Python, R >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Major > > building bpacking_avx2 fails > see > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3582&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1361 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12416) [CI] [Python] [R] ERROR: xsimd/xsimd.hpp: No such file or directory
[ https://issues.apache.org/jira/browse/ARROW-12416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12416: --- Summary: [CI] [Python] [R] ERROR: xsimd/xsimd.hpp: No such file or directory (was: [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: No such file or directory) > [CI] [Python] [R] ERROR: xsimd/xsimd.hpp: No such file or directory > --- > > Key: ARROW-12416 > URL: https://issues.apache.org/jira/browse/ARROW-12416 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration, Python, R >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Major > > building bpacking_avx2 fails > see > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3582&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1361 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12362) [Rust] [DataFusion] topk_query test failure
[ https://issues.apache.org/jira/browse/ARROW-12362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove closed ARROW-12362. -- Resolution: Duplicate > [Rust] [DataFusion] topk_query test failure > --- > > Key: ARROW-12362 > URL: https://issues.apache.org/jira/browse/ARROW-12362 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Andy Grove >Priority: Major > Fix For: 5.0.0 > > > I'm seeing this locally with latest from master. > {code:java} > topk_query stdout > thread 'topk_query' panicked at 'assertion failed: `(left == right)` > left: `["+-+-+", "| customer_id | revenue |", > "+-+-+", "| paul| 300 |", "| jorge | > 200 |", "| andy| 150 |", "+-+-+"]`, > right: `["++", "||", "++", "++"]`: output mismatch for Topk context. > Expectedn > +-+-+ > | customer_id | revenue | > +-+-+ > | paul| 300 | > | jorge | 200 | > | andy| 150 | > +-+-+Actual: > ++ > || > ++ > ++ > ', datafusion/tests/user_defined_plan.rs:133:5 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12416) [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: No such file or directory
[ https://issues.apache.org/jira/browse/ARROW-12416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12416: --- Component/s: R Python > [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: > No such file or directory > --- > > Key: ARROW-12416 > URL: https://issues.apache.org/jira/browse/ARROW-12416 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration, Python, R >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Major > > building bpacking_avx2 fails > see > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3582&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1361 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12416) [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: No such file or directory
[ https://issues.apache.org/jira/browse/ARROW-12416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12416: --- Component/s: Continuous Integration > [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: > No such file or directory > --- > > Key: ARROW-12416 > URL: https://issues.apache.org/jira/browse/ARROW-12416 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Major > > building bpacking_avx2 fails > see > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3582&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1361 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12415) [CI] [Python] ERROR: Failed building wheel for pygit2
[ https://issues.apache.org/jira/browse/ARROW-12415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12415: --- Summary: [CI] [Python] ERROR: Failed building wheel for pygit2 (was: [Nightly] [CI] [Python] [ARM64] ERROR: Failed building wheel for pygit2) > [CI] [Python] ERROR: Failed building wheel for pygit2 > - > > Key: ARROW-12415 > URL: https://issues.apache.org/jira/browse/ARROW-12415 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration, Python >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Major > > Failed to build pygit2 > ERROR: Could not build wheels for pygit2 which use PEP 517 and cannot be > installed directly > This affects Pytrhon 3.6 and 3.7, see > * https://cloud.drone.io/ursacomputing/crossbow/458/1/2 > * https://cloud.drone.io/ursacomputing/crossbow/461/1/2 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12415) [Nightly] [CI] [Python] [ARM64] ERROR: Failed building wheel for pygit2
[ https://issues.apache.org/jira/browse/ARROW-12415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12415: --- Component/s: Python Continuous Integration > [Nightly] [CI] [Python] [ARM64] ERROR: Failed building wheel for pygit2 > --- > > Key: ARROW-12415 > URL: https://issues.apache.org/jira/browse/ARROW-12415 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration, Python >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Major > > Failed to build pygit2 > ERROR: Could not build wheels for pygit2 which use PEP 517 and cannot be > installed directly > This affects Pytrhon 3.6 and 3.7, see > * https://cloud.drone.io/ursacomputing/crossbow/458/1/2 > * https://cloud.drone.io/ursacomputing/crossbow/461/1/2 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12415) [Nightly] [CI] [Python] [ARM64] ERROR: Failed building wheel for pygit2
[ https://issues.apache.org/jira/browse/ARROW-12415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12415: --- Summary: [Nightly] [CI] [Python] [ARM64] ERROR: Failed building wheel for pygit2 (was: [Nightly] [Continuous Integration] [Python] [ARM64] ERROR: Failed building wheel for pygit2) > [Nightly] [CI] [Python] [ARM64] ERROR: Failed building wheel for pygit2 > --- > > Key: ARROW-12415 > URL: https://issues.apache.org/jira/browse/ARROW-12415 > Project: Apache Arrow > Issue Type: Task >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Major > > Failed to build pygit2 > ERROR: Could not build wheels for pygit2 which use PEP 517 and cannot be > installed directly > This affects Pytrhon 3.6 and 3.7, see > * https://cloud.drone.io/ursacomputing/crossbow/458/1/2 > * https://cloud.drone.io/ursacomputing/crossbow/461/1/2 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master
Andy Grove created ARROW-12421: -- Summary: [Rust] [DataFusion] topk_query test fails in master Key: ARROW-12421 URL: https://issues.apache.org/jira/browse/ARROW-12421 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Andy Grove {code:java} Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 tests test topk_plan ... ok test topk_query ... FAILED test normal_query ... okfailures: topk_query stdout thread 'topk_query' panicked at 'assertion failed: `(left == right)` left: `["+-+-+", "| customer_id | revenue |", "+-+-+", "| paul| 300 |", "| jorge | 200 |", "| andy| 150 |", "+-+-+"]`, right: `["++", "||", "++", "++"]`: output mismatch for Topk context. Expectedn +-+-+ | customer_id | revenue | +-+-+ | paul| 300 | | jorge | 200 | | andy| 150 | +-+-+Actual: ++ || ++ ++ ', datafusion/tests/user_defined_plan.rs:133:5 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12420) [C++/Dataset] Reading null columns as dictionary not longer possible
[ https://issues.apache.org/jira/browse/ARROW-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323792#comment-17323792 ] Uwe Korn commented on ARROW-12420: -- cc [~bkietz] who wrote the PR that broke it ;) > [C++/Dataset] Reading null columns as dictionary not longer possible > > > Key: ARROW-12420 > URL: https://issues.apache.org/jira/browse/ARROW-12420 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 4.0.0 >Reporter: Uwe Korn >Priority: Major > Fix For: 4.0.0 > > > Reading a dataset with a dictionary column where some of the files don't > contain any data for that column (and thus are typed as null) broke with > https://github.com/apache/arrow/pull/9532. It worked with the 3.0 release > though and thus I would consider this a regression. > This can be reproduced using the following Python snippet: > {code} > import pyarrow as pa > import pyarrow.parquet as pq > import pyarrow.dataset as ds > table = pa.table({"a": [None, None]}) > pq.write_table(table, "test.parquet") > schema = pa.schema([pa.field("a", pa.dictionary(pa.int32(), pa.string()))]) > fsds = ds.FileSystemDataset.from_paths( > paths=["test.parquet"], > schema=schema, > format=pa.dataset.ParquetFileFormat(), > filesystem=pa.fs.LocalFileSystem(), > ) > fsds.to_table() > {code} > The exception on master is currently: > {code} > --- > ArrowNotImplementedError Traceback (most recent call last) > in > 6 filesystem=pa.fs.LocalFileSystem(), > 7 ) > > 8 fsds.to_table() > ~/Development/arrow/python/pyarrow/_dataset.pyx in > pyarrow._dataset.Dataset.to_table() > 456 table : Table instance > 457 """ > --> 458 return self._scanner(**kwargs).to_table() > 459 > 460 def head(self, int num_rows, **kwargs): > ~/Development/arrow/python/pyarrow/_dataset.pyx in > pyarrow._dataset.Scanner.to_table() >2887 result = self.scanner.ToTable() >2888 > -> 2889 return pyarrow_wrap_table(GetResultValue(result)) >2890 >2891 def take(self, object indices): > ~/Development/arrow/python/pyarrow/error.pxi in > pyarrow.lib.pyarrow_internal_check_status() > 139 cdef api int pyarrow_internal_check_status(const CStatus& status) \ > 140 nogil except -1: > --> 141 return check_status(status) > 142 > 143 > ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() > 116 raise ArrowKeyError(message) > 117 elif status.IsNotImplemented(): > --> 118 raise ArrowNotImplementedError(message) > 119 elif status.IsTypeError(): > 120 raise ArrowTypeError(message) > ArrowNotImplementedError: Unsupported cast from null to > dictionary (no available cast > function for target type) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12420) [C++/Dataset] Reading null columns as dictionary not longer possible
Uwe Korn created ARROW-12420: Summary: [C++/Dataset] Reading null columns as dictionary not longer possible Key: ARROW-12420 URL: https://issues.apache.org/jira/browse/ARROW-12420 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 4.0.0 Reporter: Uwe Korn Fix For: 4.0.0 Reading a dataset with a dictionary column where some of the files don't contain any data for that column (and thus are typed as null) broke with https://github.com/apache/arrow/pull/9532. It worked with the 3.0 release though and thus I would consider this a regression. This can be reproduced using the following Python snippet: {code} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds table = pa.table({"a": [None, None]}) pq.write_table(table, "test.parquet") schema = pa.schema([pa.field("a", pa.dictionary(pa.int32(), pa.string()))]) fsds = ds.FileSystemDataset.from_paths( paths=["test.parquet"], schema=schema, format=pa.dataset.ParquetFileFormat(), filesystem=pa.fs.LocalFileSystem(), ) fsds.to_table() {code} The exception on master is currently: {code} --- ArrowNotImplementedError Traceback (most recent call last) in 6 filesystem=pa.fs.LocalFileSystem(), 7 ) > 8 fsds.to_table() ~/Development/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table() 456 table : Table instance 457 """ --> 458 return self._scanner(**kwargs).to_table() 459 460 def head(self, int num_rows, **kwargs): ~/Development/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table() 2887 result = self.scanner.ToTable() 2888 -> 2889 return pyarrow_wrap_table(GetResultValue(result)) 2890 2891 def take(self, object indices): ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() 139 cdef api int pyarrow_internal_check_status(const CStatus& status) \ 140 nogil except -1: --> 141 return check_status(status) 142 143 ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() 116 raise ArrowKeyError(message) 117 elif status.IsNotImplemented(): --> 118 raise ArrowNotImplementedError(message) 119 elif status.IsTypeError(): 120 raise ArrowTypeError(message) ArrowNotImplementedError: Unsupported cast from null to dictionary (no available cast function for target type) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-12380) [Rust][Ballista] Add scheduler ui
[ https://issues.apache.org/jira/browse/ARROW-12380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-12380. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 10026 [https://github.com/apache/arrow/pull/10026] > [Rust][Ballista] Add scheduler ui > - > > Key: ARROW-12380 > URL: https://issues.apache.org/jira/browse/ARROW-12380 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust - Ballista >Reporter: Sathis >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11780) [C++][Python] StructArray.from_arrays() crashes Python interpreter
[ https://issues.apache.org/jira/browse/ARROW-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323776#comment-17323776 ] Alessandro Molina edited comment on ARROW-11780 at 4/16/21, 12:41 PM: -- Confirmed I was able to reproduce the issue and saw the same behaviour {code:java} * frame #0: 0x00010534d03c libarrow.400.dylib`std::__1::shared_ptr::operator->(this=0x0008) const at memory:3930:56 frame #1: 0x00010534fa27 libarrow.400.dylib`arrow::Array::type(this=0x) const at array_base.h:86:51 frame #2: 0x000105416bea libarrow.400.dylib`arrow::StructArray::Make(children=size=2, field_names=size=2, null_bitmap=nullptr, null_count=-1, offset=0) at array_nested.cc:501:61 frame #3: 0x00010510ed09 lib.cpython-39-darwin.so`__pyx_pf_7pyarrow_3lib_11StructArray_4from_arrays(_object*, _object*, _object*) + 6057 {code} {code:java} (const std::__1::vector, std::__1::allocator > >) $0 = size=2 { [0] = nullptr { __ptr_ = 0x } [1] = nullptr { __ptr_ = 0x } } {code} The names of the keys are instead correctly propagated {code:java} (const std::__1::vector, std::__1::allocator >, std::__1::allocator, std::__1::allocator > > >) $1 = size=2 { [0] = "foo" [1] = "bar" } {code} was (Author: amol-): Confirmed I was able to reproduce the issue and saw the same behaviour {code:java} * frame #0: 0x00010534d03c libarrow.400.dylib`std::__1::shared_ptr::operator->(this=0x0008) const at memory:3930:56 frame #1: 0x00010534fa27 libarrow.400.dylib`arrow::Array::type(this=0x) const at array_base.h:86:51 frame #2: 0x000105416bea libarrow.400.dylib`arrow::StructArray::Make(children=size=2, field_names=size=2, null_bitmap=nullptr, null_count=-1, offset=0) at array_nested.cc:501:61 frame #3: 0x00010510ed09 lib.cpython-39-darwin.so`__pyx_pf_7pyarrow_3lib_11StructArray_4from_arrays(_object*, _object*, _object*) + 6057 {code} {code:java} (const std::__1::vector, std::__1::allocator > >) $0 = size=2 { [0] = nullptr { __ptr_ = 0x } [1] = nullptr { __ptr_ = 0x } } {code} > [C++][Python] StructArray.from_arrays() crashes Python interpreter > -- > > Key: ARROW-11780 > URL: https://issues.apache.org/jira/browse/ARROW-11780 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 3.0.0 >Reporter: ARF >Assignee: Weston Pace >Priority: Major > > {{StructArray.from_arrays()}} crashes the Python interpreter without error > message: > {code:none} > (test_pyarrow) Z:\test_pyarrow>python > Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: > Anaconda, Inc. on win32 > Type "help", "copyright", "credits" or "license" for more information. > >>> import pyarrow as pa > >>> > >>> table = pa.Table.from_pydict({ > ... 'foo': pa.array([1, 2, 3]), > ... 'bar': pa.array([4, 5, 6]) > ... }) > >>> > >>> pa.StructArray.from_arrays([table['foo'], table['bar']], ['foo', 'bar']) > (test_pyarrow) Z:\test_pyarrow> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-11780) [C++][Python] StructArray.from_arrays() crashes Python interpreter
[ https://issues.apache.org/jira/browse/ARROW-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323776#comment-17323776 ] Alessandro Molina edited comment on ARROW-11780 at 4/16/21, 12:40 PM: -- Confirmed I was able to reproduce the issue and saw the same behaviour {code:java} * frame #0: 0x00010534d03c libarrow.400.dylib`std::__1::shared_ptr::operator->(this=0x0008) const at memory:3930:56 frame #1: 0x00010534fa27 libarrow.400.dylib`arrow::Array::type(this=0x) const at array_base.h:86:51 frame #2: 0x000105416bea libarrow.400.dylib`arrow::StructArray::Make(children=size=2, field_names=size=2, null_bitmap=nullptr, null_count=-1, offset=0) at array_nested.cc:501:61 frame #3: 0x00010510ed09 lib.cpython-39-darwin.so`__pyx_pf_7pyarrow_3lib_11StructArray_4from_arrays(_object*, _object*, _object*) + 6057 {code} {code:java} (const std::__1::vector, std::__1::allocator > >) $0 = size=2 { [0] = nullptr { __ptr_ = 0x } [1] = nullptr { __ptr_ = 0x } } {code} was (Author: amol-): Confirmed I was able to reproduce the issue and saw the same behaviour {code:java} * frame #0: 0x00010534d03c libarrow.400.dylib`std::__1::shared_ptr::operator->(this=0x0008) const at memory:3930:56 frame #1: 0x00010534fa27 libarrow.400.dylib`arrow::Array::type(this=0x) const at array_base.h:86:51 frame #2: 0x000105416bea libarrow.400.dylib`arrow::StructArray::Make(children=size=2, field_names=size=2, null_bitmap=nullptr, null_count=-1, offset=0) at array_nested.cc:501:61 frame #3: 0x00010510ed09 lib.cpython-39-darwin.so`__pyx_pf_7pyarrow_3lib_11StructArray_4from_arrays(_object*, _object*, _object*) + 6057 {code} > [C++][Python] StructArray.from_arrays() crashes Python interpreter > -- > > Key: ARROW-11780 > URL: https://issues.apache.org/jira/browse/ARROW-11780 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 3.0.0 >Reporter: ARF >Assignee: Weston Pace >Priority: Major > > {{StructArray.from_arrays()}} crashes the Python interpreter without error > message: > {code:none} > (test_pyarrow) Z:\test_pyarrow>python > Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: > Anaconda, Inc. on win32 > Type "help", "copyright", "credits" or "license" for more information. > >>> import pyarrow as pa > >>> > >>> table = pa.Table.from_pydict({ > ... 'foo': pa.array([1, 2, 3]), > ... 'bar': pa.array([4, 5, 6]) > ... }) > >>> > >>> pa.StructArray.from_arrays([table['foo'], table['bar']], ['foo', 'bar']) > (test_pyarrow) Z:\test_pyarrow> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11780) [C++][Python] StructArray.from_arrays() crashes Python interpreter
[ https://issues.apache.org/jira/browse/ARROW-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323776#comment-17323776 ] Alessandro Molina commented on ARROW-11780: --- Confirmed I was able to reproduce the issue and saw the same behaviour {code:java} * frame #0: 0x00010534d03c libarrow.400.dylib`std::__1::shared_ptr::operator->(this=0x0008) const at memory:3930:56 frame #1: 0x00010534fa27 libarrow.400.dylib`arrow::Array::type(this=0x) const at array_base.h:86:51 frame #2: 0x000105416bea libarrow.400.dylib`arrow::StructArray::Make(children=size=2, field_names=size=2, null_bitmap=nullptr, null_count=-1, offset=0) at array_nested.cc:501:61 frame #3: 0x00010510ed09 lib.cpython-39-darwin.so`__pyx_pf_7pyarrow_3lib_11StructArray_4from_arrays(_object*, _object*, _object*) + 6057 {code} > [C++][Python] StructArray.from_arrays() crashes Python interpreter > -- > > Key: ARROW-11780 > URL: https://issues.apache.org/jira/browse/ARROW-11780 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 3.0.0 >Reporter: ARF >Assignee: Weston Pace >Priority: Major > > {{StructArray.from_arrays()}} crashes the Python interpreter without error > message: > {code:none} > (test_pyarrow) Z:\test_pyarrow>python > Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: > Anaconda, Inc. on win32 > Type "help", "copyright", "credits" or "license" for more information. > >>> import pyarrow as pa > >>> > >>> table = pa.Table.from_pydict({ > ... 'foo': pa.array([1, 2, 3]), > ... 'bar': pa.array([4, 5, 6]) > ... }) > >>> > >>> pa.StructArray.from_arrays([table['foo'], table['bar']], ['foo', 'bar']) > (test_pyarrow) Z:\test_pyarrow> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12416) [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: No such file or directory
[ https://issues.apache.org/jira/browse/ARROW-12416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323753#comment-17323753 ] David Li commented on ARROW-12416: -- This was fixed in [https://github.com/apache/arrow/commit/b5045ed833aaff35e6c8064ac7d908c19a5f48fa] if I'm not mistaken. > [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: > No such file or directory > --- > > Key: ARROW-12416 > URL: https://issues.apache.org/jira/browse/ARROW-12416 > Project: Apache Arrow > Issue Type: Task >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Major > > building bpacking_avx2 fails > see > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3582&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1361 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-12416) [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: No such file or directory
[ https://issues.apache.org/jira/browse/ARROW-12416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323753#comment-17323753 ] David Li edited comment on ARROW-12416 at 4/16/21, 11:54 AM: - This was fixed in ARROW-12382 [https://github.com/apache/arrow/commit/b5045ed833aaff35e6c8064ac7d908c19a5f48fa] if I'm not mistaken. was (Author: lidavidm): This was fixed in [https://github.com/apache/arrow/commit/b5045ed833aaff35e6c8064ac7d908c19a5f48fa] if I'm not mistaken. > [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: > No such file or directory > --- > > Key: ARROW-12416 > URL: https://issues.apache.org/jira/browse/ARROW-12416 > Project: Apache Arrow > Issue Type: Task >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Major > > building bpacking_avx2 fails > see > https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3582&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1361 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()
[ https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey closed ARROW-11250. --- Fix Version/s: (was: 5.0.0) 3.0.0 Resolution: Fixed This was fixed with a new version of the adlfs library > [Python] Inconsistent behavior calling ds.dataset() > --- > > Key: ARROW-11250 > URL: https://issues.apache.org/jira/browse/ARROW-11250 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > adal 1.2.5 pyh9f0ad1d_0conda-forge > adlfs 0.5.9 pyhd8ed1ab_0conda-forge > apache-airflow1.10.14 pypi_0pypi > azure-common 1.1.24 py_0conda-forge > azure-core1.9.0 pyhd3deb0d_0conda-forge > azure-datalake-store 0.0.51 pyh9f0ad1d_0conda-forge > azure-identity1.5.0 pyhd8ed1ab_0conda-forge > azure-nspkg 3.0.2 py_0conda-forge > azure-storage-blob12.6.0 pyhd3deb0d_0conda-forge > azure-storage-common 2.1.0py37hc8dfbb8_3conda-forge > fsspec0.8.5 pyhd8ed1ab_0conda-forge > jupyterlab_pygments 0.1.2 pyh9f0ad1d_0conda-forge > pandas1.2.0py37ha9443f7_0 > pyarrow 2.0.0 py37h4935f41_6_cpuconda-forge >Reporter: Lance Dacey >Priority: Minor > Labels: azureblob, dataset,, python > Fix For: 3.0.0 > > > In a Jupyter notebook, I have noticed that sometimes I am not able to read a > dataset which certainly exists on Azure Blob. > > {code:java} > fs = fsspec.filesystem(protocol="abfs", account_name, account_key) > {code} > > One example of this is reading a dataset in one cell: > > {code:java} > ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code} > > Then in another cell I try to read the same dataset: > > {code:java} > ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) > --- > FileNotFoundError Traceback (most recent call last) > in > > 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, > schema, format, filesystem, partitioning, partition_base_dir, > exclude_invalid_files, ignore_prefixes) > 669 # TODO(kszucs): support InMemoryDataset for a table input > 670 if _is_path_like(source): > --> 671 return _filesystem_dataset(source, **kwargs) > 672 elif isinstance(source, (tuple, list)): > 673 if all(_is_path_like(elem) for elem in source): > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > _filesystem_dataset(source, schema, filesystem, partitioning, format, > partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) > 426 fs, paths_or_selector = _ensure_multiple_sources(source, > filesystem) > 427 else: > --> 428 fs, paths_or_selector = _ensure_single_source(source, > filesystem) > 429 > 430 options = FileSystemFactoryOptions( > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > _ensure_single_source(path, filesystem) > 402 paths_or_selector = [path] > 403 else: > --> 404 raise FileNotFoundError(path) > 405 > 406 return filesystem, paths_or_selector > FileNotFoundError: dev/test-split > {code} > > If I reset the kernel, it works again. It also works if I change the path > slightly, like adding a "/" at the end (so basically it just not work if I > read the same dataset twice): > > {code:java} > ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs) > {code} > > > The other strange behavior I have noticed that that if I read a dataset > inside of my Jupyter notebook, > > {code:java} > %%time > dataset = ds.dataset("dev/test-split", > partitioning=ds.partitioning(pa.schema([("date", pa.date32())]), > flavor="hive"), > filesystem=fs, > exclude_invalid_files=False) > CPU times: user 1.98 s, sys: 0 ns, total: 1.98 s Wall time: 2.58 s{code} > > Now, on the exact same server when I try to run the same code against the > same dataset in Airflow it takes over 3 minutes (comparing the timestamps in > my logs between right before I read the dataset, and immediately after the > dataset is available to filter): > {code:java} > [2021-01-14 03:52:04,011] INFO - Reading dev/test-split > [2021-01-14 03:55:17,360] INFO - Processing dat
[jira] [Closed] (ARROW-9682) [Python] Unable to specify the partition style with pq.write_to_dataset
[ https://issues.apache.org/jira/browse/ARROW-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey closed ARROW-9682. -- Resolution: Not A Problem This works using ds.write_dataset() > [Python] Unable to specify the partition style with pq.write_to_dataset > --- > > Key: ARROW-9682 > URL: https://issues.apache.org/jira/browse/ARROW-9682 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 1.0.0 > Environment: Ubuntu 18.04 > Python 3.7 >Reporter: Lance Dacey >Priority: Major > Labels: dataset-parquet-write, parquet, parquetWriter > > I am able to import and test DirectoryPartitioning but I am not able to > figure out a way to write a dataset using this feature. It seems like > write_to_dataset defaults to the "hive" style. Is there a way to test this? > {code:java} > from pyarrow.dataset import DirectoryPartitioning > partitioning = DirectoryPartitioning(pa.schema([("year", pa.int16()), > ("month", pa.int8()), ("day", pa.int8())])) > print(partitioning.parse("/2009/11/3")) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12419) [Java] flatc is not used in mvn
[ https://issues.apache.org/jira/browse/ARROW-12419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-12419: --- Labels: pull-request-available (was: ) > [Java] flatc is not used in mvn > --- > > Key: ARROW-12419 > URL: https://issues.apache.org/jira/browse/ARROW-12419 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Affects Versions: 4.0.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > ARROW-12111 removed the usage of flatc during the build process in mvn. Thus, > it is not necessary to explicitly download flatc for s390x. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12419) [Java] flatc is not used in mvn
Kazuaki Ishizaki created ARROW-12419: Summary: [Java] flatc is not used in mvn Key: ARROW-12419 URL: https://issues.apache.org/jira/browse/ARROW-12419 Project: Apache Arrow Issue Type: Improvement Components: Java Affects Versions: 4.0.0 Reporter: Kazuaki Ishizaki Assignee: Kazuaki Ishizaki ARROW-12111 removed the usage of flatc during the build process in mvn. Thus, it is not necessary to explicitly download flatc for s390x. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12418) 1Z0-1072 PDF - Become Oracle Certified With The Help Of Prepare4test
Andrew Sharon created ARROW-12418: - Summary: 1Z0-1072 PDF - Become Oracle Certified With The Help Of Prepare4test Key: ARROW-12418 URL: https://issues.apache.org/jira/browse/ARROW-12418 Project: Apache Arrow Issue Type: Task Reporter: Andrew Sharon *Take Up the 1Z0-1072 Exam For a Successful Career!* In order to prove your expertise in the Oracle Cloud Infrastructure 2019 Architect Associate Exam the best thing you could do is to take up the exam 1Z0-1072. This would bring instant fame to you and also prove that you are an Oracle Cloud expert. The passing score is decided by the Oracle and it is likely to change. You may refer to the Oracle website in order to find the correct passing score. There are many recommended 1Z0-1072 courses that you may take up for the Oracle Cloud Infrastructure 2019 Architect Associate Exam exam and these include Oracle Cloud services etc. Having this knowledge would help you to perform well in the 1Z0-1072 exam. All these training programs are offered by Oracle and you can make use of the online training option in order to get trained at home itself! The [Oracle Cloud exam|http://prepare4test.com/exam/1z0-1072-dumps/] syllabus would include topics like basic Oracle Cloud Infrastructure 2019 Architect Associate Exam etc. In case you failed to pass the 1Z0-1072 exam with the required percentage of marks, you could re attend the Oracle Cloud Infrastructure 2019 Architect Associate Exam exam. In order to prepare well for the Oracle Cloud exam, you could take up various coaching programs by the Oracle university. There are different types of programs which includes instructor led class, web based class. You could choose the appropriate program based on your convenience. There are also lots of 1Z0-1072 practice exams that are available online. You could take up these 1Z0-1072 practice tests in order to understand the Oracle Cloud Infrastructure 2019 Architect Associate Exam exam pattern in a better way! [!https://i.imgur.com/maE1HKX.jpg!|http://prepare4test.com/exam/1z0-1072-dumps/] {quote} {quote} *Why Oracle Cloud Infrastructure 2019 Architect Associate Exam training and certification?* IT professionals those who are Oracle Cloud training and certification holders boast a distinct advantage over other IT aspirants. Oracle 1Z0-1072 certification is valuable and globally recognized credential that prove the skills and expertise of the IT professionals. Oracle Cloud is the most innovative and top data base product, developed to handle the massive and continuously growing and expanding requirements of modern organizations at lower costs, with high quality standards. Oracle Cloud Infrastructure 2019 Architect Associate Exam certification bring forth the aspirants' level of knowledge and skills to create and maintain Oracle Cloud environment, etc. This is hence, can be considered as one of the highly respectable and viable Oracle certification in the industry. 1Z0-1072 IT professionals already working in the industry get benefited by being eligible to get a salary raise, also strengthen and create newer avenues in the job market and career hierarchy. -- This message was sent by Atlassian Jira (v8.3.4#803005)