date:20210416

[jira] [Commented] (ARROW-10351) [C++][Flight] See if reading/writing to gRPC get/put streams asynchronously helps performance

2021-04-16 Thread Yibo Cai (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324149#comment-17324149
 ] 

Yibo Cai commented on ARROW-10351:
--

Will redo the test on an 8 core desktop.
Maybe too many threads (grpc client, server, compression) compete limited cores.

> [C++][Flight] See if reading/writing to gRPC get/put streams asynchronously 
> helps performance
> -
>
> Key: ARROW-10351
> URL: https://issues.apache.org/jira/browse/ARROW-10351
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC
>Reporter: Wes McKinney
>Priority: Major
>
> We don't use any asynchronous concepts in the way that Flight is implemented 
> now, i.e. IPC deconstruction/reconstruction (which may include compression!) 
> is not performed concurrent with moving FlightData objects through the gRPC 
> machinery, which may yield suboptimal performance. 
> It might be better to apply an actor-type approach where a dedicated thread 
> retrieves and prepares the next raw IPC message (within a Future) while the 
> current IPC message is being processed -- that way reading/writing to/from 
> the gRPC stream is not blocked on the IPC code doing its thing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12430) [C++] Support LZO compression

2021-04-16 Thread Haowei Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haowei Yu updated ARROW-12430:
--
Description: I have some code that supports arrow compression with LZO and 
am willing to contribute. However, I do understand there is a license concern 
w.r.t using lzo library since it's under GPL2. I am not sure if you can take 
the change set.  (was: I have some code that supports arrow compression with 
LZO and am willing to contribute. However, I do understand there is a license 
concern w.r.t using lzo library since it's under GPL2 )

> [C++] Support LZO compression
> -
>
> Key: ARROW-12430
> URL: https://issues.apache.org/jira/browse/ARROW-12430
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Haowei Yu
>Priority: Major
>
> I have some code that supports arrow compression with LZO and am willing to 
> contribute. However, I do understand there is a license concern w.r.t using 
> lzo library since it's under GPL2. I am not sure if you can take the change 
> set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12430) [C++] Support LZO compression

2021-04-16 Thread Haowei Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haowei Yu updated ARROW-12430:
--
Description: I have some code that supports arrow compression with LZO and 
am willing to contribute. However, I do understand there is a license concern 
w.r.t using lzo library since it's under GPL2 

> [C++] Support LZO compression
> -
>
> Key: ARROW-12430
> URL: https://issues.apache.org/jira/browse/ARROW-12430
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Haowei Yu
>Priority: Major
>
> I have some code that supports arrow compression with LZO and am willing to 
> contribute. However, I do understand there is a license concern w.r.t using 
> lzo library since it's under GPL2 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12430) [C++] Support LZO compression

2021-04-16 Thread Haowei Yu (Jira)

Haowei Yu created ARROW-12430:
-

 Summary: [C++] Support LZO compression
 Key: ARROW-12430
 URL: https://issues.apache.org/jira/browse/ARROW-12430
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Haowei Yu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12429) [C++] MergedGeneratorTestFixture is incorrectly instantiated

2021-04-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12429:
---
Labels: pull-request-available  (was: )

> [C++] MergedGeneratorTestFixture is incorrectly instantiated
> 
>
> Key: ARROW-12429
> URL: https://issues.apache.org/jira/browse/ARROW-12429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://gist.github.com/kou/868eaed328b348e45865747044044272#file-source-cpp-txt]
> Looks like the base class was accidentally instantiated instead of the actual 
> test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12429) [C++] MergedGeneratorTestFixture is incorrectly instantiated

2021-04-16 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324121#comment-17324121
 ] 

David Li commented on ARROW-12429:
--

Recent versions of Googletest catch this but Googletest stopped releasing after 
1.10, instead declaring that master is always usable. Hence it wasn't caught in 
CI which generally installed the last available "release". I tried the latest 
master but there are link issues, presumably something has changed in the 
intervening couple years.

> [C++] MergedGeneratorTestFixture is incorrectly instantiated
> 
>
> Key: ARROW-12429
> URL: https://issues.apache.org/jira/browse/ARROW-12429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>
> [https://gist.github.com/kou/868eaed328b348e45865747044044272#file-source-cpp-txt]
> Looks like the base class was accidentally instantiated instead of the actual 
> test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12429) [C++] MergedGeneratorTestFixture is incorrectly instantiated

2021-04-16 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-12429:
-
Fix Version/s: (was: 4.0.0)

> [C++] MergedGeneratorTestFixture is incorrectly instantiated
> 
>
> Key: ARROW-12429
> URL: https://issues.apache.org/jira/browse/ARROW-12429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>
> [https://gist.github.com/kou/868eaed328b348e45865747044044272#file-source-cpp-txt]
> Looks like the base class was accidentally instantiated instead of the actual 
> test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12429) [C++] MergedGeneratorTestFixture is incorrectly instantiated

2021-04-16 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-12429:
-
Fix Version/s: 4.0.0

> [C++] MergedGeneratorTestFixture is incorrectly instantiated
> 
>
> Key: ARROW-12429
> URL: https://issues.apache.org/jira/browse/ARROW-12429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
> Fix For: 4.0.0
>
>
> [https://gist.github.com/kou/868eaed328b348e45865747044044272#file-source-cpp-txt]
> Looks like the base class was accidentally instantiated instead of the actual 
> test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12429) [C++] MergedGeneratorTestFixture is incorrectly instantiated

2021-04-16 Thread David Li (Jira)

David Li created ARROW-12429:


 Summary: [C++] MergedGeneratorTestFixture is incorrectly 
instantiated
 Key: ARROW-12429
 URL: https://issues.apache.org/jira/browse/ARROW-12429
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: David Li
Assignee: David Li


[https://gist.github.com/kou/868eaed328b348e45865747044044272#file-source-cpp-txt]

Looks like the base class was accidentally instantiated instead of the actual 
test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-12144) [R] wire up exponentiation bindings

2021-04-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-12144.
--
Resolution: Duplicate

> [R] wire up exponentiation bindings
> ---
>
> Key: ARROW-12144
> URL: https://issues.apache.org/jira/browse/ARROW-12144
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>
> Once ARROW-11070 is merged, we can remove the R-based workarounds for these.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-12372) [R] Developer Docs followups

2021-04-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-12372:
--

Assignee: Jonathan Keane

> [R] Developer Docs followups
> 
>
> Key: ARROW-12372
> URL: https://issues.apache.org/jira/browse/ARROW-12372
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Minor
>
> To dos to add later:
> * Check if {{withr::with_makevars(list(LDFLAGS = ""), 
> remotes::install_github(...)}} is sufficient instead of 
> {{withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), 
> remotes::install_github(...)}}
> * add a latest nightly downloader helper function 
> https://github.com/apache/arrow/pull/9898#discussion_r612891598
> * Add description of docker + crossbow 
> https://github.com/apache/arrow/pull/9898/files#r613590007
> * Discuss the {{ARROW_PYTHON}} flag:
> {{ARROW_PYTHON}} is an alias for:
> {code}
>   set(ARROW_COMPUTE ON)
>   set(ARROW_CSV ON)
>   set(ARROW_DATASET ON)
>   set(ARROW_FILESYSTEM ON)
>   set(ARROW_HDFS ON)
>   set(ARROW_JSON ON)
> {code}
> The only one we don't recommend being on is ARROW_HDFS, should we add that 
> (at least to the "full" section)? Then builds with the R instructions should 
> be compatible with python too. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12372) [R] Developer Docs followups

2021-04-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12372:
---
Description: 
To dos to add later:

* Check if {{withr::with_makevars(list(LDFLAGS = ""), 
remotes::install_github(...)}} is sufficient instead of 
{{withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), 
remotes::install_github(...)}}
* add a latest nightly downloader helper function 
https://github.com/apache/arrow/pull/9898#discussion_r612891598
* Add description of docker + crossbow 
https://github.com/apache/arrow/pull/9898/files#r613590007
* Discuss the {{ARROW_PYTHON}} flag:

{{ARROW_PYTHON}} is an alias for:

{code}
  set(ARROW_COMPUTE ON)
  set(ARROW_CSV ON)
  set(ARROW_DATASET ON)
  set(ARROW_FILESYSTEM ON)
  set(ARROW_HDFS ON)
  set(ARROW_JSON ON)
{code}

The only one we don't recommend being on is ARROW_HDFS, should we add that (at 
least to the "full" section)? Then builds with the R instructions should be 
compatible with python too. 

  was:
To dos to add later:

* Check if {{withr::with_makevars(list(LDFLAGS = ""), 
remotes::install_github(...)}} is sufficient instead of 
{{withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), 
remotes::install_github(...)}}
* add a latest nightly downloader helper function 
https://github.com/apache/arrow/pull/9898#discussion_r612891598
* Add description of docker + crossbow 
https://github.com/apache/arrow/pull/9898/files#r613590007


> [R] Developer Docs followups
> 
>
> Key: ARROW-12372
> URL: https://issues.apache.org/jira/browse/ARROW-12372
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Priority: Minor
>
> To dos to add later:
> * Check if {{withr::with_makevars(list(LDFLAGS = ""), 
> remotes::install_github(...)}} is sufficient instead of 
> {{withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), 
> remotes::install_github(...)}}
> * add a latest nightly downloader helper function 
> https://github.com/apache/arrow/pull/9898#discussion_r612891598
> * Add description of docker + crossbow 
> https://github.com/apache/arrow/pull/9898/files#r613590007
> * Discuss the {{ARROW_PYTHON}} flag:
> {{ARROW_PYTHON}} is an alias for:
> {code}
>   set(ARROW_COMPUTE ON)
>   set(ARROW_CSV ON)
>   set(ARROW_DATASET ON)
>   set(ARROW_FILESYSTEM ON)
>   set(ARROW_HDFS ON)
>   set(ARROW_JSON ON)
> {code}
> The only one we don't recommend being on is ARROW_HDFS, should we add that 
> (at least to the "full" section)? Then builds with the R instructions should 
> be compatible with python too. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12428) [Python] pyarrow.parquet.read_* should use pre_buffer=True

2021-04-16 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324063#comment-17324063
 ] 

David Li commented on ARROW-12428:
--

Finally, if we perform column selection, fsspec's readahead is actually 
extremely detrimental:
{noformat}
Pandas/S3FS (no pre-buffer): 88.26093492098153 seconds
Pandas/S3FS (pre-buffer): 107.76374901900999 seconds
PyArrow (no pre-buffer): 55.75352717819624 seconds
PyArrow (pre-buffer): 9.941459016874433 seconds {noformat}

{code:python}
columns = ['vendor_id', 'pickup_latitude', 'pickup_longitude', 'extra']

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
columns=columns, pre_buffer=False)
duration = time.monotonic() - start
print("Pandas/S3FS (no pre-buffer):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
columns=columns, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer):", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
columns=columns, pre_buffer=False)
duration = time.monotonic() - start
print("PyArrow (no pre-buffer):", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
columns=columns, pre_buffer=True)
duration = time.monotonic() - start
print("PyArrow (pre-buffer):", duration, "seconds")
{code}

> [Python] pyarrow.parquet.read_* should use pre_buffer=True
> --
>
> Key: ARROW-12428
> URL: https://issues.apache.org/jira/browse/ARROW-12428
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If the user is synchronously reading a single file, we should try to read it 
> as fast as possible. The one sticking point might be whether it's beneficial 
> to enable this no matter the filesystem or whether we should try to only 
> enable it on high-latency filesystems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12428) [Python] pyarrow.parquet.read_* should use pre_buffer=True

2021-04-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12428:
---
Labels: pull-request-available  (was: )

> [Python] pyarrow.parquet.read_* should use pre_buffer=True
> --
>
> Key: ARROW-12428
> URL: https://issues.apache.org/jira/browse/ARROW-12428
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If the user is synchronously reading a single file, we should try to read it 
> as fast as possible. The one sticking point might be whether it's beneficial 
> to enable this no matter the filesystem or whether we should try to only 
> enable it on high-latency filesystems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-12428) [Python] pyarrow.parquet.read_* should use pre_buffer=True

2021-04-16 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324043#comment-17324043
 ] 

David Li edited comment on ARROW-12428 at 4/16/21, 7:41 PM:


And for local files, to confirm that pre_buffer isn't a negative:
{noformat}
Pandas: 14.584974920144305 seconds
PyArrow: 6.650648137088865 seconds
PyArrow (pre-buffer): 6.587288308190182 seconds
{noformat}
This is on a system with NVME storage, so results may vary for spinning-rust or 
SATA SSDs.

(Updated results to read once without measuring before taking the measurement, 
in case disk cache is a factor)


was (Author: lidavidm):
And for local files, to confirm that pre_buffer isn't a negative:
{noformat}
Pandas: 14.566267257090658 seconds
PyArrow: 6.649410092970356 seconds
PyArrow (pre-buffer): 6.627140663098544 seconds {noformat}
This is on a system with NVME storage, so results may vary for spinning-rust or 
SATA SSDs.

> [Python] pyarrow.parquet.read_* should use pre_buffer=True
> --
>
> Key: ARROW-12428
> URL: https://issues.apache.org/jira/browse/ARROW-12428
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: David Li
>Assignee: David Li
>Priority: Major
> Fix For: 5.0.0
>
>
> If the user is synchronously reading a single file, we should try to read it 
> as fast as possible. The one sticking point might be whether it's beneficial 
> to enable this no matter the filesystem or whether we should try to only 
> enable it on high-latency filesystems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12428) [Python] pyarrow.parquet.read_* should use pre_buffer=True

2021-04-16 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324043#comment-17324043
 ] 

David Li commented on ARROW-12428:
--

And for local files, to confirm that pre_buffer isn't a negative:
{noformat}
Pandas: 14.566267257090658 seconds
PyArrow: 6.649410092970356 seconds
PyArrow (pre-buffer): 6.627140663098544 seconds {noformat}
This is on a system with NVME storage, so results may vary for spinning-rust or 
SATA SSDs.

> [Python] pyarrow.parquet.read_* should use pre_buffer=True
> --
>
> Key: ARROW-12428
> URL: https://issues.apache.org/jira/browse/ARROW-12428
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: David Li
>Assignee: David Li
>Priority: Major
> Fix For: 5.0.0
>
>
> If the user is synchronously reading a single file, we should try to read it 
> as fast as possible. The one sticking point might be whether it's beneficial 
> to enable this no matter the filesystem or whether we should try to only 
> enable it on high-latency filesystems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12428) [Python] pyarrow.parquet.read_* should use pre_buffer=True

2021-04-16 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-12428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324041#comment-17324041
 ] 

David Li commented on ARROW-12428:
--

Here's a quick comparison between Pandas/S3FS and PyArrow with a pre_buffer 
option implemented:

{noformat}
Python: 3.9.2
Pandas: 1.2.3
PyArrow: 5.0.0 master (9c1e5bd19347635ea9f373bcf93f2cea0231d50a)

Pandas/S3FS: 107.31099020410329 seconds
Pandas/S3FS (no readahead): 676.9701101030223 seconds
PyArrow: 213.81073790509254 seconds
PyArrow (pre-buffer): 29.330630503827706 seconds
Pandas/S3FS (pre-buffer): 54.61801828909665 seconds
Pandas/S3FS (pre-buffer, no readahead): 46.7531590978615 seconds {noformat}

{code:python}
import time
import pandas as pd
import pyarrow.parquet as pq

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet")
duration = time.monotonic() - start
print("Pandas/S3FS:", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
storage_options={
'default_block_size': 1,  # 0 is ignored
'default_fill_cache': False,
})
duration = time.monotonic() - start
print("Pandas/S3FS (no readahead):", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet")
duration = time.monotonic() - start
print("PyArrow:", duration, "seconds")

start = time.monotonic()
df = pq.read_pandas("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
pre_buffer=True)
duration = time.monotonic() - start
print("PyArrow (pre-buffer):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer):", duration, "seconds")

start = time.monotonic()
df = pd.read_parquet("s3://ursa-labs-taxi-data/2012/01/data.parquet", 
storage_options={
'default_block_size': 1,  # 0 is ignored
'default_fill_cache': False,
}, pre_buffer=True)
duration = time.monotonic() - start
print("Pandas/S3FS (pre-buffer, no readahead):", duration, "seconds")
{code}

> [Python] pyarrow.parquet.read_* should use pre_buffer=True
> --
>
> Key: ARROW-12428
> URL: https://issues.apache.org/jira/browse/ARROW-12428
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: David Li
>Assignee: David Li
>Priority: Major
> Fix For: 5.0.0
>
>
> If the user is synchronously reading a single file, we should try to read it 
> as fast as possible. The one sticking point might be whether it's beneficial 
> to enable this no matter the filesystem or whether we should try to only 
> enable it on high-latency filesystems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12392) [C++] Restore asynchronous streaming CSV reader

2021-04-16 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-12392:

Summary: [C++] Restore asynchronous streaming CSV reader  (was: [C++] 
Restore asynchronous streaming scanner)

> [C++] Restore asynchronous streaming CSV reader
> ---
>
> Key: ARROW-12392
> URL: https://issues.apache.org/jira/browse/ARROW-12392
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In order to support the AsyncScanner we need the asynchronous streaming CSV 
> reader back (added in ARROW-11887 but reverted later).  However, it will 
> either need to be implemented as a mirror API (so the sync and async 
> implementations are side-by-side) or the async-API must be wrapped with 
> RunInSerialExecutor when called synchronously.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12428) [Python] pyarrow.parquet.read_* should use pre_buffer=True

2021-04-16 Thread David Li (Jira)

David Li created ARROW-12428:


 Summary: [Python] pyarrow.parquet.read_* should use pre_buffer=True
 Key: ARROW-12428
 URL: https://issues.apache.org/jira/browse/ARROW-12428
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: David Li
Assignee: David Li
 Fix For: 5.0.0


If the user is synchronously reading a single file, we should try to read it as 
fast as possible. The one sticking point might be whether it's beneficial to 
enable this no matter the filesystem or whether we should try to only enable it 
on high-latency filesystems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12426) [Rust] Concatenating dictionaries ignores values

2021-04-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12426:
---
Labels: pull-request-available  (was: )

> [Rust] Concatenating dictionaries ignores values
> 
>
> Key: ARROW-12426
> URL: https://issues.apache.org/jira/browse/ARROW-12426
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Raphael Taylor-Davies
>Assignee: Raphael Taylor-Davies
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Concatenating dictionaries ignores the values array, at best leading to 
> incorrect data, but often leading to keys with indexes beyond the bounds of 
> the values array



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12427) [Rust][DataFusion] Reenable physical_optimizer::repartition::Repartition;

2021-04-16 Thread Andrew Lamb (Jira)

Andrew Lamb created ARROW-12427:
---

 Summary: [Rust][DataFusion] Reenable 
physical_optimizer::repartition::Repartition;
 Key: ARROW-12427
 URL: https://issues.apache.org/jira/browse/ARROW-12427
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andrew Lamb


To fix https://issues.apache.org/jira/browse/ARROW-12421

We disabled the physical_optimizer::repartition::Repartition rule in 
https://github.com/apache/arrow/pull/10069


this ticket tracks finding the root cause of the CI test failure and reenabing 
physical_optimizer::repartition::Repartition;




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12427) [Rust][DataFusion] Reenable physical_optimizer::repartition::Repartition;

2021-04-16 Thread Andrew Lamb (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-12427:

Component/s: Rust - DataFusion

> [Rust][DataFusion] Reenable physical_optimizer::repartition::Repartition;
> -
>
> Key: ARROW-12427
> URL: https://issues.apache.org/jira/browse/ARROW-12427
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andrew Lamb
>Priority: Major
>
> To fix https://issues.apache.org/jira/browse/ARROW-12421
> We disabled the physical_optimizer::repartition::Repartition rule in 
> https://github.com/apache/arrow/pull/10069
> this ticket tracks finding the root cause of the CI test failure and 
> reenabing physical_optimizer::repartition::Repartition;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master

2021-04-16 Thread Andrew Lamb (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb reassigned ARROW-12421:
---

Assignee: Andy Grove

> [Rust] [DataFusion] topk_query test fails in master
> ---
>
> Key: ARROW-12421
> URL: https://issues.apache.org/jira/browse/ARROW-12421
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code:java}
>  Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 
> tests
> test topk_plan ... ok
> test topk_query ... FAILED
> test normal_query ... okfailures: topk_query stdout 
> thread 'topk_query' panicked at 'assertion failed: `(left == right)`
>   left: `["+-+-+", "| customer_id | revenue |", 
> "+-+-+", "| paul| 300 |", "| jorge   | 
> 200 |", "| andy| 150 |", "+-+-+"]`,
>  right: `["++", "||", "++", "++"]`: output mismatch for Topk context. 
> Expectedn
> +-+-+
> | customer_id | revenue |
> +-+-+
> | paul| 300 |
> | jorge   | 200 |
> | andy| 150 |
> +-+-+Actual:
> ++
> ||
> ++
> ++
> ', datafusion/tests/user_defined_plan.rs:133:5
> note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master

2021-04-16 Thread Andrew Lamb (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-12421.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 10069
[https://github.com/apache/arrow/pull/10069]

> [Rust] [DataFusion] topk_query test fails in master
> ---
>
> Key: ARROW-12421
> URL: https://issues.apache.org/jira/browse/ARROW-12421
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code:java}
>  Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 
> tests
> test topk_plan ... ok
> test topk_query ... FAILED
> test normal_query ... okfailures: topk_query stdout 
> thread 'topk_query' panicked at 'assertion failed: `(left == right)`
>   left: `["+-+-+", "| customer_id | revenue |", 
> "+-+-+", "| paul| 300 |", "| jorge   | 
> 200 |", "| andy| 150 |", "+-+-+"]`,
>  right: `["++", "||", "++", "++"]`: output mismatch for Topk context. 
> Expectedn
> +-+-+
> | customer_id | revenue |
> +-+-+
> | paul| 300 |
> | jorge   | 200 |
> | andy| 150 |
> +-+-+Actual:
> ++
> ||
> ++
> ++
> ', datafusion/tests/user_defined_plan.rs:133:5
> note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12426) [Rust] Concatenating dictionaries ignores values

2021-04-16 Thread Raphael Taylor-Davies (Jira)

Raphael Taylor-Davies created ARROW-12426:
-

 Summary: [Rust] Concatenating dictionaries ignores values
 Key: ARROW-12426
 URL: https://issues.apache.org/jira/browse/ARROW-12426
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Raphael Taylor-Davies
Assignee: Raphael Taylor-Davies


Concatenating dictionaries ignores the values array, at best leading to 
incorrect data, but often leading to keys with indexes beyond the bounds of the 
values array



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12425) [Rust] new_null_array doesn't allocate keys buffer for dictionary arrays

2021-04-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12425:
---
Labels: pull-request-available  (was: )

> [Rust] new_null_array doesn't allocate keys buffer for dictionary arrays
> 
>
> Key: ARROW-12425
> URL: https://issues.apache.org/jira/browse/ARROW-12425
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Raphael Taylor-Davies
>Assignee: Raphael Taylor-Davies
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12425) [Rust] new_null_array doesn't allocate keys buffer for dictionary arrays

2021-04-16 Thread Raphael Taylor-Davies (Jira)

Raphael Taylor-Davies created ARROW-12425:
-

 Summary: [Rust] new_null_array doesn't allocate keys buffer for 
dictionary arrays
 Key: ARROW-12425
 URL: https://issues.apache.org/jira/browse/ARROW-12425
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Raphael Taylor-Davies
Assignee: Raphael Taylor-Davies






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Deleted] (ARROW-12418) 1Z0-1072 PDF - Become Oracle Certified With The Help Of Prepare4test

2021-04-16 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson deleted ARROW-12418:



> 1Z0-1072 PDF - Become Oracle Certified With The Help Of Prepare4test
> 
>
> Key: ARROW-12418
> URL: https://issues.apache.org/jira/browse/ARROW-12418
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Andrew Sharon
>Priority: Major
>
> *Take Up the 1Z0-1072 Exam For a Successful Career!*
> In order to prove your expertise in the Oracle Cloud Infrastructure 2019 
> Architect Associate Exam the best thing you could do is to take up the exam 
> 1Z0-1072. This would bring instant fame to you and also prove that you are an 
> Oracle Cloud expert. The passing score is decided by the Oracle and it is 
> likely to change. You may refer to the Oracle website in order to find the 
> correct passing score. 
> There are many recommended 1Z0-1072 courses that you may take up for the 
> Oracle Cloud Infrastructure 2019 Architect Associate Exam exam and these 
> include Oracle Cloud services etc. Having this knowledge would help you to 
> perform well in the 1Z0-1072 exam. All these training programs are offered by 
> Oracle and you can make use of the online training option in order to get 
> trained at home itself! The [Oracle Cloud 
> exam|http://prepare4test.com/exam/1z0-1072-dumps/] syllabus would include 
> topics like basic Oracle Cloud Infrastructure 2019 Architect Associate Exam 
> etc. In case you failed to pass the 1Z0-1072 exam with the required 
> percentage of marks, you could re attend the Oracle Cloud Infrastructure 2019 
> Architect Associate Exam exam. In order to prepare well for the Oracle Cloud 
> exam, you could take up various coaching programs by the Oracle university. 
> There are different types of programs which includes instructor led class, 
> web based class. You could choose the appropriate program based on your 
> convenience. There are also lots of 1Z0-1072 practice exams that are 
> available online. You could take up these 1Z0-1072 practice tests in order to 
> understand the Oracle Cloud Infrastructure 2019 Architect Associate Exam exam 
> pattern in a better way!
> [!https://i.imgur.com/maE1HKX.jpg!|http://prepare4test.com/exam/1z0-1072-dumps/]
> {quote} {quote}
> *Why Oracle Cloud Infrastructure 2019 Architect Associate Exam training and 
> certification?*
> IT professionals those who are Oracle Cloud training and certification 
> holders boast a distinct advantage over other IT aspirants. Oracle 1Z0-1072 
> certification is valuable and globally recognized credential that prove the 
> skills and expertise of the IT professionals. Oracle Cloud is the most 
> innovative and top data base product, developed to handle the massive and 
> continuously growing and expanding requirements of modern organizations at 
> lower costs, with high quality standards. Oracle Cloud Infrastructure 2019 
> Architect Associate Exam certification bring forth the aspirants' level of 
> knowledge and skills to create and maintain Oracle Cloud environment, etc. 
> This is hence, can be considered as one of the highly respectable and viable 
> Oracle certification in the industry. 1Z0-1072 IT professionals already 
> working in the industry get benefited by being eligible to get a salary 
> raise, also strengthen and create newer avenues in the job market and career 
> hierarchy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12424) Add Schema Package

2021-04-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12424:
---
Labels: pull-request-available  (was: )

> Add Schema Package
> --
>
> Key: ARROW-12424
> URL: https://issues.apache.org/jira/browse/ARROW-12424
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Go, Parquet
>Reporter: Matt Topol
>Assignee: Matt Topol
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Adding the ported code for the Schema module for Go Parquet library.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12231) [C++][Dataset] Separate datasets backed by readers from InMemoryDataset

2021-04-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12231:
---
Labels: dataset datasets pull-request-available  (was: dataset datasets)

> [C++][Dataset] Separate datasets backed by readers from InMemoryDataset
> ---
>
> Key: ARROW-12231
> URL: https://issues.apache.org/jira/browse/ARROW-12231
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 4.0.0
>Reporter: Weston Pace
>Assignee: David Li
>Priority: Major
>  Labels: dataset, datasets, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> From ARROW-10882/[https://github.com/apache/arrow/pull/9802] 
>  * Backing an InMemoryDataset with a reader is misleading. Let's split that 
> out into a separate class.
>  * Dataset scanning can then use an I/O thread for the new class. (Note that 
> for Python, we'll need to be careful to release the GIL before any operations 
> so that the I/O thread can acquire the GIL to call into the underlying Python 
> reader/file object.)
>  * Longer-term, we should interface with Python's async.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12424) Add Schema Package

2021-04-16 Thread Matt Topol (Jira)

Matt Topol created ARROW-12424:
--

 Summary: Add Schema Package
 Key: ARROW-12424
 URL: https://issues.apache.org/jira/browse/ARROW-12424
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Go, Parquet
Reporter: Matt Topol
Assignee: Matt Topol


Adding the ported code for the Schema module for Go Parquet library.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12423) Codecov badge in main Readme only applies to Rust

2021-04-16 Thread Dominik Moritz (Jira)

Dominik Moritz created ARROW-12423:
--

 Summary: Codecov badge in main Readme only applies to Rust
 Key: ARROW-12423
 URL: https://issues.apache.org/jira/browse/ARROW-12423
 Project: Apache Arrow
  Issue Type: Task
Reporter: Dominik Moritz


The badge in https://github.com/apache/arrow/blob/master/README.md links to 
https://app.codecov.io/gh/apache/arrow, which seems to only show the coverage 
for the Rust code. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-12114) [C++] Dataset to table filter expression API change

2021-04-16 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman closed ARROW-12114.

Resolution: Not A Problem

> [C++] Dataset to table filter expression API change
> ---
>
> Key: ARROW-12114
> URL: https://issues.apache.org/jira/browse/ARROW-12114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Diana Clarke
>Assignee: Ben Kietzman
>Priority: Major
>
> Ben:
> Can you please confirm that we're aware and okay with the following API 
> change? Thanks!
> {code}
> import pyarrow.dataset
> path_prefix = "ursa-labs-taxi-data-repartitioned-10k/"
> paths = [
> 
> f"ursa-labs-taxi-data-repartitioned-10k/{year}/{month:02}/{part:04}/data.parquet"
> for year in range(2009, 2020)
> for month in range(1, 13)
> for part in range(101)
> if not (year == 2019 and month > 6)  # Data ends in 2019/06
> and not (year == 2010 and month == 3)  # Data is missing in 2010/03
> ]
> partitioning = pyarrow.dataset.DirectoryPartitioning.discover(
> field_names=["year", "month", "part"],
> infer_dictionary=True,
> )
> s3 = pyarrow.fs.S3FileSystem(region="us-east-2")
> dataset = pyarrow.dataset.dataset(
> paths,
> format="parquet",
> filesystem=s3,
> partitioning=partitioning,
> partition_base_dir=path_prefix,
> )
> year = pyarrow.dataset.field("year")
> month = pyarrow.dataset.field("month")
> part = pyarrow.dataset.field("part")
> filter_expr = (year == "2011") & (month == 1) & (part == 2)
> dataset.to_table(filter=filter_expr)
> {code}
> In arrow 3.0, the above code executes without error.
> On head[1], {{year == "2011"}}, which should be {{year == 2011}} (no quotes), 
> raises the following exception.
> {code}
> pyarrow.lib.ArrowNotImplementedError: Function equal has no kernel matching 
> input types (array[int32], scalar[string])
> {code}
> This API change appears to have been introduced in ARROW-8919. Perhaps it was 
> intentional, just figured we should double check. Thanks again!
> [1] {{51c97799b8302466b9dfbb657dc23fd3f0cd8e61}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12114) [C++] Dataset to table filter expression API change

2021-04-16 Thread Ben Kietzman (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323884#comment-17323884
 ] 

Ben Kietzman commented on ARROW-12114:
--

I'll close this for now, then. Thanks all

> [C++] Dataset to table filter expression API change
> ---
>
> Key: ARROW-12114
> URL: https://issues.apache.org/jira/browse/ARROW-12114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Diana Clarke
>Assignee: Ben Kietzman
>Priority: Major
>
> Ben:
> Can you please confirm that we're aware and okay with the following API 
> change? Thanks!
> {code}
> import pyarrow.dataset
> path_prefix = "ursa-labs-taxi-data-repartitioned-10k/"
> paths = [
> 
> f"ursa-labs-taxi-data-repartitioned-10k/{year}/{month:02}/{part:04}/data.parquet"
> for year in range(2009, 2020)
> for month in range(1, 13)
> for part in range(101)
> if not (year == 2019 and month > 6)  # Data ends in 2019/06
> and not (year == 2010 and month == 3)  # Data is missing in 2010/03
> ]
> partitioning = pyarrow.dataset.DirectoryPartitioning.discover(
> field_names=["year", "month", "part"],
> infer_dictionary=True,
> )
> s3 = pyarrow.fs.S3FileSystem(region="us-east-2")
> dataset = pyarrow.dataset.dataset(
> paths,
> format="parquet",
> filesystem=s3,
> partitioning=partitioning,
> partition_base_dir=path_prefix,
> )
> year = pyarrow.dataset.field("year")
> month = pyarrow.dataset.field("month")
> part = pyarrow.dataset.field("part")
> filter_expr = (year == "2011") & (month == 1) & (part == 2)
> dataset.to_table(filter=filter_expr)
> {code}
> In arrow 3.0, the above code executes without error.
> On head[1], {{year == "2011"}}, which should be {{year == 2011}} (no quotes), 
> raises the following exception.
> {code}
> pyarrow.lib.ArrowNotImplementedError: Function equal has no kernel matching 
> input types (array[int32], scalar[string])
> {code}
> This API change appears to have been introduced in ARROW-8919. Perhaps it was 
> intentional, just figured we should double check. Thanks again!
> [1] {{51c97799b8302466b9dfbb657dc23fd3f0cd8e61}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master

2021-04-16 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323867#comment-17323867
 ] 

Jorge Leitão edited comment on ARROW-12421 at 4/16/21, 2:40 PM:


I can’t reproduce this in my small ubuntu vm, even if I use 
`.with_concurrency(50)` on the test. So, it seems it needs physical units to 
reproduce.



was (Author: jorgecarleitao):
I can’t reproduce this in my small vm, even if I use `.with_concurrency(50)` on 
the test. So, it seems it needs physical units to reproduce.


> [Rust] [DataFusion] topk_query test fails in master
> ---
>
> Key: ARROW-12421
> URL: https://issues.apache.org/jira/browse/ARROW-12421
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:java}
>  Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 
> tests
> test topk_plan ... ok
> test topk_query ... FAILED
> test normal_query ... okfailures: topk_query stdout 
> thread 'topk_query' panicked at 'assertion failed: `(left == right)`
>   left: `["+-+-+", "| customer_id | revenue |", 
> "+-+-+", "| paul| 300 |", "| jorge   | 
> 200 |", "| andy| 150 |", "+-+-+"]`,
>  right: `["++", "||", "++", "++"]`: output mismatch for Topk context. 
> Expectedn
> +-+-+
> | customer_id | revenue |
> +-+-+
> | paul| 300 |
> | jorge   | 200 |
> | andy| 150 |
> +-+-+Actual:
> ++
> ||
> ++
> ++
> ', datafusion/tests/user_defined_plan.rs:133:5
> note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master

2021-04-16 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323867#comment-17323867
 ] 

Jorge Leitão commented on ARROW-12421:
--

I can’t reproduce this in my small vm, even if I use `.with_concurrency(50)` on 
the test. So, it seems it needs physical units to reproduce.


> [Rust] [DataFusion] topk_query test fails in master
> ---
>
> Key: ARROW-12421
> URL: https://issues.apache.org/jira/browse/ARROW-12421
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:java}
>  Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 
> tests
> test topk_plan ... ok
> test topk_query ... FAILED
> test normal_query ... okfailures: topk_query stdout 
> thread 'topk_query' panicked at 'assertion failed: `(left == right)`
>   left: `["+-+-+", "| customer_id | revenue |", 
> "+-+-+", "| paul| 300 |", "| jorge   | 
> 200 |", "| andy| 150 |", "+-+-+"]`,
>  right: `["++", "||", "++", "++"]`: output mismatch for Topk context. 
> Expectedn
> +-+-+
> | customer_id | revenue |
> +-+-+
> | paul| 300 |
> | jorge   | 200 |
> | andy| 150 |
> +-+-+Actual:
> ++
> ||
> ++
> ++
> ', datafusion/tests/user_defined_plan.rs:133:5
> note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master

2021-04-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12421:
---
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] topk_query test fails in master
> ---
>
> Key: ARROW-12421
> URL: https://issues.apache.org/jira/browse/ARROW-12421
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:java}
>  Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 
> tests
> test topk_plan ... ok
> test topk_query ... FAILED
> test normal_query ... okfailures: topk_query stdout 
> thread 'topk_query' panicked at 'assertion failed: `(left == right)`
>   left: `["+-+-+", "| customer_id | revenue |", 
> "+-+-+", "| paul| 300 |", "| jorge   | 
> 200 |", "| andy| 150 |", "+-+-+"]`,
>  right: `["++", "||", "++", "++"]`: output mismatch for Topk context. 
> Expectedn
> +-+-+
> | customer_id | revenue |
> +-+-+
> | paul| 300 |
> | jorge   | 200 |
> | andy| 150 |
> +-+-+Actual:
> ++
> ||
> ++
> ++
> ', datafusion/tests/user_defined_plan.rs:133:5
> note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10351) [C++][Flight] See if reading/writing to gRPC get/put streams asynchronously helps performance

2021-04-16 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323855#comment-17323855
 ] 

David Li commented on ARROW-10351:
--

Hmm, I was unable to replicate the results here. I checked out your branch and 
current master branch. I'm running on an Intel Comet Lake laptop with 8 
physical cores.

Current master:
{noformat}
> env OMP_NUM_THREADS=4 ./release/arrow-flight-benchmark -test_put 
> -num_perf_runs=4 -num_streams=4 -num_threads=1 
Using spawned TCP server
Server running with pid 5988
Server host: localhost
Server port: 31337
Testing method: DoPut
Server host: localhost
Server port: 31337
Number of perf runs: 4
Number of concurrent gets/puts: 1
Batch size: 131040
Batches written: 39072
Bytes written: 512000
Nanos: 2655271083
Speed: 1838.91 MB/s
Throughput: 14714.9 batches/s
Latency mean: 65 us
Latency quantile=0.5: 65 us
Latency quantile=0.95: 75 us
Latency quantile=0.99: 82 us
Latency max: 941 us
 {noformat}
This branch:
{noformat}
> env OMP_NUM_THREADS=4 ./release/arrow-flight-benchmark -test_put 
> -num_perf_runs=4 -num_streams=1 -num_threads=1
Using spawned TCP server
Server running with pid 5921
Server host: localhost
Server port: 31337
Testing method: DoPut
Server host: localhost
Server port: 31337
Number of perf runs: 4
Number of concurrent gets/puts: 1
Batch size: 131040
Batches written: 9768
Bytes written: 128000
Nanos: 686687591
Speed: 1777.67 MB/s
Throughput: 14224.8 batches/s
Latency mean: 67 us
Latency quantile=0.5: 67 us
Latency quantile=0.95: 76 us
Latency quantile=0.99: 92 us
Latency max: 958 us
{noformat}

> [C++][Flight] See if reading/writing to gRPC get/put streams asynchronously 
> helps performance
> -
>
> Key: ARROW-10351
> URL: https://issues.apache.org/jira/browse/ARROW-10351
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC
>Reporter: Wes McKinney
>Priority: Major
>
> We don't use any asynchronous concepts in the way that Flight is implemented 
> now, i.e. IPC deconstruction/reconstruction (which may include compression!) 
> is not performed concurrent with moving FlightData objects through the gRPC 
> machinery, which may yield suboptimal performance. 
> It might be better to apply an actor-type approach where a dedicated thread 
> retrieves and prepares the next raw IPC message (within a Future) while the 
> current IPC message is being processed -- that way reading/writing to/from 
> the gRPC stream is not blocked on the IPC code doing its thing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12422) Add castVARCHAR for milliseconds

2021-04-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12422:
---
Labels: pull-request-available  (was: )

> Add castVARCHAR for milliseconds
> 
>
> Key: ARROW-12422
> URL: https://issues.apache.org/jira/browse/ARROW-12422
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Rodrigo Jacomozzi de Bem
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12422) Add castVARCHAR for milliseconds

2021-04-16 Thread Rodrigo Jacomozzi de Bem (Jira)

Rodrigo Jacomozzi de Bem created ARROW-12422:


 Summary: Add castVARCHAR for milliseconds
 Key: ARROW-12422
 URL: https://issues.apache.org/jira/browse/ARROW-12422
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Rodrigo Jacomozzi de Bem






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-12416) [CI] [Python] [R] ERROR: xsimd/xsimd.hpp: No such file or directory

2021-04-16 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li closed ARROW-12416.

Resolution: Duplicate

Last night's builds passed + the linked log is from a couple days ago, so I 
think we're ok here.

> [CI] [Python] [R] ERROR: xsimd/xsimd.hpp: No such file or directory
> ---
>
> Key: ARROW-12416
> URL: https://issues.apache.org/jira/browse/ARROW-12416
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration, Python, R
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
>
> building bpacking_avx2 fails
> see 
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3582&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1361



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-11780) [C++][Python] StructArray.from_arrays() crashes Python interpreter

2021-04-16 Thread Alessandro Molina (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323820#comment-17323820
 ] 

Alessandro Molina edited comment on ARROW-11780 at 4/16/21, 1:38 PM:
-

The issue seems to origin from {{pyarrow_unwrap_array}} not recognising the two 
values as arrays.

An empty shared_ptr is returned when unwrapping something that is not an array, 
see 
[https://github.com/apache/arrow/pull/827/files#diff-fd3f36df959d5f57664e7c4ca21a59515d5649679e188e77a220af490ab2b601R126-R132]
 

The two arrays are in fact {{ChunkedArray}}, thus the {{pyarrow_unwrap_array}} 
doesn't deal with them.

{code:python}
 [[1, 2, 3]]
{code}

 I'm not sure on the impact over the rest of the codebase, but it seems it 
would be more robust to have {{pyarrow_unwrap_array}} (and similar methods) 
throwing an exception when they face unsupported types instead of returning 
NULL values that might trigger an action at a distance that is then hard to 
debug.

 

 


was (Author: amol-):
The issue seems to origin from {{pyarrow_unwrap_array}} not recognising the two 
values as arrays.

An empty shared_ptr is returned when unwrapping something that is not an array, 
see 
[https://github.com/apache/arrow/pull/827/files#diff-fd3f36df959d5f57664e7c4ca21a59515d5649679e188e77a220af490ab2b601R126-R132]
 

The two arrays are in fact {{ChunkedArray}}, thus the{{ pyarrow_unwrap_array}} 
doesn't deal with them.
{code:python}
 [[1, 2, 3]]

 I'm not sure on the impact over the rest of the codebase, but it seems it 
would be more robust to have 
{code}
{{pyarrow_unwrap_array}}
{code:python}
 (and similar methods) throwing an exception when they face unsupported types 
instead of returning NULL values that might trigger an action at a distance 
that is then hard to debug.{code}
 

 

 

> [C++][Python] StructArray.from_arrays() crashes Python interpreter
> --
>
> Key: ARROW-11780
> URL: https://issues.apache.org/jira/browse/ARROW-11780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: ARF
>Assignee: Weston Pace
>Priority: Major
>
> {{StructArray.from_arrays()}} crashes the Python interpreter without error 
> message:
> {code:none}
> (test_pyarrow) Z:\test_pyarrow>python
> Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: 
> Anaconda, Inc. on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa
> >>>
> >>> table = pa.Table.from_pydict({
> ... 'foo': pa.array([1, 2, 3]),
> ... 'bar': pa.array([4, 5, 6])
> ... })
> >>>
> >>> pa.StructArray.from_arrays([table['foo'], table['bar']], ['foo', 'bar'])
> (test_pyarrow) Z:\test_pyarrow>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-11780) [C++][Python] StructArray.from_arrays() crashes Python interpreter

2021-04-16 Thread Alessandro Molina (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323820#comment-17323820
 ] 

Alessandro Molina edited comment on ARROW-11780 at 4/16/21, 1:37 PM:
-

The issue seems to origin from {{pyarrow_unwrap_array}} not recognising the two 
values as arrays.

An empty shared_ptr is returned when unwrapping something that is not an array, 
see 
[https://github.com/apache/arrow/pull/827/files#diff-fd3f36df959d5f57664e7c4ca21a59515d5649679e188e77a220af490ab2b601R126-R132]
 

The two arrays are in fact {{ChunkedArray}}, thus the{{ pyarrow_unwrap_array}} 
doesn't deal with them.
{code:python}
 [[1, 2, 3]]

 I'm not sure on the impact over the rest of the codebase, but it seems it 
would be more robust to have 
{code}
{{pyarrow_unwrap_array}}
{code:python}
 (and similar methods) throwing an exception when they face unsupported types 
instead of returning NULL values that might trigger an action at a distance 
that is then hard to debug.{code}
 

 

 


was (Author: amol-):
The issue seems to origin from {{pyarrow_unwrap_array}} not recognising the two 
values as arrays.

An empty shared_ptr is returned when unwrapping something that is not an array, 
see 
[https://github.com/apache/arrow/pull/827/files#diff-fd3f36df959d5f57664e7c4ca21a59515d5649679e188e77a220af490ab2b601R126-R132]
 

The two arrays are in fact {{ChunkedArray}}s, thus the {{pyarrow_unwrap_array}} 
doesn't deal with them.

{code:python}
 [[1, 2, 3]]
{code}
 
I'm not sure on the impact over the rest of the codebase, but it seems it would 
be more robust to have {{pyarrow_unwrap_array}} (and similar methods) throwing 
an exception when they face unsupported types instead of returning NULL values 
that might trigger an action at a distance that is then hard to debug.

 

 

 

> [C++][Python] StructArray.from_arrays() crashes Python interpreter
> --
>
> Key: ARROW-11780
> URL: https://issues.apache.org/jira/browse/ARROW-11780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: ARF
>Assignee: Weston Pace
>Priority: Major
>
> {{StructArray.from_arrays()}} crashes the Python interpreter without error 
> message:
> {code:none}
> (test_pyarrow) Z:\test_pyarrow>python
> Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: 
> Anaconda, Inc. on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa
> >>>
> >>> table = pa.Table.from_pydict({
> ... 'foo': pa.array([1, 2, 3]),
> ... 'bar': pa.array([4, 5, 6])
> ... })
> >>>
> >>> pa.StructArray.from_arrays([table['foo'], table['bar']], ['foo', 'bar'])
> (test_pyarrow) Z:\test_pyarrow>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11780) [C++][Python] StructArray.from_arrays() crashes Python interpreter

2021-04-16 Thread Alessandro Molina (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323820#comment-17323820
 ] 

Alessandro Molina commented on ARROW-11780:
---

The issue seems to origin from {{pyarrow_unwrap_array}} not recognising the two 
values as arrays.

An empty shared_ptr is returned when unwrapping something that is not an array, 
see 
[https://github.com/apache/arrow/pull/827/files#diff-fd3f36df959d5f57664e7c4ca21a59515d5649679e188e77a220af490ab2b601R126-R132]
 

The two arrays are in fact {{ChunkedArray}}s, thus the {{pyarrow_unwrap_array}} 
doesn't deal with them.

{code:python}
 [[1, 2, 3]]
{code}
 
I'm not sure on the impact over the rest of the codebase, but it seems it would 
be more robust to have {{pyarrow_unwrap_array}} (and similar methods) throwing 
an exception when they face unsupported types instead of returning NULL values 
that might trigger an action at a distance that is then hard to debug.

 

 

 

> [C++][Python] StructArray.from_arrays() crashes Python interpreter
> --
>
> Key: ARROW-11780
> URL: https://issues.apache.org/jira/browse/ARROW-11780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: ARF
>Assignee: Weston Pace
>Priority: Major
>
> {{StructArray.from_arrays()}} crashes the Python interpreter without error 
> message:
> {code:none}
> (test_pyarrow) Z:\test_pyarrow>python
> Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: 
> Anaconda, Inc. on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa
> >>>
> >>> table = pa.Table.from_pydict({
> ... 'foo': pa.array([1, 2, 3]),
> ... 'bar': pa.array([4, 5, 6])
> ... })
> >>>
> >>> pa.StructArray.from_arrays([table['foo'], table['bar']], ['foo', 'bar'])
> (test_pyarrow) Z:\test_pyarrow>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master

2021-04-16 Thread Andy Grove (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-12421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323818#comment-17323818
 ] 

Andy Grove commented on ARROW-12421:


This failure happens consistently on my 24 core Threadripper desktop running 
Ubuntu but I cannot reproduce it on my MacBook Pro or on my work PC (6 cores, 
also Ubuntu).

 

> [Rust] [DataFusion] topk_query test fails in master
> ---
>
> Key: ARROW-12421
> URL: https://issues.apache.org/jira/browse/ARROW-12421
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>
> {code:java}
>  Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 
> tests
> test topk_plan ... ok
> test topk_query ... FAILED
> test normal_query ... okfailures: topk_query stdout 
> thread 'topk_query' panicked at 'assertion failed: `(left == right)`
>   left: `["+-+-+", "| customer_id | revenue |", 
> "+-+-+", "| paul| 300 |", "| jorge   | 
> 200 |", "| andy| 150 |", "+-+-+"]`,
>  right: `["++", "||", "++", "++"]`: output mismatch for Topk context. 
> Expectedn
> +-+-+
> | customer_id | revenue |
> +-+-+
> | paul| 300 |
> | jorge   | 200 |
> | andy| 150 |
> +-+-+Actual:
> ++
> ||
> ++
> ++
> ', datafusion/tests/user_defined_plan.rs:133:5
> note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12417) [CI] [Python] ERROR: ninja: build stopped: subcommand failed

2021-04-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12417:
---
Summary: [CI] [Python] ERROR: ninja: build stopped: subcommand failed  
(was: [Nightly] [Continuous Integration] [Python] [CUDA] ERROR: ninja: build 
stopped: subcommand failed)

> [CI] [Python] ERROR: ninja: build stopped: subcommand failed
> 
>
> Key: ARROW-12417
> URL: https://issues.apache.org/jira/browse/ARROW-12417
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration, Python
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
>
> Building unity_0 fails.
> This affects Python 3.6 and 3.7, see
> * 
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3579&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1509
> * 
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3576&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1383
> This task has failed 33 times across 209 pipeline runs in the last 14 days 
> (https://dev.azure.com/ursacomputing/crossbow/_pipeline/analytics/stageawareoutcome?definitionId=1)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12417) [Nightly] [Continuous Integration] [Python] [CUDA] ERROR: ninja: build stopped: subcommand failed

2021-04-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12417:
---
Component/s: Python
 Continuous Integration

> [Nightly] [Continuous Integration] [Python] [CUDA] ERROR: ninja: build 
> stopped: subcommand failed
> -
>
> Key: ARROW-12417
> URL: https://issues.apache.org/jira/browse/ARROW-12417
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration, Python
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
>
> Building unity_0 fails.
> This affects Python 3.6 and 3.7, see
> * 
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3579&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1509
> * 
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3576&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1383
> This task has failed 33 times across 209 pipeline runs in the last 14 days 
> (https://dev.azure.com/ursacomputing/crossbow/_pipeline/analytics/stageawareoutcome?definitionId=1)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12416) [CI] [Python] [R] ERROR: xsimd/xsimd.hpp: No such file or directory

2021-04-16 Thread Jonathan Keane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-12416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323794#comment-17323794
 ] 

Jonathan Keane commented on ARROW-12416:


Did this fail again today? If not (since you've identified a fix) I would say 
let's close it as a duplicate of ARROW-12383

> [CI] [Python] [R] ERROR: xsimd/xsimd.hpp: No such file or directory
> ---
>
> Key: ARROW-12416
> URL: https://issues.apache.org/jira/browse/ARROW-12416
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration, Python, R
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
>
> building bpacking_avx2 fails
> see 
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3582&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1361



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12416) [CI] [Python] [R] ERROR: xsimd/xsimd.hpp: No such file or directory

2021-04-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12416:
---
Summary: [CI] [Python] [R] ERROR: xsimd/xsimd.hpp: No such file or 
directory  (was: [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: 
xsimd/xsimd.hpp: No such file or directory)

> [CI] [Python] [R] ERROR: xsimd/xsimd.hpp: No such file or directory
> ---
>
> Key: ARROW-12416
> URL: https://issues.apache.org/jira/browse/ARROW-12416
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration, Python, R
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
>
> building bpacking_avx2 fails
> see 
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3582&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1361



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-12362) [Rust] [DataFusion] topk_query test failure

2021-04-16 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove closed ARROW-12362.
--
Resolution: Duplicate

> [Rust] [DataFusion] topk_query test failure
> ---
>
> Key: ARROW-12362
> URL: https://issues.apache.org/jira/browse/ARROW-12362
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
> Fix For: 5.0.0
>
>
> I'm seeing this locally with latest from master.
> {code:java}
>  topk_query stdout 
> thread 'topk_query' panicked at 'assertion failed: `(left == right)`
>   left: `["+-+-+", "| customer_id | revenue |", 
> "+-+-+", "| paul| 300 |", "| jorge   | 
> 200 |", "| andy| 150 |", "+-+-+"]`,
>  right: `["++", "||", "++", "++"]`: output mismatch for Topk context. 
> Expectedn
> +-+-+
> | customer_id | revenue |
> +-+-+
> | paul| 300 |
> | jorge   | 200 |
> | andy| 150 |
> +-+-+Actual:
> ++
> ||
> ++
> ++
> ', datafusion/tests/user_defined_plan.rs:133:5
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12416) [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: No such file or directory

2021-04-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12416:
---
Component/s: R
 Python

> [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: 
> No such file or directory
> ---
>
> Key: ARROW-12416
> URL: https://issues.apache.org/jira/browse/ARROW-12416
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration, Python, R
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
>
> building bpacking_avx2 fails
> see 
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3582&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1361



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12416) [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: No such file or directory

2021-04-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12416:
---
Component/s: Continuous Integration

> [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: 
> No such file or directory
> ---
>
> Key: ARROW-12416
> URL: https://issues.apache.org/jira/browse/ARROW-12416
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
>
> building bpacking_avx2 fails
> see 
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3582&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1361



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12415) [CI] [Python] ERROR: Failed building wheel for pygit2

2021-04-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12415:
---
Summary: [CI] [Python] ERROR: Failed building wheel for pygit2  (was: 
[Nightly] [CI] [Python] [ARM64] ERROR: Failed building wheel for pygit2)

> [CI] [Python] ERROR: Failed building wheel for pygit2
> -
>
> Key: ARROW-12415
> URL: https://issues.apache.org/jira/browse/ARROW-12415
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration, Python
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
>
> Failed to build pygit2
> ERROR: Could not build wheels for pygit2 which use PEP 517 and cannot be 
> installed directly
> This affects Pytrhon 3.6 and 3.7, see 
> * https://cloud.drone.io/ursacomputing/crossbow/458/1/2
> * https://cloud.drone.io/ursacomputing/crossbow/461/1/2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12415) [Nightly] [CI] [Python] [ARM64] ERROR: Failed building wheel for pygit2

2021-04-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12415:
---
Component/s: Python
 Continuous Integration

> [Nightly] [CI] [Python] [ARM64] ERROR: Failed building wheel for pygit2
> ---
>
> Key: ARROW-12415
> URL: https://issues.apache.org/jira/browse/ARROW-12415
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration, Python
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
>
> Failed to build pygit2
> ERROR: Could not build wheels for pygit2 which use PEP 517 and cannot be 
> installed directly
> This affects Pytrhon 3.6 and 3.7, see 
> * https://cloud.drone.io/ursacomputing/crossbow/458/1/2
> * https://cloud.drone.io/ursacomputing/crossbow/461/1/2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12415) [Nightly] [CI] [Python] [ARM64] ERROR: Failed building wheel for pygit2

2021-04-16 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12415:
---
Summary: [Nightly] [CI] [Python] [ARM64] ERROR: Failed building wheel for 
pygit2  (was: [Nightly] [Continuous Integration] [Python] [ARM64] ERROR: Failed 
building wheel for pygit2)

> [Nightly] [CI] [Python] [ARM64] ERROR: Failed building wheel for pygit2
> ---
>
> Key: ARROW-12415
> URL: https://issues.apache.org/jira/browse/ARROW-12415
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
>
> Failed to build pygit2
> ERROR: Could not build wheels for pygit2 which use PEP 517 and cannot be 
> installed directly
> This affects Pytrhon 3.6 and 3.7, see 
> * https://cloud.drone.io/ursacomputing/crossbow/458/1/2
> * https://cloud.drone.io/ursacomputing/crossbow/461/1/2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12421) [Rust] [DataFusion] topk_query test fails in master

2021-04-16 Thread Andy Grove (Jira)

Andy Grove created ARROW-12421:
--

 Summary: [Rust] [DataFusion] topk_query test fails in master
 Key: ARROW-12421
 URL: https://issues.apache.org/jira/browse/ARROW-12421
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Andy Grove


{code:java}
 Running target/debug/deps/user_defined_plan-6b63acb904117235running 3 tests
test topk_plan ... ok
test topk_query ... FAILED
test normal_query ... okfailures: topk_query stdout 
thread 'topk_query' panicked at 'assertion failed: `(left == right)`
  left: `["+-+-+", "| customer_id | revenue |", 
"+-+-+", "| paul| 300 |", "| jorge   | 200  
   |", "| andy| 150 |", "+-+-+"]`,
 right: `["++", "||", "++", "++"]`: output mismatch for Topk context. Expectedn
+-+-+
| customer_id | revenue |
+-+-+
| paul| 300 |
| jorge   | 200 |
| andy| 150 |
+-+-+Actual:
++
||
++
++
', datafusion/tests/user_defined_plan.rs:133:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12420) [C++/Dataset] Reading null columns as dictionary not longer possible

2021-04-16 Thread Uwe Korn (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323792#comment-17323792
 ] 

Uwe Korn commented on ARROW-12420:
--

cc [~bkietz] who wrote the PR that broke it ;)

> [C++/Dataset] Reading null columns as dictionary not longer possible
> 
>
> Key: ARROW-12420
> URL: https://issues.apache.org/jira/browse/ARROW-12420
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 4.0.0
>Reporter: Uwe Korn
>Priority: Major
> Fix For: 4.0.0
>
>
> Reading a dataset with a dictionary column where some of the files don't 
> contain any data for that column (and thus are typed as null) broke with 
> https://github.com/apache/arrow/pull/9532. It worked with the 3.0 release 
> though and thus I would consider this a regression.
> This can be reproduced using the following Python snippet:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({"a": [None, None]})
> pq.write_table(table, "test.parquet")
> schema = pa.schema([pa.field("a", pa.dictionary(pa.int32(), pa.string()))])
> fsds = ds.FileSystemDataset.from_paths(
> paths=["test.parquet"],
> schema=schema,
> format=pa.dataset.ParquetFileFormat(),
> filesystem=pa.fs.LocalFileSystem(),
> )
> fsds.to_table()
> {code}
> The exception on master is currently:
> {code}
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
>   6 filesystem=pa.fs.LocalFileSystem(),
>   7 )
> > 8 fsds.to_table()
> ~/Development/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.Dataset.to_table()
> 456 table : Table instance
> 457 """
> --> 458 return self._scanner(**kwargs).to_table()
> 459 
> 460 def head(self, int num_rows, **kwargs):
> ~/Development/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.Scanner.to_table()
>2887 result = self.scanner.ToTable()
>2888 
> -> 2889 return pyarrow_wrap_table(GetResultValue(result))
>2890 
>2891 def take(self, object indices):
> ~/Development/arrow/python/pyarrow/error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> 139 cdef api int pyarrow_internal_check_status(const CStatus& status) \
> 140 nogil except -1:
> --> 141 return check_status(status)
> 142 
> 143 
> ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> 116 raise ArrowKeyError(message)
> 117 elif status.IsNotImplemented():
> --> 118 raise ArrowNotImplementedError(message)
> 119 elif status.IsTypeError():
> 120 raise ArrowTypeError(message)
> ArrowNotImplementedError: Unsupported cast from null to 
> dictionary (no available cast 
> function for target type)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12420) [C++/Dataset] Reading null columns as dictionary not longer possible

2021-04-16 Thread Uwe Korn (Jira)

Uwe Korn created ARROW-12420:


 Summary: [C++/Dataset] Reading null columns as dictionary not 
longer possible
 Key: ARROW-12420
 URL: https://issues.apache.org/jira/browse/ARROW-12420
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 4.0.0
Reporter: Uwe Korn
 Fix For: 4.0.0


Reading a dataset with a dictionary column where some of the files don't 
contain any data for that column (and thus are typed as null) broke with 
https://github.com/apache/arrow/pull/9532. It worked with the 3.0 release 
though and thus I would consider this a regression.

This can be reproduced using the following Python snippet:

{code}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

table = pa.table({"a": [None, None]})
pq.write_table(table, "test.parquet")
schema = pa.schema([pa.field("a", pa.dictionary(pa.int32(), pa.string()))])
fsds = ds.FileSystemDataset.from_paths(
paths=["test.parquet"],
schema=schema,
format=pa.dataset.ParquetFileFormat(),
filesystem=pa.fs.LocalFileSystem(),
)
fsds.to_table()
{code}

The exception on master is currently:

{code}
---
ArrowNotImplementedError  Traceback (most recent call last)
 in 
  6 filesystem=pa.fs.LocalFileSystem(),
  7 )
> 8 fsds.to_table()

~/Development/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset.Dataset.to_table()
456 table : Table instance
457 """
--> 458 return self._scanner(**kwargs).to_table()
459 
460 def head(self, int num_rows, **kwargs):

~/Development/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset.Scanner.to_table()
   2887 result = self.scanner.ToTable()
   2888 
-> 2889 return pyarrow_wrap_table(GetResultValue(result))
   2890 
   2891 def take(self, object indices):

~/Development/arrow/python/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()
139 cdef api int pyarrow_internal_check_status(const CStatus& status) \
140 nogil except -1:
--> 141 return check_status(status)
142 
143 

~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
116 raise ArrowKeyError(message)
117 elif status.IsNotImplemented():
--> 118 raise ArrowNotImplementedError(message)
119 elif status.IsTypeError():
120 raise ArrowTypeError(message)

ArrowNotImplementedError: Unsupported cast from null to 
dictionary (no available cast function 
for target type)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-12380) [Rust][Ballista] Add scheduler ui

2021-04-16 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-12380.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 10026
[https://github.com/apache/arrow/pull/10026]

> [Rust][Ballista] Add scheduler ui
> -
>
> Key: ARROW-12380
> URL: https://issues.apache.org/jira/browse/ARROW-12380
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - Ballista
>Reporter: Sathis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-11780) [C++][Python] StructArray.from_arrays() crashes Python interpreter

2021-04-16 Thread Alessandro Molina (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323776#comment-17323776
 ] 

Alessandro Molina edited comment on ARROW-11780 at 4/16/21, 12:41 PM:
--

Confirmed I was able to reproduce the issue and saw the same behaviour

 
{code:java}
  * frame #0: 0x00010534d03c 
libarrow.400.dylib`std::__1::shared_ptr::operator->(this=0x0008)
 const at memory:3930:56
    frame #1: 0x00010534fa27 
libarrow.400.dylib`arrow::Array::type(this=0x) const at 
array_base.h:86:51
    frame #2: 0x000105416bea 
libarrow.400.dylib`arrow::StructArray::Make(children=size=2, 
field_names=size=2, null_bitmap=nullptr, null_count=-1, offset=0) at 
array_nested.cc:501:61
    frame #3: 0x00010510ed09 
lib.cpython-39-darwin.so`__pyx_pf_7pyarrow_3lib_11StructArray_4from_arrays(_object*,
 _object*, _object*) + 6057
{code}
 
{code:java}
(const std::__1::vector, 
std::__1::allocator > >) $0 = size=2 {
  [0] = nullptr {
    __ptr_ = 0x
  }
  [1] = nullptr {
    __ptr_ = 0x
  }
}
{code}
 

The names of the keys are instead correctly propagated

 
{code:java}
(const std::__1::vector, std::__1::allocator >, 
std::__1::allocator, 
std::__1::allocator > > >) $1 = size=2 {
  [0] = "foo"
  [1] = "bar"
}
{code}
 


was (Author: amol-):
Confirmed I was able to reproduce the issue and saw the same behaviour

 
{code:java}
  * frame #0: 0x00010534d03c 
libarrow.400.dylib`std::__1::shared_ptr::operator->(this=0x0008)
 const at memory:3930:56
    frame #1: 0x00010534fa27 
libarrow.400.dylib`arrow::Array::type(this=0x) const at 
array_base.h:86:51
    frame #2: 0x000105416bea 
libarrow.400.dylib`arrow::StructArray::Make(children=size=2, 
field_names=size=2, null_bitmap=nullptr, null_count=-1, offset=0) at 
array_nested.cc:501:61
    frame #3: 0x00010510ed09 
lib.cpython-39-darwin.so`__pyx_pf_7pyarrow_3lib_11StructArray_4from_arrays(_object*,
 _object*, _object*) + 6057
{code}
 

 
{code:java}
(const std::__1::vector, 
std::__1::allocator > >) $0 = size=2 {
  [0] = nullptr {
    __ptr_ = 0x
  }
  [1] = nullptr {
    __ptr_ = 0x
  }
}
{code}
 

> [C++][Python] StructArray.from_arrays() crashes Python interpreter
> --
>
> Key: ARROW-11780
> URL: https://issues.apache.org/jira/browse/ARROW-11780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: ARF
>Assignee: Weston Pace
>Priority: Major
>
> {{StructArray.from_arrays()}} crashes the Python interpreter without error 
> message:
> {code:none}
> (test_pyarrow) Z:\test_pyarrow>python
> Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: 
> Anaconda, Inc. on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa
> >>>
> >>> table = pa.Table.from_pydict({
> ... 'foo': pa.array([1, 2, 3]),
> ... 'bar': pa.array([4, 5, 6])
> ... })
> >>>
> >>> pa.StructArray.from_arrays([table['foo'], table['bar']], ['foo', 'bar'])
> (test_pyarrow) Z:\test_pyarrow>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-11780) [C++][Python] StructArray.from_arrays() crashes Python interpreter

2021-04-16 Thread Alessandro Molina (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323776#comment-17323776
 ] 

Alessandro Molina edited comment on ARROW-11780 at 4/16/21, 12:40 PM:
--

Confirmed I was able to reproduce the issue and saw the same behaviour

 
{code:java}
  * frame #0: 0x00010534d03c 
libarrow.400.dylib`std::__1::shared_ptr::operator->(this=0x0008)
 const at memory:3930:56
    frame #1: 0x00010534fa27 
libarrow.400.dylib`arrow::Array::type(this=0x) const at 
array_base.h:86:51
    frame #2: 0x000105416bea 
libarrow.400.dylib`arrow::StructArray::Make(children=size=2, 
field_names=size=2, null_bitmap=nullptr, null_count=-1, offset=0) at 
array_nested.cc:501:61
    frame #3: 0x00010510ed09 
lib.cpython-39-darwin.so`__pyx_pf_7pyarrow_3lib_11StructArray_4from_arrays(_object*,
 _object*, _object*) + 6057
{code}
 

 
{code:java}
(const std::__1::vector, 
std::__1::allocator > >) $0 = size=2 {
  [0] = nullptr {
    __ptr_ = 0x
  }
  [1] = nullptr {
    __ptr_ = 0x
  }
}
{code}
 


was (Author: amol-):
Confirmed I was able to reproduce the issue and saw the same behaviour

 
{code:java}
  * frame #0: 0x00010534d03c 
libarrow.400.dylib`std::__1::shared_ptr::operator->(this=0x0008)
 const at memory:3930:56
    frame #1: 0x00010534fa27 
libarrow.400.dylib`arrow::Array::type(this=0x) const at 
array_base.h:86:51
    frame #2: 0x000105416bea 
libarrow.400.dylib`arrow::StructArray::Make(children=size=2, 
field_names=size=2, null_bitmap=nullptr, null_count=-1, offset=0) at 
array_nested.cc:501:61
    frame #3: 0x00010510ed09 
lib.cpython-39-darwin.so`__pyx_pf_7pyarrow_3lib_11StructArray_4from_arrays(_object*,
 _object*, _object*) + 6057
{code}
 

> [C++][Python] StructArray.from_arrays() crashes Python interpreter
> --
>
> Key: ARROW-11780
> URL: https://issues.apache.org/jira/browse/ARROW-11780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: ARF
>Assignee: Weston Pace
>Priority: Major
>
> {{StructArray.from_arrays()}} crashes the Python interpreter without error 
> message:
> {code:none}
> (test_pyarrow) Z:\test_pyarrow>python
> Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: 
> Anaconda, Inc. on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa
> >>>
> >>> table = pa.Table.from_pydict({
> ... 'foo': pa.array([1, 2, 3]),
> ... 'bar': pa.array([4, 5, 6])
> ... })
> >>>
> >>> pa.StructArray.from_arrays([table['foo'], table['bar']], ['foo', 'bar'])
> (test_pyarrow) Z:\test_pyarrow>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11780) [C++][Python] StructArray.from_arrays() crashes Python interpreter

2021-04-16 Thread Alessandro Molina (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323776#comment-17323776
 ] 

Alessandro Molina commented on ARROW-11780:
---

Confirmed I was able to reproduce the issue and saw the same behaviour

 
{code:java}
  * frame #0: 0x00010534d03c 
libarrow.400.dylib`std::__1::shared_ptr::operator->(this=0x0008)
 const at memory:3930:56
    frame #1: 0x00010534fa27 
libarrow.400.dylib`arrow::Array::type(this=0x) const at 
array_base.h:86:51
    frame #2: 0x000105416bea 
libarrow.400.dylib`arrow::StructArray::Make(children=size=2, 
field_names=size=2, null_bitmap=nullptr, null_count=-1, offset=0) at 
array_nested.cc:501:61
    frame #3: 0x00010510ed09 
lib.cpython-39-darwin.so`__pyx_pf_7pyarrow_3lib_11StructArray_4from_arrays(_object*,
 _object*, _object*) + 6057
{code}
 

> [C++][Python] StructArray.from_arrays() crashes Python interpreter
> --
>
> Key: ARROW-11780
> URL: https://issues.apache.org/jira/browse/ARROW-11780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: ARF
>Assignee: Weston Pace
>Priority: Major
>
> {{StructArray.from_arrays()}} crashes the Python interpreter without error 
> message:
> {code:none}
> (test_pyarrow) Z:\test_pyarrow>python
> Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: 
> Anaconda, Inc. on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa
> >>>
> >>> table = pa.Table.from_pydict({
> ... 'foo': pa.array([1, 2, 3]),
> ... 'bar': pa.array([4, 5, 6])
> ... })
> >>>
> >>> pa.StructArray.from_arrays([table['foo'], table['bar']], ['foo', 'bar'])
> (test_pyarrow) Z:\test_pyarrow>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12416) [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: No such file or directory

2021-04-16 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-12416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323753#comment-17323753
 ] 

David Li commented on ARROW-12416:
--

This was fixed in 
[https://github.com/apache/arrow/commit/b5045ed833aaff35e6c8064ac7d908c19a5f48fa]
 if I'm not mistaken.

> [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: 
> No such file or directory
> ---
>
> Key: ARROW-12416
> URL: https://issues.apache.org/jira/browse/ARROW-12416
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
>
> building bpacking_avx2 fails
> see 
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3582&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1361



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-12416) [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: No such file or directory

2021-04-16 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-12416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323753#comment-17323753
 ] 

David Li edited comment on ARROW-12416 at 4/16/21, 11:54 AM:
-

This was fixed in ARROW-12382 
[https://github.com/apache/arrow/commit/b5045ed833aaff35e6c8064ac7d908c19a5f48fa]
 if I'm not mistaken.


was (Author: lidavidm):
This was fixed in 
[https://github.com/apache/arrow/commit/b5045ed833aaff35e6c8064ac7d908c19a5f48fa]
 if I'm not mistaken.

> [Nightly] [Continuous Integration] [Python] [R] [CPU] ERROR: xsimd/xsimd.hpp: 
> No such file or directory
> ---
>
> Key: ARROW-12416
> URL: https://issues.apache.org/jira/browse/ARROW-12416
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
>
> building bpacking_avx2 fails
> see 
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=3582&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=1361



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()

2021-04-16 Thread Lance Dacey (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey closed ARROW-11250.
---
Fix Version/s: (was: 5.0.0)
   3.0.0
   Resolution: Fixed

This was fixed with a new version of the adlfs library

> [Python] Inconsistent behavior calling ds.dataset()
> ---
>
> Key: ARROW-11250
> URL: https://issues.apache.org/jira/browse/ARROW-11250
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> adal  1.2.5  pyh9f0ad1d_0conda-forge
> adlfs 0.5.9  pyhd8ed1ab_0conda-forge
> apache-airflow1.10.14  pypi_0pypi
> azure-common  1.1.24 py_0conda-forge
> azure-core1.9.0  pyhd3deb0d_0conda-forge
> azure-datalake-store  0.0.51 pyh9f0ad1d_0conda-forge
> azure-identity1.5.0  pyhd8ed1ab_0conda-forge
> azure-nspkg   3.0.2  py_0conda-forge
> azure-storage-blob12.6.0 pyhd3deb0d_0conda-forge
> azure-storage-common  2.1.0py37hc8dfbb8_3conda-forge
> fsspec0.8.5  pyhd8ed1ab_0conda-forge
> jupyterlab_pygments   0.1.2  pyh9f0ad1d_0conda-forge
> pandas1.2.0py37ha9443f7_0
> pyarrow   2.0.0   py37h4935f41_6_cpuconda-forge
>Reporter: Lance Dacey
>Priority: Minor
>  Labels: azureblob, dataset,, python
> Fix For: 3.0.0
>
>
> In a Jupyter notebook, I have noticed that sometimes I am not able to read a 
> dataset which certainly exists on Azure Blob.
>  
> {code:java}
> fs = fsspec.filesystem(protocol="abfs", account_name, account_key)
> {code}
>  
> One example of this is reading a dataset in one cell:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code}
>  
> Then in another cell I try to read the same dataset:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> ---
> FileNotFoundError Traceback (most recent call last)
>  in 
> > 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
> schema, format, filesystem, partitioning, partition_base_dir, 
> exclude_invalid_files, ignore_prefixes)
> 669 # TODO(kszucs): support InMemoryDataset for a table input
> 670 if _is_path_like(source):
> --> 671 return _filesystem_dataset(source, **kwargs)
> 672 elif isinstance(source, (tuple, list)):
> 673 if all(_is_path_like(elem) for elem in source):
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> _filesystem_dataset(source, schema, filesystem, partitioning, format, 
> partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
> 426 fs, paths_or_selector = _ensure_multiple_sources(source, 
> filesystem)
> 427 else:
> --> 428 fs, paths_or_selector = _ensure_single_source(source, 
> filesystem)
> 429 
> 430 options = FileSystemFactoryOptions(
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> _ensure_single_source(path, filesystem)
> 402 paths_or_selector = [path]
> 403 else:
> --> 404 raise FileNotFoundError(path)
> 405 
> 406 return filesystem, paths_or_selector
> FileNotFoundError: dev/test-split
> {code}
>  
> If I reset the kernel, it works again. It also works if I change the path 
> slightly, like adding a "/" at the end (so basically it just not work if I 
> read the same dataset twice):
>  
> {code:java}
> ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs)
> {code}
>  
>  
> The other strange behavior I have noticed that that if I read a dataset 
> inside of my Jupyter notebook,
>  
> {code:java}
> %%time
> dataset = ds.dataset("dev/test-split", 
> partitioning=ds.partitioning(pa.schema([("date", pa.date32())]), 
> flavor="hive"), 
> filesystem=fs,
> exclude_invalid_files=False)
> CPU times: user 1.98 s, sys: 0 ns, total: 1.98 s Wall time: 2.58 s{code}
>  
> Now, on the exact same server when I try to run the same code against the 
> same dataset in Airflow it takes over 3 minutes (comparing the timestamps in 
> my logs between right before I read the dataset, and immediately after the 
> dataset is available to filter):
> {code:java}
> [2021-01-14 03:52:04,011] INFO - Reading dev/test-split
> [2021-01-14 03:55:17,360] INFO - Processing dat

[jira] [Closed] (ARROW-9682) [Python] Unable to specify the partition style with pq.write_to_dataset

2021-04-16 Thread Lance Dacey (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey closed ARROW-9682.
--
Resolution: Not A Problem

This works using ds.write_dataset()

> [Python] Unable to specify the partition style with pq.write_to_dataset
> ---
>
> Key: ARROW-9682
> URL: https://issues.apache.org/jira/browse/ARROW-9682
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 1.0.0
> Environment: Ubuntu 18.04
> Python 3.7
>Reporter: Lance Dacey
>Priority: Major
>  Labels: dataset-parquet-write, parquet, parquetWriter
>
> I am able to import and test DirectoryPartitioning but I am not able to 
> figure out a way to write a dataset using this feature. It seems like 
> write_to_dataset defaults to the "hive" style. Is there a way to test this?
> {code:java}
> from pyarrow.dataset import DirectoryPartitioning
> partitioning = DirectoryPartitioning(pa.schema([("year", pa.int16()), 
> ("month", pa.int8()), ("day", pa.int8())]))
> print(partitioning.parse("/2009/11/3"))
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12419) [Java] flatc is not used in mvn

2021-04-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-12419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12419:
---
Labels: pull-request-available  (was: )

> [Java] flatc is not used in mvn
> ---
>
> Key: ARROW-12419
> URL: https://issues.apache.org/jira/browse/ARROW-12419
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 4.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-12111 removed the usage of flatc during the build process in mvn. Thus, 
> it is not necessary to explicitly download flatc for s390x.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12419) [Java] flatc is not used in mvn

2021-04-16 Thread Kazuaki Ishizaki (Jira)

Kazuaki Ishizaki created ARROW-12419:


 Summary: [Java] flatc is not used in mvn
 Key: ARROW-12419
 URL: https://issues.apache.org/jira/browse/ARROW-12419
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Affects Versions: 4.0.0
Reporter: Kazuaki Ishizaki
Assignee: Kazuaki Ishizaki


ARROW-12111 removed the usage of flatc during the build process in mvn. Thus, 
it is not necessary to explicitly download flatc for s390x.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12418) 1Z0-1072 PDF - Become Oracle Certified With The Help Of Prepare4test

2021-04-16 Thread Andrew Sharon (Jira)

Andrew Sharon created ARROW-12418:
-

 Summary: 1Z0-1072 PDF - Become Oracle Certified With The Help Of 
Prepare4test
 Key: ARROW-12418
 URL: https://issues.apache.org/jira/browse/ARROW-12418
 Project: Apache Arrow
  Issue Type: Task
Reporter: Andrew Sharon


*Take Up the 1Z0-1072 Exam For a Successful Career!*

In order to prove your expertise in the Oracle Cloud Infrastructure 2019 
Architect Associate Exam the best thing you could do is to take up the exam 
1Z0-1072. This would bring instant fame to you and also prove that you are an 
Oracle Cloud expert. The passing score is decided by the Oracle and it is 
likely to change. You may refer to the Oracle website in order to find the 
correct passing score. 

There are many recommended 1Z0-1072 courses that you may take up for the Oracle 
Cloud Infrastructure 2019 Architect Associate Exam exam and these include 
Oracle Cloud services etc. Having this knowledge would help you to perform well 
in the 1Z0-1072 exam. All these training programs are offered by Oracle and you 
can make use of the online training option in order to get trained at home 
itself! The [Oracle Cloud exam|http://prepare4test.com/exam/1z0-1072-dumps/] 
syllabus would include topics like basic Oracle Cloud Infrastructure 2019 
Architect Associate Exam etc. In case you failed to pass the 1Z0-1072 exam with 
the required percentage of marks, you could re attend the Oracle Cloud 
Infrastructure 2019 Architect Associate Exam exam. In order to prepare well for 
the Oracle Cloud exam, you could take up various coaching programs by the 
Oracle university. There are different types of programs which includes 
instructor led class, web based class. You could choose the appropriate program 
based on your convenience. There are also lots of 1Z0-1072 practice exams that 
are available online. You could take up these 1Z0-1072 practice tests in order 
to understand the Oracle Cloud Infrastructure 2019 Architect Associate Exam 
exam pattern in a better way!

[!https://i.imgur.com/maE1HKX.jpg!|http://prepare4test.com/exam/1z0-1072-dumps/]
{quote} {quote}
*Why Oracle Cloud Infrastructure 2019 Architect Associate Exam training and 
certification?*

IT professionals those who are Oracle Cloud training and certification holders 
boast a distinct advantage over other IT aspirants. Oracle 1Z0-1072 
certification is valuable and globally recognized credential that prove the 
skills and expertise of the IT professionals. Oracle Cloud is the most 
innovative and top data base product, developed to handle the massive and 
continuously growing and expanding requirements of modern organizations at 
lower costs, with high quality standards. Oracle Cloud Infrastructure 2019 
Architect Associate Exam certification bring forth the aspirants' level of 
knowledge and skills to create and maintain Oracle Cloud environment, etc. This 
is hence, can be considered as one of the highly respectable and viable Oracle 
certification in the industry. 1Z0-1072 IT professionals already working in the 
industry get benefited by being eligible to get a salary raise, also strengthen 
and create newer avenues in the job market and career hierarchy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

69 matches

Mail list logo