[jira] [Commented] (ARROW-10226) [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file

2020-10-10 Thread Josh Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211833#comment-17211833
 ] 

Josh Taylor commented on ARROW-10226:
-

I'm seeing the same issue of the initial title, which was that it never 
completes.

Test file: 
[https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing]

(This is from snowflakes example data, exported as a single file parquet file, 
same thing happens for many files).

Code that fails (both group by with sum of columns and the builder pattern 
doesn't work):

https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs

> [Rust] [Parquet] Parquet reader reading wrong columns in some batches within 
> a parquet file
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10274) [Rust] arithmetic without SIMD does unnecesary copy

2020-10-10 Thread Ritchie (Jira)
Ritchie created ARROW-10274:
---

 Summary: [Rust] arithmetic without SIMD does unnecesary copy
 Key: ARROW-10274
 URL: https://issues.apache.org/jira/browse/ARROW-10274
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ritchie


The arithmetic kernels that don't use SIMD create a `vec` in memory and later 
copy that data into a Buffer. Maybe we could directly write the arithmetic 
result to a mutable buffer and prevent this redundant copy?

 

 
{code:java}
let values = (0..left.len())
.map(|i| op(left.value(i), right.value(i))) 
.collect::>();
 
  
let data = ArrayData::new(
  T::get_data_type(),
left.len(),
None,
null_bit_buffer,
0,
vec![Buffer::from(values.to_byte_slice())],
vec![],
);{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10242) Parquet reader thread terminated due to error: ExecutionError("sending on a disconnected channel")

2020-10-10 Thread Josh Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211816#comment-17211816
 ] 

Josh Taylor commented on ARROW-10242:
-

I couldn't get this to fail again, I rebuilt everything and the basic querying 
seems to work now.

 

Thanks!

> Parquet reader thread terminated due to error: ExecutionError("sending on a 
> disconnected channel")
> --
>
> Key: ARROW-10242
> URL: https://issues.apache.org/jira/browse/ARROW-10242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Affects Versions: 2.0.0
>Reporter: Josh Taylor
>Assignee: Andy Grove
>Priority: Major
>
> *Running the latest code from github for datafusion & parquet.*
> When trying to read a directory of around ~210 parquet files (3.2gb total, 
> each file around 13-18mb), doing the following:
> {code:java}
> let mut ctx = ExecutionContext::new();
> // register parquet file with the execution context
> ctx.register_parquet(
>  "something",
>  "/home/josh/dev/pat/fff/"
> )?;
> // execute the query
> let df = ctx.sql(
>  "select * from something",
> )?;
> let results = df.collect().await?;
>  
> {code}
> I get the following error shown ~204 times:
> {code:java}
> Parquet reader thread terminated due to error: ExecutionError("sending on a 
> disconnected channel"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-10242) Parquet reader thread terminated due to error: ExecutionError("sending on a disconnected channel")

2020-10-10 Thread Josh Taylor (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Taylor closed ARROW-10242.
---
Resolution: Fixed

> Parquet reader thread terminated due to error: ExecutionError("sending on a 
> disconnected channel")
> --
>
> Key: ARROW-10242
> URL: https://issues.apache.org/jira/browse/ARROW-10242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Affects Versions: 2.0.0
>Reporter: Josh Taylor
>Assignee: Andy Grove
>Priority: Major
>
> *Running the latest code from github for datafusion & parquet.*
> When trying to read a directory of around ~210 parquet files (3.2gb total, 
> each file around 13-18mb), doing the following:
> {code:java}
> let mut ctx = ExecutionContext::new();
> // register parquet file with the execution context
> ctx.register_parquet(
>  "something",
>  "/home/josh/dev/pat/fff/"
> )?;
> // execute the query
> let df = ctx.sql(
>  "select * from something",
> )?;
> let results = df.collect().await?;
>  
> {code}
> I get the following error shown ~204 times:
> {code:java}
> Parquet reader thread terminated due to error: ExecutionError("sending on a 
> disconnected channel"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10273) [CI][Homebrew] Fix "brew audit" usage

2020-10-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10273:
---
Labels: pull-request-available  (was: )

> [CI][Homebrew] Fix "brew audit" usage
> -
>
> Key: ARROW-10273
> URL: https://issues.apache.org/jira/browse/ARROW-10273
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10273) [CI][Homebrew] Fix "brew audit" usage

2020-10-10 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-10273:


 Summary: [CI][Homebrew] Fix "brew audit" usage
 Key: ARROW-10273
 URL: https://issues.apache.org/jira/browse/ARROW-10273
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10266) [CI][macOS] Ensure using Python 3.8 with Homebrew

2020-10-10 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-10266:

Fix Version/s: 2.0.0

> [CI][macOS] Ensure using Python 3.8 with Homebrew
> -
>
> Key: ARROW-10266
> URL: https://issues.apache.org/jira/browse/ARROW-10266
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10266) [CI][macOS] Ensure using Python 3.8 with Homebrew

2020-10-10 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-10266.
-
Resolution: Fixed

> [CI][macOS] Ensure using Python 3.8 with Homebrew
> -
>
> Key: ARROW-10266
> URL: https://issues.apache.org/jira/browse/ARROW-10266
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10272) [Packaging][Python] Pin newer multibuild version to avoid updating homebrew

2020-10-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10272:
---
Labels: pull-request-available  (was: )

> [Packaging][Python] Pin newer multibuild version to avoid updating homebrew
> ---
>
> Key: ARROW-10272
> URL: https://issues.apache.org/jira/browse/ARROW-10272
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Build failure: 
> https://travis-ci.org/github/ursa-labs/crossbow/builds/734324594



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10272) [Packaging][Python] Pin newer multibuild version to avoid updating homebrew

2020-10-10 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-10272.
-
Resolution: Fixed

Issue resolved by pull request 8431
[https://github.com/apache/arrow/pull/8431]

> [Packaging][Python] Pin newer multibuild version to avoid updating homebrew
> ---
>
> Key: ARROW-10272
> URL: https://issues.apache.org/jira/browse/ARROW-10272
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 2.0.0
>
>
> Build failure: 
> https://travis-ci.org/github/ursa-labs/crossbow/builds/734324594



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10272) [Packaging][Python] Pin newer multibuild version to avoid updating homebrew

2020-10-10 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-10272:
---

 Summary: [Packaging][Python] Pin newer multibuild version to avoid 
updating homebrew
 Key: ARROW-10272
 URL: https://issues.apache.org/jira/browse/ARROW-10272
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging, Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 2.0.0


Build failure: https://travis-ci.org/github/ursa-labs/crossbow/builds/734324594



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9553) [Rust] Release script doesn't bump parquet crate's arrow dependency version

2020-10-10 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-9553.

Resolution: Fixed

Issue resolved by pull request 8429
[https://github.com/apache/arrow/pull/8429]

> [Rust] Release script doesn't bump parquet crate's arrow dependency version
> ---
>
> Key: ARROW-9553
> URL: https://issues.apache.org/jira/browse/ARROW-9553
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> After rebasing the master the rust builds have started to fail.
> The solution is to bump a version number gere 
> https://github.com/apache/arrow/pull/7829



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10249) [Rust]: Support Dictionary types for ListArrays in arrow json reader

2020-10-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10249:
---
Labels: pull-request-available  (was: )

> [Rust]: Support Dictionary types for ListArrays in arrow json reader
> 
>
> Key: ARROW-10249
> URL: https://issues.apache.org/jira/browse/ARROW-10249
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Mahmut Bulut
>Assignee: Mahmut Bulut
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, dictionary types for listarrays are not supported in Arrow JSON 
> reader. It would be nice to add dictionary type support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9553) [Rust] Release script doesn't bump parquet crate's arrow dependency version

2020-10-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9553:
--
Labels: pull-request-available  (was: )

> [Rust] Release script doesn't bump parquet crate's arrow dependency version
> ---
>
> Key: ARROW-9553
> URL: https://issues.apache.org/jira/browse/ARROW-9553
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After rebasing the master the rust builds have started to fail.
> The solution is to bump a version number gere 
> https://github.com/apache/arrow/pull/7829



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10100) [C++][Dataset] Ability to read/subset a ParquetFileFragment with given set of row group ids

2020-10-10 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-10100.
-
Resolution: Fixed

Issue resolved by pull request 8301
[https://github.com/apache/arrow/pull/8301]

> [C++][Dataset] Ability to read/subset a ParquetFileFragment with given set of 
> row group ids
> ---
>
> Key: ARROW-10100
> URL: https://issues.apache.org/jira/browse/ARROW-10100
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, dataset-dask-integration, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> From discussion at 
> https://github.com/dask/dask/pull/6534#issuecomment-698723009 (dask using the 
> dataset API in their parquet reader), it might be useful to somehow "subset" 
> or read a subset of a ParquetFileFragment for a specific set of row group ids.
> Use cases:
> * Read only a set of row groups ids (this is similar as 
> {{ParquetFile.read_row_groups}}), eg because you want to control the size of 
> the resulting table by reading subsets of row groups
> * Get a ParquetFileFragment with a subset of row groups (eg based on a 
> filter) to then eg get the statistics of only those row groups
> The first case could for example be solved by adding a {{row_groups}} keyword 
> to {{ParquetFileFragment.to_table}} (but, this is then a keyword specific to 
> the parquet format, and we should then probably also add it to {{scan}} et 
> al).
> The second case is something you can in principle do yourself manually by 
> recreating a fragment with {{fragment.format.make_fragment(fragment.path, 
> ..., row_groups=[...])}}. However, this is a) a bit cumbersome and b) 
> statistics might need to be parsed again?  
> The statistics of a set of filtered row groups could also be obtained by 
> using {{split_by_row_group(filter)}} (and then get the statistics of each of 
> the fragments), but if you then want a single fragment, you need to recreate 
> a fragment with the obtained row group ids.
> So one idea I have now (but mostly brainstorming here). Would it be useful to 
> have a method to create a "subsetted" ParquetFileFragment, either based on a 
> list of row group ids ({{fragment.subset(row_groups=[...])}} or either based 
> on a filter ({{fragment.subset(filter=...)}}, which would be equivalent as 
> split_by_row_group+recombining into a single fragment) ?
> cc [~bkietz] [~rjzamora]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10271) [Rust] packed_simd is broken and continued under a new project

2020-10-10 Thread Ritchie (Jira)
Ritchie created ARROW-10271:
---

 Summary: [Rust] packed_simd is broken and continued under a new 
project
 Key: ARROW-10271
 URL: https://issues.apache.org/jira/browse/ARROW-10271
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Ritchie


The dependency doesn't compile on newer versions of nightly. This is also known 
by the (new) project maintainers. Due to complications they continued the 
project under a new name: `packed_simd_2`.

 
packed_simd = { version = "0.3.4", package = "packed_simd_2" }
 

See:

https://github.com/rust-lang/packed_simd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7957) [Python] ParquetDataset cannot take HadoopFileSystem as filesystem

2020-10-10 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-7957.

Resolution: Fixed

Issue resolved by pull request 8414
[https://github.com/apache/arrow/pull/8414]

> [Python] ParquetDataset cannot take HadoopFileSystem as filesystem
> --
>
> Key: ARROW-7957
> URL: https://issues.apache.org/jira/browse/ARROW-7957
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Catherine
>Assignee: Joris Van den Bossche
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {{from pyarrow.fs import HadoopFileSystem}}
>  {{import pyarrow.parquet as pq}}
>  
> {{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
>  {{hdfs, path = HadoopFileSystem.from_uri(file_name)}}
>  {{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}}
>  
> has error:
>  {{OSError: Unrecognized filesystem:  'pyarrow._hdfs.HadoopFileSystem'>}}
>  
> When I tried using the deprecated {{HadoopFileSystem}}:
> {{import pyarrow}}
>  {{import pyarrow.parquet as pq}}
>  
> {{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
> {{hdfs = pyarrow.hdfs.connect('localhost', 9000)}}
> {{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}}
> {{pa_schema = dataset.schema.to_arrow_schema()}}
> {{pieces = dataset.pieces}}
> {{for piece in pieces: }}
> {{    print(piece.path)}}
>  
> {{piece.path}} lose the {{hdfs://localhost:9000}} prefix.
>  
> I think {{ParquetDataset}} should accept {{pyarrow.fs.}}{{HadoopFileSystem as 
> filesystem?}}
> And {{piece.path}} should have the prefix?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6043) [Python] Array equals returns incorrectly if NaNs are in arrays

2020-10-10 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-6043:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Python] Array equals returns incorrectly if NaNs are in arrays
> ---
>
> Key: ARROW-6043
> URL: https://issues.apache.org/jira/browse/ARROW-6043
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Keith Kraus
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> {code:python}
> import numpy as np
> import pyarrow as pa
> data = [0, 1, np.nan, None, 4]
> arr1 = pa.array(data)
> arr2 = pa.array(data)
> pa.Array.equals(arr1, arr2)
> {code}
> Unsure if this is expected behavior, but in Arrow 0.12.1 this returned `True` 
> as compared to `False` in 0.14.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10240) [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query

2020-10-10 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10240.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8409
[https://github.com/apache/arrow/pull/8409]

> [Rust] [Datafusion] Optionally load tpch data into memory before running 
> benchmark query
> 
>
> Key: ARROW-10240
> URL: https://issues.apache.org/jira/browse/ARROW-10240
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The tpch benchmark runtime seems to be dominated by csv parsing code and it 
> is really difficult to see any performance hotspots related to actual query 
> execution in a flamegraph.
> With the date in memory and more iterations it should be easier to profile 
> and find bottlenecks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10251) [Rust] [DataFusion] MemTable::load() should load partitions in parallel

2020-10-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10251:
---
Labels: beginner pull-request-available  (was: beginner)

> [Rust] [DataFusion] MemTable::load() should load partitions in parallel
> ---
>
> Key: ARROW-10251
> URL: https://issues.apache.org/jira/browse/ARROW-10251
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> MemTable::load() should load partitions in parallel using async tasks, rather 
> than loading one partition at a time.
> Also, we should make batch size configurable. It is currently hard-coded to 
> 1024*1024 which can be quite inefficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10270) [R] Fix CSV timestamp_parsers test on R-devel

2020-10-10 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-10270:
---

 Summary: [R] Fix CSV timestamp_parsers test on R-devel
 Key: ARROW-10270
 URL: https://issues.apache.org/jira/browse/ARROW-10270
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 2.0.0


Apparently there is a change in the development version of R with respect to 
timezone handling. I suspect it is this: 
https://github.com/wch/r-source/blob/trunk/doc/NEWS.Rd#L296-L300

It causes this failure:

{code}
── 1. Failure: read_csv_arrow() can read timestamps (@test-csv.R#216)  ─
`tbl` not equal to `df`.
Component "time": 'tzone' attributes are inconsistent ('UTC' and '')

── 2. Failure: read_csv_arrow() can read timestamps (@test-csv.R#219)  ─
`tbl` not equal to `df`.
Component "time": 'tzone' attributes are inconsistent ('UTC' and '')
{code}

This needs to be fixed for the CRAN release because they check on the devel 
version. But it doesn't need to block the 2.0 release candidate because I can 
(at minimum) skip these tests before submitting to CRAN (FYI [~kszucs])

I'll also add a CI job to test on R-devel. I just removed 2 R jobs so we can 
afford to add one back.

cc [~romainfrancois]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10267) [Python] Skip flight test if disable_server_verification feature is not available

2020-10-10 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-10267.
-
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8427
[https://github.com/apache/arrow/pull/8427]

> [Python] Skip flight test if disable_server_verification feature is not 
> available
> -
>
> Key: ARROW-10267
> URL: https://issues.apache.org/jira/browse/ARROW-10267
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Our nightly builds are failing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10249) [Rust]: Support Dictionary types for ListArrays in arrow json reader

2020-10-10 Thread Mahmut Bulut (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahmut Bulut reassigned ARROW-10249:


Assignee: Mahmut Bulut

> [Rust]: Support Dictionary types for ListArrays in arrow json reader
> 
>
> Key: ARROW-10249
> URL: https://issues.apache.org/jira/browse/ARROW-10249
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Mahmut Bulut
>Assignee: Mahmut Bulut
>Priority: Major
>
> Currently, dictionary types for listarrays are not supported in Arrow JSON 
> reader. It would be nice to add dictionary type support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10269) [Rust] Update nightly: Oct 2020 Edition

2020-10-10 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10269:
--

 Summary: [Rust] Update nightly: Oct 2020 Edition
 Key: ARROW-10269
 URL: https://issues.apache.org/jira/browse/ARROW-10269
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Neville Dipale


We should update to a more recent nighly after the 2.0.0 release. It carries 
some clippy annoyances, which will mean that I have to revert much of what I 
did around float comparisons.

Might also be preferable to do this sooner, so that we can complete the clippy 
integration and throw away the carrot in favour of the stick.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10268) [Rust] Support writing dictionaries to IPC file and stream

2020-10-10 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10268:
--

 Summary: [Rust] Support writing dictionaries to IPC file and stream
 Key: ARROW-10268
 URL: https://issues.apache.org/jira/browse/ARROW-10268
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


We currently do not support writing dictionary arrays to the IPC file and 
stream format.

When this is supported, we can test the integration with other implementations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10267) [Python] Skip flight test if disable_server_verification feature is not available

2020-10-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10267:
---
Labels: pull-request-available  (was: )

> [Python] Skip flight test if disable_server_verification feature is not 
> available
> -
>
> Key: ARROW-10267
> URL: https://issues.apache.org/jira/browse/ARROW-10267
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Our nightly builds are failing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10267) [Python] Skip flight test if disable_server_verification feature is not available

2020-10-10 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-10267:
---

 Summary: [Python] Skip flight test if disable_server_verification 
feature is not available
 Key: ARROW-10267
 URL: https://issues.apache.org/jira/browse/ARROW-10267
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Krisztian Szucs


Our nightly builds are failing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10265) [CI] Use smaler build when cache doesn't exit on Travis CI

2020-10-10 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-10265.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8424
[https://github.com/apache/arrow/pull/8424]

> [CI] Use smaler build when cache doesn't exit on Travis CI
> --
>
> Key: ARROW-10265
> URL: https://issues.apache.org/jira/browse/ARROW-10265
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10266) [CI][macOS] Ensure using Python 3.8 with Homebrew

2020-10-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10266:
---
Labels: pull-request-available  (was: )

> [CI][macOS] Ensure using Python 3.8 with Homebrew
> -
>
> Key: ARROW-10266
> URL: https://issues.apache.org/jira/browse/ARROW-10266
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10266) [CI][macOS] Ensure using Python 3.8 with Homebrew

2020-10-10 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-10266:


 Summary: [CI][macOS] Ensure using Python 3.8 with Homebrew
 Key: ARROW-10266
 URL: https://issues.apache.org/jira/browse/ARROW-10266
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10105) [FlightRPC] Add client option to disable certificate validation with TLS

2020-10-10 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-10105.
-
Resolution: Fixed

> [FlightRPC] Add client option to disable certificate validation with TLS
> 
>
> Key: ARROW-10105
> URL: https://issues.apache.org/jira/browse/ARROW-10105
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, FlightRPC, Java, Python
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Users of Flight may want to disable certificate validation if they want to 
> only use encryption. A use case might be that the Flight server uses a 
> self-signed certificate and doesn't distribute a certificate for clients to 
> use.
> This feature would be to add an explicit option to FlightClient.Builder to 
> disable certificate validation. Note that this should not happen implicitly 
> if a client uses a TLS location, but does not set a certificate. The client 
> should explicitly set this option so that they are fully aware that they are 
> making a connection with reduced security.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9704) [Java] TestEndianness.testLittleEndian fails on big endian platform

2020-10-10 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9704:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Java] TestEndianness.testLittleEndian fails on big endian platform
> ---
>
> Key: ARROW-9704
> URL: https://issues.apache.org/jira/browse/ARROW-9704
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> {{TestEndianness.testLittleEndian}} assumes that the data layout of int is 
> little-endian. Thus, this test fails on a big-endian platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10265) [CI] Use smaler build when cache doesn't exit on Travis CI

2020-10-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10265:
---
Labels: pull-request-available  (was: )

> [CI] Use smaler build when cache doesn't exit on Travis CI
> --
>
> Key: ARROW-10265
> URL: https://issues.apache.org/jira/browse/ARROW-10265
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10265) [CI] Use smaler build when cache doesn't exit on Travis CI

2020-10-10 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-10265:


 Summary: [CI] Use smaler build when cache doesn't exit on Travis CI
 Key: ARROW-10265
 URL: https://issues.apache.org/jira/browse/ARROW-10265
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9952) [Python] Use pyarrow.dataset writing for pq.write_to_dataset

2020-10-10 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-9952.

Resolution: Fixed

Issue resolved by pull request 8412
[https://github.com/apache/arrow/pull/8412]

> [Python] Use pyarrow.dataset writing for pq.write_to_dataset
> 
>
> Key: ARROW-9952
> URL: https://issues.apache.org/jira/browse/ARROW-9952
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Now ARROW-9658 and ARROW-9893 are in, we can explore using the 
> {{pyarrow.dataset}} writing capabilities in {{parquet.write_to_dataset}}.
> Similarly as was done in {{pq.read_table}}, we could initially have a keyword 
> to switch between both implementations, eventually defaulting to the new 
> datasets one, and to deprecated the old (inefficient) python implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10256) [C++][Flight] Disable -Werror carefully

2020-10-10 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-10256.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8419
[https://github.com/apache/arrow/pull/8419]

> [C++][Flight] Disable -Werror carefully
> ---
>
> Key: ARROW-10256
> URL: https://issues.apache.org/jira/browse/ARROW-10256
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10261) [Rust] [BREAKING] Lists should take Field instead of DataType

2020-10-10 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211667#comment-17211667
 ] 

Andrew Lamb commented on ARROW-10261:
-

[~nevi_me] -- this proposal makes sense to me and I think it brings the Rust 
implementation closer to the C++ implementation, which is a good thing. 

>From my (cursory) reading of the C++, it appears [source 
>link|https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L539-L546]
> that Lists in C++ use `Field` rather than `DataType` do describe each list 
>item's type as well

> [Rust] [BREAKING] Lists should take Field instead of DataType
> -
>
> Key: ARROW-10261
> URL: https://issues.apache.org/jira/browse/ARROW-10261
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Integration, Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
>
> There is currently no way of tracking nested field metadata on lists. For 
> example, if a list's children are nullable, there's no way of telling just by 
> looking at the Field.
> This causes problems with integration testing, and also affects Parquet 
> roundtrips.
> I propose the breaking change of [Large|FixedSize]List taking a Field instead 
> of Box, as this will overcome this issue, and ensure that the Rust 
> implementation passes integration tests.
> CC [~andygrove] [~jorgecarleitao] [~alamb]  [~jhorstmann] ([~carols10cents] 
> as this addresses some of the roundtrip failures).
> I'm leaning towards this landing in 3.0.0, as I'd love for us to have 
> completed or made significant traction on the Arrow Parquet writer (and 
> reader), and integration testing, by then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10260) [Python] Missing MapType to Pandas dtype

2020-10-10 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-10260.
-
Resolution: Fixed

Issue resolved by pull request 8422
[https://github.com/apache/arrow/pull/8422]

> [Python] Missing MapType to Pandas dtype
> 
>
> Key: ARROW-10260
> URL: https://issues.apache.org/jira/browse/ARROW-10260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Bryan Cutler
>Assignee: Derek Marsh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The Map type conversion to Pandas done in ARROW-10151 forgot to add dtype 
> mapping for {{to_pandas_dtype()}}
>  
> {code:java}
> In [2]: d = pa.map_(pa.int64(), pa.float64()) 
>In [3]: d.to_pandas_dtype()
>   
> 
> ---
> NotImplementedError   Traceback (most recent call last)
>  in 
> > 1 
> d.to_pandas_dtype()~/miniconda2/envs/pyarrow-test/lib/python3.7/site-packages/pyarrow/types.pxi
>  in pyarrow.lib.DataType.to_pandas_dtype()NotImplementedError: map double>{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10261) [Rust] [BREAKING] Lists should take Field instead of DataType

2020-10-10 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211654#comment-17211654
 ] 

Neville Dipale commented on ARROW-10261:


[~jhorstmann] nullability should be determined by the overall field for 
consistency; as you could have 1000 batches of 1000 records, but only have say 
5 nulls scattered around.

The main issue is that if I have a non-nullable list, which in turn has a 
nullable struct with various child fields with differing nullability; I won't 
know if the struct is nullable, because I lose that information when only 
taking the field.

Also, in the hypothetical case where the struct has some metadata of its own, 
it gets lost because we would only keep the DataType, and not other attributes 
such as dictionary or metadata (HashMap).

Interestingly, looking at the CPP implementation, it looks like they still use 
List, but I can't see how they preserve the extra details that the 
Rust implementation is failing because of. [~apitrou] any ideas?

> [Rust] [BREAKING] Lists should take Field instead of DataType
> -
>
> Key: ARROW-10261
> URL: https://issues.apache.org/jira/browse/ARROW-10261
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Integration, Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
>
> There is currently no way of tracking nested field metadata on lists. For 
> example, if a list's children are nullable, there's no way of telling just by 
> looking at the Field.
> This causes problems with integration testing, and also affects Parquet 
> roundtrips.
> I propose the breaking change of [Large|FixedSize]List taking a Field instead 
> of Box, as this will overcome this issue, and ensure that the Rust 
> implementation passes integration tests.
> CC [~andygrove] [~jorgecarleitao] [~alamb]  [~jhorstmann] ([~carols10cents] 
> as this addresses some of the roundtrip failures).
> I'm leaning towards this landing in 3.0.0, as I'd love for us to have 
> completed or made significant traction on the Arrow Parquet writer (and 
> reader), and integration testing, by then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10261) [Rust] [BREAKING] Lists should take Field instead of DataType

2020-10-10 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211650#comment-17211650
 ] 

Jörn Horstmann commented on ARROW-10261:


For my understanding, this is about metadata? So even if there are no null 
values in one batch or partition you want to mark the elements as potentially 
nullable?

> [Rust] [BREAKING] Lists should take Field instead of DataType
> -
>
> Key: ARROW-10261
> URL: https://issues.apache.org/jira/browse/ARROW-10261
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Integration, Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
>
> There is currently no way of tracking nested field metadata on lists. For 
> example, if a list's children are nullable, there's no way of telling just by 
> looking at the Field.
> This causes problems with integration testing, and also affects Parquet 
> roundtrips.
> I propose the breaking change of [Large|FixedSize]List taking a Field instead 
> of Box, as this will overcome this issue, and ensure that the Rust 
> implementation passes integration tests.
> CC [~andygrove] [~jorgecarleitao] [~alamb]  [~jhorstmann] ([~carols10cents] 
> as this addresses some of the roundtrip failures).
> I'm leaning towards this landing in 3.0.0, as I'd love for us to have 
> completed or made significant traction on the Arrow Parquet writer (and 
> reader), and integration testing, by then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9945) [C++][Dataset] Refactor Expression::Assume to return a Result

2020-10-10 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9945:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [C++][Dataset] Refactor Expression::Assume to return a Result
> -
>
> Key: ARROW-9945
> URL: https://issues.apache.org/jira/browse/ARROW-9945
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset
> Fix For: 3.0.0
>
>
> Expression::Assume can abort if the two expressions are not valid against a 
> single schema. This is not ideal since a schema is not always easily 
> available. The method should be able to fail gracefully in the case of a 
> best-effort simplification where validation against a schema is not desired.
> https://github.com/apache/arrow/pull/8037#discussion_r475594117



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10252) [Python] Add option to skip inclusion of Arrow headers in Python installation

2020-10-10 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-10252.
-
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8416
[https://github.com/apache/arrow/pull/8416]

> [Python] Add option to skip inclusion of Arrow headers in Python installation
> -
>
> Key: ARROW-10252
> URL: https://issues.apache.org/jira/browse/ARROW-10252
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> We don't want to have them as part of the conda package as the single source 
> should be {{arrow-cpp}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4960) [R] Add crossbow task for r-arrow-feedstock

2020-10-10 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211621#comment-17211621
 ] 

Krisztian Szucs commented on ARROW-4960:


It doesn't seem required for 2.0 so updating the version. 

> [R] Add crossbow task for r-arrow-feedstock
> ---
>
> Key: ARROW-4960
> URL: https://issues.apache.org/jira/browse/ARROW-4960
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, R
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> We also have an R package on conda-forge now: 
> [https://github.com/conda-forge/r-arrow-feedstock] This should be tested 
> using crossbow as we do with the other packages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4960) [R] Add crossbow task for r-arrow-feedstock

2020-10-10 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-4960:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [R] Add crossbow task for r-arrow-feedstock
> ---
>
> Key: ARROW-4960
> URL: https://issues.apache.org/jira/browse/ARROW-4960
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, R
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> We also have an R package on conda-forge now: 
> [https://github.com/conda-forge/r-arrow-feedstock] This should be tested 
> using crossbow as we do with the other packages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10230) [JS][Doc] JavaScript documentation fails to build

2020-10-10 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-10230.
-
Resolution: Fixed

Issue resolved by pull request 8395
[https://github.com/apache/arrow/pull/8395]

> [JS][Doc] JavaScript documentation fails to build
> -
>
> Key: ARROW-10230
> URL: https://issues.apache.org/jira/browse/ARROW-10230
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, JavaScript
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Probably because of typedoc updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-3080) [Python] Unify Arrow to Python object conversion paths

2020-10-10 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-3080.
--
Fix Version/s: (was: 3.0.0)
   2.0.0
   Resolution: Fixed

Issue resolved by pull request 8349
[https://github.com/apache/arrow/pull/8349]

> [Python] Unify Arrow to Python object conversion paths
> --
>
> Key: ARROW-3080
> URL: https://issues.apache.org/jira/browse/ARROW-3080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> Similar to ARROW-2814, we have inconsistent support for converting Arrow 
> nested types back to object sequences. For example, a list of structs fails 
> when calling {{to_pandas}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10260) [Python] Missing MapType to Pandas dtype

2020-10-10 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-10260:
-

Assignee: Derek Marsh

> [Python] Missing MapType to Pandas dtype
> 
>
> Key: ARROW-10260
> URL: https://issues.apache.org/jira/browse/ARROW-10260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Bryan Cutler
>Assignee: Derek Marsh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The Map type conversion to Pandas done in ARROW-10151 forgot to add dtype 
> mapping for {{to_pandas_dtype()}}
>  
> {code:java}
> In [2]: d = pa.map_(pa.int64(), pa.float64()) 
>In [3]: d.to_pandas_dtype()
>   
> 
> ---
> NotImplementedError   Traceback (most recent call last)
>  in 
> > 1 
> d.to_pandas_dtype()~/miniconda2/envs/pyarrow-test/lib/python3.7/site-packages/pyarrow/types.pxi
>  in pyarrow.lib.DataType.to_pandas_dtype()NotImplementedError: map double>{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10248) [C++][Dataset] Dataset writing does not write schema metadata

2020-10-10 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-10248.
---
Resolution: Fixed

Issue resolved by pull request 8415
[https://github.com/apache/arrow/pull/8415]

> [C++][Dataset] Dataset writing does not write schema metadata
> -
>
> Key: ARROW-10248
> URL: https://issues.apache.org/jira/browse/ARROW-10248
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Not sure if this is related to the writing refactor that landed yesterday, 
> but `write_dataset` does not preserve the schema metadata (eg used for pandas 
> metadata):
> {code}
> In [20]: df = pd.DataFrame({'a': [1, 2, 3]})
> In [21]: table = pa.Table.from_pandas(df)
> In [22]: table.schema
> Out[22]: 
> a: int64
> -- schema metadata --
> pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 396
> In [23]: ds.write_dataset(table, "test_write_dataset_pandas", 
> format="parquet")
> In [24]: pq.read_table("test_write_dataset_pandas/part-0.parquet").schema
> Out[24]: 
> a: int64
>   -- field metadata --
>   PARQUET:field_id: '1'
> {code}
> I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't 
> yet look into how easy it would be to fix.
> cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10175) [CI] Nightly hdfs integration test job fails

2020-10-10 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211606#comment-17211606
 ] 

Joris Van den Bossche commented on ARROW-10175:
---

I opened ARROW-10264 for the failing URI test (which I would think should work)

> [CI] Nightly hdfs integration test job fails
> 
>
> Key: ARROW-10175
> URL: https://issues.apache.org/jira/browse/ARROW-10175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Neal Richardson
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Two tests fail:
> https://github.com/ursa-labs/crossbow/runs/1204680589
> [removed bogus investigation]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10264) [C++][Python] Parquet test failing with HadoopFileSystem URI

2020-10-10 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-10264:
--
Labels: filesystem hdfs  (was: )

> [C++][Python] Parquet test failing with HadoopFileSystem URI
> 
>
> Key: ARROW-10264
> URL: https://issues.apache.org/jira/browse/ARROW-10264
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: filesystem, hdfs
> Fix For: 3.0.0
>
>
> Follow-up on ARROW-10175. In the HDFS integration tests, there is a test 
> using a URI failing if we use the new filesystem / dataset implementation:
> {code}
> FAILED 
> opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_hdfs.py::TestLibHdfs::test_read_multiple_parquet_files_with_uri
> {code}
> fails with
> {code}
> pyarrow.lib.ArrowInvalid: Path 
> '/tmp/pyarrow-test-838/multi-parquet-uri-48569714efc74397816722c9c6723191/0.parquet'
>  is not relative to '/user/root'
> {code}
> while it is passing a URI (and not a filesystem object) to 
> {{parquet.read_table}}, and the new filesystems/dataset implementation should 
> be able to handle URIs.
> cc [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10264) [C++][Python] Parquet test failing with HadoopFileSystem URI

2020-10-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10264:
-

 Summary: [C++][Python] Parquet test failing with HadoopFileSystem 
URI
 Key: ARROW-10264
 URL: https://issues.apache.org/jira/browse/ARROW-10264
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Joris Van den Bossche
 Fix For: 3.0.0


Follow-up on ARROW-10175. In the HDFS integration tests, there is a test using 
a URI failing if we use the new filesystem / dataset implementation:

{code}
FAILED 
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_hdfs.py::TestLibHdfs::test_read_multiple_parquet_files_with_uri
{code}

fails with

{code}
pyarrow.lib.ArrowInvalid: Path 
'/tmp/pyarrow-test-838/multi-parquet-uri-48569714efc74397816722c9c6723191/0.parquet'
 is not relative to '/user/root'
{code}

while it is passing a URI (and not a filesystem object) to 
{{parquet.read_table}}, and the new filesystems/dataset implementation should 
be able to handle URIs.

cc [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Parquet

2020-10-10 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9812:
-
Fix Version/s: 3.0.0

> [Python] Map data types doesn't work from Arrow to Parquet
> --
>
> Key: ARROW-9812
> URL: https://issues.apache.org/jira/browse/ARROW-9812
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mayur Srivastava
>Priority: Major
> Fix For: 3.0.0
>
>
> Hi,
> I'm having problems using 'map' data type in Arrow/parquet/pandas.
> I'm able to convert a pandas data frame to Arrow with a map data type.
> When I write Arrow to Parquet, it seems to work, but I'm not sure if the data 
> type is written correctly.
> When I read back Parquet to Arrow, it fails saying "reading list of structs" 
> is not supported. It seems that map is stored as list of structs.
> There are two problems here:
>  # -Map data type doesn't work from Arrow -> Pandas-. Fixed in ARROW-10151
>  # Map data type doesn't get written to or read from Arrow -> Parquet.
> Questions:
> 1. Am I doing something wrong? Is there a way to get these to work? 
> 2. If these are unsupported features, will this be fixed in a future version? 
> Do you plans or ETA?
> The following code example (followed by output) should demonstrate the issues:
> I'm using Arrow 1.0.0 and Pandas 1.0.5.
> Thanks!
> Mayur
> {code:java}
> $ cat arrowtest.py
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> import traceback as tb
> import io
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df1 = pd.DataFrame({'a': [[('b', '2')]]})
> print(f'df1')
> print(f'{df1}')
> print(f'Pandas -> Arrow')
> try:
> t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', 
> pa.map_(pa.string(), pa.string()))]))
> print('PASSED')
> print(t1)
> except:
> print(f'FAILED')
> tb.print_exc()
> print(f'Arrow -> Pandas')
> try:
> t1.to_pandas()
> print('PASSED')
> except:
> print(f'FAILED')
> tb.print_exc()print(f'Arrow -> Parquet')
> fh = io.BytesIO()
> try:
> pq.write_table(t1, fh)
> print('PASSED')
> except:
> print('FAILED')
> tb.print_exc()
> 
> print(f'Parquet -> Arrow')
> try:
> t2 = pq.read_table(source=fh)
> print('PASSED')
> print(t2)
> except:
> print('FAILED')
> tb.print_exc()
> {code}
> {code:java}
> $ python3.6 arrowtest.py
> PyArrow Version = 1.0.0 
> Pandas Version = 1.0.5 
> df1 
> a 0 [(b, 2)] 
>  
> Pandas -> Arrow 
> PASSED 
> pyarrow.Table 
> a: map
>  child 0, entries: struct not null
>  child 0, key: string not null
>  child 1, value: string 
>  
> Arrow -> Pandas 
> FAILED 
> Traceback (most recent call last):
> File "arrowtest.py", line 26, in  t1.to_pandas() 
> File "pyarrow/array.pxi", line 715, in 
> pyarrow.lib._PandasConvertible.to_pandas 
> File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File 
> "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in 
> table_to_blockmanager blocks = _table_to_blocks(options, table, categories, 
> ext_columns_dtypes) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 
> 1115, in _table_to_blocks list(extension_columns.keys())) 
> File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File 
> "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for 
> Arrow data of type map is known. 
>  
> Arrow -> Parquet 
> PASSED 
>  
> Parquet -> Arrow 
> FAILED 
> Traceback (most recent call last): File "arrowtest.py", line 43, in  
> t2 = pq.read_table(source=fh) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in 
> read_table use_pandas_metadata=use_pandas_metadata) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1474, in 
> read use_threads=use_threads 
> File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table 
> File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table 
> File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status 
> File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet 
> files not yet supported: key_value: list null, value: string> not null> not null
> {code}
> Updated to indicate to Pandas conversion done, but not yet for Parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10243) [Rust] [Datafusion] Optimize literal expression evaluation

2020-10-10 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörn Horstmann reassigned ARROW-10243:
--

Assignee: Jörn Horstmann

> [Rust] [Datafusion] Optimize literal expression evaluation
> --
>
> Key: ARROW-10243
> URL: https://issues.apache.org/jira/browse/ARROW-10243
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
> Attachments: flamegraph.svg
>
>
> While benchmarking the tpch query I noticed that the physical literal 
> expression takes up a sizable amount of time. I think the creation of the 
> corresponding array for numeric literals can be speed up by creating Buffer 
> and ArrayData directly without going through a builder. That also allows to 
> skip building a null bitmap for non-null literals.
> I'm also thinking whether it might be possible to cache the created array. 
> For queries without a WHERE clause, I'd expect all batches except the last to 
> have the same length. I'm not sure though where to store the cached value.
> Another possible optimization could be to cast literals already on the 
> logical plan side. In the tpch query the literal `1` is of type `u64` in the 
> logical plan and then needs to be processed by a cast kernel to convert to 
> `f64` for usage in an arithmetic expression.
> The attached flamegraph is of 10 runs of tpch, with the data being loaded 
> into memory before running the queries (See ARROW-10240).
> {code}
> flamegraph ./target/release/tpch --iterations 10 --path ../tpch-dbgen 
> --format tbl --query 1 --batch-size 4096 -c1 --load
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10244) [Python][Docs] Add docs on using pyarrow.dataset.parquet_dataset

2020-10-10 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-10244.
---
Resolution: Fixed

Issue resolved by pull request 8410
[https://github.com/apache/arrow/pull/8410]

> [Python][Docs] Add docs on using pyarrow.dataset.parquet_dataset
> 
>
> Key: ARROW-10244
> URL: https://issues.apache.org/jira/browse/ARROW-10244
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10263) [C++][Compute] Improve numerical stability of variances merging

2020-10-10 Thread Yibo Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai updated ARROW-10263:
-
Description: 
For chunked array, variance kernel needs to merge variances.
Tested with two single value chunk, [400800490], [400800400]. 
The merged variance is 3872. If treated as single array with two values, the 
variance is 3904, same as numpy outputs.
So current merging method is not stable in extreme cases when chunks are very 
short and with approximate mean values.

  was:
For chunked array, variance kernel needs to merge variances.
Tested with two single value chunk, [400800490], [400800400]. 
The merged variance is 3872. If treated as single array with two values, the 
variance is 3904, same as numpy outputs.
So current merging method is not stable in extreme cases.


> [C++][Compute] Improve numerical stability of variances merging
> ---
>
> Key: ARROW-10263
> URL: https://issues.apache.org/jira/browse/ARROW-10263
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>
> For chunked array, variance kernel needs to merge variances.
> Tested with two single value chunk, [400800490], [400800400]. 
> The merged variance is 3872. If treated as single array with two values, the 
> variance is 3904, same as numpy outputs.
> So current merging method is not stable in extreme cases when chunks are very 
> short and with approximate mean values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10263) [C++][Compute] Improve numerical stability of variances merging

2020-10-10 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-10263:


 Summary: [C++][Compute] Improve numerical stability of variances 
merging
 Key: ARROW-10263
 URL: https://issues.apache.org/jira/browse/ARROW-10263
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


For chunked array, variance kernel needs to merge variances.
Tested with two single value chunk, [400800490], [400800400]. 
The merged variance is 3872. If treated as single array with two values, the 
variance is 3904, same as numpy outputs.
So current merging method is not stable in extreme cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9962) [Python] Conversion to pandas with index column using fixed timezone fails

2020-10-10 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-9962.
--
Resolution: Fixed

Issue resolved by pull request 8162
[https://github.com/apache/arrow/pull/8162]

> [Python] Conversion to pandas with index column using fixed timezone fails
> --
>
> Key: ARROW-9962
> URL: https://issues.apache.org/jira/browse/ARROW-9962
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> From https://github.com/pandas-dev/pandas/issues/35997: it seems we are 
> handling a normal column and index column differently in the conversion to 
> pandas.
> {code}
> In [5]: import pandas as pd
>...: from datetime import datetime, timezone
>...: 
>...: df = pd.DataFrame([[datetime.now(timezone.utc), 
> datetime.now(timezone.utc)]], columns=['date_index', 'date_column'])
>...: table = pa.Table.from_pandas(df.set_index('date_index'))
>...: 
> In [6]: table
> Out[6]: 
> pyarrow.Table
> date_column: timestamp[ns, tz=+00:00]
> date_index: timestamp[ns, tz=+00:00]
> In [7]: table.to_pandas()
> ...
> UnknownTimeZoneError: '+00:00'
> {code}
> So this happens specifically for "fixed offset" timezones, and only for index 
> columns (eg {{table.select(["date_column"]).to_pandas()}} works fine).
> It seems this is because for columns we use our helper {{make_tz_aware}} to 
> convert the string "+01:00" to a python timezone, which is then understood by 
> pandas (the string is not handled by pandas). But for the index column we 
> fail to do this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10260) [Python] Missing MapType to Pandas dtype

2020-10-10 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned ARROW-10260:


Assignee: (was: Bryan Cutler)

> [Python] Missing MapType to Pandas dtype
> 
>
> Key: ARROW-10260
> URL: https://issues.apache.org/jira/browse/ARROW-10260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The Map type conversion to Pandas done in ARROW-10151 forgot to add dtype 
> mapping for {{to_pandas_dtype()}}
>  
> {code:java}
> In [2]: d = pa.map_(pa.int64(), pa.float64()) 
>In [3]: d.to_pandas_dtype()
>   
> 
> ---
> NotImplementedError   Traceback (most recent call last)
>  in 
> > 1 
> d.to_pandas_dtype()~/miniconda2/envs/pyarrow-test/lib/python3.7/site-packages/pyarrow/types.pxi
>  in pyarrow.lib.DataType.to_pandas_dtype()NotImplementedError: map double>{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10260) [Python] Missing MapType to Pandas dtype

2020-10-10 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned ARROW-10260:


Assignee: Bryan Cutler

> [Python] Missing MapType to Pandas dtype
> 
>
> Key: ARROW-10260
> URL: https://issues.apache.org/jira/browse/ARROW-10260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The Map type conversion to Pandas done in ARROW-10151 forgot to add dtype 
> mapping for {{to_pandas_dtype()}}
>  
> {code:java}
> In [2]: d = pa.map_(pa.int64(), pa.float64()) 
>In [3]: d.to_pandas_dtype()
>   
> 
> ---
> NotImplementedError   Traceback (most recent call last)
>  in 
> > 1 
> d.to_pandas_dtype()~/miniconda2/envs/pyarrow-test/lib/python3.7/site-packages/pyarrow/types.pxi
>  in pyarrow.lib.DataType.to_pandas_dtype()NotImplementedError: map double>{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10262) [C++] Some TypeClass in Scalar classes seem incorrect

2020-10-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10262:
---
Labels: pull-request-available  (was: )

> [C++] Some TypeClass in Scalar classes seem incorrect
> -
>
> Key: ARROW-10262
> URL: https://issues.apache.org/jira/browse/ARROW-10262
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: RUOXI SUN
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Alias _TypeClass_ in 
> _[BinaryScalar|https://github.com/apache/arrow/blob/master/cpp/src/arrow/scalar.h#L217]_
>  and 
> _[LargeBinaryScalar|https://github.com/apache/arrow/blob/master/cpp/src/arrow/scalar.h#L242]_
>  are being _BinaryScalar_ and _LargeBinaryScalar_. Are they supposed to be 
> _BinaryType_ and _LargeBinaryType_?
> I'm having issues when I use _TypeTrait_ on _ScalarType::TypeClass_ - 
> compiler complains that there are no whatever members in specialized 
> _TypeTrait_ and _TypeTrait_.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10262) [C++] Some TypeClass in Scalar classes seem incorrect

2020-10-10 Thread RUOXI SUN (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RUOXI SUN updated ARROW-10262:
--
Description: 
Alias _TypeClass_ in 
_[BinaryScalar|https://github.com/apache/arrow/blob/master/cpp/src/arrow/scalar.h#L217]_
 and 
_[LargeBinaryScalar|https://github.com/apache/arrow/blob/master/cpp/src/arrow/scalar.h#L242]_
 are being _BinaryScalar_ and _LargeBinaryScalar_. Are they supposed to be 
_BinaryType_ and _LargeBinaryType_?

I'm having issues when I use _TypeTrait_ on _ScalarType::TypeClass_ - compiler 
complains that there are no whatever members in specialized 
_TypeTrait_ and _TypeTrait_.

  was:
Alias _TypeClass_ in 
_[BinaryScalar|https://github.com/apache/arrow/blob/master/cpp/src/arrow/scalar.h#L217]_
 and 
_[LargeBinaryScalar|https://github.com/apache/arrow/blob/master/cpp/src/arrow/scalar.h#L242]_
 are being _BinaryScalar_ and _LargeBinaryScalar_. Are they supposed to be 
_BinaryType_ and _LargeBinaryType_?

I'm having issues when I use _TypeTrait_ on _ScalarType::TypeClass_ - compiler 
complains that there are no whatever members in specialized 
_TypeTrait_ class _TypeTrait_.


> [C++] Some TypeClass in Scalar classes seem incorrect
> -
>
> Key: ARROW-10262
> URL: https://issues.apache.org/jira/browse/ARROW-10262
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: RUOXI SUN
>Priority: Minor
>
> Alias _TypeClass_ in 
> _[BinaryScalar|https://github.com/apache/arrow/blob/master/cpp/src/arrow/scalar.h#L217]_
>  and 
> _[LargeBinaryScalar|https://github.com/apache/arrow/blob/master/cpp/src/arrow/scalar.h#L242]_
>  are being _BinaryScalar_ and _LargeBinaryScalar_. Are they supposed to be 
> _BinaryType_ and _LargeBinaryType_?
> I'm having issues when I use _TypeTrait_ on _ScalarType::TypeClass_ - 
> compiler complains that there are no whatever members in specialized 
> _TypeTrait_ and _TypeTrait_.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10262) [C++] Some TypeClass in Scalar classes seem incorrect

2020-10-10 Thread RUOXI SUN (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RUOXI SUN updated ARROW-10262:
--
Description: 
Alias _TypeClass_ in 
_[BinaryScalar|https://github.com/apache/arrow/blob/master/cpp/src/arrow/scalar.h#L217]_
 and 
_[LargeBinaryScalar|https://github.com/apache/arrow/blob/master/cpp/src/arrow/scalar.h#L242]_
 are being _BinaryScalar_ and _LargeBinaryScalar_. Are they supposed to be 
_BinaryType_ and _LargeBinaryType_?

I'm having issues when I use _TypeTrait_ on _ScalarType::TypeClass_ - compiler 
complains that there are no whatever members in specialized 
_TypeTrait_ class _TypeTrait_.

  was:
Alias `TypeClass` in 
[BinaryScalar|https://github.com/apache/arrow/blob/master/cpp/src/arrow/scalar.h#L217]
 and 
[LargeBinaryScalar|https://github.com/apache/arrow/blob/master/cpp/src/arrow/scalar.h#L242]
 are being `BinaryScalar` and `LargeBinaryScalar`. Are they supposed to be 
`BinaryType` and `LargeBinaryType`?

I'm having issues when I use `TypeTrait` on `ScalarType::TypeClass`.


> [C++] Some TypeClass in Scalar classes seem incorrect
> -
>
> Key: ARROW-10262
> URL: https://issues.apache.org/jira/browse/ARROW-10262
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: RUOXI SUN
>Priority: Minor
>
> Alias _TypeClass_ in 
> _[BinaryScalar|https://github.com/apache/arrow/blob/master/cpp/src/arrow/scalar.h#L217]_
>  and 
> _[LargeBinaryScalar|https://github.com/apache/arrow/blob/master/cpp/src/arrow/scalar.h#L242]_
>  are being _BinaryScalar_ and _LargeBinaryScalar_. Are they supposed to be 
> _BinaryType_ and _LargeBinaryType_?
> I'm having issues when I use _TypeTrait_ on _ScalarType::TypeClass_ - 
> compiler complains that there are no whatever members in specialized 
> _TypeTrait_ class _TypeTrait_.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10262) [C++] Some TypeClass in Scalar classes seem incorrect

2020-10-10 Thread RUOXI SUN (Jira)
RUOXI SUN created ARROW-10262:
-

 Summary: [C++] Some TypeClass in Scalar classes seem incorrect
 Key: ARROW-10262
 URL: https://issues.apache.org/jira/browse/ARROW-10262
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 1.0.1
Reporter: RUOXI SUN


Alias `TypeClass` in 
[BinaryScalar|https://github.com/apache/arrow/blob/master/cpp/src/arrow/scalar.h#L217]
 and 
[LargeBinaryScalar|https://github.com/apache/arrow/blob/master/cpp/src/arrow/scalar.h#L242]
 are being `BinaryScalar` and `LargeBinaryScalar`. Are they supposed to be 
`BinaryType` and `LargeBinaryType`?

I'm having issues when I use `TypeTrait` on `ScalarType::TypeClass`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)