[jira] [Updated] (ARROW-8883) [Rust] [Integration Testing] Enable passing tests and update spec doc

2020-09-12 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-8883:
--
Summary: [Rust] [Integration Testing] Enable passing tests and update spec 
doc  (was: [Rust] [Integration Testing] Disable unsupported tests)

> [Rust] [Integration Testing] Enable passing tests and update spec doc
> -
>
> Key: ARROW-8883
> URL: https://issues.apache.org/jira/browse/ARROW-8883
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Integration, Rust
>Affects Versions: 0.17.0
>Reporter: Neville Dipale
>Priority: Major
>
> Some of the integration test failures can be avoided by disabling unsupported 
> tests, like large lists and nested types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9985) [C++][Parquet] Add bitmap based validity bitmap and nested reconstruction

2020-09-12 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-9985:
--

 Summary: [C++][Parquet] Add bitmap based validity bitmap and 
nested reconstruction
 Key: ARROW-9985
 URL: https://issues.apache.org/jira/browse/ARROW-9985
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Micah Kornfield
Assignee: Micah Kornfield


For low levels of this will likely be  more performant then existing rep/level 
based reconstruction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8494) [C++] Implement vectorized array reassembly logic

2020-09-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-8494:


Assignee: Apache Arrow JIRA Bot  (was: Micah Kornfield)

> [C++] Implement vectorized array reassembly logic
> -
>
> Key: ARROW-8494
> URL: https://issues.apache.org/jira/browse/ARROW-8494
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Apache Arrow JIRA Bot
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This logic would attempt to create the data necessary for each field by 
> passing through the levels once for each field.  it is expected that due to 
> SIMD this will perform better for nested data with shallow nesting, but due 
> to repetitive computation might perform worse for deep nested that include 
> List-types.
>  
> At a high level the logic would be structured as:
> {{for each field:}}
> {{   for each rep/def level entry:}}
> {{           update null bitmask and offsets.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8494) [C++] Implement vectorized array reassembly logic

2020-09-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-8494:


Assignee: Micah Kornfield  (was: Apache Arrow JIRA Bot)

> [C++] Implement vectorized array reassembly logic
> -
>
> Key: ARROW-8494
> URL: https://issues.apache.org/jira/browse/ARROW-8494
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This logic would attempt to create the data necessary for each field by 
> passing through the levels once for each field.  it is expected that due to 
> SIMD this will perform better for nested data with shallow nesting, but due 
> to repetitive computation might perform worse for deep nested that include 
> List-types.
>  
> At a high level the logic would be structured as:
> {{for each field:}}
> {{   for each rep/def level entry:}}
> {{           update null bitmask and offsets.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8494) [C++] Implement vectorized array reassembly logic

2020-09-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8494:
--
Labels: pull-request-available  (was: )

> [C++] Implement vectorized array reassembly logic
> -
>
> Key: ARROW-8494
> URL: https://issues.apache.org/jira/browse/ARROW-8494
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This logic would attempt to create the data necessary for each field by 
> passing through the levels once for each field.  it is expected that due to 
> SIMD this will perform better for nested data with shallow nesting, but due 
> to repetitive computation might perform worse for deep nested that include 
> List-types.
>  
> At a high level the logic would be structured as:
> {{for each field:}}
> {{   for each rep/def level entry:}}
> {{           update null bitmask and offsets.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9984) [Rust] [DataFusion] DRY of function to string

2020-09-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9984:


Assignee: Apache Arrow JIRA Bot  (was: Jorge)

> [Rust] [DataFusion] DRY of function to string
> -
>
> Key: ARROW-9984
> URL: https://issues.apache.org/jira/browse/ARROW-9984
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Apache Arrow JIRA Bot
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9984) [Rust] [DataFusion] DRY of function to string

2020-09-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9984:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] DRY of function to string
> -
>
> Key: ARROW-9984
> URL: https://issues.apache.org/jira/browse/ARROW-9984
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9984) [Rust] [DataFusion] DRY of function to string

2020-09-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9984:


Assignee: Jorge  (was: Apache Arrow JIRA Bot)

> [Rust] [DataFusion] DRY of function to string
> -
>
> Key: ARROW-9984
> URL: https://issues.apache.org/jira/browse/ARROW-9984
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9984) [Rust] [DataFusion] DRY of function to string

2020-09-12 Thread Jorge (Jira)
Jorge created ARROW-9984:


 Summary: [Rust] [DataFusion] DRY of function to string
 Key: ARROW-9984
 URL: https://issues.apache.org/jira/browse/ARROW-9984
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7288) [C++][R] read_parquet() freezes on Windows with Japanese locale

2020-09-12 Thread Hiroaki Yutani (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194918#comment-17194918
 ] 

Hiroaki Yutani commented on ARROW-7288:
---

> To be clear, I believe the issue is in the parquet C++ library as compiled 
> with the Rtools 35 toolchain. I don't recall if I've tried with Rtools 40.

FWIW, I quickly tried the CRAN precompiled binary for R 4.0.2, which I believe 
uses Rtools 40, but it seems the code still freezes on Japanese locale (I 
haven't used Windows for a while, so there might be a chance my setup is not 
correct).

> [C++][R] read_parquet() freezes on Windows with Japanese locale
> ---
>
> Key: ARROW-7288
> URL: https://issues.apache.org/jira/browse/ARROW-7288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.15.1
> Environment: R 3.6.1 on Windows 10
>Reporter: Hiroaki Yutani
>Priority: Critical
>  Labels: parquet
> Fix For: 2.0.0
>
>
> The following example on read_parquet()'s doc freezes (seems to wait for the 
> result forever) on my Windows.
> df <- read_parquet(system.file("v0.7.1.parquet", package="arrow"))
> The CRAN checks are all fine, which means the example is successfully 
> executed on the CRAN Windows. So, I have no idea why it doesn't work on my 
> local.
> [https://cran.r-project.org/web/checks/check_results_arrow.html]
> Here's my session info in case it helps:
> {code:java}
> > sessioninfo::session_info()
> - Session info 
> -
>  setting  value
>  version  R version 3.6.1 (2019-07-05)
>  os   Windows 10 x64
>  system   x86_64, mingw32
>  ui   RStudio
>  language en
>  collate  Japanese_Japan.932
>  ctypeJapanese_Japan.932
>  tz   Asia/Tokyo
>  date 2019-12-01
> - Packages 
> -
>  package * version  date   lib source
>  arrow   * 0.15.1.1 2019-11-05 [1] CRAN (R 3.6.1)
>  assertthat0.2.12019-03-21 [1] CRAN (R 3.6.0)
>  bit   1.1-14   2018-05-29 [1] CRAN (R 3.6.0)
>  bit64 0.9-72017-05-08 [1] CRAN (R 3.6.0)
>  cli   1.1.02019-03-19 [1] CRAN (R 3.6.0)
>  crayon1.3.42017-09-16 [1] CRAN (R 3.6.0)
>  fs1.3.12019-05-06 [1] CRAN (R 3.6.0)
>  glue  1.3.12019-03-12 [1] CRAN (R 3.6.0)
>  magrittr  1.5  2014-11-22 [1] CRAN (R 3.6.0)
>  purrr 0.3.32019-10-18 [1] CRAN (R 3.6.1)
>  R62.4.12019-11-12 [1] CRAN (R 3.6.1)
>  Rcpp  1.0.32019-11-08 [1] CRAN (R 3.6.1)
>  reprex0.3.02019-05-16 [1] CRAN (R 3.6.0)
>  rlang 0.4.22019-11-23 [1] CRAN (R 3.6.1)
>  rstudioapi0.10 2019-03-19 [1] CRAN (R 3.6.0)
>  sessioninfo   1.1.12018-11-05 [1] CRAN (R 3.6.0)
>  tidyselect0.2.52018-10-11 [1] CRAN (R 3.6.0)
>  withr 2.1.22018-03-15 [1] CRAN (R 3.6.0)
> [1] C:/Users/hiroaki-yutani/Documents/R/win-library/3.6
> [2] C:/Program Files/R/R-3.6.1/library
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9937) [Rust] [DataFusion] Average is not correct

2020-09-12 Thread Jorge (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge reassigned ARROW-9937:


Assignee: Jorge

> [Rust] [DataFusion] Average is not correct
> --
>
> Key: ARROW-9937
> URL: https://issues.apache.org/jira/browse/ARROW-9937
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> The current design of aggregates makes the calculation of the average 
> incorrect.
> Namely, if there are multiple input partitions, the result is average of the 
> averages. For example if the input it in two batches {{[1,2]}}, and 
> {{[3,4,5]}}, datafusion will say "average=3.25" rather than "average=3".
>  It also makes it impossible to compute the [geometric 
> mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other 
> operations.
> The central issue is that Accumulator returns a `ScalarValue` during partial 
> aggregations via {{get_value}}, but very often a `ScalarValue` is not 
> sufficient information to perform the full aggregation.
> A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are 
> distributed in batches of 2,
> {[x1, x2], [x3, x4], [x5]}
> . Our current calculation performs partial means,
> {(x1+x2)/2, (x3+x4)/2, x5}
> , and then reduces them using another average, i.e.
> {{((x1+x2)/2 + (x3+x4)/2 + x5)/3}}
> which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}.
> I believe that our Accumulators need to pass more information from the 
> partial aggregations to the final aggregation.
> We could consider taking an API equivalent to 
> [spark]([https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html]),
>  i.e. have an `update`, a `merge` and an `evaluate`.
> Code with a failing test ({{src/execution/context.rs}})
> {code:java}
> #[test]
> fn simple_avg() -> Result<()> {
> let schema = Schema::new(vec![
> Field::new("a", DataType::Int32, false),
> ]);
> let batch1 = RecordBatch::try_new(
> Arc::new(schema.clone()),
> vec![
> Arc::new(Int32Array::from(vec![1, 2, 3])),
> ],
> )?;
> let batch2 = RecordBatch::try_new(
> Arc::new(schema.clone()),
> vec![
> Arc::new(Int32Array::from(vec![4, 5])),
> ],
> )?;
> let mut ctx = ExecutionContext::new();
> let provider = MemTable::new(Arc::new(schema), vec![vec![batch1], 
> vec![batch2]])?;
> ctx.register_table("t", Box::new(provider));
> let result = collect( ctx, "SELECT AVG(a) FROM t")?;
> let batch = [0];
> assert_eq!(1, batch.num_columns());
> assert_eq!(1, batch.num_rows());
> let values = batch
> .column(0)
> .as_any()
> .downcast_ref::()
> .expect("failed to cast version");
> assert_eq!(values.len(), 1);
> // avg(1,2,3,4,5) = 3.0
> assert_eq!(values.value(0), 3.0_f64);
> Ok(())
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-9918) [Rust] [DataFusion] Favor "from" to create arrays for improved performance

2020-09-12 Thread Jorge (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge closed ARROW-9918.

Resolution: Won't Fix

While this is true, in general we should use buffers for operations, as they 
increase performance even further.

> [Rust] [DataFusion] Favor "from" to create arrays for improved performance
> --
>
> Key: ARROW-9918
> URL: https://issues.apache.org/jira/browse/ARROW-9918
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>
> [~yordan-pavlov]  hinted in https://issues.apache.org/jira/browse/ARROW-8908 
> that there is a performance hit when using array builders.
> Indeed, I get about 15% speedup by using {{From}} instead of a builder in 
> DataFusion's math operations, and this number _increases with batch size_, 
> which is detrimental since larger batch sizes leverage SIMD better.
> Below, {{sqrt_X_Y}}, X is the log2 of the array's size, Y is log2 of the 
> batch size:
>  
> {code:java}
> sqrt_20_12  time:   [34.422 ms 34.503 ms 34.584 ms]   
> 
> change: [-16.333% -16.055% -15.806%] (p = 0.00 < 0.05)
> Performance has improved.
> sqrt_22_12  time:   [150.13 ms 150.79 ms 151.42 ms]   
> 
> change: [-16.281% -15.488% -14.779%] (p = 0.00 < 0.05)
> Performance has improved.
> sqrt_22_14  time:   [151.45 ms 151.68 ms 151.90 ms]   
> 
> change: [-18.233% -16.919% -15.888%] (p = 0.00 < 0.05)
> Performance has improved.
> {code}
> See how there is no difference between {{sqrt_20_12}} and {{sqrt_22_12}} 
> (same batch size), but a measurable difference between {{sqrt_22_12}} and 
> {{sqrt_22_14}} (different batch size).
> This issue proposes that we migrate our datafusion and rust operations from 
> builder-base array construction to a non-builder based, whenever possible.
>  
> Most code can be easily migrated to use {{From}} trait, a migration is along 
> the lines of
>  
> {code:java}
> let mut builder = Float64Builder::new(array.len());
> for i in 0..array.len() {
> if array.is_null(i) {
> builder.append_null()?;
> } else {
> builder.append_value(array.value(i).$FUNC())?;
> }
> }
> Ok(Arc::new(builder.finish()))
> {code}
>  to
>  
> {code:java}
> let result: Float64Array = (0..array.len())
> .map(|i| {
> if array.is_null(i) {
> None
> } else {
> Some(array.value(i).$FUNC())
> }
> })
> .collect::>>()
> .into();
> Ok(Arc::new(result)){code}
> and this is even simpler as we do not need to use an extra API (Builder).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9980) [Rust] Fix parquet crate clippy lints

2020-09-12 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-9980.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8173
[https://github.com/apache/arrow/pull/8173]

> [Rust] Fix parquet crate clippy lints
> -
>
> Key: ARROW-9980
> URL: https://issues.apache.org/jira/browse/ARROW-9980
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This addresses most clippy lints on the parquet crate. Other remaining lints 
> can be addressed as part of future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

2020-09-12 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194853#comment-17194853
 ] 

Wes McKinney commented on ARROW-9924:
-

Think I found the problem. I expanded the chunk size to 10M so there is a 
single chunk in both cases and:

{code}
In [1]: %time a = pq.read_table('test.parquet', use_legacy_dataset=False)   
  
CPU times: user 1.5 s, sys: 2.59 s, total: 4.08 s
Wall time: 4.09 s

In [2]: %time a = pq.read_table('test.parquet', use_legacy_dataset=True)
  
CPU times: user 3.49 s, sys: 5.28 s, total: 8.77 s
Wall time: 1.64 s
{code}

Digging deeper, another problem is that column decoding is not being 
parallelized when using the Datasets API, whereas it is when you use 
{{FileReader::ReadTable}}. This is likely an artifact of the fact that we have 
not yet tackled the nested parallelism problem in the Datasets API. It's too 
bad that our users are now suffering the consequences of this.

So there are a two problems here:

* 32K is too small of a default batch size for quickly reading files into 
memory. I suggest setting it to ~256K or ~1M rows per batch
* Parquet row group deserialization is not being parallelized at the column 
level in {{parquet::arrow::FileReader::GetRecordBatchReader}}

The band-aid solution to this problem will be to use the old code path when no 
special Datasets features are needed when using {{parquet.read_table}}, but 
these two issues do need to be fixed. 

> [Python] Performance regression reading individual Parquet files using 
> Dataset interface
> 
>
> Key: ARROW-9924
> URL: https://issues.apache.org/jira/browse/ARROW-9924
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Critical
> Fix For: 2.0.0
>
>
> I haven't investigated very deeply but this seems symptomatic of a problem:
> {code}
> In [27]: df = pd.DataFrame({'A': np.random.randn(1000)})  
>   
>   
> In [28]: pq.write_table(pa.table(df), 'test.parquet') 
>   
>   
> In [29]: timeit pq.read_table('test.parquet') 
>   
>   
> 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)
>   
>   
> 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

2020-09-12 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194843#comment-17194843
 ] 

Wes McKinney commented on ARROW-9924:
-

I found that with a larger dataset with more columns, the slowdown is much worse

Setup:

{code}
df = pd.DataFrame({'f{0}'.format(i): np.random.randn(1000) for i in 
range(50)})
pq.write_table(pa.table(df), 'test.parquet')
{code}

So we have:

{code}
In [1]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)   
  
843 ms ± 40 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [2]: timeit pq.read_table('test.parquet', use_legacy_dataset=False)  
  
4.53 s ± 29.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}

> [Python] Performance regression reading individual Parquet files using 
> Dataset interface
> 
>
> Key: ARROW-9924
> URL: https://issues.apache.org/jira/browse/ARROW-9924
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Critical
> Fix For: 2.0.0
>
>
> I haven't investigated very deeply but this seems symptomatic of a problem:
> {code}
> In [27]: df = pd.DataFrame({'A': np.random.randn(1000)})  
>   
>   
> In [28]: pq.write_table(pa.table(df), 'test.parquet') 
>   
>   
> In [29]: timeit pq.read_table('test.parquet') 
>   
>   
> 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)
>   
>   
> 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9983) [C++][Dataset][Python] Use larger default batch size than 32K for Datasets API

2020-09-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9983:

Labels: dataset  (was: )

> [C++][Dataset][Python] Use larger default batch size than 32K for Datasets API
> --
>
> Key: ARROW-9983
> URL: https://issues.apache.org/jira/browse/ARROW-9983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset
> Fix For: 2.0.0
>
>
> Dremio uses 64K batch sizes. We could probably get away with even larger 
> batch sizes (e.g. 256K or 1M) and allow memory-constrained users to elect a 
> smaller batch size. 
> See example of some performance issues related to this in ARROW-9924



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9983) [C++][Dataset][Python] Use larger default batch size than 32K for Datasets API

2020-09-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9983:

Summary: [C++][Dataset][Python] Use larger default batch size than 32K for 
Datasets API  (was: [C++][Dataset] Use larger default batch size than 32K for 
Datasets API)

> [C++][Dataset][Python] Use larger default batch size than 32K for Datasets API
> --
>
> Key: ARROW-9983
> URL: https://issues.apache.org/jira/browse/ARROW-9983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 2.0.0
>
>
> Dremio uses 64K batch sizes. We could probably get away with even larger 
> batch sizes (e.g. 256K or 1M) and allow memory-constrained users to elect a 
> smaller batch size. 
> See example of some performance issues related to this in ARROW-9924



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9983) [C++][Dataset] Use larger default batch size than 32K for Datasets API

2020-09-12 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9983:
---

 Summary: [C++][Dataset] Use larger default batch size than 32K for 
Datasets API
 Key: ARROW-9983
 URL: https://issues.apache.org/jira/browse/ARROW-9983
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 2.0.0


Dremio uses 64K batch sizes. We could probably get away with even larger batch 
sizes (e.g. 256K or 1M) and allow memory-constrained users to elect a smaller 
batch size. 

See example of some performance issues related to this in ARROW-9924



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-09-12 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194835#comment-17194835
 ] 

Wes McKinney commented on ARROW-8394:
-

Don't think anyone is looking at it. 

> [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm 
> package
> ---
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Priority: Blocker
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9982) IterableArrayLike should support map

2020-09-12 Thread Dominik Moritz (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominik Moritz updated ARROW-9982:
--
Priority: Minor  (was: Major)

> IterableArrayLike should support map
> 
>
> Key: ARROW-9982
> URL: https://issues.apache.org/jira/browse/ARROW-9982
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Minor
>
> `table.toArray()` returns an `IterableArrayLike` and I would like to be able 
> to `map` a function to it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9982) IterableArrayLike should support map

2020-09-12 Thread Dominik Moritz (Jira)
Dominik Moritz created ARROW-9982:
-

 Summary: IterableArrayLike should support map
 Key: ARROW-9982
 URL: https://issues.apache.org/jira/browse/ARROW-9982
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Dominik Moritz


`table.toArray()` returns an `IterableArrayLike` and I would like to be able to 
`map` a function to it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

2020-09-12 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194834#comment-17194834
 ] 

Wes McKinney commented on ARROW-9924:
-

I took a look into this since I was curious what's wrong.

So I don't know about this:

{code}
In [10]: a = pq.read_table('test.parquet', use_legacy_dataset=True)

In [11]: b = pq.read_table('test.parquet', use_legacy_dataset=False)

In [12]: a[0].num_chunks
Out[12]: 1

In [13]: b[0].num_chunks
Out[13]: 306
{code}

Looking at the top of the hierarchical perf report for the "new" code, the 
deeply nested layers of iterators strikes me as one thing to think more about 
whether that's the design we want

https://gist.github.com/wesm/3e3eeb6b7f5f22650f18e69e206c2eb8

I think the Datasets API may need to make a wiser decision about how to read a 
file based on the declared intent of the user. If the user calls {{ToTable}}, 
then I don't think it makes sense to break the problem up into so many small 
tasks -- perhaps the default chunk size should be larger than it is (so that 
streaming readers who are concerned about memory use can shrink the chunksize 
to something smaller)? 

Another question: why ProjectRecordBatch and FilterRecordBatch being used? 
Nothing is being projected nor filtered. 

> [Python] Performance regression reading individual Parquet files using 
> Dataset interface
> 
>
> Key: ARROW-9924
> URL: https://issues.apache.org/jira/browse/ARROW-9924
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Critical
> Fix For: 2.0.0
>
>
> I haven't investigated very deeply but this seems symptomatic of a problem:
> {code}
> In [27]: df = pd.DataFrame({'A': np.random.randn(1000)})  
>   
>   
> In [28]: pq.write_table(pa.table(df), 'test.parquet') 
>   
>   
> In [29]: timeit pq.read_table('test.parquet') 
>   
>   
> 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)
>   
>   
> 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9974) [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while reading large number of files using ParquetDataset

2020-09-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9974:

Summary: [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception 
while reading large number of files using ParquetDataset  (was: t)

> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while 
> reading large number of files using ParquetDataset
> ---
>
> Key: ARROW-9974
> URL: https://issues.apache.org/jira/browse/ARROW-9974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Ashish Gupta
>Priority: Major
>  Labels: dataset
> Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use 
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to 
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of 
> memory: malloc of size 131072 failed". The same code on the same machine 
> still works with older version. My machine has 256Gb memory way more than 
> enough to load the data which requires < 10Gb. You can use below code to 
> generate the issue on your side.
> {code}
> # create a big dataframe
> import pandas as pd
> import numpy as np
> df = pd.DataFrame({'A': np.arange(5000)})
> df['F1'] = np.random.randn(5000) * 100
> df['F2'] = np.random.randn(5000) * 100
> df['F3'] = np.random.randn(5000) * 100
> df['F4'] = np.random.randn(5000) * 100
> df['F5'] = np.random.randn(5000) * 100
> df['F6'] = np.random.randn(5000) * 100
> df['F7'] = np.random.randn(5000) * 100
> df['F8'] = np.random.randn(5000) * 100
> df['F9'] = 'ABCDEFGH'
> df['F10'] = 'ABCDEFGH'
> df['F11'] = 'ABCDEFGH'
> df['F12'] = 'ABCDEFGH01234'
> df['F13'] = 'ABCDEFGH01234'
> df['F14'] = 'ABCDEFGH01234'
> df['F15'] = 'ABCDEFGH01234567'
> df['F16'] = 'ABCDEFGH01234567'
> df['F17'] = 'ABCDEFGH01234567'
> # split and save data to 5000 files
> for i in range(5000):
> df.iloc[i*1:(i+1)*1].to_parquet(f'{i}.parquet', index=False)
> # use a fresh session to read data
> # below code works to read
> import pandas as pd
> df = []
> for i in range(5000):
> df.append(pd.read_parquet(f'{i}.parquet'))
> df = pd.concat(df)
> # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine 
> with version 0.13.0)
> # tried use_legacy_dataset=False, same issue
> import pyarrow.parquet as pq
> fnames = []
> for i in range(5000):
> fnames.append(f'{i}.parquet')
> len(fnames)
> df = pq.ParquetDataset(fnames).read(use_threads=False)
>  
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9974) [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while reading large number of files using ParquetDataset

2020-09-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9974:

Priority: Critical  (was: Major)

> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while 
> reading large number of files using ParquetDataset
> ---
>
> Key: ARROW-9974
> URL: https://issues.apache.org/jira/browse/ARROW-9974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Ashish Gupta
>Priority: Critical
>  Labels: dataset
> Fix For: 2.0.0
>
> Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use 
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to 
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of 
> memory: malloc of size 131072 failed". The same code on the same machine 
> still works with older version. My machine has 256Gb memory way more than 
> enough to load the data which requires < 10Gb. You can use below code to 
> generate the issue on your side.
> {code}
> # create a big dataframe
> import pandas as pd
> import numpy as np
> df = pd.DataFrame({'A': np.arange(5000)})
> df['F1'] = np.random.randn(5000) * 100
> df['F2'] = np.random.randn(5000) * 100
> df['F3'] = np.random.randn(5000) * 100
> df['F4'] = np.random.randn(5000) * 100
> df['F5'] = np.random.randn(5000) * 100
> df['F6'] = np.random.randn(5000) * 100
> df['F7'] = np.random.randn(5000) * 100
> df['F8'] = np.random.randn(5000) * 100
> df['F9'] = 'ABCDEFGH'
> df['F10'] = 'ABCDEFGH'
> df['F11'] = 'ABCDEFGH'
> df['F12'] = 'ABCDEFGH01234'
> df['F13'] = 'ABCDEFGH01234'
> df['F14'] = 'ABCDEFGH01234'
> df['F15'] = 'ABCDEFGH01234567'
> df['F16'] = 'ABCDEFGH01234567'
> df['F17'] = 'ABCDEFGH01234567'
> # split and save data to 5000 files
> for i in range(5000):
> df.iloc[i*1:(i+1)*1].to_parquet(f'{i}.parquet', index=False)
> # use a fresh session to read data
> # below code works to read
> import pandas as pd
> df = []
> for i in range(5000):
> df.append(pd.read_parquet(f'{i}.parquet'))
> df = pd.concat(df)
> # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine 
> with version 0.13.0)
> # tried use_legacy_dataset=False, same issue
> import pyarrow.parquet as pq
> fnames = []
> for i in range(5000):
> fnames.append(f'{i}.parquet')
> len(fnames)
> df = pq.ParquetDataset(fnames).read(use_threads=False)
>  
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9974) [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while reading large number of files using ParquetDataset

2020-09-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9974:

Fix Version/s: 2.0.0

> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while 
> reading large number of files using ParquetDataset
> ---
>
> Key: ARROW-9974
> URL: https://issues.apache.org/jira/browse/ARROW-9974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Ashish Gupta
>Priority: Major
>  Labels: dataset
> Fix For: 2.0.0
>
> Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use 
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to 
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of 
> memory: malloc of size 131072 failed". The same code on the same machine 
> still works with older version. My machine has 256Gb memory way more than 
> enough to load the data which requires < 10Gb. You can use below code to 
> generate the issue on your side.
> {code}
> # create a big dataframe
> import pandas as pd
> import numpy as np
> df = pd.DataFrame({'A': np.arange(5000)})
> df['F1'] = np.random.randn(5000) * 100
> df['F2'] = np.random.randn(5000) * 100
> df['F3'] = np.random.randn(5000) * 100
> df['F4'] = np.random.randn(5000) * 100
> df['F5'] = np.random.randn(5000) * 100
> df['F6'] = np.random.randn(5000) * 100
> df['F7'] = np.random.randn(5000) * 100
> df['F8'] = np.random.randn(5000) * 100
> df['F9'] = 'ABCDEFGH'
> df['F10'] = 'ABCDEFGH'
> df['F11'] = 'ABCDEFGH'
> df['F12'] = 'ABCDEFGH01234'
> df['F13'] = 'ABCDEFGH01234'
> df['F14'] = 'ABCDEFGH01234'
> df['F15'] = 'ABCDEFGH01234567'
> df['F16'] = 'ABCDEFGH01234567'
> df['F17'] = 'ABCDEFGH01234567'
> # split and save data to 5000 files
> for i in range(5000):
> df.iloc[i*1:(i+1)*1].to_parquet(f'{i}.parquet', index=False)
> # use a fresh session to read data
> # below code works to read
> import pandas as pd
> df = []
> for i in range(5000):
> df.append(pd.read_parquet(f'{i}.parquet'))
> df = pd.concat(df)
> # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine 
> with version 0.13.0)
> # tried use_legacy_dataset=False, same issue
> import pyarrow.parquet as pq
> fnames = []
> for i in range(5000):
> fnames.append(f'{i}.parquet')
> len(fnames)
> df = pq.ParquetDataset(fnames).read(use_threads=False)
>  
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9980) [Rust] Fix parquet crate clippy lints

2020-09-12 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-9980:
-

Assignee: Neville Dipale

> [Rust] Fix parquet crate clippy lints
> -
>
> Key: ARROW-9980
> URL: https://issues.apache.org/jira/browse/ARROW-9980
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This addresses most clippy lints on the parquet crate. Other remaining lints 
> can be addressed as part of future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9981) [Rust] Allow configuring flight IPC with IpcWriteOptions

2020-09-12 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9981:
-

 Summary: [Rust] Allow configuring flight IPC with IpcWriteOptions
 Key: ARROW-9981
 URL: https://issues.apache.org/jira/browse/ARROW-9981
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


We have introduced an IPC write option, but we use the default for the 
arrow-flight crate, which is not ideal. Change this to allow configuring writer 
options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9961) [Rust][DataFusion] to_timestamp function parses timestamp without timezone offset as UTC rather than local

2020-09-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9961.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8161
[https://github.com/apache/arrow/pull/8161]

> [Rust][DataFusion] to_timestamp function parses timestamp without timezone 
> offset as UTC rather than local
> --
>
> Key: ARROW-9961
> URL: https://issues.apache.org/jira/browse/ARROW-9961
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> ARROW-9944 added a TO_TIMESTAMP function that supports parsing timestamps 
> without a specified timezone, such as {{2020-09-08T13:42:29.190855}}
> Such timestamps are supposed to be interpreted as in the local timezone, but 
> instead are interpreted as UTC. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9954) [Rust] [DataFusion] Simplify code of aggregate planning

2020-09-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9954.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8155
[https://github.com/apache/arrow/pull/8155]

> [Rust] [DataFusion] Simplify code of aggregate planning
> ---
>
> Key: ARROW-9954
> URL: https://issues.apache.org/jira/browse/ARROW-9954
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9950) [Rust] [DataFusion] Allow UDF usage without registry

2020-09-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9950.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8144
[https://github.com/apache/arrow/pull/8144]

> [Rust] [DataFusion] Allow UDF usage without registry
> 
>
> Key: ARROW-9950
> URL: https://issues.apache.org/jira/browse/ARROW-9950
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is a functionality relevant only for the DataFrame API.
> Sometimes a UDF declaration happens during planning, and it makes it very 
> expressive when the user does not have to access the registry at all to plan 
> the UDF.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9979) [Rust] Fix arrow crate clippy lints

2020-09-12 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-9979.
---
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8168
[https://github.com/apache/arrow/pull/8168]

> [Rust] Fix arrow crate clippy lints
> ---
>
> Key: ARROW-9979
> URL: https://issues.apache.org/jira/browse/ARROW-9979
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This fixes many clippy lints, but not all. It takes hours to address lints, 
> ansd we can work on remaining ones in future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9974) t

2020-09-12 Thread Ashish Gupta (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194764#comment-17194764
 ] 

Ashish Gupta commented on ARROW-9974:
-

1) Please find attached full traceback of both cases [^legacy_false.txt]

 

^2) I started removing columns one by one and below is the smallest dataframe 
where adding the last column F11 causes both scenarios use_legacy_dataset set 
to True and False to throw error. Interestingly, use_legacy_dataset=True starts 
crashing after first 5 columns. If you have more than 8Gb ram you should be 
able to test it.^
{code:java}
# create a big dataframe
import pandas as pd
import numpy as npdf = pd.DataFrame({'A': np.arange(5000)})
df['F1'] = np.random.randn(5000)
df['F2'] = np.random.randn(5000)
df['F3'] = np.random.randn(5000)
df['F4'] = 'ABCDEFGH'
df['F5'] = 'ABCDEFGH'
df['F6'] = 'ABCDEFGH'
df['F7'] = 'ABCDEFGH'
df['F8'] = 'ABCDEFGH'
df['F9'] = 'ABCDEFGH'
df['F10'] = 'ABCDEFGH'
# df['F11'] = 'ABCDEFGH'

{code}
 

 

 

> t
> -
>
> Key: ARROW-9974
> URL: https://issues.apache.org/jira/browse/ARROW-9974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Ashish Gupta
>Priority: Major
>  Labels: dataset
> Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use 
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to 
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of 
> memory: malloc of size 131072 failed". The same code on the same machine 
> still works with older version. My machine has 256Gb memory way more than 
> enough to load the data which requires < 10Gb. You can use below code to 
> generate the issue on your side.
> {code}
> # create a big dataframe
> import pandas as pd
> import numpy as np
> df = pd.DataFrame({'A': np.arange(5000)})
> df['F1'] = np.random.randn(5000) * 100
> df['F2'] = np.random.randn(5000) * 100
> df['F3'] = np.random.randn(5000) * 100
> df['F4'] = np.random.randn(5000) * 100
> df['F5'] = np.random.randn(5000) * 100
> df['F6'] = np.random.randn(5000) * 100
> df['F7'] = np.random.randn(5000) * 100
> df['F8'] = np.random.randn(5000) * 100
> df['F9'] = 'ABCDEFGH'
> df['F10'] = 'ABCDEFGH'
> df['F11'] = 'ABCDEFGH'
> df['F12'] = 'ABCDEFGH01234'
> df['F13'] = 'ABCDEFGH01234'
> df['F14'] = 'ABCDEFGH01234'
> df['F15'] = 'ABCDEFGH01234567'
> df['F16'] = 'ABCDEFGH01234567'
> df['F17'] = 'ABCDEFGH01234567'
> # split and save data to 5000 files
> for i in range(5000):
> df.iloc[i*1:(i+1)*1].to_parquet(f'{i}.parquet', index=False)
> # use a fresh session to read data
> # below code works to read
> import pandas as pd
> df = []
> for i in range(5000):
> df.append(pd.read_parquet(f'{i}.parquet'))
> df = pd.concat(df)
> # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine 
> with version 0.13.0)
> # tried use_legacy_dataset=False, same issue
> import pyarrow.parquet as pq
> fnames = []
> for i in range(5000):
> fnames.append(f'{i}.parquet')
> len(fnames)
> df = pq.ParquetDataset(fnames).read(use_threads=False)
>  
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9974) t

2020-09-12 Thread Ashish Gupta (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashish Gupta updated ARROW-9974:

Attachment: legacy_true.txt
legacy_false.txt

> t
> -
>
> Key: ARROW-9974
> URL: https://issues.apache.org/jira/browse/ARROW-9974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Ashish Gupta
>Priority: Major
>  Labels: dataset
> Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use 
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to 
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of 
> memory: malloc of size 131072 failed". The same code on the same machine 
> still works with older version. My machine has 256Gb memory way more than 
> enough to load the data which requires < 10Gb. You can use below code to 
> generate the issue on your side.
> {code}
> # create a big dataframe
> import pandas as pd
> import numpy as np
> df = pd.DataFrame({'A': np.arange(5000)})
> df['F1'] = np.random.randn(5000) * 100
> df['F2'] = np.random.randn(5000) * 100
> df['F3'] = np.random.randn(5000) * 100
> df['F4'] = np.random.randn(5000) * 100
> df['F5'] = np.random.randn(5000) * 100
> df['F6'] = np.random.randn(5000) * 100
> df['F7'] = np.random.randn(5000) * 100
> df['F8'] = np.random.randn(5000) * 100
> df['F9'] = 'ABCDEFGH'
> df['F10'] = 'ABCDEFGH'
> df['F11'] = 'ABCDEFGH'
> df['F12'] = 'ABCDEFGH01234'
> df['F13'] = 'ABCDEFGH01234'
> df['F14'] = 'ABCDEFGH01234'
> df['F15'] = 'ABCDEFGH01234567'
> df['F16'] = 'ABCDEFGH01234567'
> df['F17'] = 'ABCDEFGH01234567'
> # split and save data to 5000 files
> for i in range(5000):
> df.iloc[i*1:(i+1)*1].to_parquet(f'{i}.parquet', index=False)
> # use a fresh session to read data
> # below code works to read
> import pandas as pd
> df = []
> for i in range(5000):
> df.append(pd.read_parquet(f'{i}.parquet'))
> df = pd.concat(df)
> # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine 
> with version 0.13.0)
> # tried use_legacy_dataset=False, same issue
> import pyarrow.parquet as pq
> fnames = []
> for i in range(5000):
> fnames.append(f'{i}.parquet')
> len(fnames)
> df = pq.ParquetDataset(fnames).read(use_threads=False)
>  
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9848) [Rust] Implement changes to ensure flatbuffer alignment

2020-09-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9848:
--
Labels: pull-request-available  (was: )

> [Rust] Implement changes to ensure flatbuffer alignment
> ---
>
> Key: ARROW-9848
> URL: https://issues.apache.org/jira/browse/ARROW-9848
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See ARROW-6313, changes were made to all IPC implementations except for Rust



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9980) [Rust] Fix parquet crate clippy lints

2020-09-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9980:
--
Labels: pull-request-available  (was: )

> [Rust] Fix parquet crate clippy lints
> -
>
> Key: ARROW-9980
> URL: https://issues.apache.org/jira/browse/ARROW-9980
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This addresses most clippy lints on the parquet crate. Other remaining lints 
> can be addressed as part of future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9980) [Rust] Fix parquet crate clippy lints

2020-09-12 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9980:
-

 Summary: [Rust] Fix parquet crate clippy lints
 Key: ARROW-9980
 URL: https://issues.apache.org/jira/browse/ARROW-9980
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


This addresses most clippy lints on the parquet crate. Other remaining lints 
can be addressed as part of future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9979) [Rust] Fix arrow crate clippy lints

2020-09-12 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-9979:


Assignee: Neville Dipale  (was: Apache Arrow JIRA Bot)

> [Rust] Fix arrow crate clippy lints
> ---
>
> Key: ARROW-9979
> URL: https://issues.apache.org/jira/browse/ARROW-9979
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This fixes many clippy lints, but not all. It takes hours to address lints, 
> ansd we can work on remaining ones in future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9979) [Rust] Fix arrow crate clippy lints

2020-09-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9979:
--
Labels: pull-request-available  (was: )

> [Rust] Fix arrow crate clippy lints
> ---
>
> Key: ARROW-9979
> URL: https://issues.apache.org/jira/browse/ARROW-9979
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This fixes many clippy lints, but not all. It takes hours to address lints, 
> ansd we can work on remaining ones in future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9979) [Rust] Fix arrow crate clippy lints

2020-09-12 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-9979:
-

Assignee: Neville Dipale

> [Rust] Fix arrow crate clippy lints
> ---
>
> Key: ARROW-9979
> URL: https://issues.apache.org/jira/browse/ARROW-9979
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This fixes many clippy lints, but not all. It takes hours to address lints, 
> ansd we can work on remaining ones in future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9979) [Rust] Fix arrow crate clippy lints

2020-09-12 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9979:
-

 Summary: [Rust] Fix arrow crate clippy lints
 Key: ARROW-9979
 URL: https://issues.apache.org/jira/browse/ARROW-9979
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


This fixes many clippy lints, but not all. It takes hours to address lints, 
ansd we can work on remaining ones in future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9296) [CI][Rust] Enable more clippy lint checks

2020-09-12 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-9296:
--
Parent: ARROW-9978
Issue Type: Sub-task  (was: Improvement)

> [CI][Rust] Enable more clippy lint checks
> -
>
> Key: ARROW-9296
> URL: https://issues.apache.org/jira/browse/ARROW-9296
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Continuous Integration, Rust
>Reporter: Krisztian Szucs
>Priority: Major
>
> Currently only {{clippy::redundant_field_names}} is allowed, so we should 
> incrementally extend the list of enabled lints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9338) [Rust] Add instructions for running clippy locally

2020-09-12 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-9338:
--
Parent: ARROW-9978
Issue Type: Sub-task  (was: Improvement)

> [Rust] Add instructions for running clippy locally
> --
>
> Key: ARROW-9338
> URL: https://issues.apache.org/jira/browse/ARROW-9338
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Paddy Horan
>Priority: Minor
>
> Similar to the "Code Formatting" section in the top level README it would be 
> useful to add instructions for running clippy locally to avoid wasted CI time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9978) [Rust] Umbrella issue for clippy integration

2020-09-12 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9978:
-

 Summary: [Rust] Umbrella issue for clippy integration
 Key: ARROW-9978
 URL: https://issues.apache.org/jira/browse/ARROW-9978
 Project: Apache Arrow
  Issue Type: New Feature
  Components: CI, Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


This is an umbrella issue to collate outstanding and new tasks to enable clippy 
integration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9937) [Rust] [DataFusion] Average is not correct

2020-09-12 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-9937:
---
Description: 
The current design of aggregates makes the calculation of the average incorrect.

Namely, if there are multiple input partitions, the result is average of the 
averages. For example if the input it in two batches {{[1,2]}}, and 
{{[3,4,5]}}, datafusion will say "average=3.25" rather than "average=3".

 It also makes it impossible to compute the [geometric 
mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other 
operations.

The central issue is that Accumulator returns a `ScalarValue` during partial 
aggregations via {{get_value}}, but very often a `ScalarValue` is not 
sufficient information to perform the full aggregation.

A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are 
distributed in batches of 2,

{[x1, x2], [x3, x4], [x5]}

. Our current calculation performs partial means,

{(x1+x2)/2, (x3+x4)/2, x5}

, and then reduces them using another average, i.e.

{{((x1+x2)/2 + (x3+x4)/2 + x5)/3}}

which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}.

I believe that our Accumulators need to pass more information from the partial 
aggregations to the final aggregation.

We could consider taking an API equivalent to 
[spark]([https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html]), 
i.e. have an `update`, a `merge` and an `evaluate`.

Code with a failing test ({{src/execution/context.rs}})
{code:java}
#[test]
fn simple_avg() -> Result<()> {
let schema = Schema::new(vec![
Field::new("a", DataType::Int32, false),
]);

let batch1 = RecordBatch::try_new(
Arc::new(schema.clone()),
vec![
Arc::new(Int32Array::from(vec![1, 2, 3])),
],
)?;
let batch2 = RecordBatch::try_new(
Arc::new(schema.clone()),
vec![
Arc::new(Int32Array::from(vec![4, 5])),
],
)?;

let mut ctx = ExecutionContext::new();

let provider = MemTable::new(Arc::new(schema), vec![vec![batch1], 
vec![batch2]])?;
ctx.register_table("t", Box::new(provider));

let result = collect( ctx, "SELECT AVG(a) FROM t")?;

let batch = [0];
assert_eq!(1, batch.num_columns());
assert_eq!(1, batch.num_rows());

let values = batch
.column(0)
.as_any()
.downcast_ref::()
.expect("failed to cast version");
assert_eq!(values.len(), 1);
// avg(1,2,3,4,5) = 3.0
assert_eq!(values.value(0), 3.0_f64);
Ok(())
}
{code}

  was:
The current design of aggregates makes the calculation of the average incorrect.
 It also makes it impossible to compute the [geometric 
mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other 
operations.

The central issue is that Accumulator returns a `ScalarValue` during partial 
aggregations via {{get_value}}, but very often a `ScalarValue` is not 
sufficient information to perform the full aggregation.

A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are 
distributed in batches of 2,

{[x1, x2], [x3, x4], [x5]}

. Our current calculation performs partial means,

{(x1+x2)/2, (x3+x4)/2, x5}

, and then reduces them using another average, i.e.

{{((x1+x2)/2 + (x3+x4)/2 + x5)/3}}

which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}.

I believe that our Accumulators need to pass more information from the partial 
aggregations to the final aggregation.

We could consider taking an API equivalent to 
[spark]([https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html]), 
i.e. have an `update`, a `merge` and an `evaluate`.

Code with a failing test ({{src/execution/context.rs}})
{code:java}
#[test]
fn simple_avg() -> Result<()> {
let schema = Schema::new(vec![
Field::new("a", DataType::Int32, false),
]);

let batch1 = RecordBatch::try_new(
Arc::new(schema.clone()),
vec![
Arc::new(Int32Array::from(vec![1, 2, 3])),
],
)?;
let batch2 = RecordBatch::try_new(
Arc::new(schema.clone()),
vec![
Arc::new(Int32Array::from(vec![4, 5])),
],
)?;

let mut ctx = ExecutionContext::new();

let provider = MemTable::new(Arc::new(schema), vec![vec![batch1], 
vec![batch2]])?;
ctx.register_table("t", Box::new(provider));

let result = collect( ctx, "SELECT AVG(a) FROM t")?;

let batch = [0];
assert_eq!(1, batch.num_columns());
assert_eq!(1, batch.num_rows());

let values = batch
.column(0)
.as_any()
.downcast_ref::()
.expect("failed to cast version");

[jira] [Commented] (ARROW-9937) [Rust] [DataFusion] Average is not correct

2020-09-12 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194706#comment-17194706
 ] 

Andrew Lamb commented on ARROW-9937:


This is bad correctness bug -- we should definitely fix this

> [Rust] [DataFusion] Average is not correct
> --
>
> Key: ARROW-9937
> URL: https://issues.apache.org/jira/browse/ARROW-9937
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> The current design of aggregates makes the calculation of the average 
> incorrect.
>  It also makes it impossible to compute the [geometric 
> mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other 
> operations.
> The central issue is that Accumulator returns a `ScalarValue` during partial 
> aggregations via {{get_value}}, but very often a `ScalarValue` is not 
> sufficient information to perform the full aggregation.
> A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are 
> distributed in batches of 2,
> {[x1, x2], [x3, x4], [x5]}
> . Our current calculation performs partial means,
> {(x1+x2)/2, (x3+x4)/2, x5}
> , and then reduces them using another average, i.e.
> {{((x1+x2)/2 + (x3+x4)/2 + x5)/3}}
> which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}.
> I believe that our Accumulators need to pass more information from the 
> partial aggregations to the final aggregation.
> We could consider taking an API equivalent to 
> [spark]([https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html]),
>  i.e. have an `update`, a `merge` and an `evaluate`.
> Code with a failing test ({{src/execution/context.rs}})
> {code:java}
> #[test]
> fn simple_avg() -> Result<()> {
> let schema = Schema::new(vec![
> Field::new("a", DataType::Int32, false),
> ]);
> let batch1 = RecordBatch::try_new(
> Arc::new(schema.clone()),
> vec![
> Arc::new(Int32Array::from(vec![1, 2, 3])),
> ],
> )?;
> let batch2 = RecordBatch::try_new(
> Arc::new(schema.clone()),
> vec![
> Arc::new(Int32Array::from(vec![4, 5])),
> ],
> )?;
> let mut ctx = ExecutionContext::new();
> let provider = MemTable::new(Arc::new(schema), vec![vec![batch1], 
> vec![batch2]])?;
> ctx.register_table("t", Box::new(provider));
> let result = collect( ctx, "SELECT AVG(a) FROM t")?;
> let batch = [0];
> assert_eq!(1, batch.num_columns());
> assert_eq!(1, batch.num_rows());
> let values = batch
> .column(0)
> .as_any()
> .downcast_ref::()
> .expect("failed to cast version");
> assert_eq!(values.len(), 1);
> // avg(1,2,3,4,5) = 3.0
> assert_eq!(values.value(0), 3.0_f64);
> Ok(())
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-09-12 Thread Joao (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194700#comment-17194700
 ] 

Joao edited comment on ARROW-8394 at 9/12/20, 10:17 AM:


Hi.
 We are also facing issues using apache arrow (@apache-arrow/ts 0.17.0) with 
Typescript 3.9.x and 4.0.x. 
 Is this being looked into?

Thanks


was (Author: costa):
Hi.
We are also facing issues using apache arrow with Typescript 3.9.x and 4.0.x. 
Is this being looked into?

Thanks

> [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm 
> package
> ---
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Priority: Blocker
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-09-12 Thread Joao (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194700#comment-17194700
 ] 

Joao commented on ARROW-8394:
-

Hi.
We are also facing issues using apache arrow with Typescript 3.9.x and 4.0.x. 
Is this being looked into?

Thanks

> [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm 
> package
> ---
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Priority: Blocker
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9974) t

2020-09-12 Thread Ashish Gupta (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashish Gupta updated ARROW-9974:

Summary: t  (was: [Python][C++] pyarrow version 1.0.1 throws Out Of Memory 
exception while reading large number of files using ParquetDataset (works fine 
with version 0.13))

> t
> -
>
> Key: ARROW-9974
> URL: https://issues.apache.org/jira/browse/ARROW-9974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Ashish Gupta
>Priority: Major
>  Labels: dataset
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use 
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to 
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of 
> memory: malloc of size 131072 failed". The same code on the same machine 
> still works with older version. My machine has 256Gb memory way more than 
> enough to load the data which requires < 10Gb. You can use below code to 
> generate the issue on your side.
> {code}
> # create a big dataframe
> import pandas as pd
> import numpy as np
> df = pd.DataFrame({'A': np.arange(5000)})
> df['F1'] = np.random.randn(5000) * 100
> df['F2'] = np.random.randn(5000) * 100
> df['F3'] = np.random.randn(5000) * 100
> df['F4'] = np.random.randn(5000) * 100
> df['F5'] = np.random.randn(5000) * 100
> df['F6'] = np.random.randn(5000) * 100
> df['F7'] = np.random.randn(5000) * 100
> df['F8'] = np.random.randn(5000) * 100
> df['F9'] = 'ABCDEFGH'
> df['F10'] = 'ABCDEFGH'
> df['F11'] = 'ABCDEFGH'
> df['F12'] = 'ABCDEFGH01234'
> df['F13'] = 'ABCDEFGH01234'
> df['F14'] = 'ABCDEFGH01234'
> df['F15'] = 'ABCDEFGH01234567'
> df['F16'] = 'ABCDEFGH01234567'
> df['F17'] = 'ABCDEFGH01234567'
> # split and save data to 5000 files
> for i in range(5000):
> df.iloc[i*1:(i+1)*1].to_parquet(f'{i}.parquet', index=False)
> # use a fresh session to read data
> # below code works to read
> import pandas as pd
> df = []
> for i in range(5000):
> df.append(pd.read_parquet(f'{i}.parquet'))
> df = pd.concat(df)
> # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine 
> with version 0.13.0)
> # tried use_legacy_dataset=False, same issue
> import pyarrow.parquet as pq
> fnames = []
> for i in range(5000):
> fnames.append(f'{i}.parquet')
> len(fnames)
> df = pq.ParquetDataset(fnames).read(use_threads=False)
>  
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9809) [Rust] [DataFusion] logical schema = physical schema is not true

2020-09-12 Thread Jorge (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge reassigned ARROW-9809:


Assignee: Jorge

> [Rust] [DataFusion] logical schema = physical schema is not true
> 
>
> Key: ARROW-9809
> URL: https://issues.apache.org/jira/browse/ARROW-9809
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge
>Assignee: Jorge
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In tests/sql.rs, we test that the physical and the optimized schema must 
> match. However, this is not necessarily true for all our queries. An example:
> {code:java}
> #[test]
> fn csv_query_sum_cast() {
> let mut ctx = ExecutionContext::new();
> register_aggregate_csv_by_sql( ctx);
> // c8 = i32; c9 = i64
> let sql = "SELECT c8 + c9 FROM aggregate_test_100";
> // check that the physical and logical schemas are equal
> execute( ctx, sql);
> }
> {code}
> The physical expression (and schema) of this operation, after optimization, 
> is {{CAST(c8 as Int64) Plus c9}} (this test fails).
> AFAIK, the invariant of the optimizer is that the output types and 
> nullability are the same.
> Also, note that the reason the optimized logical schema equals the logical 
> schema is that our type coercer does not change the output names of the 
> schema, even though it re-writes logical expressions. I.e. after the 
> optimization, `.to_field()` of an expression may no longer match the field 
> name nor type in the Plan's schema. IMO this is currently by (implicit?) 
> design, as we do not want our logical schema's column names to change during 
> optimizations, or all column references may point to non-existent columns. 
> This is something that brought up on the mailing list about polymorphism.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)