[jira] [Updated] (ARROW-8215) [CI][GLib] Meson install fails in the macOS build

2020-03-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8215:
--
Labels: pull-request-available  (was: )

> [CI][GLib] Meson install fails in the macOS build
> -
>
> Key: ARROW-8215
> URL: https://issues.apache.org/jira/browse/ARROW-8215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, GLib
>Reporter: Krisztian Szucs
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>
> It also happens in the pull request builds, see build log 
> https://github.com/apache/arrow/runs/533168517#step:5:1230
> cc @kou



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8228) [C++][Parquet] Support writing lists that have null elements that are non-empty.

2020-03-25 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-8228:
--

 Summary: [C++][Parquet] Support writing lists that have null 
elements that are non-empty.
 Key: ARROW-8228
 URL: https://issues.apache.org/jira/browse/ARROW-8228
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Micah Kornfield
 Fix For: 1.0.0


With the new V2 level writing engine we can detect this case but fail as not 
implemented.  Fixing this will require changes to the "core" parquet API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8226) [Go] Add binary builder that uses 64 bit offsets and make binary builders resettable

2020-03-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8226:
--
Labels: pull-request-available  (was: )

> [Go] Add binary builder that uses 64 bit offsets and make binary builders 
> resettable
> 
>
> Key: ARROW-8226
> URL: https://issues.apache.org/jira/browse/ARROW-8226
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Go
>Reporter: Richard
>Priority: Minor
>  Labels: pull-request-available
>
> I ran into some overflow issues with the existing 32 bit binary builder. My 
> changes add a new binary builder that uses 64-bit offsets + tests.
> I also added a panic for when the 32-bit offset binary builder overflows.
> Finally I made both binary builders resettable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8227) [C++] Propose refining SIMD code framework

2020-03-25 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8227:
---

 Summary: [C++] Propose refining SIMD code framework
 Key: ARROW-8227
 URL: https://issues.apache.org/jira/browse/ARROW-8227
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


Arrow supports wide range of hardware(x86,arm,ppc?) + 
os(linux,windows,macos,others?) + compiler(gcc,clang,msvc,others?). Managing 
platform dependent code is non-trivial. This Jira aims to refine(or mess up) 
simd related code framework.
Some goals: Move simd feature definition into one place, possibly in cmake, and 
reduce compiler based ifdef is source code. Manage simd code in one place, but 
leave non-simd default implementations where they are. Shouldn't introduce any 
performance penalty, prefer direct inline to runtime dispatcher. Code should be 
easy to maintain, expand, and hard to make mistakes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8226) [Go] Add binary builder that uses 64 bit offsets and make binary builders resettable

2020-03-25 Thread Richard (Jira)
Richard created ARROW-8226:
--

 Summary: [Go] Add binary builder that uses 64 bit offsets and make 
binary builders resettable
 Key: ARROW-8226
 URL: https://issues.apache.org/jira/browse/ARROW-8226
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Go
Reporter: Richard


I ran into some overflow issues with the existing 32 bit binary builder. My 
changes add a new binary builder that uses 64-bit offsets + tests.

I also added a panic for when the 32-bit offset binary builder overflows.

Finally I made both binary builders resettable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8225) [rust] Rust Arrow IPC reader must respect continuation markers

2020-03-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8225:
--
Labels: pull-request-available  (was: )

> [rust] Rust Arrow IPC reader must respect continuation markers
> --
>
> Key: ARROW-8225
> URL: https://issues.apache.org/jira/browse/ARROW-8225
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Max Burke
>Priority: Major
>  Labels: pull-request-available
>
> A continuation marker (value of 0x) in a message size block is
>  used to align the next block to an 8-byte boundary. This value needs to
>  be skipped over if encountered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8225) [rust] Rust Arrow IPC reader must respect continuation markers

2020-03-25 Thread Max Burke (Jira)
Max Burke created ARROW-8225:


 Summary: [rust] Rust Arrow IPC reader must respect continuation 
markers
 Key: ARROW-8225
 URL: https://issues.apache.org/jira/browse/ARROW-8225
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Max Burke


A continuation marker (value of 0x) in a message size block is
 used to align the next block to an 8-byte boundary. This value needs to
 be skipped over if encountered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8215) [CI][GLib] Meson install fails in the macOS build

2020-03-25 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-8215:

Summary: [CI][GLib] Meson install fails in the macOS build  (was: 
[CI][Glib] Meson install fails in the macOS build)

> [CI][GLib] Meson install fails in the macOS build
> -
>
> Key: ARROW-8215
> URL: https://issues.apache.org/jira/browse/ARROW-8215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, GLib
>Reporter: Krisztian Szucs
>Assignee: Kouhei Sutou
>Priority: Major
>
> It also happens in the pull request builds, see build log 
> https://github.com/apache/arrow/runs/533168517#step:5:1230
> cc @kou



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7708) [Release] Include PARQUET commits from git changelog in release changelogs

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7708.
-
Resolution: Fixed

Issue resolved by pull request 6722
[https://github.com/apache/arrow/pull/6722]

> [Release] Include PARQUET commits from git changelog in release changelogs
> --
>
> Key: ARROW-7708
> URL: https://issues.apache.org/jira/browse/ARROW-7708
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8215) [CI][Glib] Meson install fails in the macOS build

2020-03-25 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-8215:
---

Assignee: Kouhei Sutou

> [CI][Glib] Meson install fails in the macOS build
> -
>
> Key: ARROW-8215
> URL: https://issues.apache.org/jira/browse/ARROW-8215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, GLib
>Reporter: Krisztian Szucs
>Assignee: Kouhei Sutou
>Priority: Major
>
> It also happens in the pull request builds, see build log 
> https://github.com/apache/arrow/runs/533168517#step:5:1230
> cc @kou



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8224) [C++] Remove APIs deprecated prior to 0.16.0

2020-03-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8224:
---

 Summary: [C++] Remove APIs deprecated prior to 0.16.0
 Key: ARROW-8224
 URL: https://issues.apache.org/jira/browse/ARROW-8224
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8223) [Python] Schema.from_pandas breaks with pandas nullable integer dtype

2020-03-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067214#comment-17067214
 ] 

Wes McKinney commented on ARROW-8223:
-

{{Schema.from_pandas}} hasn't been very actively maintained. It hasn't acquired 
support for pandas ExtensionDType yet

> [Python] Schema.from_pandas breaks with pandas nullable integer dtype
> -
>
> Key: ARROW-8223
> URL: https://issues.apache.org/jira/browse/ARROW-8223
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0, 0.16.0, 0.15.1
> Environment: pyarrow 0.16
>Reporter: Ged Steponavicius
>Priority: Minor
>  Labels: easyfix
>
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame([{'int_col':1},
>  {'int_col':2}])
> df['int_col'] = df['int_col'].astype(pd.Int64Dtype())
> schema = pa.Schema.from_pandas(df)
> {code}
> produces ArrowTypeError: Did not pass numpy.dtype object
>  
> However, this works fine 
> {code:java}
> schema = pa.Table.from_pandas(df).schema{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8223) [Python] Schema.from_pandas breaks with pandas nullable integer dtype

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8223:

Summary: [Python] Schema.from_pandas breaks with pandas nullable integer 
dtype  (was: Schema.from_pandas breaks with pandas nullable integer dtype)

> [Python] Schema.from_pandas breaks with pandas nullable integer dtype
> -
>
> Key: ARROW-8223
> URL: https://issues.apache.org/jira/browse/ARROW-8223
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0, 0.16.0, 0.15.1
> Environment: pyarrow 0.16
>Reporter: Ged Steponavicius
>Priority: Minor
>  Labels: easyfix
>
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame([{'int_col':1},
>  {'int_col':2}])
> df['int_col'] = df['int_col'].astype(pd.Int64Dtype())
> schema = pa.Schema.from_pandas(df)
> {code}
> produces ArrowTypeError: Did not pass numpy.dtype object
>  
> However, this works fine 
> {code:java}
> schema = pa.Table.from_pandas(df).schema{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8217) [R][C++] Fix crashing test in test-dataset.R on 32-bit Windows from ARROW-7979

2020-03-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067211#comment-17067211
 ] 

Wes McKinney commented on ARROW-8217:
-

If the R nightly setup can build a DEBUG package I can try to install that and 
debug myself, too

> [R][C++] Fix crashing test in test-dataset.R on 32-bit Windows from ARROW-7979
> --
>
> Key: ARROW-8217
> URL: https://issues.apache.org/jira/browse/ARROW-8217
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.17.0
>
>
> If we can obtain a gdb backtrace from the failed test in 
> https://github.com/apache/arrow/pull/6638 then we can sort out what's wrong. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8217) [R][C++] Fix crashing test in test-dataset.R on 32-bit Windows from ARROW-7979

2020-03-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067209#comment-17067209
 ] 

Wes McKinney commented on ARROW-8217:
-

A debug build of Arrow should tell you more, it's worth a shot

> [R][C++] Fix crashing test in test-dataset.R on 32-bit Windows from ARROW-7979
> --
>
> Key: ARROW-8217
> URL: https://issues.apache.org/jira/browse/ARROW-8217
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.17.0
>
>
> If we can obtain a gdb backtrace from the failed test in 
> https://github.com/apache/arrow/pull/6638 then we can sort out what's wrong. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8217) [R][C++] Fix crashing test in test-dataset.R on 32-bit Windows from ARROW-7979

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8217:

Summary: [R][C++] Fix crashing test in test-dataset.R on 32-bit Windows 
from ARROW-7979  (was: [R][C++] Fix crashing data in test-dataset.R on 32-bit 
Windows from ARROW-7979)

> [R][C++] Fix crashing test in test-dataset.R on 32-bit Windows from ARROW-7979
> --
>
> Key: ARROW-8217
> URL: https://issues.apache.org/jira/browse/ARROW-8217
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.17.0
>
>
> If we can obtain a gdb backtrace from the failed test in 
> https://github.com/apache/arrow/pull/6638 then we can sort out what's wrong. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8210) [C++][Dataset] Handling of duplicate columns in Dataset factory and scanning

2020-03-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067208#comment-17067208
 ] 

Wes McKinney commented on ARROW-8210:
-

We should probably relax the code around duplicate column names except in 
scenarios where this is definitely ambiguity. 

> [C++][Dataset] Handling of duplicate columns in Dataset factory and scanning
> 
>
> Key: ARROW-8210
> URL: https://issues.apache.org/jira/browse/ARROW-8210
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, C++ - Dataset
>Reporter: Joris Van den Bossche
>Priority: Major
>
> While testing duplicate column names, I ran into multiple issues:
> * Factory fails if there are duplicate columns, even for a single file
> * In addition, we should also fix and/or test that factory works for 
> duplicate columns if the schema's are equal
> * Once a Dataset with duplicated columns is created, scanning without any 
> column projection fails
> ---
> My python reproducer:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> import pyarrow.fs
> # create single parquet file with duplicated column names
> table = pa.table([pa.array([1, 2, 3]), pa.array([4, 5, 6]), pa.array([7, 8, 
> 9])], names=['a', 'b', 'a'])
> pq.write_table(table, "data_duplicate_columns.parquet")
> {code}
> Factory fails:
> {code}
> dataset = ds.dataset("data_duplicate_columns.parquet", format="parquet")
> ...
> ~/scipy/repos/arrow/python/pyarrow/dataset.py in dataset(paths_or_factories, 
> filesystem, partitioning, format)
> 346 
> 347 factories = [_ensure_factory(f, **kwargs) for f in 
> paths_or_factories]
> --> 348 return UnionDatasetFactory(factories).finish()
> 349 
> 350 
> ArrowInvalid: Can't unify schema with duplicate field names.
> {code}
> And when creating a Dataset manually:
> {code:python}
> schema = pa.schema([('a', 'int64'), ('b', 'int64'), ('a', 'int64')])
> dataset = ds.FileSystemDataset(
> schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(),
> [str(basedir / "data_duplicate_columns.parquet")], 
> [ds.ScalarExpression(True)])
> {code}
> then scanning fails:
> {code}
> >>> dataset.to_table()
> ...
> ArrowInvalid: Multiple matches for FieldRef.Name(a) in a: int64
> b: int64
> a: int64
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8219) [Rust] sqlparser crate needs to be bumped to version 0.2.5

2020-03-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-8219.
---
Fix Version/s: 0.17.0
   Resolution: Fixed

Issue resolved by pull request 6719
[https://github.com/apache/arrow/pull/6719]

> [Rust] sqlparser crate needs to be bumped to version 0.2.5
> --
>
> Key: ARROW-8219
> URL: https://issues.apache.org/jira/browse/ARROW-8219
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Affects Versions: 0.16.0
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8223) Schema.from_pandas breaks with pandas nullable integer dtype

2020-03-25 Thread Ged Steponavicius (Jira)
Ged Steponavicius created ARROW-8223:


 Summary: Schema.from_pandas breaks with pandas nullable integer 
dtype
 Key: ARROW-8223
 URL: https://issues.apache.org/jira/browse/ARROW-8223
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.1, 0.16.0, 0.15.0
 Environment: pyarrow 0.16
Reporter: Ged Steponavicius


 
{code:java}
import pandas as pd
import pyarrow as pa
df = pd.DataFrame([{'int_col':1},
 {'int_col':2}])
df['int_col'] = df['int_col'].astype(pd.Int64Dtype())

schema = pa.Schema.from_pandas(df)
{code}
produces ArrowTypeError: Did not pass numpy.dtype object

 

However, this works fine 
{code:java}
schema = pa.Table.from_pandas(df).schema{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8222) [C++] Use bcp to make a slim boost for bundled build

2020-03-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067190#comment-17067190
 ] 

Wes McKinney commented on ARROW-8222:
-

> We would need a place to host this tarball

You can host it as an artifact in a GitHub release. So we could put the bcp 
generator script in the Arrow repo and then use it to create the release 
artifact in a for-this-purpose GitHub repo

> [C++] Use bcp to make a slim boost for bundled build
> 
>
> Key: ARROW-8222
> URL: https://issues.apache.org/jira/browse/ARROW-8222
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 0.17.0
>
>
> We don't use much of Boost (just system, filesystem, and regex), but when we 
> do a bundled build, we still download and extract all of boost. The tarball 
> itself is 113mb, expanded is over 700mb. This can be slow, and it requires a 
> lot of free disk space that we don't really need.
> [bcp|https://www.boost.org/doc/libs/1_72_0/tools/bcp/doc/html/index.html] is 
> a boost tool that lets you extract a subset of boost, resolving any of its 
> necessary dependencies across boost. The savings for us could be huge:
> {code}
> mkdir test
> ./bcp system.hpp filesystem.hpp regex.hpp test
> tar -czf test.tar.gz test/
> {code}
> The resulting tarball is 885K (kilobytes!). 
> {{bcp}} also lets you re-namespace, so this would (IIUC) solve ARROW-4286 as 
> well.
> We would need a place to host this tarball, and we would have to updated it 
> whenever we (1) bump the boost version or (2) add a new boost library 
> dependency. This patch would of course include a script that would generate 
> the tarball. Given the small size, we could also consider just vendoring it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8222) [C++] Use bcp to make a slim boost for bundled build

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8222:

Fix Version/s: 0.17.0

> [C++] Use bcp to make a slim boost for bundled build
> 
>
> Key: ARROW-8222
> URL: https://issues.apache.org/jira/browse/ARROW-8222
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 0.17.0
>
>
> We don't use much of Boost (just system, filesystem, and regex), but when we 
> do a bundled build, we still download and extract all of boost. The tarball 
> itself is 113mb, expanded is over 700mb. This can be slow, and it requires a 
> lot of free disk space that we don't really need.
> [bcp|https://www.boost.org/doc/libs/1_72_0/tools/bcp/doc/html/index.html] is 
> a boost tool that lets you extract a subset of boost, resolving any of its 
> necessary dependencies across boost. The savings for us could be huge:
> {code}
> mkdir test
> ./bcp system.hpp filesystem.hpp regex.hpp test
> tar -czf test.tar.gz test/
> {code}
> The resulting tarball is 885K (kilobytes!). 
> {{bcp}} also lets you re-namespace, so this would (IIUC) solve ARROW-4286 as 
> well.
> We would need a place to host this tarball, and we would have to updated it 
> whenever we (1) bump the boost version or (2) add a new boost library 
> dependency. This patch would of course include a script that would generate 
> the tarball. Given the small size, we could also consider just vendoring it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8222) [C++] Use bcp to make a slim boost for bundled build

2020-03-25 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8222:
--

 Summary: [C++] Use bcp to make a slim boost for bundled build
 Key: ARROW-8222
 URL: https://issues.apache.org/jira/browse/ARROW-8222
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Neal Richardson


We don't use much of Boost (just system, filesystem, and regex), but when we do 
a bundled build, we still download and extract all of boost. The tarball itself 
is 113mb, expanded is over 700mb. This can be slow, and it requires a lot of 
free disk space that we don't really need.

[bcp|https://www.boost.org/doc/libs/1_72_0/tools/bcp/doc/html/index.html] is a 
boost tool that lets you extract a subset of boost, resolving any of its 
necessary dependencies across boost. The savings for us could be huge:

{code}
mkdir test
./bcp system.hpp filesystem.hpp regex.hpp test
tar -czf test.tar.gz test/
{code}

The resulting tarball is 885K (kilobytes!). 

{{bcp}} also lets you re-namespace, so this would (IIUC) solve ARROW-4286 as 
well.

We would need a place to host this tarball, and we would have to updated it 
whenever we (1) bump the boost version or (2) add a new boost library 
dependency. This patch would of course include a script that would generate the 
tarball. Given the small size, we could also consider just vendoring it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7708) [Release] Include PARQUET commits from git changelog in release changelogs

2020-03-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7708:
--
Labels: pull-request-available  (was: )

> [Release] Include PARQUET commits from git changelog in release changelogs
> --
>
> Key: ARROW-7708
> URL: https://issues.apache.org/jira/browse/ARROW-7708
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8164) [C++][Dataset] Let datasets be viewable with non-identical schema

2020-03-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8164:
--
Labels: pull-request-available  (was: )

> [C++][Dataset] Let datasets be viewable with non-identical schema
> -
>
> Key: ARROW-8164
> URL: https://issues.apache.org/jira/browse/ARROW-8164
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Dataset
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> It would be useful to allow some schema unification capability after 
> discovery has completed. For example, if a FileSystemDataset is being wrapped 
> into a UnionDataset with another and their schemas are unifiable then there 
> is no reason we can't create the UnionDataset (rather than emitting an error 
> because the schemas are not identical).
> I think this behavior will be most naturally expressed in C++ like so:
> {code}
> virtual Result Dataset::ReplaceSchema(std::shared_ptr 
> schema) const = 0;
> {code}
> which will raise an error if the provided schema is not unifiable with the 
> current dataset schema.
> If this needs to be extended to non trivial projections then this will 
> probably warrant a separate class, {{ProjectedDataset}} or so. Definitely 
> follow up material (if desired)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8217) [R][C++] Fix crashing data in test-dataset.R on 32-bit Windows from ARROW-7979

2020-03-25 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067094#comment-17067094
 ] 

Neal Richardson commented on ARROW-8217:


I managed to run with gdb (included in Rtools) like this: 

{code}
C:\Program Files\R\R-3.6.0\bin\i386>\Rtools\mingw_32\bin\gdb R.exe
{code}

Unfortunately, there's no backtrace on exit:

{code}
Start test: IPC/Arrow format data
  test-dataset.R#172:1 [success]
  test-dataset.R#173:1 [success]
[Inferior 1 (process 4920) exited with code 0305]
(gdb) bt
No stack.
(gdb)
{code}

Retrying with {{break exit}}, there's nothing useful, presumably because R 
needs to be rebuilt with debug symbols:

{code}
Start test: IPC/Arrow format data
  test-dataset.R#172:1 [success]
  test-dataset.R#173:1 [success]
Breakpoint 1, 0x767866f5 in msvcrt!exit () from C:\Windows\System32\msvcrt.dll
(gdb) bt
#0  0x767866f5 in msvcrt!exit () from C:\Windows\System32\msvcrt.dll
#1  0x00405787 in ?? ()
#2  0x004013e2 in ?? ()
#3  0x74ea6359 in KERNEL32!BaseThreadInitThunk () from 
C:\Windows\System32\kernel32.dll
#4  0x77247b74 in ntdll!RtlGetAppContainerNamedObjectPath () from 
C:\Windows\SYSTEM32\ntdll.dll
#5  0x77247b44 in ntdll!RtlGetAppContainerNamedObjectPath () from 
C:\Windows\SYSTEM32\ntdll.dll
#6  0x in ?? ()
(gdb)
{code}

On the one hand, no backtrace could mean that it's not a segfault, just some 
other hard exit. On the other, googling the exit code comes up with lots of 
hits suggesting that it is a segfault.

If someone wants to tell me other ideas of places to set breakpoints or 
whatever to try to see what's failing, I can try. I can also try making a DEBUG 
build of Arrow C++ (this is a release build). I'd rather not have to build R 
from source to get more debug symbols.


> [R][C++] Fix crashing data in test-dataset.R on 32-bit Windows from ARROW-7979
> --
>
> Key: ARROW-8217
> URL: https://issues.apache.org/jira/browse/ARROW-8217
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.17.0
>
>
> If we can obtain a gdb backtrace from the failed test in 
> https://github.com/apache/arrow/pull/6638 then we can sort out what's wrong. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7708) [Release] Include PARQUET commits from git changelog in release changelogs

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7708:
---

Assignee: Wes McKinney

> [Release] Include PARQUET commits from git changelog in release changelogs
> --
>
> Key: ARROW-7708
> URL: https://issues.apache.org/jira/browse/ARROW-7708
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.17.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8208) [PYTHON] Row Group Filtering With ParquetDataset

2020-03-25 Thread Christophe Clienti (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067087#comment-17067087
 ] 

Christophe Clienti commented on ARROW-8208:
---

Thank you for all information, and it works like a charm. The new dataset seems 
really interesting. I will do further tests with large dataset and I will send 
some feedback.

> [PYTHON] Row Group Filtering With ParquetDataset
> 
>
> Key: ARROW-8208
> URL: https://issues.apache.org/jira/browse/ARROW-8208
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christophe Clienti
>Priority: Major
>  Labels: dataset, dataset-parquet-read
>
> Hello,
> I tried to use the row_group filtering at the file level with an instance of 
> ParquetDataset without success.
> I've tested the workaround proposed here:
>  [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]
> But I wonder if it can work on a file as I get an exception with the 
> following code:
> {code:python}
> ParquetDataset('data.parquet',
>filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
> {code}
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {noformat}
> I read the documentation, and the filtering seems to work only on partitioned 
> dataset. Moreover I read some information in the following JIRA ticket: 
> ARROW-1796
> So I'm not sure that a ParquetDataset can use row_group statistics to filter 
> specific row_group in a file (in a dataset or not)?
> As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
> (statistics.min instead of statistics.min_value), I was able to apply the 
> row_group filtering.
> Today I'm forced with pyarrow to filter manually the row_groups in each file, 
> which prevents me to use the ParquetDataset partition filtering functionality.
> The row groups are really useful because it prevents to fill the filesystem 
> with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8220) [Python] Make dataset FileFormat objects serializable

2020-03-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8220:
--
Labels: pull-request-available  (was: )

> [Python] Make dataset FileFormat objects serializable
> -
>
> Key: ARROW-8220
> URL: https://issues.apache.org/jira/browse/ARROW-8220
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>
> Similar to ARROW-8060, ARROW-8059, also the FileFormats need to be pickleable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067081#comment-17067081
 ] 

Wes McKinney commented on ARROW-3329:
-

Thanks. It's our intention that these instructions work consistently if 
followed faithfully, so if something doesn't work for you it's important that 
we are able to debug it and fix the documentation. 

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8221) [Python][Dataset] Expose schema inference / validation options in the factory

2020-03-25 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8221:
-
Description: 
ARROW-8058 added options related to schema inference / validation for the 
Dataset factory. We should expose this in Python in the {{dataset(..)}} factory 
function:

- Add ability to pass a user-specified schema with a {{schema}} keyword, 
instead of inferring the schema from (one of) the files (to be passed to the 
factory finish method)
- Add {{validate_schema}} option to toggle whether the schema is validated 
against the actual files or not.
- Expose in some way the number of fragments to be inspected when inferring or 
validating the schema. Not sure yet what the best API for this would be. 

  was:
ARROW-8058 added options related to schema inference / validation for the 
Dataset factory. We should expose this in Python in the {{dataset(..)}} factory 
function:

- Add ability to pass a user-specified schema with a {{schema}} keyword, 
instead of inferring the schema from (one of) the files (to be passed to the 
factory finish method)
- Add {{validate_schema}} option to toggle whether the schema is validated 
against the actual files or not.
- Expose in some way the number of fragments to be inspected when inferring the 
schema. Not sure yet what the best API for this would be. 


> [Python][Dataset] Expose schema inference / validation options in the factory
> -
>
> Key: ARROW-8221
> URL: https://issues.apache.org/jira/browse/ARROW-8221
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 0.17.0
>
>
> ARROW-8058 added options related to schema inference / validation for the 
> Dataset factory. We should expose this in Python in the {{dataset(..)}} 
> factory function:
> - Add ability to pass a user-specified schema with a {{schema}} keyword, 
> instead of inferring the schema from (one of) the files (to be passed to the 
> factory finish method)
> - Add {{validate_schema}} option to toggle whether the schema is validated 
> against the actual files or not.
> - Expose in some way the number of fragments to be inspected when inferring 
> or validating the schema. Not sure yet what the best API for this would be. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8221) [Python][Dataset] Expose schema inference / validation options in the factory

2020-03-25 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8221:


 Summary: [Python][Dataset] Expose schema inference / validation 
options in the factory
 Key: ARROW-8221
 URL: https://issues.apache.org/jira/browse/ARROW-8221
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


ARROW-8058 added options related to schema inference / validation for the 
Dataset factory. We should expose this in Python in the {{dataset(..)}} factory 
function:

- Add ability to pass a user-specified schema with a {{schema}} keyword, 
instead of inferring the schema from (one of) the files (to be passed to the 
factory finish method)
- Add {{validate_schema}} option to toggle whether the schema is validated 
against the actual files or not.
- Expose in some way the number of fragments to be inspected when inferring the 
schema. Not sure yet what the best API for this would be. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-25 Thread Jacek Pliszka (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067041#comment-17067041
 ] 

Jacek Pliszka commented on ARROW-3329:
--

I will follow them again tommorrow.

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-25 Thread Jacek Pliszka (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Pliszka updated ARROW-3329:
-
Comment: was deleted

(was: I can have another shot at compilation during weekend but I may fail 
again.

 

If you can add something minimal just to get into 0.17 then, once it is in pip, 
I can work on more without need for compilation.

 

 )

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7688) [Java] Bump checkstyle from 8.18 to 8.29

2020-03-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-7688:
--
Fix Version/s: (was: 0.17.0)

> [Java] Bump checkstyle from 8.18 to 8.29
> 
>
> Key: ARROW-7688
> URL: https://issues.apache.org/jira/browse/ARROW-7688
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.15.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7688) [Java] Bump checkstyle from 8.18 to 8.29

2020-03-25 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067037#comment-17067037
 ] 

Andy Grove commented on ARROW-7688:
---

Hi [~fokko] Do you still want to try this upgrade?

> [Java] Bump checkstyle from 8.18 to 8.29
> 
>
> Key: ARROW-7688
> URL: https://issues.apache.org/jira/browse/ARROW-7688
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.15.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-25 Thread Jacek Pliszka (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067035#comment-17067035
 ] 

Jacek Pliszka commented on ARROW-3329:
--

I can have another shot at compilation during weekend but I may fail again.

 

If you can add something minimal just to get into 0.17 then, once it is in pip, 
I can work on more without need for compilation.

 

 

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7763) [Java] Update README with instructions for IntelliJ users

2020-03-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-7763:
--
Fix Version/s: (was: 0.17.0)
   1.0.0

> [Java] Update README with instructions for IntelliJ users
> -
>
> Key: ARROW-7763
> URL: https://issues.apache.org/jira/browse/ARROW-7763
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Minor
> Fix For: 1.0.0
>
>
> IntelliJ needs to be configured to use the errorprone compiler and this is 
> not currently documented, making it hard for new contributors to build/test 
> the project. We can pretty much just link to the instructions at 
> https://errorprone.info/docs/installation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7756) [Java] Explore using Avatica as basis for Flight JDBC Driver

2020-03-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-7756:
--
Fix Version/s: (was: 0.17.0)

> [Java] Explore using Avatica as basis for Flight JDBC Driver
> 
>
> Key: ARROW-7756
> URL: https://issues.apache.org/jira/browse/ARROW-7756
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> Explore using Avatica as basis for Flight JDBC Driver to see how suitable it 
> is compared to building the JDBC driver from the ground up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8208) [PYTHON] Row Group Filtering With ParquetDataset

2020-03-25 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067021#comment-17067021
 ] 

Joris Van den Bossche commented on ARROW-8208:
--

And also related to ARROW-8063 and ARROW-8047 to ensure we document all this!

> [PYTHON] Row Group Filtering With ParquetDataset
> 
>
> Key: ARROW-8208
> URL: https://issues.apache.org/jira/browse/ARROW-8208
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christophe Clienti
>Priority: Major
>  Labels: dataset, dataset-parquet-read
>
> Hello,
> I tried to use the row_group filtering at the file level with an instance of 
> ParquetDataset without success.
> I've tested the workaround proposed here:
>  [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]
> But I wonder if it can work on a file as I get an exception with the 
> following code:
> {code:python}
> ParquetDataset('data.parquet',
>filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
> {code}
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {noformat}
> I read the documentation, and the filtering seems to work only on partitioned 
> dataset. Moreover I read some information in the following JIRA ticket: 
> ARROW-1796
> So I'm not sure that a ParquetDataset can use row_group statistics to filter 
> specific row_group in a file (in a dataset or not)?
> As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
> (statistics.min instead of statistics.min_value), I was able to apply the 
> row_group filtering.
> Today I'm forced with pyarrow to filter manually the row_groups in each file, 
> which prevents me to use the ParquetDataset partition filtering functionality.
> The row groups are really useful because it prevents to fill the filesystem 
> with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-8208) [PYTHON] Row Group Filtering With ParquetDataset

2020-03-25 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-8208.

Resolution: Implemented

> [PYTHON] Row Group Filtering With ParquetDataset
> 
>
> Key: ARROW-8208
> URL: https://issues.apache.org/jira/browse/ARROW-8208
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christophe Clienti
>Priority: Major
>  Labels: dataset, dataset-parquet-read
>
> Hello,
> I tried to use the row_group filtering at the file level with an instance of 
> ParquetDataset without success.
> I've tested the workaround proposed here:
>  [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]
> But I wonder if it can work on a file as I get an exception with the 
> following code:
> {code:python}
> ParquetDataset('data.parquet',
>filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
> {code}
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {noformat}
> I read the documentation, and the filtering seems to work only on partitioned 
> dataset. Moreover I read some information in the following JIRA ticket: 
> ARROW-1796
> So I'm not sure that a ParquetDataset can use row_group statistics to filter 
> specific row_group in a file (in a dataset or not)?
> As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
> (statistics.min instead of statistics.min_value), I was able to apply the 
> row_group filtering.
> Today I'm forced with pyarrow to filter manually the row_groups in each file, 
> which prevents me to use the ParquetDataset partition filtering functionality.
> The row groups are really useful because it prevents to fill the filesystem 
> with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7705) [Rust] Initial sort implementation

2020-03-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-7705:
--
Fix Version/s: (was: 0.17.0)
   1.0.0

> [Rust] Initial sort implementation
> --
>
> Key: ARROW-7705
> URL: https://issues.apache.org/jira/browse/ARROW-7705
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> An initial sort implementation that allows sorting an array by various 
> options (e.g. sort order). This is mainly to iterate on the design and inner 
> workings of a sort algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6482) [Rust] Investigate enabling features in regex crate to reduce compile times

2020-03-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-6482:
--
Fix Version/s: (was: 0.17.0)
   1.0.0

> [Rust] Investigate enabling features in regex crate to reduce compile times
> ---
>
> Key: ARROW-6482
> URL: https://issues.apache.org/jira/browse/ARROW-6482
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.14.1
>Reporter: Paddy Horan
>Priority: Minor
>  Labels: beginner
> Fix For: 1.0.0
>
>
> The regex crate recently added a feature flag to reduce compile times and 
> binary size if certain unicode related features are not needed.  We should 
> investigate using this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3688) [Rust] Implement PrimitiveArrayBuilder.push_values

2020-03-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-3688:
--
Fix Version/s: (was: 0.17.0)
   1.0.0

> [Rust] Implement PrimitiveArrayBuilder.push_values
> -
>
> Key: ARROW-3688
> URL: https://issues.apache.org/jira/browse/ARROW-3688
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 1.0.0
>
>
> Follow-up of https://github.com/apache/arrow/pull/2858
> See discussion https://github.com/apache/arrow/pull/2858/files#r228808948



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5350) [Rust] Support filtering on nested array types

2020-03-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-5350:
--
Fix Version/s: (was: 0.17.0)
   1.0.0

> [Rust] Support filtering on nested array types
> --
>
> Key: ARROW-5350
> URL: https://issues.apache.org/jira/browse/ARROW-5350
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Priority: Major
> Fix For: 1.0.0
>
>
> We currently only filter on primitive types, but not on lists and structs. 
> Add the ability to filter on nested array types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5408) [Rust] Create struct array builder that creates null buffers

2020-03-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-5408:
--
Fix Version/s: (was: 0.17.0)
   1.0.0

> [Rust] Create struct array builder that creates null buffers
> 
>
> Key: ARROW-5408
> URL: https://issues.apache.org/jira/browse/ARROW-5408
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Priority: Major
> Fix For: 1.0.0
>
>
> We currently have a way of creating a struct array from a list of (field, 
> array) tuples. This does not create null buffers for the struct (because no 
> index is null). While this works fine for Rust, it often leads to 
> incompatible data with IPC data and kernel function outputs.
> Having a function that caters for nulls, or expanding the current one, would 
> alleviate this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8208) [PYTHON] Row Group Filtering With ParquetDataset

2020-03-25 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067018#comment-17067018
 ] 

Joris Van den Bossche commented on ARROW-8208:
--

[~cclienti] feedback on those new functionalities is very welcome!

But so, since it's already possible and using this in ParquetDataset is covered 
by other issues, going to close this one.

> [PYTHON] Row Group Filtering With ParquetDataset
> 
>
> Key: ARROW-8208
> URL: https://issues.apache.org/jira/browse/ARROW-8208
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christophe Clienti
>Priority: Major
>  Labels: dataset, dataset-parquet-read
>
> Hello,
> I tried to use the row_group filtering at the file level with an instance of 
> ParquetDataset without success.
> I've tested the workaround proposed here:
>  [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]
> But I wonder if it can work on a file as I get an exception with the 
> following code:
> {code:python}
> ParquetDataset('data.parquet',
>filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
> {code}
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {noformat}
> I read the documentation, and the filtering seems to work only on partitioned 
> dataset. Moreover I read some information in the following JIRA ticket: 
> ARROW-1796
> So I'm not sure that a ParquetDataset can use row_group statistics to filter 
> specific row_group in a file (in a dataset or not)?
> As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
> (statistics.min instead of statistics.min_value), I was able to apply the 
> row_group filtering.
> Today I'm forced with pyarrow to filter manually the row_groups in each file, 
> which prevents me to use the ParquetDataset partition filtering functionality.
> The row groups are really useful because it prevents to fill the filesystem 
> with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7364) [Rust] Add cast options to cast kernel

2020-03-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-7364:
--
Fix Version/s: (was: 0.17.0)
   1.0.0

> [Rust] Add cast options to cast kernel
> --
>
> Key: ARROW-7364
> URL: https://issues.apache.org/jira/browse/ARROW-7364
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Priority: Major
> Fix For: 1.0.0
>
>
> The cast kernels currently do not take explicit options, but instead convert 
> overflows and invalid uft8 to nulls. We can create options that customise the 
> behaviour, similarly to CastOptions in CPP 
> ([https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.h#L38])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5352) [Rust] BinaryArray filter replaces nulls with empty strings

2020-03-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-5352:
--
Fix Version/s: (was: 0.17.0)
   1.0.0

> [Rust] BinaryArray filter replaces nulls with empty strings
> ---
>
> Key: ARROW-5352
> URL: https://issues.apache.org/jira/browse/ARROW-5352
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.13.0
>Reporter: Neville Dipale
>Priority: Minor
> Fix For: 1.0.0
>
>
> The filter implementation for BinaryArray discards nullness of data. 
> BinaryArrays that are null (seem to) always return an empty string slice when 
> getting a value, so the way filter works might be a bug depending on what 
> Arrow developers' or users' intentions are.
> I think we should either preserve nulls (and their count) or document this as 
> intended behaviour.
> Below is a test case that reproduces the bug.
> {code:java}
> #[test]
> fn test_filter_binary_array_with_nulls() {
> let mut a: BinaryBuilder = BinaryBuilder::new(100);
> a.append_null().unwrap();
> a.append_string("a string").unwrap();
> a.append_null().unwrap();
> a.append_string("with nulls").unwrap();
> let array = a.finish();
> let b = BooleanArray::from(vec![true, true, true, true]);
> let c = filter(, ).unwrap();
> let d:  = c.as_any().downcast_ref::().unwrap();
> // I didn't expect this behaviour
> assert_eq!("", d.get_string(0));
> // fails here
> assert!(d.is_null(0));
> assert_eq!(4, d.len());
> // fails here
> assert_eq!(2, d.null_count());
> assert_eq!("a string", d.get_string(1));
> // fails here
> assert!(d.is_null(2));
> assert_eq!("with nulls", d.get_string(3));
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067015#comment-17067015
 ] 

Wes McKinney commented on ARROW-3329:
-

[~jacek.pliszka] can you show us what is going wrong when you are trying to 
follow the source build instructions in 

https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst

? 

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8208) [PYTHON] Row Group Filtering With ParquetDataset

2020-03-25 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067016#comment-17067016
 ] 

Joris Van den Bossche commented on ARROW-8208:
--

This is now implemented, and also already available in the Python bindings to 
the new Datasets framework.

In released pyarrow 0.16.0, you can use the datasets API to use filtering on 
non-partition key columns, and this looks like (example I did locally with NYC 
taxi data):

{code}
import pyarrow.dataset as ds

dataset = ds.dataset("nyc-taxi-data/", format="parquet", partitioning="hive")   

 
dataset.to_table(filter=ds.field("passenger_count") > 8)

   
{code}

So the above is already possible with pyarrow 0.16. In the upcoming pyarrow 
0.17, we will also provide this functionality through the existing 
{{ParquetDataset}} API as you were using. But this is work in progress right 
now (ARROW-8039, https://github.com/apache/arrow/pull/6303)

> [PYTHON] Row Group Filtering With ParquetDataset
> 
>
> Key: ARROW-8208
> URL: https://issues.apache.org/jira/browse/ARROW-8208
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christophe Clienti
>Priority: Major
>  Labels: dataset, dataset-parquet-read
>
> Hello,
> I tried to use the row_group filtering at the file level with an instance of 
> ParquetDataset without success.
> I've tested the workaround proposed here:
>  [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]
> But I wonder if it can work on a file as I get an exception with the 
> following code:
> {code:python}
> ParquetDataset('data.parquet',
>filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
> {code}
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {noformat}
> I read the documentation, and the filtering seems to work only on partitioned 
> dataset. Moreover I read some information in the following JIRA ticket: 
> ARROW-1796
> So I'm not sure that a ParquetDataset can use row_group statistics to filter 
> specific row_group in a file (in a dataset or not)?
> As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
> (statistics.min instead of statistics.min_value), I was able to apply the 
> row_group filtering.
> Today I'm forced with pyarrow to filter manually the row_groups in each file, 
> which prevents me to use the ParquetDataset partition filtering functionality.
> The row groups are really useful because it prevents to fill the filesystem 
> with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067015#comment-17067015
 ] 

Wes McKinney edited comment on ARROW-3329 at 3/25/20, 7:23 PM:
---

[~jacek.pliszka] can you show us what is going wrong when you are trying to 
follow the source build instructions in 

https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst

? Please be sure to clean out all temporary files from the git repository with 
{{git clean -fdx .}}


was (Author: wesmckinn):
[~jacek.pliszka] can you show us what is going wrong when you are trying to 
follow the source build instructions in 

https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst

? 

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-25 Thread Jacek Pliszka (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Pliszka updated ARROW-3329:
-
Comment: was deleted

(was: Actually, if you agree to it, we can have it in 0.17.0 - then I can work 
on tests without compiling it and add tests in the next release.)

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7642) [Rust] Create build.rs to generate flatbuffers files

2020-03-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-7642:
--
Fix Version/s: (was: 0.17.0)
   1.0.0

> [Rust] Create build.rs to generate flatbuffers files
> 
>
> Key: ARROW-7642
> URL: https://issues.apache.org/jira/browse/ARROW-7642
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> We should take the logic from the regen.sh [1] bash script and convert it 
> into a Rust build.rs script that can run in CI. This would require flatc to 
> be installed to be able to build the project.
>  
> [1] https://github.com/apache/arrow/blob/master/rust/arrow/regen.sh



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8213) [Python][Dataset] Opening a dataset with a local incorrect path gives confusing error message

2020-03-25 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8213:
-
Summary: [Python][Dataset] Opening a dataset with a local incorrect path 
gives confusing error message  (was: [Python][Dataste] Opening a dataset with a 
local incorrect path gives confusing error message)

> [Python][Dataset] Opening a dataset with a local incorrect path gives 
> confusing error message
> -
>
> Key: ARROW-8213
> URL: https://issues.apache.org/jira/browse/ARROW-8213
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 0.17.0
>
>
> Even after the previous PRs related to local paths 
> (https://github.com/apache/arrow/pull/6643, 
> https://github.com/apache/arrow/pull/6655), I don't the user experience 
> optimal in case you are working with local files, and pass a wrong, 
> non-existent path (eg due to a typo).
> Currently, you get this error:
> {code}
> >>> dataset = ds.dataset("data_with_typo.parquet", format="parquet")
> ...
> ArrowInvalid: URI has empty scheme: 'data_with_typo.parquet'
> {code}
> where "URI has empty scheme" is rather confusing for the user in case of a 
> non-existent path.  I think ideally we should raise a "No such file or 
> directory" error.
> I am not fully sure what the best solution is, as {{FileSystem.from_uri}} can 
> also give other errors that we do want to propagate to the user. 
> The most straightforward that I am now thinking of is checking if "URI has 
> empty scheme" is in the error message, and then rewording it, but that's not 
> very clean ..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067012#comment-17067012
 ] 

Wes McKinney commented on ARROW-3329:
-

I think we should have some basic "it works" tests at the Python level

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4193) [Rust] Add support for decimal data type

2020-03-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-4193:
--
Fix Version/s: (was: 0.17.0)
   1.0.0

> [Rust] Add support for decimal data type
> 
>
> Key: ARROW-4193
> URL: https://issues.apache.org/jira/browse/ARROW-4193
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Andy Grove
>Priority: Minor
>  Labels: beginner
> Fix For: 1.0.0
>
>
> We should add {{Decimal(usize,usize)}} to DataType and add the corresponding 
> array and builder classes.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-25 Thread Jacek Pliszka (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067013#comment-17067013
 ] 

Jacek Pliszka commented on ARROW-3329:
--

Actually, if you agree to it, we can have it in 0.17.0 - then I can work on 
tests without compiling it and add tests in the next release.

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6689) [Rust] [DataFusion] Query execution enhancements for 1.0.0 release

2020-03-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-6689:
--
Fix Version/s: (was: 0.17.0)
   1.0.0

> [Rust] [DataFusion] Query execution enhancements for 1.0.0 release
> --
>
> Key: ARROW-6689
> URL: https://issues.apache.org/jira/browse/ARROW-6689
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> There a number of optimizations that can be made to the new query execution 
> and this is a top level story to track them all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6691) [Rust] [DataFusion] Use tokio and Futures instead of spawning threads

2020-03-25 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-6691:
--
Fix Version/s: (was: 0.17.0)
   1.0.0

> [Rust] [DataFusion] Use tokio and Futures instead of spawning threads
> -
>
> Key: ARROW-6691
> URL: https://issues.apache.org/jira/browse/ARROW-6691
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: image-2019-12-07-17-54-57-862.png
>
>
> The current implementation of the physical query plan uses "thread::spawn" 
> which is expensive. We should switch to using Futures, async!/await!, and 
> tokio so that we are launching tasks in a thread pool instead and writing 
> idiomatic Rust code with futures combinators to chain actions together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8220) [Python] Make dataset FileFormat objects serializable

2020-03-25 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-8220:
--

Assignee: Krisztian Szucs

> [Python] Make dataset FileFormat objects serializable
> -
>
> Key: ARROW-8220
> URL: https://issues.apache.org/jira/browse/ARROW-8220
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 0.17.0
>
>
> Similar to ARROW-8060, ARROW-8059, also the FileFormats need to be pickleable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8208) [PYTHON] Row Group Filtering With ParquetDataset

2020-03-25 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8208:
-
Description: 
Hello,

I tried to use the row_group filtering at the file level with an instance of 
ParquetDataset without success.

I've tested the workaround proposed here:
 [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]

But I wonder if it can work on a file as I get an exception with the following 
code:
{code:python}
ParquetDataset('data.parquet',
   filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
{code}
{noformat}
AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
{noformat}
I read the documentation, and the filtering seems to work only on partitioned 
dataset. Moreover I read some information in the following JIRA ticket: 
ARROW-1796

So I'm not sure that a ParquetDataset can use row_group statistics to filter 
specific row_group in a file (in a dataset or not)?

As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
(statistics.min instead of statistics.min_value), I was able to apply the 
row_group filtering.

Today I'm forced with pyarrow to filter manually the row_groups in each file, 
which prevents me to use the ParquetDataset partition filtering functionality.

The row groups are really useful because it prevents to fill the filesystem 
with small files...

  was:
Hello,

I tried to use the row_group filtering at the file level with an instance of 
ParquetDataset without success.

I've tested the workaround proposed here:
 [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]

But I wonder if it can work on a file as I get an exception with the following 
code:
{code:python}
ParquetDataset('data.parquet',
   filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
{code}
{noformat}
AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
{noformat}
I read the documentation, and the filtering seems to work only on partitioned 
dataset. Moreover I read some information in the following JIRA ticket:
 https://issues.apache.org/jira/browse/ARROW-1796

So I'm not sure that a ParquetDataset can use row_group statistics to filter 
specific row_group in a file (in a dataset or not)?

As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
(statistics.min instead of statistics.min_value), I was able to apply the 
row_group filtering.

Today I'm forced with pyarrow to filter manually the row_groups in each file, 
which prevents me to use the ParquetDataset partition filtering functionality.

The row groups are really useful because it prevents to fill the filesystem 
with small files...


> [PYTHON] Row Group Filtering With ParquetDataset
> 
>
> Key: ARROW-8208
> URL: https://issues.apache.org/jira/browse/ARROW-8208
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christophe Clienti
>Priority: Major
>  Labels: dataset, dataset-parquet-read
>
> Hello,
> I tried to use the row_group filtering at the file level with an instance of 
> ParquetDataset without success.
> I've tested the workaround proposed here:
>  [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]
> But I wonder if it can work on a file as I get an exception with the 
> following code:
> {code:python}
> ParquetDataset('data.parquet',
>filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
> {code}
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {noformat}
> I read the documentation, and the filtering seems to work only on partitioned 
> dataset. Moreover I read some information in the following JIRA ticket: 
> ARROW-1796
> So I'm not sure that a ParquetDataset can use row_group statistics to filter 
> specific row_group in a file (in a dataset or not)?
> As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
> (statistics.min instead of statistics.min_value), I was able to apply the 
> row_group filtering.
> Today I'm forced with pyarrow to filter manually the row_groups in each file, 
> which prevents me to use the ParquetDataset partition filtering functionality.
> The row groups are really useful because it prevents to fill the filesystem 
> with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-25 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067004#comment-17067004
 ] 

Antoine Pitrou commented on ARROW-3329:
---

{code:python}
>>> arr = pa.array([Decimal("1234"), None], type=pa.decimal128(19,10))  
>>>  
>>> arr 
>>>  

[
  1234.00,
  null
]
>>> arr.cast(pa.int16())
>>>  

[
  1234,
  null
]
>>> arr.cast(pa.int8()) 
>>>  
Traceback (most recent call last):
  File "", line 1, in 
arr.cast(pa.int8())
  File "pyarrow/array.pxi", line 674, in pyarrow.lib.Array.cast
check_status(Cast(_context(), self.ap[0], type.sp_type,
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
raise ArrowInvalid(message)
ArrowInvalid: Integer value out of bounds

>>> arr.cast(pa.int8(), safe=False) 
>>>  

[
  -46,
  null
]
{code}


> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-25 Thread Jacek Pliszka (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067002#comment-17067002
 ] 

Jacek Pliszka commented on ARROW-3329:
--

Well, maybe we can close it - tests in C++ do conver all options.

Do nulls work in Python as well? Options?

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-25 Thread Jacek Pliszka (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066999#comment-17066999
 ] 

Jacek Pliszka edited comment on ARROW-3329 at 3/25/20, 7:09 PM:


Oh, it works, great. 

Some tests would be good - unfortunately I am still not able to compile it.

I've made some progress but it will take me some more time.


was (Author: jacek.pliszka):
i wish to but very unlikely.

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8220) [Python] Make dataset FileFormat objects serializable

2020-03-25 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8220:


 Summary: [Python] Make dataset FileFormat objects serializable
 Key: ARROW-8220
 URL: https://issues.apache.org/jira/browse/ARROW-8220
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


Similar to ARROW-8060, ARROW-8059, also the FileFormats need to be pickleable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8219) [Rust] sqlparser crate needs to be bumped to version 0.2.5

2020-03-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8219:
--
Labels: pull-request-available  (was: )

> [Rust] sqlparser crate needs to be bumped to version 0.2.5
> --
>
> Key: ARROW-8219
> URL: https://issues.apache.org/jira/browse/ARROW-8219
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Affects Versions: 0.16.0
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-25 Thread Jacek Pliszka (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066999#comment-17066999
 ] 

Jacek Pliszka commented on ARROW-3329:
--

i wish to but very unlikely.

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8219) [Rust] sqlparser crate needs to be bumped to version 0.2.5

2020-03-25 Thread Paddy Horan (Jira)
Paddy Horan created ARROW-8219:
--

 Summary: [Rust] sqlparser crate needs to be bumped to version 0.2.5
 Key: ARROW-8219
 URL: https://issues.apache.org/jira/browse/ARROW-8219
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Affects Versions: 0.16.0
Reporter: Paddy Horan
Assignee: Paddy Horan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7771) [Developer] Use ARROW_TMPDIR environment variable in the verification scripts instead of TMPDIR

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7771.
-
Resolution: Fixed

Issue resolved by pull request 6718
[https://github.com/apache/arrow/pull/6718]

> [Developer] Use ARROW_TMPDIR environment variable in the verification scripts 
> instead of TMPDIR
> ---
>
> Key: ARROW-7771
> URL: https://issues.apache.org/jira/browse/ARROW-7771
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See discussion 
> https://github.com/apache/arrow/pull/6344#issuecomment-582128686



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-25 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066990#comment-17066990
 ] 

Antoine Pitrou edited comment on ARROW-3329 at 3/25/20, 6:58 PM:
-

It works here:
{code:python}
>>> arr = pa.array([Decimal("1234")], type=pa.decimal128(19,10))
>>> 
>>>  
>>> arr 
>>> 
>>>  

[
  1234.00
]
>>> arr.cast(pa.int16())
>>> 
>>>  

[
  1234
]
{code}

[~jacek.pliszka] I assume you want to add some tests? Or is it something else?


was (Author: pitrou):
It works here:
{code}
>>> arr = pa.array([Decimal("1234")], type=pa.decimal128(19,10))
>>> 
>>>  
>>> arr 
>>> 
>>>  

[
  1234.00
]
>>> arr.cast(pa.int16())
>>> 
>>>  

[
  1234
]
{code}

[~jacek.pliszka] I assume you want to add some tests? Or is it something else?

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-25 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066990#comment-17066990
 ] 

Antoine Pitrou commented on ARROW-3329:
---

It works here:
{code}
>>> arr = pa.array([Decimal("1234")], type=pa.decimal128(19,10))
>>> 
>>>  
>>> arr 
>>> 
>>>  

[
  1234.00
]
>>> arr.cast(pa.int16())
>>> 
>>>  

[
  1234
]
{code}

[~jacek.pliszka] I assume you want to add some tests? Or is it something else?

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6941) [C++] Unpin gtest in build environment

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6941:

Fix Version/s: (was: 0.17.0)
   1.0.0

> [C++] Unpin gtest in build environment
> --
>
> Key: ARROW-6941
> URL: https://issues.apache.org/jira/browse/ARROW-6941
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Follow up to failure triaged in ARROW-6834



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8142) [C++] Casting a chunked array with 0 chunks critical failure

2020-03-25 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-8142.
---
Resolution: Fixed

Issue resolved by pull request 6668
[https://github.com/apache/arrow/pull/6668]

> [C++] Casting a chunked array with 0 chunks critical failure
> 
>
> Key: ARROW-8142
> URL: https://issues.apache.org/jira/browse/ARROW-8142
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Florian Jetter
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> When casting a schema of an empty table from dict encoded to non-dict encoded 
> type a critical error is raised and not handled causing the interpreter to 
> shut down.
> This only happens after a parquet roundtrip
>  
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> df = pd.DataFrame({"col": ["a"]}).astype({"col": "category"}).iloc[:0]
> table = pa.Table.from_pandas(df)
> field = table.schema[0]
> new_field = pa.field(field.name, field.type.value_type, field.nullable, 
> field.metadata)
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> reader = pa.BufferReader(buf.getvalue().to_pybytes())
> table = pq.read_table(reader)
> schema = table.schema.remove(0).insert(0, new_field)
> new_table = table.cast(schema)
> assert new_table.schema == schema
>  {code}
>  
> Output
> {code:java}
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0318 09:55:14.266649 299722176 table.cc:47] Check failed: (chunks.size()) > 
> (0) cannot construct ChunkedArray from empty vector and omitted type {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7733) [Developer] Install locally a new enough version of Go for release verification script

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7733:
---

Assignee: Wes McKinney

> [Developer] Install locally a new enough version of Go for release 
> verification script
> --
>
> Key: ARROW-7733
> URL: https://issues.apache.org/jira/browse/ARROW-7733
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This will ensure that if a developer has a too-old version of Go installed on 
> their system that the release verification will still work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7733) [Developer] Install locally a new enough version of Go for release verification script

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7733.
-
Resolution: Fixed

Issue resolved by pull request 6710
[https://github.com/apache/arrow/pull/6710]

> [Developer] Install locally a new enough version of Go for release 
> verification script
> --
>
> Key: ARROW-7733
> URL: https://issues.apache.org/jira/browse/ARROW-7733
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This will ensure that if a developer has a too-old version of Go installed on 
> their system that the release verification will still work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6915) [Developer] Do not overwrite minor release version with merge script, even if not specified by committer

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6915.
-
Resolution: Fixed

Issue resolved by pull request 6708
[https://github.com/apache/arrow/pull/6708]

> [Developer] Do not overwrite minor release version with merge script, even if 
> not specified by committer
> 
>
> Key: ARROW-6915
> URL: https://issues.apache.org/jira/browse/ARROW-6915
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Not every committer knows to write "$MAJOR_VERSION,$MINOR_VERSION" for the 
> fix version when merging



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6915) [Developer] Do not overwrite minor release version with merge script, even if not specified by committer

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6915:

Fix Version/s: (was: 0.15.1)

> [Developer] Do not overwrite minor release version with merge script, even if 
> not specified by committer
> 
>
> Key: ARROW-6915
> URL: https://issues.apache.org/jira/browse/ARROW-6915
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Not every committer knows to write "$MAJOR_VERSION,$MINOR_VERSION" for the 
> fix version when merging



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-6915) [Developer] Do not overwrite minor release version with merge script, even if not specified by committer

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reopened ARROW-6915:
-

> [Developer] Do not overwrite minor release version with merge script, even if 
> not specified by committer
> 
>
> Key: ARROW-6915
> URL: https://issues.apache.org/jira/browse/ARROW-6915
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.1, 0.17.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Not every committer knows to write "$MAJOR_VERSION,$MINOR_VERSION" for the 
> fix version when merging



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6915) [Developer] Do not overwrite minor release version with merge script, even if not specified by committer

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6915.
-
Resolution: Fixed

Issue resolved by pull request 6708
[https://github.com/apache/arrow/pull/6708]

> [Developer] Do not overwrite minor release version with merge script, even if 
> not specified by committer
> 
>
> Key: ARROW-6915
> URL: https://issues.apache.org/jira/browse/ARROW-6915
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.1, 0.17.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Not every committer knows to write "$MAJOR_VERSION,$MINOR_VERSION" for the 
> fix version when merging



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6915) [Developer] Do not overwrite minor release version with merge script, even if not specified by committer

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6915:

Fix Version/s: 0.15.1

> [Developer] Do not overwrite minor release version with merge script, even if 
> not specified by committer
> 
>
> Key: ARROW-6915
> URL: https://issues.apache.org/jira/browse/ARROW-6915
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.1, 0.17.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Not every committer knows to write "$MAJOR_VERSION,$MINOR_VERSION" for the 
> fix version when merging



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7919) [R] install_arrow() should conda install if appropriate

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7919.
-
Resolution: Fixed

Issue resolved by pull request 6709
[https://github.com/apache/arrow/pull/6709]

> [R] install_arrow() should conda install if appropriate
> ---
>
> Key: ARROW-7919
> URL: https://issues.apache.org/jira/browse/ARROW-7919
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Like, check {{if (grepl("conda", R.Version()$platform))}} and if so then 
> {{system("conda install ...")}}. Error if nightly == TRUE because we don't 
> host conda nightlies yet.
> This would help with issues like https://github.com/apache/arrow/issues/6448



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5845) [Java] Implement converter between Arrow record batches and Avro records

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5845:

Fix Version/s: (was: 0.17.0)
   1.0.0

> [Java] Implement converter between Arrow record batches and Avro records
> 
>
> Key: ARROW-5845
> URL: https://issues.apache.org/jira/browse/ARROW-5845
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> It would be useful for applications which need convert Avro data to Arrow 
> data.
> This is an adapter which convert data with existing API (like JDBC adapter) 
> rather than a native reader (like orc).
> We implement this function through Avro java project, receiving param like 
> Decoder/Schema/DatumReader of Avro and return VectorSchemaRoot. For each data 
> type we have a consumer class as below to get Avro data and write it into 
> vector to avoid boxing/unboxing (e.g. GenericRecord#get returns Object)
> {code:java}
> public class AvroIntConsumer implements Consumer {
> private final IntWriter writer;
> public AvroIntConsumer(IntVector vector)
> { this.writer = new IntWriterImpl(vector); }
> @Override
> public void consume(Decoder decoder) throws IOException
> { writer.writeInt(decoder.readInt()); writer.setPosition(writer.getPosition() 
> + 1); }
> {code}
> We intended to support primitive and complex types (null value represented 
> via unions type with null type), size limit and field selection could be 
> optional for users. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3329) [Python] Error casting decimal(38, 4) to int64

2020-03-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066950#comment-17066950
 ] 

Wes McKinney commented on ARROW-3329:
-

[~jacek.pliszka] do you think you'll be able to tackle this in the next week or 
less?

> [Python] Error casting decimal(38, 4) to int64
> --
>
> Key: ARROW-3329
> URL: https://issues.apache.org/jira/browse/ARROW-3329
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Python version : 3.6.5
> Pyarrow version : 0.10.0
>Reporter: Kavita Sheth
>Assignee: Jacek Pliszka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Git issue LInk : https://github.com/apache/arrow/issues/2627
> I want to cast pyarrow table column from decimal(38,4) to int64.
> col.cast(pa.int64())
> Error:
>  File "pyarrow/table.pxi", line 443, in pyarrow.lib.Column.cast
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowNotImplementedError: No cast implemented from decimal(38, 
> 4) to int64
> Python version : 3.6.5
>  Pyarrow version : 0.10.0
> is it not implemented yet or I am not using it correctly? If not implemented 
> yet, then any work around to cast columns?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8060) [Python] Make dataset Expression objects serializable

2020-03-25 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-8060.

Resolution: Fixed

Issue resolved by pull request 6702
[https://github.com/apache/arrow/pull/6702]

> [Python] Make dataset Expression objects serializable
> -
>
> Key: ARROW-8060
> URL: https://issues.apache.org/jira/browse/ARROW-8060
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> It would be good to be able to pickle pyarrow.dataset.Expression objects (eg 
> for use in dask.distributed)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7755) [Python] Windows wheel cannot be installed on Python 3.8

2020-03-25 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-7755.

Resolution: Fixed

> [Python] Windows wheel cannot be installed on Python 3.8
> 
>
> Key: ARROW-7755
> URL: https://issues.apache.org/jira/browse/ARROW-7755
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Critical
> Fix For: 0.17.0
>
>
> {code}
> λ pip install 
> C:\tmp\arrow-verify-release-wheels\pyarrow-0.16.0-cp38-cp38m-win_amd64.whl 
> ERROR: pyarrow-0.16.0-cp38-cp38m-win_amd64.whl is not a supported wheel on 
> this platform.
> {code}
> The wheel came from
> https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.16.0-cp38-cp38m-win_amd64.whl
> The "m" ABI tag appears to have been removed in Python 3.8
> https://github.com/pypa/setuptools/pull/1822
> Locally I have pip 20.0.2, wheel 0.34.1, and setuptools 45.1.0.post20200127



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7755) [Python] Windows wheel cannot be installed on Python 3.8

2020-03-25 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066938#comment-17066938
 ] 

Krisztian Szucs commented on ARROW-7755:


I've removed the dirty {{pyarrow-0.16.0-cp38-cp38m-win_amd64.*}} files from the 
0.16 bintray release, but I left them under the 0.16.0-rc2 tag.

> [Python] Windows wheel cannot be installed on Python 3.8
> 
>
> Key: ARROW-7755
> URL: https://issues.apache.org/jira/browse/ARROW-7755
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Critical
> Fix For: 0.17.0
>
>
> {code}
> λ pip install 
> C:\tmp\arrow-verify-release-wheels\pyarrow-0.16.0-cp38-cp38m-win_amd64.whl 
> ERROR: pyarrow-0.16.0-cp38-cp38m-win_amd64.whl is not a supported wheel on 
> this platform.
> {code}
> The wheel came from
> https://bintray.com/apache/arrow/download_file?file_path=python-rc%2F0.16.0-rc2%2Fpyarrow-0.16.0-cp38-cp38m-win_amd64.whl
> The "m" ABI tag appears to have been removed in Python 3.8
> https://github.com/pypa/setuptools/pull/1822
> Locally I have pip 20.0.2, wheel 0.34.1, and setuptools 45.1.0.post20200127



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8209) [Python] Accessing duplicate column of Table by name gives wrong error

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8209:

Fix Version/s: 0.17.0

> [Python] Accessing duplicate column of Table by name gives wrong error
> --
>
> Key: ARROW-8209
> URL: https://issues.apache.org/jira/browse/ARROW-8209
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>  Labels: beginner
> Fix For: 0.17.0
>
>
> When you have a table with duplicate column names and you try to access this 
> column, you get an error about the column not existing:
> {code}
> >>> table = pa.table([pa.array([1, 2, 3]), pa.array([4, 5, 6]), pa.array([7, 
> >>> 8, 9])], names=['a', 'b', 'a']) 
> >>> table.column('a') 
> >>>   
> >>>
> ---
> KeyError  Traceback (most recent call last)
>  in 
> > 1 table.column('a')
> ~/scipy/repos/arrow/python/pyarrow/table.pxi in pyarrow.lib.Table.column()
> KeyError: 'Column a does not exist in table'
> {code}
> It should rather give an error message about the column name being duplicate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8216) [R] filter method for Dataset doesn't distinguish between empty strings and NAs

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8216:

Summary: [R] filter method for Dataset doesn't distinguish between empty 
strings and NAs  (was: filter method for Dataset doesn't distinguish between 
empty strings and NAs)

> [R] filter method for Dataset doesn't distinguish between empty strings and 
> NAs
> ---
>
> Key: ARROW-8216
> URL: https://issues.apache.org/jira/browse/ARROW-8216
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.16.0
> Environment: R 3.6.3, Windows 10
>Reporter: Sam Albers
>Priority: Minor
>
>  
> I have just noticed some slightly odd behaviour with the filter method for 
> Dataset. 
>  
> {code:java}
> library(arrow)
> library(dplyr)
> packageVersion("arrow")
> #> [1] '0.16.0.20200323'
> ## Make sample parquet
> starwars$hair_color[starwars$hair_color == "brown"] <- ""
> dir <- tempdir()
> fpath <- file.path(dir, "data.parquet")
> write_parquet(starwars, fpath)
> ## df in memory
> df_mem <- starwars %>%
>  filter(hair_color == "")
> ## reading from the parquet
> df_parquet <- read_parquet(fpath) %>%
>  filter(hair_color == "")
> ## using open_dataset
> df_dataset <- open_dataset(dir) %>%
>  filter(hair_color == "") %>%
>  collect()
> identical(df_mem, df_parquet)
> #> [1] TRUE
> identical(df_mem, df_dataset)
> #> [1] FALSE
> {code}
>  
>  
> I'm pretty sure all these should return the same data.frame. Am I missing 
> something?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-2672) [Python] Build ORC extension in manylinux1 wheels

2020-03-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066936#comment-17066936
 ] 

Wes McKinney commented on ARROW-2672:
-

The issue is ARROW-7811 so no need to create a new one

> [Python] Build ORC extension in manylinux1 wheels
> -
>
> Key: ARROW-2672
> URL: https://issues.apache.org/jira/browse/ARROW-2672
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See discussion in https://github.com/apache/arrow/issues/2100. We just need 
> to set {{export PYARROW_WITH_ORC=1}} in the manylinux1 build



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8208) [PYTHON] Row Group Filtering With ParquetDataset

2020-03-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066933#comment-17066933
 ] 

Wes McKinney commented on ARROW-8208:
-

This functionality is being reimplemented using the new C++ Datasets framework. 
I don't know what the timeline is for that yet though. See ARROW-3764

> [PYTHON] Row Group Filtering With ParquetDataset
> 
>
> Key: ARROW-8208
> URL: https://issues.apache.org/jira/browse/ARROW-8208
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christophe Clienti
>Priority: Major
>  Labels: dataset, dataset-parquet-read
>
> Hello,
> I tried to use the row_group filtering at the file level with an instance of 
> ParquetDataset without success.
> I've tested the workaround proposed here:
>  [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]
> But I wonder if it can work on a file as I get an exception with the 
> following code:
> {code:python}
> ParquetDataset('data.parquet',
>filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
> {code}
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {noformat}
> I read the documentation, and the filtering seems to work only on partitioned 
> dataset. Moreover I read some information in the following JIRA ticket:
>  https://issues.apache.org/jira/browse/ARROW-1796
> So I'm not sure that a ParquetDataset can use row_group statistics to filter 
> specific row_group in a file (in a dataset or not)?
> As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
> (statistics.min instead of statistics.min_value), I was able to apply the 
> row_group filtering.
> Today I'm forced with pyarrow to filter manually the row_groups in each file, 
> which prevents me to use the ParquetDataset partition filtering functionality.
> The row groups are really useful because it prevents to fill the filesystem 
> with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8208) [PYTHON] Row Group Filtering With ParquetDataset

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8208:

Labels: dataset dataset-parquet-read  (was: )

> [PYTHON] Row Group Filtering With ParquetDataset
> 
>
> Key: ARROW-8208
> URL: https://issues.apache.org/jira/browse/ARROW-8208
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christophe Clienti
>Priority: Major
>  Labels: dataset, dataset-parquet-read
>
> Hello,
> I tried to use the row_group filtering at the file level with an instance of 
> ParquetDataset without success.
> I've tested the workaround proposed here:
>  [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]
> But I wonder if it can work on a file as I get an exception with the 
> following code:
> {code:python}
> ParquetDataset('data.parquet',
>filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
> {code}
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {noformat}
> I read the documentation, and the filtering seems to work only on partitioned 
> dataset. Moreover I read some information in the following JIRA ticket:
>  https://issues.apache.org/jira/browse/ARROW-1796
> So I'm not sure that a ParquetDataset can use row_group statistics to filter 
> specific row_group in a file (in a dataset or not)?
> As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
> (statistics.min instead of statistics.min_value), I was able to apply the 
> row_group filtering.
> Today I'm forced with pyarrow to filter manually the row_groups in each file, 
> which prevents me to use the ParquetDataset partition filtering functionality.
> The row groups are really useful because it prevents to fill the filesystem 
> with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7771) [Developer] Use ARROW_TMPDIR environment variable in the verification scripts instead of TMPDIR

2020-03-25 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-7771:
---
Summary: [Developer] Use ARROW_TMPDIR environment variable in the 
verification scripts instead of TMPDIR  (was: [Release] Use ARROW_TMPDIR 
environment variable in the verification scripts instead of TMPDIR)

> [Developer] Use ARROW_TMPDIR environment variable in the verification scripts 
> instead of TMPDIR
> ---
>
> Key: ARROW-7771
> URL: https://issues.apache.org/jira/browse/ARROW-7771
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See discussion 
> https://github.com/apache/arrow/pull/6344#issuecomment-582128686



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7850) [Packaging][Python] Document how to install nightly built wheels

2020-03-25 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-7850:
---
Fix Version/s: (was: 0.17.0)

> [Packaging][Python] Document how to install nightly built wheels
> 
>
> Key: ARROW-7850
> URL: https://issues.apache.org/jira/browse/ARROW-7850
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>
> Follow-up work on https://github.com/apache/arrow/pull/6366#issue-371626256
> As per comment 
> https://github.com/apache/arrow/pull/6366#issuecomment-585750794
> It'd be also nice to resolve the version selection issue described in the 
> comments above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-7850) [Packaging][Python] Document how to install nightly built wheels

2020-03-25 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reopened ARROW-7850:


> [Packaging][Python] Document how to install nightly built wheels
> 
>
> Key: ARROW-7850
> URL: https://issues.apache.org/jira/browse/ARROW-7850
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 0.17.0
>
>
> Follow-up work on https://github.com/apache/arrow/pull/6366#issue-371626256
> As per comment 
> https://github.com/apache/arrow/pull/6366#issuecomment-585750794
> It'd be also nice to resolve the version selection issue described in the 
> comments above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-7850) [Packaging][Python] Document how to install nightly built wheels

2020-03-25 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs closed ARROW-7850.
--
Resolution: Duplicate

> [Packaging][Python] Document how to install nightly built wheels
> 
>
> Key: ARROW-7850
> URL: https://issues.apache.org/jira/browse/ARROW-7850
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>
> Follow-up work on https://github.com/apache/arrow/pull/6366#issue-371626256
> As per comment 
> https://github.com/apache/arrow/pull/6366#issuecomment-585750794
> It'd be also nice to resolve the version selection issue described in the 
> comments above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7850) [Packaging][Python] Document how to install nightly built wheels

2020-03-25 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-7850.

Resolution: Duplicate

> [Packaging][Python] Document how to install nightly built wheels
> 
>
> Key: ARROW-7850
> URL: https://issues.apache.org/jira/browse/ARROW-7850
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 0.17.0
>
>
> Follow-up work on https://github.com/apache/arrow/pull/6366#issue-371626256
> As per comment 
> https://github.com/apache/arrow/pull/6366#issuecomment-585750794
> It'd be also nice to resolve the version selection issue described in the 
> comments above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-7850) [Packaging][Python] Document how to install nightly built wheels

2020-03-25 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs closed ARROW-7850.
--

> [Packaging][Python] Document how to install nightly built wheels
> 
>
> Key: ARROW-7850
> URL: https://issues.apache.org/jira/browse/ARROW-7850
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 0.17.0
>
>
> Follow-up work on https://github.com/apache/arrow/pull/6366#issue-371626256
> As per comment 
> https://github.com/apache/arrow/pull/6366#issuecomment-585750794
> It'd be also nice to resolve the version selection issue described in the 
> comments above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7771) [Release] Use ARROW_TMPDIR environment variable in the verification scripts instead of TMPDIR

2020-03-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7771:
--
Labels: pull-request-available  (was: )

> [Release] Use ARROW_TMPDIR environment variable in the verification scripts 
> instead of TMPDIR
> -
>
> Key: ARROW-7771
> URL: https://issues.apache.org/jira/browse/ARROW-7771
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>
> See discussion 
> https://github.com/apache/arrow/pull/6344#issuecomment-582128686



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8218) [C++] Parallelize decompression at field level in experimental IPC compression code

2020-03-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8218:
---

 Summary: [C++] Parallelize decompression at field level in 
experimental IPC compression code
 Key: ARROW-8218
 URL: https://issues.apache.org/jira/browse/ARROW-8218
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0


This is follow up work to ARROW-7979, a minor amount of refactoring will be 
required to move the decompression step out of {{ArrayLoader}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8217) [R][C++] Fix crashing data in test-dataset.R on 32-bit Windows from ARROW-7979

2020-03-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8217:
---

 Summary: [R][C++] Fix crashing data in test-dataset.R on 32-bit 
Windows from ARROW-7979
 Key: ARROW-8217
 URL: https://issues.apache.org/jira/browse/ARROW-8217
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, R
Reporter: Wes McKinney
 Fix For: 0.17.0


If we can obtain a gdb backtrace from the failed test in 
https://github.com/apache/arrow/pull/6638 then we can sort out what's wrong. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >