[jira] [Updated] (ARROW-10228) Donate Julia Implementation

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10228:
---
Labels: pull-request-available  (was: )

> Donate Julia Implementation
> ---
>
> Key: ARROW-10228
> URL: https://issues.apache.org/jira/browse/ARROW-10228
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Jacob Quinn
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Contribute pure Julia implementation supporting arrow array types and 
> reading/writing streams/files with the arrow format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10229) [C++][Parquet] Remove left over ARROW_LOG statement.

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10229:
---
Labels: pull-request-available  (was: )

> [C++][Parquet] Remove left over ARROW_LOG statement.
> 
>
> Key: ARROW-10229
> URL: https://issues.apache.org/jira/browse/ARROW-10229
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10229) [C++][Parquet] Remove left over ARROW_LOG statement.

2020-10-07 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-10229:
---

 Summary: [C++][Parquet] Remove left over ARROW_LOG statement.
 Key: ARROW-10229
 URL: https://issues.apache.org/jira/browse/ARROW-10229
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Micah Kornfield
Assignee: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10228) Donate Julia Implementation

2020-10-07 Thread Jacob Quinn (Jira)
Jacob Quinn created ARROW-10228:
---

 Summary: Donate Julia Implementation
 Key: ARROW-10228
 URL: https://issues.apache.org/jira/browse/ARROW-10228
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Jacob Quinn


Contribute pure Julia implementation supporting arrow array types and 
reading/writing streams/files with the arrow format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10227) [Ruby] Use a table size as the default for parquet chunk_size

2020-10-07 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-10227.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8391
[https://github.com/apache/arrow/pull/8391]

> [Ruby] Use a table size as the default for parquet chunk_size
> -
>
> Key: ARROW-10227
> URL: https://issues.apache.org/jira/browse/ARROW-10227
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Shoichi Kagawa
>Assignee: Shoichi Kagawa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10227) [Ruby] Use a table size as the default for parquet chunk_size

2020-10-07 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-10227:


Assignee: Shoichi Kagawa

> [Ruby] Use a table size as the default for parquet chunk_size
> -
>
> Key: ARROW-10227
> URL: https://issues.apache.org/jira/browse/ARROW-10227
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Shoichi Kagawa
>Assignee: Shoichi Kagawa
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10227) [Ruby] Use a table size as the default for parquet chunk_size

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10227:
---
Labels: pull-request-available  (was: )

> [Ruby] Use a table size as the default for parquet chunk_size
> -
>
> Key: ARROW-10227
> URL: https://issues.apache.org/jira/browse/ARROW-10227
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Shoichi Kagawa
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8518) [Python] Create tools to enable optional components (like Gandiva, Flight) to be built and deployed as separate Python packages

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8518:
--
Labels: pull-request-available  (was: )

> [Python] Create tools to enable optional components (like Gandiva, Flight) to 
> be built and deployed as separate Python packages
> ---
>
> Key: ARROW-8518
> URL: https://issues.apache.org/jira/browse/ARROW-8518
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Our current monolithic approach to Python packaging isn't likely to be 
> sustainable long-term.
> At a high level, I would propose a structure like this:
> {code}
> pip install pyarrow  # core package containing libarrow, libarrow_python, and 
> any other common bundled C++ library dependencies
> pip install pyarrow-flight  # installs pyarrow, pyarrow_flight
> pip install pyarrow-gandiva # installs pyarrow, pyarrow_gandiva
> {code}
> We can maintain the semantic appearance of a single {{pyarrow}} package by 
> having thin API modules that would look like
> {code}
> CONTENTS OF pyarrow/flight.py
> from pyarrow_flight import *
> {code}
> Obviously, this is more difficult to build and package:
> * CMake and setup.py files must be refactored a bit so that we can reuse code 
> between the parent and child packages
> * Separate conda and wheel packages must be produced. With conda this seems 
> more straightforward but since the child wheels depend on the parent core 
> wheel, the build process seems more complicated
> In any case, I don't think these challenges are insurmountable. This will 
> have several benefits:
> * Smaller installation footprint for simple use cases (though note we are 
> STILL duplicating shared libraries in the wheels, which is quite bad)
> * Less developer anxiety about expanding the scope of what Python code is 
> shipped from apache/arrow. If in 5 years we are shipping 5 different Python 
> wheels with each Apache Arrow release, that sounds completely fine to me. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10227) [Ruby] Use a table size as the default for parquet chunk_size

2020-10-07 Thread Shoichi Kagawa (Jira)
Shoichi Kagawa created ARROW-10227:
--

 Summary: [Ruby] Use a table size as the default for parquet 
chunk_size
 Key: ARROW-10227
 URL: https://issues.apache.org/jira/browse/ARROW-10227
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Ruby
Reporter: Shoichi Kagawa






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10207) [C++] Unary kernels that results in a list have no preallocated offset buffer

2020-10-07 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-10207:
-
Summary: [C++] Unary kernels that results in a list have no preallocated 
offset buffer  (was: C++] Unary kernels that results in a list have no 
preallocated offset buffer)

> [C++] Unary kernels that results in a list have no preallocated offset buffer
> -
>
> Key: ARROW-10207
> URL: https://issues.apache.org/jira/browse/ARROW-10207
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Maarten Breddels
>Priority: Minor
> Fix For: 3.0.0
>
>
> I noticed in
> [https://github.com/apache/arrow/pull/8271]
> That a string->list[string] kernel does not have the offsets preallocated in 
> the output. I believe there is a preference for not doing allocations in 
> kernels, so this can be optimized at a higher level. I think it can also be 
> done in this case. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10224) Build Python 3.9 wheels

2020-10-07 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-10224:


Assignee: Terence Honles

> Build Python 3.9 wheels
> ---
>
> Key: ARROW-10224
> URL: https://issues.apache.org/jira/browse/ARROW-10224
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Terence Honles
>Assignee: Terence Honles
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Now that Python 3.9 is out, there should be wheels built for it.
> I have started an initial stab at building the 3.9 wheels and have tested 
> with the docker image {{python:3.9-buster}} with a {{manylinux2010}} build of 
> {{pyarrow}}.
> The goal of this change will be to get a review and identify what is or is 
> not working at this point.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-07 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209945#comment-17209945
 ] 

Andy Grove commented on ARROW-10226:


Query works fine against tbl files but not against parquet files (it's reading 
the wrong columns somehow). Spark works fine so the issue is not with the 
Parquet files. Really odd to find this now.

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10134) [C++][Dataset] Add ParquetFileFragment::num_row_groups property

2020-10-07 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-10134.
--
Resolution: Fixed

Issue resolved by pull request 8317
[https://github.com/apache/arrow/pull/8317]

> [C++][Dataset] Add ParquetFileFragment::num_row_groups property
> ---
>
> Key: ARROW-10134
> URL: https://issues.apache.org/jira/browse/ARROW-10134
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, dataset-dask-integration, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> From https://github.com/dask/dask/pull/6534#issuecomment-699512602, comment 
> by [~rjzamora]:
> bq.  it would be great to have access the total row-group count for the 
> fragment from a {{num_row_groups}} attribute (which pyarrow should be able to 
> get without parsing all row-group metadata/statistics - I think?).
> One question is: does this attribute correspond to the row groups in the 
> parquet file, or the (subset of) row groups represented by the fragment? 
> I expect the second (so if you do SplitByRowGroup, you would get a fragment 
> with num_row_groups==1), but this might be a potential confusing aspect of 
> the attribute.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8296) [C++][Dataset] IpcFileFormat should support writing files with compressed buffers

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8296:
--
Labels: dataset pull-request-available  (was: dataset)

> [C++][Dataset] IpcFileFormat should support writing files with compressed 
> buffers
> -
>
> Key: ARROW-8296
> URL: https://issues.apache.org/jira/browse/ARROW-8296
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9782) [C++][Dataset] Ability to write ".feather" files with IpcFileFormat

2020-10-07 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-9782.
-
Resolution: Fixed

Issue resolved by pull request 8305
[https://github.com/apache/arrow/pull/8305]

> [C++][Dataset] Ability to write ".feather" files with IpcFileFormat
> ---
>
> Key: ARROW-9782
> URL: https://issues.apache.org/jira/browse/ARROW-9782
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python, R
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> With the new dataset writing bindings, one can do {{ds.write_dataset(data, 
> format="feather")}} (Python) or {{write_dataset(data, format = "feather")}} 
> (R) to write a dataset to feather files. 
> However, because "feather" is just an alias for the IpcFileFormat, it will 
> currently write all files with the {{.ipc}} extension.   
> I think this can be a bit confusing, since many people will be more familiar 
> with "feather" and expect such an extension. 
> (more generally, ".ipc" is maybe not the best default, since it's not very 
> descriptive extension. Something like ".arrow" might be better?)
> cc [~npr] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-07 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209937#comment-17209937
 ] 

Andy Grove commented on ARROW-10226:


The query also returns the wrong results ... grouping by l_comment (high 
cardinality) instead of l_returnflag (low cardinality)

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6720) [JAVA][C++]Support Parquet Read and Write in Java

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-6720:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [JAVA][C++]Support Parquet Read and Write in Java
> -
>
> Key: ARROW-6720
> URL: https://issues.apache.org/jira/browse/ARROW-6720
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Java
>Affects Versions: 0.15.0
>Reporter: Chendi.Xue
>Assignee: Chendi.Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 38.5h
>  Remaining Estimate: 0h
>
> We added a new java interface to support parquet read and write from hdfs or 
> local file.
> The purpose of this implementation is that when we loading and dumping 
> parquet data in Java, we can only use rowBased put and get methods. Since 
> arrow already has C++ implementation to load and dump parquet, so we wrapped 
> those codes as Java APIs.
> After test, we noticed in our workload, performance improved more than 2x 
> comparing with rowBased load and dump. So we want to contribute codes to 
> arrow.
> since this is a total independent change, there is no codes change to current 
> arrow codes. We added two folders as listed:  java/adapter/parquet and 
> cpp/src/jni/parquet



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6720) [JAVA][C++]Support Parquet Read and Write in Java

2020-10-07 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209927#comment-17209927
 ] 

Krisztian Szucs commented on ARROW-6720:


It would be nice to have a status update here. Until that I'm postponing to 3.0.

> [JAVA][C++]Support Parquet Read and Write in Java
> -
>
> Key: ARROW-6720
> URL: https://issues.apache.org/jira/browse/ARROW-6720
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Java
>Affects Versions: 0.15.0
>Reporter: Chendi.Xue
>Assignee: Chendi.Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 38.5h
>  Remaining Estimate: 0h
>
> We added a new java interface to support parquet read and write from hdfs or 
> local file.
> The purpose of this implementation is that when we loading and dumping 
> parquet data in Java, we can only use rowBased put and get methods. Since 
> arrow already has C++ implementation to load and dump parquet, so we wrapped 
> those codes as Java APIs.
> After test, we noticed in our workload, performance improved more than 2x 
> comparing with rowBased load and dump. So we want to contribute codes to 
> arrow.
> since this is a total independent change, there is no codes change to current 
> arrow codes. We added two folders as listed:  java/adapter/parquet and 
> cpp/src/jni/parquet



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10056) [Python] PyArrow writes invalid Feather v2 file: OSError: Verification of flatbuffer-encoded Footer failed.

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-10056:

Fix Version/s: (was: 2.0.0)
   3.0.0

> [Python] PyArrow writes invalid Feather v2 file: OSError: Verification of 
> flatbuffer-encoded Footer failed.
> ---
>
> Key: ARROW-10056
> URL: https://issues.apache.org/jira/browse/ARROW-10056
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
> Environment: CentOS7
> conda environment with pyarrow 1.0.1, numpy 1.19.1 and pandas 1.1.1
>Reporter: Gert Hulselmans
>Priority: Major
> Fix For: 3.0.0
>
>
> pyarrow writes an invalid Feather v2 file, which it can't read afterwards.
> {code:java}
> OSError: Verification of flatbuffer-encoded Footer failed.
> {code}
> The following code reproduces the problem for me:
> {code:python}
> import pyarrow as pa
> import numpy as np
> import pandas as pd
> nbr_regions = 1223024
> nbr_motifs = 4891
> # Create (big) dataframe.
> df = pd.DataFrame(
> np.arange(nbr_regions * nbr_motifs, 
> dtype=np.float32).reshape((nbr_regions, nbr_motifs)),
> index=pd.Index(['region' + str(i) for i in range(nbr_regions)], 
> name='regions'),
> columns=pd.Index(['motif' + str(i) for i in range(nbr_motifs)], 
> name='motifs')
> )
> # Transpose dataframe
> df_transposed = df.transpose()
> # Write transposed dataframe to Feather v2 format.
> pf.write_feather(df_transposed, 'df_transposed.feather')
> # Trying to read the transposed dataframe from Feather v2 format, results in 
> this error:
> df_transposed_read = pf.read_feather('df_transposed.feather')
> {code}
> {code:python}
> ---
> OSError   Traceback (most recent call last)
>  in 
> > 1 df_transposed_read = pf.read_feather('df_transposed.feather')
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py
>  in read_feather(source, columns, use_threads, memory_map)
> 213 """
> 214 _check_pandas_version()
> --> 215 return (read_table(source, columns=columns, memory_map=memory_map)
> 216 .to_pandas(use_threads=use_threads))
> 217
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py
>  in read_table(source, columns, memory_map)
> 235 """
> 236 reader = ext.FeatherReader()
> --> 237 reader.open(source, use_memory_map=memory_map)
> 238
> 239 if columns is None:
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.pxi
>  in pyarrow.lib.FeatherReader.open()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.pyarrow_internal_check_status()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()
> OSError: Verification of flatbuffer-encoded Footer failed.
> {code}
> Later I discovered that it happens also if the original dataframe is created 
> in the transposed order:
> {code:python}
> # Create (big) dataframe.
> df_without_transpose = pd.DataFrame(
> np.arange(nbr_motifs * nbr_regions, 
> dtype=np.float32).reshape((nbr_motifs, nbr_regions)),
> index=pd.Index(['motif' + str(i) for i in range(nbr_motifs)], 
> name='motifs'),
> columns=pd.Index(['region' + str(i) for i in range(nbr_regions)], 
> name='regions'),
> )
> pf.write_feather(df_without_transpose, 'df_without_transpose.feather')
> df_without_transpose_read = pf.read_feather('df_without_transpose.feather')
> ---
> OSError   Traceback (most recent call last)
>  in 
> > 1 df_without_transpose_read = 
> pf.read_feather('df_without_transpose.feather')
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py
>  in read_feather(source, columns, use_threads, memory_map)
> 213 """
> 214 _check_pandas_version()
> --> 215 return (read_table(source, columns=columns, memory_map=memory_map)
> 216 .to_pandas(use_threads=use_threads))
> 217
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py
>  in read_table(source, columns, memory_map)
> 235 """
> 236 reader = ext.FeatherReader()
> --> 237 reader.open(source, use_memory_map=memory_map)
> 238
> 239 if columns is None:
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.pxi
>  in pyarrow.lib.FeatherReader.open()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi
>  in 

[jira] [Updated] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-07 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10226:
---
Description: 
I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and when 
I try and run the TPC-H benchmark, it never completes and eventually uses up 
all 64 GB RAM.

I can run Spark against the data  set and the query completes in 24 seconds, 
which IIRC is how long it took before.

It is possible that something is odd on my environment, but it is also 
possible/likely that this is a real bug.

I am investigating this and will update the Jira once I know more.

I also went back to old commits that were working for me before and they show 
the same issue so I don't think this is related to a recent code change.

  was:
I re-installed my desktop a few days ago and when I try and run the TPC-H 
benchmark, it never completes and eventually uses up all 64 GB RAM.

I can run Spark against the data  set and the query completes in 24 seconds, 
which IIRC is how long it took before.

It is possible that something is odd on my environment, but it is also 
possible/likely that this is a real bug.

I am investigating this and will update the Jira once I know more.


> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-10226:
--

 Summary: [Rust] [DataFusion] TPC-H query 1 no longer completes for 
100GB dataset
 Key: ARROW-10226
 URL: https://issues.apache.org/jira/browse/ARROW-10226
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 2.0.0


I re-installed my desktop a few days ago and when I try and run the TPC-H 
benchmark, it never completes and eventually uses up all 64 GB RAM.

I can run Spark against the data  set and the query completes in 24 seconds, 
which IIRC is how long it took before.

It is possible that something is odd on my environment, but it is also 
possible/likely that this is a real bug.

I am investigating this and will update the Jira once I know more.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7494) [Java] Remove reader index and writer index from ArrowBuf

2020-10-07 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209913#comment-17209913
 ] 

Krisztian Szucs commented on ARROW-7494:


It's not likely to land in 2.0 so postponing.

> [Java] Remove reader index and writer index from ArrowBuf
> -
>
> Key: ARROW-7494
> URL: https://issues.apache.org/jira/browse/ARROW-7494
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Jacques Nadeau
>Assignee: Ji Liu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Reader and writer index and functionality doesn't belong on a chunk of memory 
> and is due to inheritance from ByteBuf. As part of removing ByteBuf 
> inheritance, we should also remove reader and writer indexes from ArrowBuf 
> functionality. It wastes heap memory for rare utility. In general, a slice 
> can be used instead of a reader/writer index pattern.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7494) [Java] Remove reader index and writer index from ArrowBuf

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-7494:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Java] Remove reader index and writer index from ArrowBuf
> -
>
> Key: ARROW-7494
> URL: https://issues.apache.org/jira/browse/ARROW-7494
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Jacques Nadeau
>Assignee: Ji Liu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Reader and writer index and functionality doesn't belong on a chunk of memory 
> and is due to inheritance from ByteBuf. As part of removing ByteBuf 
> inheritance, we should also remove reader and writer indexes from ArrowBuf 
> functionality. It wastes heap memory for rare utility. In general, a slice 
> can be used instead of a reader/writer index pattern.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9843) [C++] Implement Between trinary kernel

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9843:
--
Labels: pull-request-available  (was: )

> [C++] Implement Between trinary kernel
> --
>
> Key: ARROW-9843
> URL: https://issues.apache.org/jira/browse/ARROW-9843
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A specialized {{between(arr, left_bound, right_bound)}} kernel would avoid 
> multiple scans and AND operation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9572) [CI][Homebrew] Properly enable Gandiva and improve testing

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9572:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [CI][Homebrew] Properly enable Gandiva and improve testing
> --
>
> Key: ARROW-9572
> URL: https://issues.apache.org/jira/browse/ARROW-9572
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva, Packaging
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 3.0.0
>
>
> ARROW-9086 enabled Gandiva in our Homebrew formula, but when I tried to add 
> that to the official Homebrew formula at release time, it failed. See some 
> discussion at https://github.com/Homebrew/homebrew-core/pull/58581, though 
> unfortunately the build logs are gone. 
> It turns out that the testing that Homebrew does is more thorough than the 
> install/audit we do in CI. See 
> https://github.com/Homebrew/homebrew-core/pull/58581/checks?check_run_id=915732878
>  for example. They install, build the bottle, then remove all dependencies 
> and reinstall the bottle. Since this failed, what I think it means is that 
> `llvm` is not a build-only dependency for Gandiva--it built but couldn't run 
> successfully because `llvm` had been removed.
> cc [~kou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9572) [CI][Homebrew] Properly enable Gandiva and improve testing

2020-10-07 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209908#comment-17209908
 ] 

Krisztian Szucs commented on ARROW-9572:


Since it has been added upstream I'm postponing to 3.0.

> [CI][Homebrew] Properly enable Gandiva and improve testing
> --
>
> Key: ARROW-9572
> URL: https://issues.apache.org/jira/browse/ARROW-9572
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva, Packaging
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 2.0.0
>
>
> ARROW-9086 enabled Gandiva in our Homebrew formula, but when I tried to add 
> that to the official Homebrew formula at release time, it failed. See some 
> discussion at https://github.com/Homebrew/homebrew-core/pull/58581, though 
> unfortunately the build logs are gone. 
> It turns out that the testing that Homebrew does is more thorough than the 
> install/audit we do in CI. See 
> https://github.com/Homebrew/homebrew-core/pull/58581/checks?check_run_id=915732878
>  for example. They install, build the bottle, then remove all dependencies 
> and reinstall the bottle. Since this failed, what I think it means is that 
> `llvm` is not a build-only dependency for Gandiva--it built but couldn't run 
> successfully because `llvm` had been removed.
> cc [~kou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10225) [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests

2020-10-07 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-10225:
--

Assignee: Neville Dipale

> [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests
> ---
>
> Key: ARROW-10225
> URL: https://issues.apache.org/jira/browse/ARROW-10225
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>
> The Arrow spec allows makes the null bitmap optional if an array has no nulls 
> [~carols10cents], so the tests that were failing were because we're comparing 
> `None` with a 100% populated bitmap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10225) [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10225:
---
Labels: pull-request-available  (was: )

> [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests
> ---
>
> Key: ARROW-10225
> URL: https://issues.apache.org/jira/browse/ARROW-10225
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The Arrow spec allows makes the null bitmap optional if an array has no nulls 
> [~carols10cents], so the tests that were failing were because we're comparing 
> `None` with a 100% populated bitmap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10225) [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests

2020-10-07 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10225:
--

 Summary: [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip 
tests
 Key: ARROW-10225
 URL: https://issues.apache.org/jira/browse/ARROW-10225
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


The Arrow spec allows makes the null bitmap optional if an array has no nulls 
[~carols10cents], so the tests that were failing were because we're comparing 
`None` with a 100% populated bitmap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10215) [Rust] [DataFusion] Rename "Source" typedef

2020-10-07 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209906#comment-17209906
 ] 

Krisztian Szucs commented on ARROW-10215:
-

Postponing to 3.0.

> [Rust] [DataFusion] Rename "Source" typedef
> ---
>
> Key: ARROW-10215
> URL: https://issues.apache.org/jira/browse/ARROW-10215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Minor
> Fix For: 3.0.0
>
>
> The name "Source" for this type doesn't make sense to me. I would like to 
> discuss alternate names for it.
> {code:java}
> type Source = Box; {code}
> My first thoughts are:
>  * RecordBatchIterator
>  * RecordBatchStream
>  * SendableRecordBatchReader



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10215) [Rust] [DataFusion] Rename "Source" typedef

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-10215:

Fix Version/s: (was: 2.0.0)
   3.0.0

> [Rust] [DataFusion] Rename "Source" typedef
> ---
>
> Key: ARROW-10215
> URL: https://issues.apache.org/jira/browse/ARROW-10215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Minor
> Fix For: 3.0.0
>
>
> The name "Source" for this type doesn't make sense to me. I would like to 
> discuss alternate names for it.
> {code:java}
> type Source = Box; {code}
> My first thoughts are:
>  * RecordBatchIterator
>  * RecordBatchStream
>  * SendableRecordBatchReader



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9847) [Rust] Inconsistent use of import arrow:: vs crate::arrow::

2020-10-07 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209903#comment-17209903
 ] 

Krisztian Szucs commented on ARROW-9847:


[~andygrove] Is it going to make into 2.0? If not please postpone to 3.0.

> [Rust] Inconsistent use of import arrow:: vs crate::arrow::
> ---
>
> Key: ARROW-9847
> URL: https://issues.apache.org/jira/browse/ARROW-9847
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> Both the DataFusion and Parquet crates have a mix of "import arrow::" and 
> "import crate::arrow::" and we should standardize on one or the other.
>  
> Which ever standard we use should be enforced in build.rs so CI fails on PRs 
> that do not follow the standard.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9001) [R] Box outputs as correct type in call_function

2020-10-07 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209901#comment-17209901
 ] 

Krisztian Szucs commented on ARROW-9001:


[~romainfrancois] [~npr] Based on the github conversation I assume we can 
postpone it to 3.0. Please reset the fix version if it can be included in 2.0.

> [R] Box outputs as correct type in call_function
> 
>
> Key: ARROW-9001
> URL: https://issues.apache.org/jira/browse/ARROW-9001
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Wes McKinney
>Assignee: Romain Francois
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> This would prevent segfaults by putting the externalptr in the wrong kind of 
> R6 container, plus allow us to skip a bunch of hackery where we try to track 
> or guess the class of the object returned from call_function (could be Array, 
> ChunkedArray, Scalar, RecordBatch, or Table).
> reticulate does something along these lines; it's a subclass of environment I 
> think but not exactly an R6 class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9001) [R] Box outputs as correct type in call_function

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9001:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [R] Box outputs as correct type in call_function
> 
>
> Key: ARROW-9001
> URL: https://issues.apache.org/jira/browse/ARROW-9001
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Wes McKinney
>Assignee: Romain Francois
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> This would prevent segfaults by putting the externalptr in the wrong kind of 
> R6 container, plus allow us to skip a bunch of hackery where we try to track 
> or guess the class of the object returned from call_function (could be Array, 
> ChunkedArray, Scalar, RecordBatch, or Table).
> reticulate does something along these lines; it's a subclass of environment I 
> think but not exactly an R6 class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10168) [Rust] [Parquet] Extend arrow schema conversion to projected fields

2020-10-07 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10168.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8354
[https://github.com/apache/arrow/pull/8354]

> [Rust] [Parquet] Extend arrow schema conversion to projected fields
> ---
>
> Key: ARROW-10168
> URL: https://issues.apache.org/jira/browse/ARROW-10168
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When writing Arrow data to Parquet, we serialise the schema's IPC 
> representation. This schema is then read back by the Parquet reader, and used 
> to preserve the array type information from the original Arrow data.
> We however do not rely on the above mechanism when reading projected columns 
> from a Parquet file; i.e. if we have a file with 3 columns, but we only read 
> 2 columns, we do not yet rely on the serialised arrow schema; and can thus 
> lose type information.
> This behaviour was deliberately left out, as the function 
> *parquet_to_arrow_schema_by_columns* does not check for the existence of 
> arrow schema in the metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10144) [Flight] Add support for using the TLS_SNI extension

2020-10-07 Thread James Duong (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Duong resolved ARROW-10144.
-
Resolution: Invalid

Verified as already working by [~tifflhl]

> [Flight] Add support for using the TLS_SNI extension
> 
>
> Key: ARROW-10144
> URL: https://issues.apache.org/jira/browse/ARROW-10144
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC, Java, Python
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
> Fix For: 3.0.0
>
>
> When using encryption, add support for the TLS_SNI extension 
> (https://en.wikipedia.org/wiki/Server_Name_Indication).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10144) [Flight] Add support for using the TLS_SNI extension

2020-10-07 Thread James Duong (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Duong reassigned ARROW-10144:
---

Assignee: James Duong

> [Flight] Add support for using the TLS_SNI extension
> 
>
> Key: ARROW-10144
> URL: https://issues.apache.org/jira/browse/ARROW-10144
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC, Java, Python
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
> Fix For: 3.0.0
>
>
> When using encryption, add support for the TLS_SNI extension 
> (https://en.wikipedia.org/wiki/Server_Name_Indication).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9853) [RUST] Implement "take" kernel for dictionary arrays

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9853:
---
Fix Version/s: (was: 3.0.0)
   2.0.0

> [RUST] Implement "take" kernel for dictionary arrays
> 
>
> Key: ARROW-9853
> URL: https://issues.apache.org/jira/browse/ARROW-9853
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9585) [Rust] Remove duplicated to-do line in DataFusion readme

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9585:
---
Fix Version/s: 2.0.0

> [Rust] Remove duplicated to-do line in DataFusion readme
> 
>
> Key: ARROW-9585
> URL: https://issues.apache.org/jira/browse/ARROW-9585
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation, Rust - DataFusion
>Reporter: Paul Whalen
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9536) Missing parameters in PlasmaOutOfMemoryException.java

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9536:
---
Fix Version/s: 2.0.0

> Missing parameters in PlasmaOutOfMemoryException.java
> -
>
> Key: ARROW-9536
> URL: https://issues.apache.org/jira/browse/ARROW-9536
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 0.17.0, 0.17.1
>Reporter: Xudingyu
>Assignee: Xudingyu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> When catch PlasmaOutOfMemoryException
> It shows
>  Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.arrow.plasma.exceptions.PlasmaOutOfMemoryException: method 
> (Ljava/lang/String;)V not found
>   at org.apache.arrow.plasma.PlasmaClientJNI.create(Native Method)
>   at org.apache.arrow.plasma.PlasmaClient.create(PlasmaClient.java:143)
>   at 
> org.apache.arrow.plasma.PlasmaClientTest.doPlasmaOutOfMemoryExceptionTest(PlasmaClientTest.java:287)
>   at 
> org.apache.arrow.plasma.PlasmaClientTest.main(PlasmaClientTest.java:308)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10224) Build Python 3.9 wheels

2020-10-07 Thread Terence Honles (Jira)
Terence Honles created ARROW-10224:
--

 Summary: Build Python 3.9 wheels
 Key: ARROW-10224
 URL: https://issues.apache.org/jira/browse/ARROW-10224
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Terence Honles


Now that Python 3.9 is out, there should be wheels built for it.

I have started an initial stab at building the 3.9 wheels and have tested with 
the docker image {{python:3.9-buster}} with a {{manylinux2010}} build of 
{{pyarrow}}.

The goal of this change will be to get a review and identify what is or is not 
working at this point.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10144) [Flight] Add support for using the TLS_SNI extension

2020-10-07 Thread Tiffany Lam (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209883#comment-17209883
 ] 

Tiffany Lam commented on ARROW-10144:
-

[~jduong] I have verified that there are existing TLS SNI configurations 
available in the FlightClient implementations and they work. 

*Java Client* - When connecting with a FlightClient using TLS, add builder 
option overrideHostname.

*Python Client* - When connecting with a FlightClient using TLS, add a 
connection argument called generic_options. 
[generic-options|https://github.com/apache/arrow/blob/732e333c49555f696e5c1885629e2fafa8d0fd65/python/pyarrow/_flight.pyx#L1014]
 is a list of tuples that stores other generic Grpc connection arguments. To 
the list of generic options, add the following tuple 
('grpc.ssl_target_name_override', server_name).

*C++ Client* - Python client wraps the C++ client. 

 

This ticket can be closed.

> [Flight] Add support for using the TLS_SNI extension
> 
>
> Key: ARROW-10144
> URL: https://issues.apache.org/jira/browse/ARROW-10144
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC, Java, Python
>Reporter: James Duong
>Priority: Major
> Fix For: 3.0.0
>
>
> When using encryption, add support for the TLS_SNI extension 
> (https://en.wikipedia.org/wiki/Server_Name_Indication).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9508) [Release][APT][Yum] Enable verification for arm64 binaries

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9508:
---
Fix Version/s: (was: 1.0.0)
   2.0.0

> [Release][APT][Yum] Enable verification for arm64 binaries
> --
>
> Key: ARROW-9508
> URL: https://issues.apache.org/jira/browse/ARROW-9508
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9328) [C++][Gandiva] Add LTRIM, RTRIM, BTRIM functions for string

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9328:
---
Fix Version/s: (was: 1.0.0)
   2.0.0

> [C++][Gandiva] Add LTRIM, RTRIM, BTRIM functions for string
> ---
>
> Key: ARROW-9328
> URL: https://issues.apache.org/jira/browse/ARROW-9328
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Sagnik Chakraborty
>Assignee: Sagnik Chakraborty
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10223) [C++] Use timestamp parsers for date32() CSV parsing

2020-10-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10223:
---
Priority: Minor  (was: Major)

> [C++] Use timestamp parsers for date32() CSV parsing
> 
>
> Key: ARROW-10223
> URL: https://issues.apache.org/jira/browse/ARROW-10223
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Minor
>
> Followup to ARROW-9964. Consider the simple CSV (well, one column so there's 
> no comma needed)
> {code}
> "time"
> "23/09/2020"
> {code}
> If I specify the column as type timestamp, I can provide a timestamp_parser 
> to parse it. But if I specify it as date32, the timestamp_parsers don't get 
> invoked and I get an error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9534) [Rust] [DataFusion] Implement functions for creating literal expressions for all types

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9534:
---
Fix Version/s: (was: 1.0.0)
   2.0.0

> [Rust] [DataFusion] Implement functions for creating literal expressions for 
> all types
> --
>
> Key: ARROW-9534
> URL: https://issues.apache.org/jira/browse/ARROW-9534
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Minor
>  Labels: beginner, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> In logical_plan.rs we have the function `lit_str`. We should add equivalents 
> for all supported literal types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10223) [C++] Use timestamp parsers for date32() CSV parsing

2020-10-07 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-10223:
---

 Summary: [C++] Use timestamp parsers for date32() CSV parsing
 Key: ARROW-10223
 URL: https://issues.apache.org/jira/browse/ARROW-10223
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Neal Richardson
Assignee: Antoine Pitrou


Followup to ARROW-9964. Consider the simple CSV (well, one column so there's no 
comma needed)

{code}
"time"
"23/09/2020"
{code}

If I specify the column as type timestamp, I can provide a timestamp_parser to 
parse it. But if I specify it as date32, the timestamp_parsers don't get 
invoked and I get an error.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9205) [Documentation] Fix typos in Columnar.rst

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9205:
---
Fix Version/s: (was: 1.0.0)
   2.0.0

> [Documentation] Fix typos in Columnar.rst
> -
>
> Key: ARROW-9205
> URL: https://issues.apache.org/jira/browse/ARROW-9205
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9621) [Python] test_move_file() is failed with fsspec 0.8.0

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9621:
---
Fix Version/s: 2.0.0

> [Python] test_move_file() is failed with fsspec 0.8.0
> -
>
> Key: ARROW-9621
> URL: https://issues.apache.org/jira/browse/ARROW-9621
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Kouhei Sutou
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.1, 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> It works with fsspec 0.7.4: 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/34414340/job/os9t8kj9t4afgym9
> Failed with fsspec 0.8.0: 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/34422556/job/abedu9it26qvfxkm
> {noformat}
> == FAILURES 
> ===
> __ test_move_file[PyFileSystem(FSSpecHandler(fsspec.filesystem("memory")))] 
> ___
> fs = 
> pathfn = . at 0x003D04F70B58>
> def test_move_file(fs, pathfn):
> s = pathfn('test-move-source-file')
> t = pathfn('test-move-target-file')
> 
> with fs.open_output_stream(s):
> pass
> 
> >   fs.move(s, t)
> pyarrow\tests\test_fs.py:798: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> pyarrow\_fs.pyx:519: in pyarrow._fs.FileSystem.move
> check_status(self.fs.Move(source, destination))
> pyarrow\_fs.pyx:1024: in pyarrow._fs._cb_move
> handler.move(frombytes(src), frombytes(dest))
> pyarrow\fs.py:199: in move
> self.fs.mv(src, dest, recursive=True)
> C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\spec.py:744: in mv
> self.copy(path1, path2, recursive=recursive, maxdepth=maxdepth)
> C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\spec.py:719: in copy
> self.cp_file(p1, p2, **kwargs)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> self =  0x003D01096A78>
> path1 = 'test-move-source-file/', path2 = 'test-move-target-file/'
> kwargs = {'maxdepth': None}
> def cp_file(self, path1, path2, **kwargs):
> if self.isfile(path1):
> >   self.store[path2] = MemoryFile(self, path2, 
> > self.store[path1].getbuffer())
> E   KeyError: 'test-move-source-file/'
> C:\Miniconda37-x64\envs\arrow\lib\site-packages\fsspec\implementations\memory.py:134:
>  KeyError
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9973) [Java] JDBC DateConsumer does not allow dates before epoch

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9973:
---
Fix Version/s: 2.0.0

> [Java] JDBC DateConsumer does not allow dates before epoch
> --
>
> Key: ARROW-9973
> URL: https://issues.apache.org/jira/browse/ARROW-9973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 1.0.1
>Reporter: Patrick Woody
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> DateConsumer seems to do an overflow check when converting from a SQL Date 
> and sees if the TimeUnit.MILLISECONDS.toDays() is negative. This is 
> how any date less than 1970-01-01 will be represented, so unfortunately the 
> adapter breaks for these values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9853) [RUST] Implement "take" kernel for dictionary arrays

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-9853:
---
Fix Version/s: 3.0.0

> [RUST] Implement "take" kernel for dictionary arrays
> 
>
> Key: ARROW-9853
> URL: https://issues.apache.org/jira/browse/ARROW-9853
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 1.0.0
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9997) [Python] StructScalar.as_py() fails if the type has duplicate field names

2020-10-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-9997:
--
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Python] StructScalar.as_py() fails if the type has duplicate field names
> -
>
> Key: ARROW-9997
> URL: https://issues.apache.org/jira/browse/ARROW-9997
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 3.0.0
>
>
> {{StructScalar}} currently extends an abstract Mapping interface. Since the 
> type allows duplicate field names we cannot provide that API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10183) [C++] Create a ForEach library function that runs on an iterator of futures

2020-10-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10183:
-
Summary: [C++] Create a ForEach library function that runs on an iterator 
of futures  (was: Create a ForEach library function that runs on an iterator of 
futures)

> [C++] Create a ForEach library function that runs on an iterator of futures
> ---
>
> Key: ARROW-10183
> URL: https://issues.apache.org/jira/browse/ARROW-10183
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
> Attachments: arrow-continuation-flow.jpg
>
>
> This method should take in an iterator of futures and a callback and pull an 
> item off the iterator, "await" it, run the callback on it, and then fetch the 
> next item from the iterator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1614) [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

2020-10-07 Thread Rok Mihevc (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209872#comment-17209872
 ] 

Rok Mihevc commented on ARROW-1614:
---

I'd like to contribute to this work and will have time available next week. 
[~chrish42] could I help out somehow?

> [C++] Add a Tensor logical value type with constant dimensions, implemented 
> using ExtensionType
> ---
>
> Key: ARROW-1614
> URL: https://issues.apache.org/jira/browse/ARROW-1614
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Format
>Reporter: Wes McKinney
>Priority: Major
>
> In an Arrow table, we would like to add support for a column that has values 
> cells each containing a tensor value, with all tensors having the same 
> dimensions. These would be stored as a binary value, plus some metadata to 
> store type and shape/strides.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10139) [C++] Add support for building arrow_testing without building tests

2020-10-07 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-10139.
--
Resolution: Fixed

Issue resolved by pull request 8356
[https://github.com/apache/arrow/pull/8356]

> [C++] Add support for building arrow_testing without building tests
> ---
>
> Key: ARROW-10139
> URL: https://issues.apache.org/jira/browse/ARROW-10139
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yuri
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> {{ARROW_BUILD_TESTS}} installs the following arrow_testing related files 
> implicitly:
> {noformat}
> lib/cmake/arrow/ArrowTestingConfig.cmake
> lib/cmake/arrow/ArrowTestingConfigVersion.cmake
> lib/cmake/arrow/ArrowTestingTargets-%%CMAKE_BUILD_TYPE%%.cmake
> lib/cmake/arrow/ArrowTestingTargets.cmake
> lib/cmake/arrow/FindArrowTesting.cmake
> lib/libarrow_testing.so
> lib/libarrow_testing.so.100
> lib/libarrow_testing.so.100.1.0
> libdata/pkgconfig/arrow-testing.pc
> {noformat}
> If we have {{ARROW_TESTING}} or something, users can do it explicitly.
> The original GitHub bug report: [https://github.com/apache/arrow/issues/8306]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9997) [Python] StructScalar.as_py() fails if the type has duplicate field names

2020-10-07 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209862#comment-17209862
 ] 

Krisztian Szucs commented on ARROW-9997:


I find this issue a bit pressing before the release, but I'm not sure about the 
desired resolution. Perhaps returning with a multimap/multidict like object 
from {{scalar.as_py()}} would work (although it can cause backward 
incompatibilities if users expect dictionary instances). 

[~apitrou] [~jorisvandenbossche] thoughts?

> [Python] StructScalar.as_py() fails if the type has duplicate field names
> -
>
> Key: ARROW-9997
> URL: https://issues.apache.org/jira/browse/ARROW-9997
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 2.0.0
>
>
> {{StructScalar}} currently extends an abstract Mapping interface. Since the 
> type allows duplicate field names we cannot provide that API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6607) [Python] Support for set/list columns when converting from Pandas

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-6607:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Python] Support for set/list columns when converting from Pandas
> -
>
> Key: ARROW-6607
> URL: https://issues.apache.org/jira/browse/ARROW-6607
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
> Environment: python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in 
> Windows 10
>Reporter: Giora Simchoni
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 3.0.0
>
>
> Hi,
> Using python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in Windows 10...
> ```python
> import pandas as pd
> df = pd.DataFrame(\{'a': [1,2,3], 'b': [set([1,2]), set([2,3]), 
> set([3,4,5])]})
> df.to_feather('test.ft')
> ```
> I get:
> ```
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
> line 2131, in to_feather
>  to_feather(self, fname)
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py",
>  line 83, in to_feather
>  feather.write_feather(df, path)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 182, in write_feather
>  writer.write(df)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 93, in write
>  table = Table.from_pandas(df, preserve_index=False)
>  File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 496, in dataframe_to_arrays
>  for c, f in zip(columns_to_convert, convert_fields)]
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 496, in 
>  for c, f in zip(columns_to_convert, convert_fields)]
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 487, in convert_column
>  raise e
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 481, in convert_column
>  result = pa.array(col, type=type_, from_pandas=True, safe=safe)
>  File "pyarrow/array.pxi", line 191, in pyarrow.lib.array
>  File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array
>  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: ('Could not convert \{1, 2} with type set: did not 
> recognize Python value type when inferring an Arrow data type', 'Conversion 
> failed for column b with type object')
> ```
> And obviously `df.drop('b', axis=1).to_feather('test.ft')` works.
> Questions:
> (1) Is it possible to support these kind of set/list columns?
> (2) Anyone has an idea on how to deal with this? I *cannot* unnest these 
> set/list columns as this would explode the DataFrame. My only other idea is 
> to convert set `\{1,2}` into a string `1,2` and parse it after reading the 
> file. And hoping it won't be slow.
>  
> Update:
> With lists column the error is different:
> ```python
> import pandas as pd
> df = pd.DataFrame(\{'a': [1,2,3], 'b': [[1,2], [2,3], [3,4,5]]})
> df.to_feather('test.ft')
> ```
> ```
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
> line 2131, in to_feather
>  to_feather(self, fname)
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py",
>  line 83, in to_feather
>  feather.write_feather(df, path)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 182, in write_feather
>  writer.write(df)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 97, in write
>  self.writer.write_array(name, col.data.chunk(0))
>  File "pyarrow/feather.pxi", line 67, in pyarrow.lib.FeatherWriter.write_array
>  File "pyarrow/error.pxi", line 93, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: list
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6607) [Python] Support for set/list columns when converting from Pandas

2020-10-07 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209859#comment-17209859
 ] 

Krisztian Szucs commented on ARROW-6607:


[~jorisvandenbossche] I'm postponing it to 3.0 so we can elaborate more on the 
desired behavior. 

> [Python] Support for set/list columns when converting from Pandas
> -
>
> Key: ARROW-6607
> URL: https://issues.apache.org/jira/browse/ARROW-6607
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
> Environment: python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in 
> Windows 10
>Reporter: Giora Simchoni
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 2.0.0
>
>
> Hi,
> Using python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in Windows 10...
> ```python
> import pandas as pd
> df = pd.DataFrame(\{'a': [1,2,3], 'b': [set([1,2]), set([2,3]), 
> set([3,4,5])]})
> df.to_feather('test.ft')
> ```
> I get:
> ```
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
> line 2131, in to_feather
>  to_feather(self, fname)
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py",
>  line 83, in to_feather
>  feather.write_feather(df, path)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 182, in write_feather
>  writer.write(df)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 93, in write
>  table = Table.from_pandas(df, preserve_index=False)
>  File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 496, in dataframe_to_arrays
>  for c, f in zip(columns_to_convert, convert_fields)]
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 496, in 
>  for c, f in zip(columns_to_convert, convert_fields)]
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 487, in convert_column
>  raise e
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 481, in convert_column
>  result = pa.array(col, type=type_, from_pandas=True, safe=safe)
>  File "pyarrow/array.pxi", line 191, in pyarrow.lib.array
>  File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array
>  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: ('Could not convert \{1, 2} with type set: did not 
> recognize Python value type when inferring an Arrow data type', 'Conversion 
> failed for column b with type object')
> ```
> And obviously `df.drop('b', axis=1).to_feather('test.ft')` works.
> Questions:
> (1) Is it possible to support these kind of set/list columns?
> (2) Anyone has an idea on how to deal with this? I *cannot* unnest these 
> set/list columns as this would explode the DataFrame. My only other idea is 
> to convert set `\{1,2}` into a string `1,2` and parse it after reading the 
> file. And hoping it won't be slow.
>  
> Update:
> With lists column the error is different:
> ```python
> import pandas as pd
> df = pd.DataFrame(\{'a': [1,2,3], 'b': [[1,2], [2,3], [3,4,5]]})
> df.to_feather('test.ft')
> ```
> ```
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
> line 2131, in to_feather
>  to_feather(self, fname)
>  File 
> "/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py",
>  line 83, in to_feather
>  feather.write_feather(df, path)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 182, in write_feather
>  writer.write(df)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
> line 97, in write
>  self.writer.write_array(name, col.data.chunk(0))
>  File "pyarrow/feather.pxi", line 67, in pyarrow.lib.FeatherWriter.write_array
>  File "pyarrow/error.pxi", line 93, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: list
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10099) [C++][Dataset] Also allow integer partition fields to be dictionary encoded

2020-10-07 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-10099.
--
Resolution: Fixed

Issue resolved by pull request 8367
[https://github.com/apache/arrow/pull/8367]

> [C++][Dataset] Also allow integer partition fields to be dictionary encoded
> ---
>
> Key: ARROW-10099
> URL: https://issues.apache.org/jira/browse/ARROW-10099
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, dataset-dask-integration, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In ARROW-8647, we added the option to indicate that you partition field 
> columns should be dictionary encoded, but it currently does only do this for 
> string type, and not for integer type (wiht the reasoning that for integers, 
> it is not giving any memory efficiency gains to use dictionary encoding). 
> In dask, they have been using categorical dtypes for _all_ partition fields, 
> also if they are integers. They would like to keep doing this (apart from 
> memory efficiency, using categorical/dictionary type also gives information 
> about all uniques values of the column, without having to calculate this), so 
> it would be nice to enable this use case. 
> So I think we could either simply always dictionary encode also integers when 
> {{max_partition_dictionary_size}} indicates partition fields should be 
> dictionary encoded, or either have an additional option to indicate also 
> integer partition fields should be encoded (if the other option indicates 
> dictionary encoding should be used).
> Based on feedback from the dask PR using the dataset API at 
> https://github.com/dask/dask/pull/6534#issuecomment-698723009
> cc [~rjzamora] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9266) [Python][Packaging] Enable S3 support in macOS wheels

2020-10-07 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-9266.

Resolution: Fixed

Issue resolved by pull request 8315
[https://github.com/apache/arrow/pull/8315]

> [Python][Packaging] Enable S3 support in macOS wheels
> -
>
> Key: ARROW-9266
> URL: https://issues.apache.org/jira/browse/ARROW-9266
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Antoine Pitrou
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10222) [C++] Add FileSystem::MakeUri() to serialize file locations to URIs

2020-10-07 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-10222:


 Summary: [C++] Add FileSystem::MakeUri() to serialize file 
locations to URIs
 Key: ARROW-10222
 URL: https://issues.apache.org/jira/browse/ARROW-10222
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 1.0.1
Reporter: Ben Kietzman
Assignee: Antoine Pitrou
 Fix For: 3.0.0


Making the transform FS -> URI bijective would greatly simplify unambiguous 
location of files and serialization of filesystems. Something like:

{code}
Result FileSystem::MakeUri(std::string path = "/");
{code}

Difficulties:
-  SubTreeFileSystem::MakeUri("/") would probably return a URI referring to its 
base directory in the wrapped filesystem, which wouldn't roundtrip to a 
SubTreeFileSystem with FileSystemFromUri
- Not all of s3's parameters are supported in a URI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10204) [RUST] [Datafusion] Test failure in aggregate_grouped_empty with simd feature enabled

2020-10-07 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10204.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8378
[https://github.com/apache/arrow/pull/8378]

> [RUST] [Datafusion] Test failure in aggregate_grouped_empty with simd feature 
> enabled
> -
>
> Key: ARROW-10204
> URL: https://issues.apache.org/jira/browse/ARROW-10204
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {code}
>  execution::context::tests::aggregate_grouped_empty stdout 
> thread 'execution::context::tests::aggregate_grouped_empty' panicked at 
> 'assertion failed: `(left == right)`
>   left: `["0,0.0"]`,
>  right: `[]`', datafusion/src/execution/context.rs:883:9
> note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10221) Javascript toArray() method ignores nulls on some types.

2020-10-07 Thread Ben Schmidt (Jira)
Ben Schmidt created ARROW-10221:
---

 Summary: Javascript toArray() method ignores nulls on some types.
 Key: ARROW-10221
 URL: https://issues.apache.org/jira/browse/ARROW-10221
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: 0.17.1
Reporter: Ben Schmidt


The .toArray() javascript method of vectors includes a shortcut to return the 
underlying typed array; but this doesn't respect null values, and so can return 
the wrong number.

 

```

v = arrow.Vector.from(\{values: [1, 2, 3, 4, 5, null, 6],type: new 
arrow.Int32()})

v.toArray()[5] // Incorrectly returns '0'

v.get(5) // Correctly returns null

```

 

Solution: Eliminate the fast method, always return Javascript arrays. It might 
be better to keep the old method in cases where there are guaranteed no nulls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10220) Cache javascript utf-8 dictionary keys?

2020-10-07 Thread Ben Schmidt (Jira)
Ben Schmidt created ARROW-10220:
---

 Summary: Cache javascript utf-8 dictionary keys?
 Key: ARROW-10220
 URL: https://issues.apache.org/jira/browse/ARROW-10220
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Affects Versions: 1.0.1
Reporter: Ben Schmidt


String decoding from arrow tables is a major bottleneck in using arrow in 
Javascript–it can take a second to decode a million rows. For utf-8 types, I'm 
not sure what could be done; but some memoization would help utf-8 dictionary 
types.

Currently, the javascript implementation decodes a utf-8 string every time you 
request an item from a dictionary with utf-8 data. If arrow cached the decoded 
strings to a native js Map, routine operations like looping over all the 
entries in a text column might be on the order of 10x faster. Here's an 
observable notebook [benchmarking that and a couple other 
strategies|https://observablehq.com/@bmschmidt/faster-arrow-dictionary-unpacking].

I would file a pull request, but 1) I would have to learn some typescript to do 
so, and 2) this idea may be undesirable because it creates new objects that 
will increase the memory footprint of a table, rather than just using the typed 
arrays.

Some discussion of how the real-world issues here affect the arquero project is 
[here|https://github.com/uwdata/arquero/issues/1].

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10219) [C++] csv::TableReader column names, Read() arguments

2020-10-07 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209762#comment-17209762
 ] 

Neal Richardson commented on ARROW-10219:
-

I didn't know about include_columns, thanks.

Here's two use cases for being able to get the column names without reading the 
whole table:

* R's various CSV readers all let you specify column types as an unnamed vector 
of types; column names can also be specified but via a different argument. But 
the arrow csv reader currently can't do this: you can't specify column types 
while allowing the column names to be read from the file. So in this case, I'd 
like to be able to instantiate a TableReader with the other given options, 
query to get the column names, and then use those to create the fully specified 
TableReader to call Read on.
* Some of R's CSV readers let you specify columns to keep in (or exclude from) 
the resulting data frame either by integer indices or by some expression (e.g. 
{{starts_with("something")}}). In order to pass those to 
{{ConvertOptions::include_columns}}, I need to get the column names from the 
CSV so that I can translate those.

> [C++] csv::TableReader column names, Read() arguments
> -
>
> Key: ARROW-10219
> URL: https://issues.apache.org/jira/browse/ARROW-10219
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 3.0.0
>
>
> Some feature requests:
> * csv::TableReader {{column_names}} method, and/or {{schema}} method. This 
> will (in most cases) require IO to get these from the file, but that's fine. 
> There are use cases (we've seen in R) where it would help to be able to get 
> the names from the file (e.g. when you specify column types, it's a map of 
> column name to type, so you can't currently specify types without also 
> specifying names)
> * Add Read(std::vector) like how feather (and parquet?) have so that you 
> don't have to parse and allocate columns you don't want.
> cc [~apitrou] [~romainfrancois]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3822) [C++] parquet::arrow::FileReader::GetRecordBatchReader may not iterate through chunked columns completely

2020-10-07 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-3822:

Fix Version/s: (was: 2.0.0)
   3.0.0

> [C++] parquet::arrow::FileReader::GetRecordBatchReader may not iterate 
> through chunked columns completely
> -
>
> Key: ARROW-3822
> URL: https://issues.apache.org/jira/browse/ARROW-3822
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> EDIT: https://github.com/apache/arrow/pull/3911#issuecomment-473679153
> We don't currently test that all data is iterated through when reading from a 
> Parquet file where the result is chunked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6972) [C#] Should support StructField arrays

2020-10-07 Thread Eric Erhardt (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Erhardt resolved ARROW-6972.
-
Resolution: Fixed

Issue resolved by pull request 8348
[https://github.com/apache/arrow/pull/8348]

> [C#] Should support StructField arrays
> --
>
> Key: ARROW-6972
> URL: https://issues.apache.org/jira/browse/ARROW-6972
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C#
>Reporter: Cameron Murray
>Assignee: Prashanth Govindarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> The C# implementation of Arrow does not support struct arrays and, complex 
> types more generally.
>  I notice ARROW-6870 addresses Dictionary arrays however this is not as 
> flexible as structs (for example, cannot mix data types)
> The source does have a stub for StructArray however there is no Builder nor 
> example on how to use it so I can assume it is not supported.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9964) [C++] CSV date support

2020-10-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9964.
---
Resolution: Fixed

Issue resolved by pull request 8381
[https://github.com/apache/arrow/pull/8381]

> [C++] CSV date support
> --
>
> Key: ARROW-9964
> URL: https://issues.apache.org/jira/browse/ARROW-9964
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Maciej
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> There is no support for reading date type from CSV file. I'd like to read 
> such a value:
> {code:java}
> 1991-02-03
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10219) [C++] csv::TableReader column names, Read() arguments

2020-10-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209724#comment-17209724
 ] 

Antoine Pitrou commented on ARROW-10219:


I'm not sure I understand #1, can you explain a bit more?
As for #2, by giving {{ConvertOptions::include_columns}} you can already 
restrict which columns you want to convert.

> [C++] csv::TableReader column names, Read() arguments
> -
>
> Key: ARROW-10219
> URL: https://issues.apache.org/jira/browse/ARROW-10219
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 3.0.0
>
>
> Some feature requests:
> * csv::TableReader {{column_names}} method, and/or {{schema}} method. This 
> will (in most cases) require IO to get these from the file, but that's fine. 
> There are use cases (we've seen in R) where it would help to be able to get 
> the names from the file (e.g. when you specify column types, it's a map of 
> column name to type, so you can't currently specify types without also 
> specifying names)
> * Add Read(std::vector) like how feather (and parquet?) have so that you 
> don't have to parse and allocate columns you don't want.
> cc [~apitrou] [~romainfrancois]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10219) [C++] csv::TableReader column names, Read() arguments

2020-10-07 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-10219:
---

 Summary: [C++] csv::TableReader column names, Read() arguments
 Key: ARROW-10219
 URL: https://issues.apache.org/jira/browse/ARROW-10219
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Neal Richardson
 Fix For: 3.0.0


Some feature requests:

* csv::TableReader {{column_names}} method, and/or {{schema}} method. This will 
(in most cases) require IO to get these from the file, but that's fine. There 
are use cases (we've seen in R) where it would help to be able to get the names 
from the file (e.g. when you specify column types, it's a map of column name to 
type, so you can't currently specify types without also specifying names)
* Add Read(std::vector) like how feather (and parquet?) have so that you 
don't have to parse and allocate columns you don't want.

cc [~apitrou] [~romainfrancois]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9645) [Python] Deprecate the legacy pyarrow.filesystem interface

2020-10-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9645.
---
Resolution: Fixed

Issue resolved by pull request 8149
[https://github.com/apache/arrow/pull/8149]

> [Python] Deprecate the legacy pyarrow.filesystem interface
> --
>
> Key: ARROW-9645
> URL: https://issues.apache.org/jira/browse/ARROW-9645
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The {{pyarrow.filesystem}} interfaces are dubbed "legacy" (in favor of 
> {{pyarrow.fs}}), but at some point we should actually deprecate (and 
> eventually remove) them. 
> There is probably still some work to do before that: ensure the new 
> filesystems can be used instead in all places (eg in pyarrow.parquet), 
> improve the docs about the new filesystems, ..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10196) [C++] Add Future::DeferNotOk()

2020-10-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10196.

Resolution: Fixed

Issue resolved by pull request 8362
[https://github.com/apache/arrow/pull/8362]

> [C++] Add Future::DeferNotOk()
> --
>
> Key: ARROW-10196
> URL: https://issues.apache.org/jira/browse/ARROW-10196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Provide a static method mapping Result> -> Future. If the Result 
> is an error, a finished future containing its Status will be constructed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9974) [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while reading large number of files using ParquetDataset

2020-10-07 Thread Ashish Gupta (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209705#comment-17209705
 ] 

Ashish Gupta commented on ARROW-9974:
-

Tried...

export MALLOC_MMAP_THRESHOLD_=65536

same error "OSError: Out of memory: malloc of size 131072 failed"

 

> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while 
> reading large number of files using ParquetDataset
> ---
>
> Key: ARROW-9974
> URL: https://issues.apache.org/jira/browse/ARROW-9974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Ashish Gupta
>Assignee: Ben Kietzman
>Priority: Critical
>  Labels: dataset
> Fix For: 3.0.0
>
> Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use 
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to 
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of 
> memory: malloc of size 131072 failed". The same code on the same machine 
> still works with older version. My machine has 256Gb memory way more than 
> enough to load the data which requires < 10Gb. You can use below code to 
> generate the issue on your side.
> {code}
> import pandas as pd
> import numpy as np
> import pyarrow.parquet as pq
> def generate():
> # create a big dataframe
> df = pd.DataFrame({'A': np.arange(5000)})
> df['F1'] = np.random.randn(5000) * 100
> df['F2'] = np.random.randn(5000) * 100
> df['F3'] = np.random.randn(5000) * 100
> df['F4'] = np.random.randn(5000) * 100
> df['F5'] = np.random.randn(5000) * 100
> df['F6'] = np.random.randn(5000) * 100
> df['F7'] = np.random.randn(5000) * 100
> df['F8'] = np.random.randn(5000) * 100
> df['F9'] = 'ABCDEFGH'
> df['F10'] = 'ABCDEFGH'
> df['F11'] = 'ABCDEFGH'
> df['F12'] = 'ABCDEFGH01234'
> df['F13'] = 'ABCDEFGH01234'
> df['F14'] = 'ABCDEFGH01234'
> df['F15'] = 'ABCDEFGH01234567'
> df['F16'] = 'ABCDEFGH01234567'
> df['F17'] = 'ABCDEFGH01234567'
> # split and save data to 5000 files
> for i in range(5000):
> df.iloc[i*1:(i+1)*1].to_parquet(f'{i}.parquet', index=False)
> def read_works():
> # below code works to read
> df = []
> for i in range(5000):
> df.append(pd.read_parquet(f'{i}.parquet'))
> df = pd.concat(df)
> def read_errors():
> # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine 
> with version 0.13.0)
> # tried use_legacy_dataset=False, same issue
> fnames = []
> for i in range(5000):
> fnames.append(f'{i}.parquet')
> len(fnames)
> df = pq.ParquetDataset(fnames).read(use_threads=False)
>  
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10181) [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)

2020-10-07 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão resolved ARROW-10181.
--
Resolution: Fixed

Issue resolved by pull request 8353
[https://github.com/apache/arrow/pull/8353]

> [Rust] Arrow tests fail to compile on Raspberry Pi (32 bit)
> ---
>
> Key: ARROW-10181
> URL: https://issues.apache.org/jira/browse/ARROW-10181
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Raspberry Pi still tends to use 32-bit operating systems although there is a 
> beta 64 bit version of Raspbian. It would be nice to be able to at least 
> disable these tests when runnign on 32-bit. 
> {code:java}
> error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:421:25
> |
> 421 | assert_eq!(ceil(100, 10), 10);
> | ^^^
> |
> = note: `#[deny(overflowing_literals)]` on by default
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:422:29
> |
> 422 | assert_eq!(ceil(10, 100), 1);
> | ^^^
> |
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`error: literal out of range for `usize`
>--> arrow/src/util/bit_util.rs:423:25
> |
> 423 | assert_eq!(ceil(100, 10), 10);
> | ^^^
> |
> = note: the literal `100` does not fit into the type `usize` 
> whose range is `0..=4294967295`
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9974) [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while reading large number of files using ParquetDataset

2020-10-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209676#comment-17209676
 ] 

Antoine Pitrou commented on ARROW-9974:
---

[~kgashish] Can you try what I suggested above?

> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while 
> reading large number of files using ParquetDataset
> ---
>
> Key: ARROW-9974
> URL: https://issues.apache.org/jira/browse/ARROW-9974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Ashish Gupta
>Assignee: Ben Kietzman
>Priority: Critical
>  Labels: dataset
> Fix For: 3.0.0
>
> Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use 
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to 
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of 
> memory: malloc of size 131072 failed". The same code on the same machine 
> still works with older version. My machine has 256Gb memory way more than 
> enough to load the data which requires < 10Gb. You can use below code to 
> generate the issue on your side.
> {code}
> import pandas as pd
> import numpy as np
> import pyarrow.parquet as pq
> def generate():
> # create a big dataframe
> df = pd.DataFrame({'A': np.arange(5000)})
> df['F1'] = np.random.randn(5000) * 100
> df['F2'] = np.random.randn(5000) * 100
> df['F3'] = np.random.randn(5000) * 100
> df['F4'] = np.random.randn(5000) * 100
> df['F5'] = np.random.randn(5000) * 100
> df['F6'] = np.random.randn(5000) * 100
> df['F7'] = np.random.randn(5000) * 100
> df['F8'] = np.random.randn(5000) * 100
> df['F9'] = 'ABCDEFGH'
> df['F10'] = 'ABCDEFGH'
> df['F11'] = 'ABCDEFGH'
> df['F12'] = 'ABCDEFGH01234'
> df['F13'] = 'ABCDEFGH01234'
> df['F14'] = 'ABCDEFGH01234'
> df['F15'] = 'ABCDEFGH01234567'
> df['F16'] = 'ABCDEFGH01234567'
> df['F17'] = 'ABCDEFGH01234567'
> # split and save data to 5000 files
> for i in range(5000):
> df.iloc[i*1:(i+1)*1].to_parquet(f'{i}.parquet', index=False)
> def read_works():
> # below code works to read
> df = []
> for i in range(5000):
> df.append(pd.read_parquet(f'{i}.parquet'))
> df = pd.concat(df)
> def read_errors():
> # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine 
> with version 0.13.0)
> # tried use_legacy_dataset=False, same issue
> fnames = []
> for i in range(5000):
> fnames.append(f'{i}.parquet')
> len(fnames)
> df = pq.ParquetDataset(fnames).read(use_threads=False)
>  
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10030) [Rust] Support fromIter and toIter

2020-10-07 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão resolved ARROW-10030.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8211
[https://github.com/apache/arrow/pull/8211]

> [Rust] Support fromIter and toIter
> --
>
> Key: ARROW-10030
> URL: https://issues.apache.org/jira/browse/ARROW-10030
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Proposal for comments: 
> [https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing]
> (dump of the document above)
> Rust Arrow supports two main computational models:
>  # Batch Operations, that leverage some form of vectorization
>  # Element-by-element operations, that emerge in more complex operations
> This document concerns element-by-element operations, that are common outside 
> of the library (and sometimes in the library).
> h2. Element-by-element operations
> These operations are programmatically written as:
>  # Downcast the array to its specific type
>  # Initialize buffers
>  # Iterate over indices and perform the operation, appending to the buffers 
> accordingly
>  # Create ArrayData with the required null bitmap, buffers, childs, etc.
>  # return ArrayRef from ArrayData
>  
> We can split this process in 3 parts:
>  # Initialization (1 and 2)
>  # Iteration (3)
>  # Finalization (4 and 5)
> Currently, the API that we offer to our users is:
>  # as_any() to downcast the array based on its DataType
>  # Builders for all types, that users can initialize, matching the downcasted 
> array
>  # Iterate
>  ## Use for i in (0..array.len())
>  ## Use {{Array::value(i)}} and {{Array::is_valid(i)/is_null(i)}}
>  ## use builder.append_value(new_value) or builder.append_null()
>  # Finish the builder and wrap the result in an Arc
> This API has some issues:
>  # value(i) +is unsafe+, even though it is not marked as such
>  # builders are usually slow due to the checks that they need to perform
>  # The API is not intuitive
> h2. Proposal
> This proposal aims at improving this API in 2 specific ways:
>  * Implement IntoIterator Iterator and Iterator>
>  * Implement FromIterator and Item=Option
> so that users can write:
> {code:java}
> // incoming array
> let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
> let array = Arc::new(array) as ArrayRef;
> let array = array.as_any().downcast_ref::().unwrap();
> // to and from iter, with a +1
> let result: Int32Array = array
>     .iter()
>     .map(|e| if let Some(r) = e { Some(r + 1) } else { None })
>     .collect();
> let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); 
> assert_eq!(result, expected);
> {code}
>  
> This results in an API that is:
>  # efficient, as it is our responsibility to create `FromIterator` that are 
> efficient in populating the buffers/child etc from an iterator
>  # Safe, as it does not allow segfaults
>  # Simple, as users do not need to worry about Builders, buffers, etc, only 
> native Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9974) [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while reading large number of files using ParquetDataset

2020-10-07 Thread Ashish Gupta (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209671#comment-17209671
 ] 

Ashish Gupta commented on ARROW-9974:
-

Anyone tried to reproduce on centos-8?

> [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while 
> reading large number of files using ParquetDataset
> ---
>
> Key: ARROW-9974
> URL: https://issues.apache.org/jira/browse/ARROW-9974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Ashish Gupta
>Assignee: Ben Kietzman
>Priority: Critical
>  Labels: dataset
> Fix For: 3.0.0
>
> Attachments: legacy_false.txt, legacy_true.txt
>
>
> [https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe]
> I have a dataframe split and stored in more than 5000 files. I use 
> ParquetDataset(fnames).read() to load all files. I updated the pyarrow to 
> latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of 
> memory: malloc of size 131072 failed". The same code on the same machine 
> still works with older version. My machine has 256Gb memory way more than 
> enough to load the data which requires < 10Gb. You can use below code to 
> generate the issue on your side.
> {code}
> import pandas as pd
> import numpy as np
> import pyarrow.parquet as pq
> def generate():
> # create a big dataframe
> df = pd.DataFrame({'A': np.arange(5000)})
> df['F1'] = np.random.randn(5000) * 100
> df['F2'] = np.random.randn(5000) * 100
> df['F3'] = np.random.randn(5000) * 100
> df['F4'] = np.random.randn(5000) * 100
> df['F5'] = np.random.randn(5000) * 100
> df['F6'] = np.random.randn(5000) * 100
> df['F7'] = np.random.randn(5000) * 100
> df['F8'] = np.random.randn(5000) * 100
> df['F9'] = 'ABCDEFGH'
> df['F10'] = 'ABCDEFGH'
> df['F11'] = 'ABCDEFGH'
> df['F12'] = 'ABCDEFGH01234'
> df['F13'] = 'ABCDEFGH01234'
> df['F14'] = 'ABCDEFGH01234'
> df['F15'] = 'ABCDEFGH01234567'
> df['F16'] = 'ABCDEFGH01234567'
> df['F17'] = 'ABCDEFGH01234567'
> # split and save data to 5000 files
> for i in range(5000):
> df.iloc[i*1:(i+1)*1].to_parquet(f'{i}.parquet', index=False)
> def read_works():
> # below code works to read
> df = []
> for i in range(5000):
> df.append(pd.read_parquet(f'{i}.parquet'))
> df = pd.concat(df)
> def read_errors():
> # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine 
> with version 0.13.0)
> # tried use_legacy_dataset=False, same issue
> fnames = []
> for i in range(5000):
> fnames.append(f'{i}.parquet')
> len(fnames)
> df = pq.ParquetDataset(fnames).read(use_threads=False)
>  
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10172) [Python] pyarrow.concat_arrays segfaults if a resulting StringArray's capacity overflows

2020-10-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10172:
-
Summary: [Python] pyarrow.concat_arrays segfaults if a resulting 
StringArray's capacity overflows  (was: [Python] cancat_arrays requires upcast 
for large array)

> [Python] pyarrow.concat_arrays segfaults if a resulting StringArray's 
> capacity overflows
> 
>
> Key: ARROW-10172
> URL: https://issues.apache.org/jira/browse/ARROW-10172
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
>
> I'm sorry if this was already reported, but there's an overflow issue in 
> concatenation of large arrays
> {code:python}
> In [1]: import pyarrow as pa
> In [2]: str_array = pa.array(['a' * 128] * 10**8)
> In [3]: large_array = pa.concat_arrays([str_array] * 50)
> Segmentation fault (core dumped)
> {code}
> I suppose that  this should be handled by upcast to large_string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10172) [Python] cancat_arrays requires upcast for large array

2020-10-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10172:
-
Summary: [Python] cancat_arrays requires upcast for large array  (was: 
cancat_arrays requires upcast for large array)

> [Python] cancat_arrays requires upcast for large array
> --
>
> Key: ARROW-10172
> URL: https://issues.apache.org/jira/browse/ARROW-10172
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Artem KOZHEVNIKOV
>Priority: Major
>
> I'm sorry if this was already reported, but there's an overflow issue in 
> concatenation of large arrays
> {code:python}
> In [1]: import pyarrow as pa
> In [2]: str_array = pa.array(['a' * 128] * 10**8)
> In [3]: large_array = pa.concat_arrays([str_array] * 50)
> Segmentation fault (core dumped)
> {code}
> I suppose that  this should be handled by upcast to large_string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10140) [Python][C++] No data for map column of a parquet file created from pyarrow and pandas

2020-10-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209667#comment-17209667
 ] 

Wes McKinney commented on ARROW-10140:
--

I'm reopening this until someone confirms that this case is adequately tested

> [Python][C++] No data for map column of a parquet file created from pyarrow 
> and pandas
> --
>
> Key: ARROW-10140
> URL: https://issues.apache.org/jira/browse/ARROW-10140
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Chen Ming
>Assignee: Micah Kornfield
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: pyspark.snappy.parquet, test_map.parquet, test_map.py, 
> test_map_2.0.0.parquet
>
>
> Hi,
> I'm having problems reading parquet files with 'map' data type created by 
> pyarrow.
> I followed 
> [https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries]
>  to convert a pandas DF to an arrow table, then call write_table to output a 
> parquet file:
> (We also referred to https://issues.apache.org/jira/browse/ARROW-9812)
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df = pd.DataFrame({
>  'col1': pd.Series([
>  [('id', 'something'), ('value2', 'else')],
>  [('id', 'something2'), ('value','else2')],
>  ]),
>  'col2': pd.Series(['foo', 'bar'])
>  })
> udt = pa.map_(pa.string(), pa.string())
> schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
> table = pa.Table.from_pandas(df, schema)
> pq.write_table(table, './test_map.parquet')
> {code}
> The above code (attached as test_map.py) runs smoothly on my developing 
> computer:
> {code:java}
> PyArrow Version = 1.0.1
> Pandas Version = 1.1.2
> {code}
> And generated the test_map.parquet file (attached as test_map.parquet) 
> successfully.
> Then I use parquet-tools (1.11.1) to read the file, but get the following 
> output:
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head test_map.parquet
> col1:
> .key_value:
> .key_value:
> col2 = foo
> col1:
> .key_value:
> .key_value:
> col2 = bar
> {code}
> I also checked the schema of the parquet file:
> {code:java}
> java -jar parquet-tools-1.11.1.jar schema test_map.parquet
> message schema {
>   optional group col1 (MAP) {
> repeated group key_value {
>   required binary key (STRING);
>   optional binary value (STRING);
> }
>   }
>   optional binary col2 (STRING);
> }{code}
> Am I doing something wrong? 
> We need to output the data to parquet files, and query them later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10140) [Python][C++] No data for map column of a parquet file created from pyarrow and pandas

2020-10-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209667#comment-17209667
 ] 

Wes McKinney edited comment on ARROW-10140 at 10/7/20, 4:39 PM:


I'm reopening this until someone confirms that this case is adequately tested 
in out test suite


was (Author: wesmckinn):
I'm reopening this until someone confirms that this case is adequately tested

> [Python][C++] No data for map column of a parquet file created from pyarrow 
> and pandas
> --
>
> Key: ARROW-10140
> URL: https://issues.apache.org/jira/browse/ARROW-10140
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Chen Ming
>Assignee: Micah Kornfield
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: pyspark.snappy.parquet, test_map.parquet, test_map.py, 
> test_map_2.0.0.parquet
>
>
> Hi,
> I'm having problems reading parquet files with 'map' data type created by 
> pyarrow.
> I followed 
> [https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries]
>  to convert a pandas DF to an arrow table, then call write_table to output a 
> parquet file:
> (We also referred to https://issues.apache.org/jira/browse/ARROW-9812)
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df = pd.DataFrame({
>  'col1': pd.Series([
>  [('id', 'something'), ('value2', 'else')],
>  [('id', 'something2'), ('value','else2')],
>  ]),
>  'col2': pd.Series(['foo', 'bar'])
>  })
> udt = pa.map_(pa.string(), pa.string())
> schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
> table = pa.Table.from_pandas(df, schema)
> pq.write_table(table, './test_map.parquet')
> {code}
> The above code (attached as test_map.py) runs smoothly on my developing 
> computer:
> {code:java}
> PyArrow Version = 1.0.1
> Pandas Version = 1.1.2
> {code}
> And generated the test_map.parquet file (attached as test_map.parquet) 
> successfully.
> Then I use parquet-tools (1.11.1) to read the file, but get the following 
> output:
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head test_map.parquet
> col1:
> .key_value:
> .key_value:
> col2 = foo
> col1:
> .key_value:
> .key_value:
> col2 = bar
> {code}
> I also checked the schema of the parquet file:
> {code:java}
> java -jar parquet-tools-1.11.1.jar schema test_map.parquet
> message schema {
>   optional group col1 (MAP) {
> repeated group key_value {
>   required binary key (STRING);
>   optional binary value (STRING);
> }
>   }
>   optional binary col2 (STRING);
> }{code}
> Am I doing something wrong? 
> We need to output the data to parquet files, and query them later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-10140) [Python][C++] No data for map column of a parquet file created from pyarrow and pandas

2020-10-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reopened ARROW-10140:
--

> [Python][C++] No data for map column of a parquet file created from pyarrow 
> and pandas
> --
>
> Key: ARROW-10140
> URL: https://issues.apache.org/jira/browse/ARROW-10140
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
>Reporter: Chen Ming
>Assignee: Micah Kornfield
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: pyspark.snappy.parquet, test_map.parquet, test_map.py, 
> test_map_2.0.0.parquet
>
>
> Hi,
> I'm having problems reading parquet files with 'map' data type created by 
> pyarrow.
> I followed 
> [https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries]
>  to convert a pandas DF to an arrow table, then call write_table to output a 
> parquet file:
> (We also referred to https://issues.apache.org/jira/browse/ARROW-9812)
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df = pd.DataFrame({
>  'col1': pd.Series([
>  [('id', 'something'), ('value2', 'else')],
>  [('id', 'something2'), ('value','else2')],
>  ]),
>  'col2': pd.Series(['foo', 'bar'])
>  })
> udt = pa.map_(pa.string(), pa.string())
> schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
> table = pa.Table.from_pandas(df, schema)
> pq.write_table(table, './test_map.parquet')
> {code}
> The above code (attached as test_map.py) runs smoothly on my developing 
> computer:
> {code:java}
> PyArrow Version = 1.0.1
> Pandas Version = 1.1.2
> {code}
> And generated the test_map.parquet file (attached as test_map.parquet) 
> successfully.
> Then I use parquet-tools (1.11.1) to read the file, but get the following 
> output:
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head test_map.parquet
> col1:
> .key_value:
> .key_value:
> col2 = foo
> col1:
> .key_value:
> .key_value:
> col2 = bar
> {code}
> I also checked the schema of the parquet file:
> {code:java}
> java -jar parquet-tools-1.11.1.jar schema test_map.parquet
> message schema {
>   optional group col1 (MAP) {
> repeated group key_value {
>   required binary key (STRING);
>   optional binary value (STRING);
> }
>   }
>   optional binary col2 (STRING);
> }{code}
> Am I doing something wrong? 
> We need to output the data to parquet files, and query them later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10088) [R] Don't store "data.table" pointer in metadata

2020-10-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10088:

Fix Version/s: (was: 2.0.0)

> [R] Don't store "data.table" pointer in metadata
> 
>
> Key: ARROW-10088
> URL: https://issues.apache.org/jira/browse/ARROW-10088
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
>Reporter: Kyle Kavanagh
>Assignee: Romain Francois
>Priority: Major
>
> Issues with metadata$r:
> * The ".internal.selfref" attribute from data.table is an externalptr, which 
> won't be valid to serialize and restore, so it needs to be dropped (and then 
> presumably also the data.table class too)
> -
> Original description:
> I've got a proprietary dataset where one of the columns is an integer64 but 
> all of the values would fit within 32bits.  As I understand it, arrow/feather 
> will downcast that column when the data is read back into R (not ideal IMO, 
> but not an issue generally).  However, I'm having some trouble with a 
> specific dataset. 
> When I read in the data, the column is set to the class "integer64", however 
> the column type (typeof) is 'integer' and not 'double', which is the 
> underlying type used by bit64.  This mismatch causes R data.table to error 
> out 
> ([https://github.com/Rdatatable/data.table/blob/master/src/rbindlist.c#L325)]
> I do not have any issue with integer64 columns which have values > 2^32, and 
> suspiciously I am also unable to recreate the issue by manually creating a 
> data.table with an int64 column with small values (e.g 
> data.table(col=as.integer64(c(1,2,3))) )
> I did look thru the arrow::r cpp source and couldnt find an obvious case 
> where the underlying storage array would be an integer but also have the 
> 'integer64' class attr assigned...  A fix would either be to remove the 
> integer64 class attr, or ensure that the underlying data store is a REALSXP 
> instead of INTEGERSXP
> My company's network policies wont let me upload the sample dataset, hoping 
> to see if this triggers an immediate thoughts.  If not, I can try to figure 
> our how to upload the dataset or otherwise provide details from it as 
> requested.
>  
> {code:java}
> > arrow::write_feather(df[,list(testCol)][1], '~/test.feather')
> > test = arrow::read_feather('~/test.feather')
> > class(test$testCol)
> [1] "integer64" "np.ulong"
> > typeof(test$testCol)
> [1] "integer"
> > str(test)
> Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1 obs. of  1 variable: $ 
> testCol:Error in as.character.integer64(object) :  REAL() can only be applied 
> to a 'numeric', not a 'integer'
> #In the larger original dataset, it handles most columns properly, only the 
> 'testCol' breaks things.  Note the difference:
> > typeof(df$goodCol)
> [1] "double"
> > class(df$goodCol)
> [1] "integer64" "np.ulong"
> > typeof(df$testCol)
> [1] "integer"
> > class(df$testCol)
> [1] "integer64" "np.ulong"
> > str(df)
> Classes ‘data.table’ and 'data.frame':  214781 obs. of  17 variables: 
> $ goodCol:integer64 159977700604025 ... 
> $ testCol:Error in as.character.integer64(object) :
> > sessionInfo()
> R version 3.6.1 (2019-07-05)Platform: x86_64-pc-linux-gnu (64-bit)Running 
> under: Red Hat Enterprise Linux Server 7.7 (Maipo)
> Matrix products: defaultBLAS:   /usr/lib64/libblas.so.3.4.2LAPACK: 
> /usr/lib64/liblapack.so.3.4.2locale: 
> [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8
> LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 
> [7] LC_PAPER=en_US.UTF-8   LC_NAME=C [9] LC_ADDRESS=C   
> LC_TELEPHONE=C[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:[1] stats graphics  grDevices utils datasets  
> methods   baseother attached packages:[1] data.table_1.13.0 bit64_4.0.5   
> bit_4.0.4loaded via a namespace (and not attached): [1] Rcpp_1.0.5   
> lattice_0.20-41  arrow_1.0.1 [4] assertthat_0.2.1 rappdirs_0.3.1  
>  grid_3.6.1 [7] R6_2.4.1 jsonlite_1.7.1   magrittr_1.5[10] 
> rlang_0.4.7  Matrix_1.2-18vctrs_0.3.4[13] 
> reticulate_1.14-9001 tools_3.6.1  glue_1.4.2[16] purrr_0.3.4  
> compiler_3.6.1   tidyselect_1.1.0{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10088) [R] Don't store "data.table" pointer in metadata

2020-10-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10088:

Summary: [R] Don't store "data.table" pointer in metadata  (was: [R] Issues 
in restoring R metadata for "integer64", "data.table" classes)

> [R] Don't store "data.table" pointer in metadata
> 
>
> Key: ARROW-10088
> URL: https://issues.apache.org/jira/browse/ARROW-10088
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
>Reporter: Kyle Kavanagh
>Assignee: Romain Francois
>Priority: Major
> Fix For: 2.0.0
>
>
> Issues with metadata$r:
> * The ".internal.selfref" attribute from data.table is an externalptr, which 
> won't be valid to serialize and restore, so it needs to be dropped (and then 
> presumably also the data.table class too)
> -
> Original description:
> I've got a proprietary dataset where one of the columns is an integer64 but 
> all of the values would fit within 32bits.  As I understand it, arrow/feather 
> will downcast that column when the data is read back into R (not ideal IMO, 
> but not an issue generally).  However, I'm having some trouble with a 
> specific dataset. 
> When I read in the data, the column is set to the class "integer64", however 
> the column type (typeof) is 'integer' and not 'double', which is the 
> underlying type used by bit64.  This mismatch causes R data.table to error 
> out 
> ([https://github.com/Rdatatable/data.table/blob/master/src/rbindlist.c#L325)]
> I do not have any issue with integer64 columns which have values > 2^32, and 
> suspiciously I am also unable to recreate the issue by manually creating a 
> data.table with an int64 column with small values (e.g 
> data.table(col=as.integer64(c(1,2,3))) )
> I did look thru the arrow::r cpp source and couldnt find an obvious case 
> where the underlying storage array would be an integer but also have the 
> 'integer64' class attr assigned...  A fix would either be to remove the 
> integer64 class attr, or ensure that the underlying data store is a REALSXP 
> instead of INTEGERSXP
> My company's network policies wont let me upload the sample dataset, hoping 
> to see if this triggers an immediate thoughts.  If not, I can try to figure 
> our how to upload the dataset or otherwise provide details from it as 
> requested.
>  
> {code:java}
> > arrow::write_feather(df[,list(testCol)][1], '~/test.feather')
> > test = arrow::read_feather('~/test.feather')
> > class(test$testCol)
> [1] "integer64" "np.ulong"
> > typeof(test$testCol)
> [1] "integer"
> > str(test)
> Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1 obs. of  1 variable: $ 
> testCol:Error in as.character.integer64(object) :  REAL() can only be applied 
> to a 'numeric', not a 'integer'
> #In the larger original dataset, it handles most columns properly, only the 
> 'testCol' breaks things.  Note the difference:
> > typeof(df$goodCol)
> [1] "double"
> > class(df$goodCol)
> [1] "integer64" "np.ulong"
> > typeof(df$testCol)
> [1] "integer"
> > class(df$testCol)
> [1] "integer64" "np.ulong"
> > str(df)
> Classes ‘data.table’ and 'data.frame':  214781 obs. of  17 variables: 
> $ goodCol:integer64 159977700604025 ... 
> $ testCol:Error in as.character.integer64(object) :
> > sessionInfo()
> R version 3.6.1 (2019-07-05)Platform: x86_64-pc-linux-gnu (64-bit)Running 
> under: Red Hat Enterprise Linux Server 7.7 (Maipo)
> Matrix products: defaultBLAS:   /usr/lib64/libblas.so.3.4.2LAPACK: 
> /usr/lib64/liblapack.so.3.4.2locale: 
> [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8
> LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 
> [7] LC_PAPER=en_US.UTF-8   LC_NAME=C [9] LC_ADDRESS=C   
> LC_TELEPHONE=C[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:[1] stats graphics  grDevices utils datasets  
> methods   baseother attached packages:[1] data.table_1.13.0 bit64_4.0.5   
> bit_4.0.4loaded via a namespace (and not attached): [1] Rcpp_1.0.5   
> lattice_0.20-41  arrow_1.0.1 [4] assertthat_0.2.1 rappdirs_0.3.1  
>  grid_3.6.1 [7] R6_2.4.1 jsonlite_1.7.1   magrittr_1.5[10] 
> rlang_0.4.7  Matrix_1.2-18vctrs_0.3.4[13] 
> reticulate_1.14-9001 tools_3.6.1  glue_1.4.2[16] purrr_0.3.4  
> compiler_3.6.1   tidyselect_1.1.0{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10088) [R] Issues in restoring R metadata for "integer64", "data.table" classes

2020-10-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10088:

Description: 
Issues with metadata$r:

* The ".internal.selfref" attribute from data.table is an externalptr, which 
won't be valid to serialize and restore, so it needs to be dropped (and then 
presumably also the data.table class too)

-
Original description:

I've got a proprietary dataset where one of the columns is an integer64 but all 
of the values would fit within 32bits.  As I understand it, arrow/feather will 
downcast that column when the data is read back into R (not ideal IMO, but not 
an issue generally).  However, I'm having some trouble with a specific dataset. 

When I read in the data, the column is set to the class "integer64", however 
the column type (typeof) is 'integer' and not 'double', which is the underlying 
type used by bit64.  This mismatch causes R data.table to error out 
([https://github.com/Rdatatable/data.table/blob/master/src/rbindlist.c#L325)]

I do not have any issue with integer64 columns which have values > 2^32, and 
suspiciously I am also unable to recreate the issue by manually creating a 
data.table with an int64 column with small values (e.g 
data.table(col=as.integer64(c(1,2,3))) )

I did look thru the arrow::r cpp source and couldnt find an obvious case where 
the underlying storage array would be an integer but also have the 'integer64' 
class attr assigned...  A fix would either be to remove the integer64 class 
attr, or ensure that the underlying data store is a REALSXP instead of 
INTEGERSXP

My company's network policies wont let me upload the sample dataset, hoping to 
see if this triggers an immediate thoughts.  If not, I can try to figure our 
how to upload the dataset or otherwise provide details from it as requested.

 
{code:java}
> arrow::write_feather(df[,list(testCol)][1], '~/test.feather')
> test = arrow::read_feather('~/test.feather')
> class(test$testCol)
[1] "integer64" "np.ulong"
> typeof(test$testCol)
[1] "integer"

> str(test)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1 obs. of  1 variable: $ 
testCol:Error in as.character.integer64(object) :  REAL() can only be applied 
to a 'numeric', not a 'integer'


#In the larger original dataset, it handles most columns properly, only the 
'testCol' breaks things.  Note the difference:
> typeof(df$goodCol)
[1] "double"
> class(df$goodCol)
[1] "integer64" "np.ulong"

> typeof(df$testCol)
[1] "integer"
> class(df$testCol)
[1] "integer64" "np.ulong"

> str(df)
Classes ‘data.table’ and 'data.frame':  214781 obs. of  17 variables: 
$ goodCol:integer64 159977700604025 ... 
$ testCol:Error in as.character.integer64(object) :

> sessionInfo()
R version 3.6.1 (2019-07-05)Platform: x86_64-pc-linux-gnu (64-bit)Running 
under: Red Hat Enterprise Linux Server 7.7 (Maipo)
Matrix products: defaultBLAS:   /usr/lib64/libblas.so.3.4.2LAPACK: 
/usr/lib64/liblapack.so.3.4.2locale: 

[1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8
LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 
[7] LC_PAPER=en_US.UTF-8   LC_NAME=C [9] LC_ADDRESS=C   
LC_TELEPHONE=C[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:[1] stats graphics  grDevices utils datasets  
methods   baseother attached packages:[1] data.table_1.13.0 bit64_4.0.5   
bit_4.0.4loaded via a namespace (and not attached): [1] Rcpp_1.0.5   
lattice_0.20-41  arrow_1.0.1 [4] assertthat_0.2.1 rappdirs_0.3.1   
grid_3.6.1 [7] R6_2.4.1 jsonlite_1.7.1   magrittr_1.5[10] 
rlang_0.4.7  Matrix_1.2-18vctrs_0.3.4[13] reticulate_1.14-9001 
tools_3.6.1  glue_1.4.2[16] purrr_0.3.4  compiler_3.6.1   
tidyselect_1.1.0{code}

  was:
Issues with metadata$r:

* Handling integer64 (and subclasses) when relating to automatic downcasting
* The ".internal.selfref" attribute from data.table is an externalptr, which 
won't be valid to serialize and restore, so it needs to be dropped (and then 
presumably also the data.table class too)

-
Original description:

I've got a proprietary dataset where one of the columns is an integer64 but all 
of the values would fit within 32bits.  As I understand it, arrow/feather will 
downcast that column when the data is read back into R (not ideal IMO, but not 
an issue generally).  However, I'm having some trouble with a specific dataset. 

When I read in the data, the column is set to the class "integer64", however 
the column type (typeof) is 'integer' and not 'double', which is the underlying 
type used by bit64.  This mismatch causes R data.table to error out 
([https://github.com/Rdatatable/data.table/blob/master/src/rbindlist.c#L325)]

I do not have any issue with integer64 columns which have values > 2^32, and 
suspiciously I am also 

[jira] [Commented] (ARROW-10088) [R] Issues in restoring R metadata for "integer64", "data.table" classes

2020-10-07 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209665#comment-17209665
 ] 

Neal Richardson commented on ARROW-10088:
-

The integer64 subclass issue doesn't reproduce on master because we've changed 
how that metadata is collected and we no longer keep "class" for integer64: 
https://github.com/apache/arrow/blob/master/r/R/table.R#L188-L189

{code}
> df <- data.frame(a = bit64::integer64(1L))
> class(df$a) <- c(class(df$a), "something_else")
> class(df$a)
[1] "integer64"  "something_else"
> b <- record_batch(df)
> b
RecordBatch
1 rows x 1 columns
$a 

See $metadata for additional Schema metadata
> b$metadata
$r
 'arrow_r_metadata' chr 
"A\n3\n198146\n197888\n5\nUTF-8\n531\n2\n531\n1\n16\n1\n262153\n10\ndata.frame\n1026\n1\n262153\n5\nnames\n16\n1"|
 __truncated__
List of 2
 $ attributes:List of 1
  ..$ class: chr "data.frame"
 $ columns   :List of 1
  ..$ a: NULL
{code}

The data.table .internal.selfref is still kept though. I would imagine that 
that should be a problem, but I'm not experienced enough with data.table to 
know how exactly, and my naive attempts have not been able to cause a failure.

> [R] Issues in restoring R metadata for "integer64", "data.table" classes
> 
>
> Key: ARROW-10088
> URL: https://issues.apache.org/jira/browse/ARROW-10088
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
>Reporter: Kyle Kavanagh
>Assignee: Romain Francois
>Priority: Major
> Fix For: 2.0.0
>
>
> Issues with metadata$r:
> * Handling integer64 (and subclasses) when relating to automatic downcasting
> * The ".internal.selfref" attribute from data.table is an externalptr, which 
> won't be valid to serialize and restore, so it needs to be dropped (and then 
> presumably also the data.table class too)
> -
> Original description:
> I've got a proprietary dataset where one of the columns is an integer64 but 
> all of the values would fit within 32bits.  As I understand it, arrow/feather 
> will downcast that column when the data is read back into R (not ideal IMO, 
> but not an issue generally).  However, I'm having some trouble with a 
> specific dataset. 
> When I read in the data, the column is set to the class "integer64", however 
> the column type (typeof) is 'integer' and not 'double', which is the 
> underlying type used by bit64.  This mismatch causes R data.table to error 
> out 
> ([https://github.com/Rdatatable/data.table/blob/master/src/rbindlist.c#L325)]
> I do not have any issue with integer64 columns which have values > 2^32, and 
> suspiciously I am also unable to recreate the issue by manually creating a 
> data.table with an int64 column with small values (e.g 
> data.table(col=as.integer64(c(1,2,3))) )
> I did look thru the arrow::r cpp source and couldnt find an obvious case 
> where the underlying storage array would be an integer but also have the 
> 'integer64' class attr assigned...  A fix would either be to remove the 
> integer64 class attr, or ensure that the underlying data store is a REALSXP 
> instead of INTEGERSXP
> My company's network policies wont let me upload the sample dataset, hoping 
> to see if this triggers an immediate thoughts.  If not, I can try to figure 
> our how to upload the dataset or otherwise provide details from it as 
> requested.
>  
> {code:java}
> > arrow::write_feather(df[,list(testCol)][1], '~/test.feather')
> > test = arrow::read_feather('~/test.feather')
> > class(test$testCol)
> [1] "integer64" "np.ulong"
> > typeof(test$testCol)
> [1] "integer"
> > str(test)
> Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1 obs. of  1 variable: $ 
> testCol:Error in as.character.integer64(object) :  REAL() can only be applied 
> to a 'numeric', not a 'integer'
> #In the larger original dataset, it handles most columns properly, only the 
> 'testCol' breaks things.  Note the difference:
> > typeof(df$goodCol)
> [1] "double"
> > class(df$goodCol)
> [1] "integer64" "np.ulong"
> > typeof(df$testCol)
> [1] "integer"
> > class(df$testCol)
> [1] "integer64" "np.ulong"
> > str(df)
> Classes ‘data.table’ and 'data.frame':  214781 obs. of  17 variables: 
> $ goodCol:integer64 159977700604025 ... 
> $ testCol:Error in as.character.integer64(object) :
> > sessionInfo()
> R version 3.6.1 (2019-07-05)Platform: x86_64-pc-linux-gnu (64-bit)Running 
> under: Red Hat Enterprise Linux Server 7.7 (Maipo)
> Matrix products: defaultBLAS:   /usr/lib64/libblas.so.3.4.2LAPACK: 
> /usr/lib64/liblapack.so.3.4.2locale: 
> [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8
> LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 
> [7] LC_PAPER=en_US.UTF-8   LC_NAME=C [9] 

[jira] [Resolved] (ARROW-10217) [CI] Run fewer GitHub Actions jobs

2020-10-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-10217.
-
Resolution: Fixed

Issue resolved by pull request 8380
[https://github.com/apache/arrow/pull/8380]

> [CI] Run fewer GitHub Actions jobs
> --
>
> Key: ARROW-10217
> URL: https://issues.apache.org/jira/browse/ARROW-10217
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10218) [Python] [C++] Errors when building pyarrow from source

2020-10-07 Thread Andrew Wieteska (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wieteska resolved ARROW-10218.
-
Resolution: Fixed

> [Python] [C++] Errors when building pyarrow from source
> ---
>
> Key: ARROW-10218
> URL: https://issues.apache.org/jira/browse/ARROW-10218
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andrew Wieteska
>Priority: Major
>
> Not 100% sure this is the right place for this (maybe users/devs mailing 
> lists would be better?)
> In any case, I've recently been having trouble building pyarrow from source 
> and I'm not sure where to debug. I use the following script (which worked 
> fine until the last week or so):
>  
> {code:java}
> export ARROW_HOME=$CONDA_PREFIX
> cd cpp/build
> cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>  -DCMAKE_INSTALL_LIBDIR=lib \
>  -DARROW_WITH_BZ2=ON \
>  -DARROW_WITH_ZLIB=ON \
>  -DARROW_WITH_ZSTD=ON \
>  -DARROW_WITH_LZ4=ON \
>  -DARROW_WITH_SNAPPY=ON \
>  -DARROW_WITH_BROTLI=ON \
>  -DARROW_PARQUET=ON \
>  -DARROW_PYTHON3=ON \
>  -DARROW_IPC=ON \
>  -DARROW_BUILD_TESTS=ON \
>  ..
> make -j4
> sudo make install
> cd ../../python
> export PYARROW_WITH_PARQUET=1
> python setup.py build_ext --inplace
> cd ..
> {code}
> and on current master (Oct 7th '20) I get this error:
>  
> {code:java}
> [ 5%] Compiling Cython CXX source for _fs...
> [ 5%] Built target _fs_pyx
> Scanning dependencies of target _fs
> [ 11%] Building CXX object CMakeFiles/_fs.dir/_fs.cpp.o
> /home/andrew/git_repo/arrow/python/build/temp.linux-x86_64-3.7/_fs.cpp:700:10:
>  fatal error: arrow/python/ipc.h: No such file or directory
>  #include "arrow/python/ipc.h"
>  ^~~~
> compilation terminated.
> make[2]: *** [CMakeFiles/_fs.dir/build.make:82: CMakeFiles/_fs.dir/_fs.cpp.o] 
> Error 1
> make[1]: *** [CMakeFiles/Makefile2:138: CMakeFiles/_fs.dir/all] Error 2
> make: *** [Makefile:103: all] Error 2
> error: command 'cmake' failed with exit status 2
> {code}
>  
> I'm pretty sure this is an issue with the local toolchain because I've pushed 
> to a Python PR since this happened and CI is green. 
> I appreciate any hints on how to go about solving this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10218) [Python] [C++] Errors when building pyarrow from source

2020-10-07 Thread Andrew Wieteska (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209648#comment-17209648
 ] 

Andrew Wieteska commented on ARROW-10218:
-

That was it. Thanks so much!!!

> [Python] [C++] Errors when building pyarrow from source
> ---
>
> Key: ARROW-10218
> URL: https://issues.apache.org/jira/browse/ARROW-10218
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andrew Wieteska
>Priority: Major
>
> Not 100% sure this is the right place for this (maybe users/devs mailing 
> lists would be better?)
> In any case, I've recently been having trouble building pyarrow from source 
> and I'm not sure where to debug. I use the following script (which worked 
> fine until the last week or so):
>  
> {code:java}
> export ARROW_HOME=$CONDA_PREFIX
> cd cpp/build
> cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>  -DCMAKE_INSTALL_LIBDIR=lib \
>  -DARROW_WITH_BZ2=ON \
>  -DARROW_WITH_ZLIB=ON \
>  -DARROW_WITH_ZSTD=ON \
>  -DARROW_WITH_LZ4=ON \
>  -DARROW_WITH_SNAPPY=ON \
>  -DARROW_WITH_BROTLI=ON \
>  -DARROW_PARQUET=ON \
>  -DARROW_PYTHON3=ON \
>  -DARROW_IPC=ON \
>  -DARROW_BUILD_TESTS=ON \
>  ..
> make -j4
> sudo make install
> cd ../../python
> export PYARROW_WITH_PARQUET=1
> python setup.py build_ext --inplace
> cd ..
> {code}
> and on current master (Oct 7th '20) I get this error:
>  
> {code:java}
> [ 5%] Compiling Cython CXX source for _fs...
> [ 5%] Built target _fs_pyx
> Scanning dependencies of target _fs
> [ 11%] Building CXX object CMakeFiles/_fs.dir/_fs.cpp.o
> /home/andrew/git_repo/arrow/python/build/temp.linux-x86_64-3.7/_fs.cpp:700:10:
>  fatal error: arrow/python/ipc.h: No such file or directory
>  #include "arrow/python/ipc.h"
>  ^~~~
> compilation terminated.
> make[2]: *** [CMakeFiles/_fs.dir/build.make:82: CMakeFiles/_fs.dir/_fs.cpp.o] 
> Error 1
> make[1]: *** [CMakeFiles/Makefile2:138: CMakeFiles/_fs.dir/all] Error 2
> make: *** [Makefile:103: all] Error 2
> error: command 'cmake' failed with exit status 2
> {code}
>  
> I'm pretty sure this is an issue with the local toolchain because I've pushed 
> to a Python PR since this happened and CI is green. 
> I appreciate any hints on how to go about solving this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10218) [Python] [C++] Errors when building pyarrow from source

2020-10-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209646#comment-17209646
 ] 

Antoine Pitrou commented on ARROW-10218:


It's {{-DARROW_PYTHON=ON}} and not {{-DARROW_PYTHON3=ON}}.

> [Python] [C++] Errors when building pyarrow from source
> ---
>
> Key: ARROW-10218
> URL: https://issues.apache.org/jira/browse/ARROW-10218
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andrew Wieteska
>Priority: Major
>
> Not 100% sure this is the right place for this (maybe users/devs mailing 
> lists would be better?)
> In any case, I've recently been having trouble building pyarrow from source 
> and I'm not sure where to debug. I use the following script (which worked 
> fine until the last week or so):
>  
> {code:java}
> export ARROW_HOME=$CONDA_PREFIX
> cd cpp/build
> cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>  -DCMAKE_INSTALL_LIBDIR=lib \
>  -DARROW_WITH_BZ2=ON \
>  -DARROW_WITH_ZLIB=ON \
>  -DARROW_WITH_ZSTD=ON \
>  -DARROW_WITH_LZ4=ON \
>  -DARROW_WITH_SNAPPY=ON \
>  -DARROW_WITH_BROTLI=ON \
>  -DARROW_PARQUET=ON \
>  -DARROW_PYTHON3=ON \
>  -DARROW_IPC=ON \
>  -DARROW_BUILD_TESTS=ON \
>  ..
> make -j4
> sudo make install
> cd ../../python
> export PYARROW_WITH_PARQUET=1
> python setup.py build_ext --inplace
> cd ..
> {code}
> and on current master (Oct 7th '20) I get this error:
>  
> {code:java}
> [ 5%] Compiling Cython CXX source for _fs...
> [ 5%] Built target _fs_pyx
> Scanning dependencies of target _fs
> [ 11%] Building CXX object CMakeFiles/_fs.dir/_fs.cpp.o
> /home/andrew/git_repo/arrow/python/build/temp.linux-x86_64-3.7/_fs.cpp:700:10:
>  fatal error: arrow/python/ipc.h: No such file or directory
>  #include "arrow/python/ipc.h"
>  ^~~~
> compilation terminated.
> make[2]: *** [CMakeFiles/_fs.dir/build.make:82: CMakeFiles/_fs.dir/_fs.cpp.o] 
> Error 1
> make[1]: *** [CMakeFiles/Makefile2:138: CMakeFiles/_fs.dir/all] Error 2
> make: *** [Makefile:103: all] Error 2
> error: command 'cmake' failed with exit status 2
> {code}
>  
> I'm pretty sure this is an issue with the local toolchain because I've pushed 
> to a Python PR since this happened and CI is green. 
> I appreciate any hints on how to go about solving this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10217) [CI] Run fewer GitHub Actions jobs

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10217:
---
Labels: pull-request-available  (was: )

> [CI] Run fewer GitHub Actions jobs
> --
>
> Key: ARROW-10217
> URL: https://issues.apache.org/jira/browse/ARROW-10217
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10214) [Python] UnicodeDecodeError when printing schema with binary metadata

2020-10-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10214.

Resolution: Fixed

Issue resolved by pull request 8379
[https://github.com/apache/arrow/pull/8379]

> [Python] UnicodeDecodeError when printing schema with binary metadata
> -
>
> Key: ARROW-10214
> URL: https://issues.apache.org/jira/browse/ARROW-10214
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.0, 0.17.1, 1.0.0, 1.0.1
> Environment: Python 3.6 - 3.8
>Reporter: Paul Balanca
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The following small example raises a `UnicodeDecodeError` error, since 
> PyArrow version 0.17.0:
> {code:java}
> import pyarrow as pa
> bdata = 
> b"\xff\xff\xff\xff8\x02\x00\x00\x10\x00\x00\x00\x00\x00\n\x00\x0c\x00\x06\x00\x05\x00\x08\x00\n\x00\x00\x00\x00\x01\x04\x00\x0c\x00\x00\x00\x08\x00\x08\x00\x00\x00\x04\x00\x08\x00\x00\x00\x04\x00\x00\x00\x02\x00\x00\x00\x00\x01\x00\x00\x04\x00\x00\x00\x1a\xff\xff\xff\x00\x00\x00\x0c\xd0\x00\x00\x00\x9c\x00\x00\x00\x90\x00\x00\x00\x04\x00\x00\x00\x02\x00\x00\x00P\x00\x00\x00\x04\x00\x00\x00\xc0\xfe\xff\xff\x08\x00\x00\x00
>  \x00\x00\x00\x14\x00\x00\x00ARROW:extension:name\x00\x00\x00\x00\x1b"
> t = pa.table({"data": pa.array([1, 2])}, metadata={b"k": bdata})
> print(t.schema){code}
> In our case, the binary data is coming from the serialization of another 
> PyArrow schema. But I guess the error can appear with any binary metadata in 
> the schema.
> The print used to work fine with PyArrow 0.16, getting this output:
> {code:java}
> data: int64
> metadata
> 
> OrderedDict([(b'k',
>   b'\xff\xff\xff\xff8\x02\x00\x00\x10\x00\x00\x00\x00\x00\n\x00'
>   
> b'\x0c\x00\x06\x00\x05\x00\x08\x00\n\x00\x00\x00\x00\x01\x04\x00'
>   b'\x0c\x00\x00\x00\x08\x00\x08\x00\x00\x00\x04\x00'
>   b'\x08\x00\x00\x00\x04\x00\x00\x00\x02\x00\x00\x00'
>   b'\x00\x01\x00\x00\x04\x00\x00\x00\x1a\xff\xff\xff'
>   b'\x00\x00\x00\x0c\xd0\x00\x00\x00\x9c\x00\x00\x00'
>   b'\x90\x00\x00\x00\x04\x00\x00\x00\x02\x00\x00\x00P\x00\x00\x00'
>   b'\x04\x00\x00\x00\xc0\xfe\xff\xff\x08\x00\x00\x00 \x00\x00\x00'
>   b'\x14\x00\x00\x00ARROW:extension:name\x00\x00\x00\x00\x1b')])
> {code}
> I can work on a patch to reverse the behaviour back to PyArrow 0.16?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9964) [C++] CSV date support

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9964:
--
Labels: pull-request-available  (was: )

> [C++] CSV date support
> --
>
> Key: ARROW-9964
> URL: https://issues.apache.org/jira/browse/ARROW-9964
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Maciej
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is no support for reading date type from CSV file. I'd like to read 
> such a value:
> {code:java}
> 1991-02-03
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10218) [Python] [C++] Errors when building pyarrow from source

2020-10-07 Thread Andrew Wieteska (Jira)
Andrew Wieteska created ARROW-10218:
---

 Summary: [Python] [C++] Errors when building pyarrow from source
 Key: ARROW-10218
 URL: https://issues.apache.org/jira/browse/ARROW-10218
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Andrew Wieteska


Not 100% sure this is the right place for this (maybe users/devs mailing lists 
would be better?)

In any case, I've recently been having trouble building pyarrow from source and 
I'm not sure where to debug. I use the following script (which worked fine 
until the last week or so):

 
{code:java}
export ARROW_HOME=$CONDA_PREFIX
cd cpp/build
cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
 -DCMAKE_INSTALL_LIBDIR=lib \
 -DARROW_WITH_BZ2=ON \
 -DARROW_WITH_ZLIB=ON \
 -DARROW_WITH_ZSTD=ON \
 -DARROW_WITH_LZ4=ON \
 -DARROW_WITH_SNAPPY=ON \
 -DARROW_WITH_BROTLI=ON \
 -DARROW_PARQUET=ON \
 -DARROW_PYTHON3=ON \
 -DARROW_IPC=ON \
 -DARROW_BUILD_TESTS=ON \
 ..
make -j4
sudo make install

cd ../../python
export PYARROW_WITH_PARQUET=1
python setup.py build_ext --inplace
cd ..
{code}
and on current master (Oct 7th '20) I get this error:

 
{code:java}
[ 5%] Compiling Cython CXX source for _fs...
[ 5%] Built target _fs_pyx
Scanning dependencies of target _fs
[ 11%] Building CXX object CMakeFiles/_fs.dir/_fs.cpp.o
/home/andrew/git_repo/arrow/python/build/temp.linux-x86_64-3.7/_fs.cpp:700:10: 
fatal error: arrow/python/ipc.h: No such file or directory
 #include "arrow/python/ipc.h"
 ^~~~
compilation terminated.
make[2]: *** [CMakeFiles/_fs.dir/build.make:82: CMakeFiles/_fs.dir/_fs.cpp.o] 
Error 1
make[1]: *** [CMakeFiles/Makefile2:138: CMakeFiles/_fs.dir/all] Error 2
make: *** [Makefile:103: all] Error 2
error: command 'cmake' failed with exit status 2
{code}
 

I'm pretty sure this is an issue with the local toolchain because I've pushed 
to a Python PR since this happened and CI is green. 

I appreciate any hints on how to go about solving this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10093) [R] Add ability to opt-out of int64 -> int demotion

2020-10-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-10093.
-
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8341
[https://github.com/apache/arrow/pull/8341]

> [R] Add ability to opt-out of int64 -> int demotion
> ---
>
> Key: ARROW-10093
> URL: https://issues.apache.org/jira/browse/ARROW-10093
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Kyle Kavanagh
>Assignee: Romain Francois
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Currently, if arrow detects that every value in an int64 column can fit in a 
> 32bit int, it will downcast the column an set the type to integer instead of 
> integer64.  Not having a mechanism to disable this optimization makes it 
> tricky if you have many parallel processes (think HPC use case) performing 
> the same calculation but potentially outputting different result values, some 
> being >2^32 and others not.  When you go to collect the resulting feather 
> files from the parallel processes, the types across the files may not line up.
> Feature request is to provide an option to disable this demotion and maintain 
> the source column type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10100) [C++][Dataset] Ability to read/subset a ParquetFileFragment with given set of row group ids

2020-10-07 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209624#comment-17209624
 ] 

Joris Van den Bossche commented on ARROW-10100:
---

[~bkietz] thoughts on the return value for an empty set of row groups?

> [C++][Dataset] Ability to read/subset a ParquetFileFragment with given set of 
> row group ids
> ---
>
> Key: ARROW-10100
> URL: https://issues.apache.org/jira/browse/ARROW-10100
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, dataset-dask-integration, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> From discussion at 
> https://github.com/dask/dask/pull/6534#issuecomment-698723009 (dask using the 
> dataset API in their parquet reader), it might be useful to somehow "subset" 
> or read a subset of a ParquetFileFragment for a specific set of row group ids.
> Use cases:
> * Read only a set of row groups ids (this is similar as 
> {{ParquetFile.read_row_groups}}), eg because you want to control the size of 
> the resulting table by reading subsets of row groups
> * Get a ParquetFileFragment with a subset of row groups (eg based on a 
> filter) to then eg get the statistics of only those row groups
> The first case could for example be solved by adding a {{row_groups}} keyword 
> to {{ParquetFileFragment.to_table}} (but, this is then a keyword specific to 
> the parquet format, and we should then probably also add it to {{scan}} et 
> al).
> The second case is something you can in principle do yourself manually by 
> recreating a fragment with {{fragment.format.make_fragment(fragment.path, 
> ..., row_groups=[...])}}. However, this is a) a bit cumbersome and b) 
> statistics might need to be parsed again?  
> The statistics of a set of filtered row groups could also be obtained by 
> using {{split_by_row_group(filter)}} (and then get the statistics of each of 
> the fragments), but if you then want a single fragment, you need to recreate 
> a fragment with the obtained row group ids.
> So one idea I have now (but mostly brainstorming here). Would it be useful to 
> have a method to create a "subsetted" ParquetFileFragment, either based on a 
> list of row group ids ({{fragment.subset(row_groups=[...])}} or either based 
> on a filter ({{fragment.subset(filter=...)}}, which would be equivalent as 
> split_by_row_group+recombining into a single fragment) ?
> cc [~bkietz] [~rjzamora]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10217) [CI] Run fewer GitHub Actions jobs

2020-10-07 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-10217:
---

 Summary: [CI] Run fewer GitHub Actions jobs
 Key: ARROW-10217
 URL: https://issues.apache.org/jira/browse/ARROW-10217
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 2.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10214) [Python] UnicodeDecodeError when printing schema with binary metadata

2020-10-07 Thread Paul Balanca (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209614#comment-17209614
 ] 

Paul Balanca commented on ARROW-10214:
--

That was fast! Thanks, amazing support :)

> [Python] UnicodeDecodeError when printing schema with binary metadata
> -
>
> Key: ARROW-10214
> URL: https://issues.apache.org/jira/browse/ARROW-10214
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.0, 0.17.1, 1.0.0, 1.0.1
> Environment: Python 3.6 - 3.8
>Reporter: Paul Balanca
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The following small example raises a `UnicodeDecodeError` error, since 
> PyArrow version 0.17.0:
> {code:java}
> import pyarrow as pa
> bdata = 
> b"\xff\xff\xff\xff8\x02\x00\x00\x10\x00\x00\x00\x00\x00\n\x00\x0c\x00\x06\x00\x05\x00\x08\x00\n\x00\x00\x00\x00\x01\x04\x00\x0c\x00\x00\x00\x08\x00\x08\x00\x00\x00\x04\x00\x08\x00\x00\x00\x04\x00\x00\x00\x02\x00\x00\x00\x00\x01\x00\x00\x04\x00\x00\x00\x1a\xff\xff\xff\x00\x00\x00\x0c\xd0\x00\x00\x00\x9c\x00\x00\x00\x90\x00\x00\x00\x04\x00\x00\x00\x02\x00\x00\x00P\x00\x00\x00\x04\x00\x00\x00\xc0\xfe\xff\xff\x08\x00\x00\x00
>  \x00\x00\x00\x14\x00\x00\x00ARROW:extension:name\x00\x00\x00\x00\x1b"
> t = pa.table({"data": pa.array([1, 2])}, metadata={b"k": bdata})
> print(t.schema){code}
> In our case, the binary data is coming from the serialization of another 
> PyArrow schema. But I guess the error can appear with any binary metadata in 
> the schema.
> The print used to work fine with PyArrow 0.16, getting this output:
> {code:java}
> data: int64
> metadata
> 
> OrderedDict([(b'k',
>   b'\xff\xff\xff\xff8\x02\x00\x00\x10\x00\x00\x00\x00\x00\n\x00'
>   
> b'\x0c\x00\x06\x00\x05\x00\x08\x00\n\x00\x00\x00\x00\x01\x04\x00'
>   b'\x0c\x00\x00\x00\x08\x00\x08\x00\x00\x00\x04\x00'
>   b'\x08\x00\x00\x00\x04\x00\x00\x00\x02\x00\x00\x00'
>   b'\x00\x01\x00\x00\x04\x00\x00\x00\x1a\xff\xff\xff'
>   b'\x00\x00\x00\x0c\xd0\x00\x00\x00\x9c\x00\x00\x00'
>   b'\x90\x00\x00\x00\x04\x00\x00\x00\x02\x00\x00\x00P\x00\x00\x00'
>   b'\x04\x00\x00\x00\xc0\xfe\xff\xff\x08\x00\x00\x00 \x00\x00\x00'
>   b'\x14\x00\x00\x00ARROW:extension:name\x00\x00\x00\x00\x1b')])
> {code}
> I can work on a patch to reverse the behaviour back to PyArrow 0.16?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7960) [C++][Parquet] Add support for schema translation from parquet nodes back to arrow for missing types

2020-10-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-7960.
---
Resolution: Fixed

Issue resolved by pull request 8376
[https://github.com/apache/arrow/pull/8376]

> [C++][Parquet] Add support for schema translation from parquet nodes back to 
> arrow for missing types
> 
>
> Key: ARROW-7960
> URL: https://issues.apache.org/jira/browse/ARROW-7960
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Map seems to be the most obvious one missing.  Without additional metadata I 
> don't think FixedSizeList is possible.  LargeList would probably have to also 
> be could be determined  empirically while parsing if there are any entries 
> that exceed the int32 range (or with medata).  Need to also double check that 
> struct is supported



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10216) [Rust] Simd implementation of min/max aggregation kernels for primitive types

2020-10-07 Thread Jira
Jörn Horstmann created ARROW-10216:
--

 Summary: [Rust] Simd implementation of min/max aggregation kernels 
for primitive types
 Key: ARROW-10216
 URL: https://issues.apache.org/jira/browse/ARROW-10216
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jörn Horstmann


Using a similar approach as the sum kernel (ARROW-10015). Instead of 
initializing the accumulator with 0 we'd need the largest/smallest possible 
value for each ArrowNumericType (i.e. u64::MAX or +-Inf)

Pseudo code for min aggregation 
{code}
// initialize accumulator
min_acc = +Inf
// aggregate each chunk
min_acc = min(min_acc, select(valid, value, +Inf))
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >