[jira] [Commented] (ARROW-10260) [Python] Missing MapType to Pandas dtype

2020-10-09 Thread Derek Marsh (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211552#comment-17211552
 ] 

Derek Marsh commented on ARROW-10260:
-

I appreciate the opportunity to contribute.
https://github.com/apache/arrow/pull/8422

> [Python] Missing MapType to Pandas dtype
> 
>
> Key: ARROW-10260
> URL: https://issues.apache.org/jira/browse/ARROW-10260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The Map type conversion to Pandas done in ARROW-10151 forgot to add dtype 
> mapping for {{to_pandas_dtype()}}
>  
> {code:java}
> In [2]: d = pa.map_(pa.int64(), pa.float64()) 
>In [3]: d.to_pandas_dtype()
>   
> 
> ---
> NotImplementedError   Traceback (most recent call last)
>  in 
> > 1 
> d.to_pandas_dtype()~/miniconda2/envs/pyarrow-test/lib/python3.7/site-packages/pyarrow/types.pxi
>  in pyarrow.lib.DataType.to_pandas_dtype()NotImplementedError: map double>{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10260) [Python] Missing MapType to Pandas dtype

2020-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10260:
---
Labels: pull-request-available  (was: )

> [Python] Missing MapType to Pandas dtype
> 
>
> Key: ARROW-10260
> URL: https://issues.apache.org/jira/browse/ARROW-10260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The Map type conversion to Pandas done in ARROW-10151 forgot to add dtype 
> mapping for {{to_pandas_dtype()}}
>  
> {code:java}
> In [2]: d = pa.map_(pa.int64(), pa.float64()) 
>In [3]: d.to_pandas_dtype()
>   
> 
> ---
> NotImplementedError   Traceback (most recent call last)
>  in 
> > 1 
> d.to_pandas_dtype()~/miniconda2/envs/pyarrow-test/lib/python3.7/site-packages/pyarrow/types.pxi
>  in pyarrow.lib.DataType.to_pandas_dtype()NotImplementedError: map double>{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10261) [Rust] [BREAKING] Lists should take Field instead of DataType

2020-10-09 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211537#comment-17211537
 ] 

Jorge Leitão commented on ARROW-10261:
--

Makes sense to me. :)

> [Rust] [BREAKING] Lists should take Field instead of DataType
> -
>
> Key: ARROW-10261
> URL: https://issues.apache.org/jira/browse/ARROW-10261
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Integration, Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
>
> There is currently no way of tracking nested field metadata on lists. For 
> example, if a list's children are nullable, there's no way of telling just by 
> looking at the Field.
> This causes problems with integration testing, and also affects Parquet 
> roundtrips.
> I propose the breaking change of [Large|FixedSize]List taking a Field instead 
> of Box, as this will overcome this issue, and ensure that the Rust 
> implementation passes integration tests.
> CC [~andygrove] [~jorgecarleitao] [~alamb]  [~jhorstmann] ([~carols10cents] 
> as this addresses some of the roundtrip failures).
> I'm leaning towards this landing in 3.0.0, as I'd love for us to have 
> completed or made significant traction on the Arrow Parquet writer (and 
> reader), and integration testing, by then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10260) [Python] Missing MapType to Pandas dtype

2020-10-09 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211521#comment-17211521
 ] 

Bryan Cutler commented on ARROW-10260:
--

Should be a quick fix, so marking this for 2.0.0

> [Python] Missing MapType to Pandas dtype
> 
>
> Key: ARROW-10260
> URL: https://issues.apache.org/jira/browse/ARROW-10260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Bryan Cutler
>Priority: Major
> Fix For: 2.0.0
>
>
> The Map type conversion to Pandas done in ARROW-10151 forgot to add dtype 
> mapping for {{to_pandas_dtype()}}
>  
> {code:java}
> In [2]: d = pa.map_(pa.int64(), pa.float64()) 
>In [3]: d.to_pandas_dtype()
>   
> 
> ---
> NotImplementedError   Traceback (most recent call last)
>  in 
> > 1 
> d.to_pandas_dtype()~/miniconda2/envs/pyarrow-test/lib/python3.7/site-packages/pyarrow/types.pxi
>  in pyarrow.lib.DataType.to_pandas_dtype()NotImplementedError: map double>{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10260) [Python] Missing MapType to Pandas dtype

2020-10-09 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated ARROW-10260:
-
Fix Version/s: 2.0.0

> [Python] Missing MapType to Pandas dtype
> 
>
> Key: ARROW-10260
> URL: https://issues.apache.org/jira/browse/ARROW-10260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Bryan Cutler
>Priority: Major
> Fix For: 2.0.0
>
>
> The Map type conversion to Pandas done in ARROW-10151 forgot to add dtype 
> mapping for {{to_pandas_dtype()}}
>  
> {code:java}
> In [2]: d = pa.map_(pa.int64(), pa.float64()) 
>In [3]: d.to_pandas_dtype()
>   
> 
> ---
> NotImplementedError   Traceback (most recent call last)
>  in 
> > 1 
> d.to_pandas_dtype()~/miniconda2/envs/pyarrow-test/lib/python3.7/site-packages/pyarrow/types.pxi
>  in pyarrow.lib.DataType.to_pandas_dtype()NotImplementedError: map double>{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8810) [R] Add documentation about Parquet format, appending to stream format

2020-10-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-8810.

Resolution: Fixed

> [R] Add documentation about Parquet format, appending to stream format
> --
>
> Key: ARROW-8810
> URL: https://issues.apache.org/jira/browse/ARROW-8810
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Reporter: Carl Boettiger
>Assignee: Neal Richardson
>Priority: Minor
> Fix For: 2.0.0
>
>
> Is it possible to append new rows to an existing .parquet file using the R 
> client's arrow::write_parquet(), in a manner similar to the `append=TRUE` 
> argument in text-based output formats like write.table()? 
>  
> Apologies as this is perhaps more a question of documentation or user 
> interface, or maybe just my ignorance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10257) [R] Prepare news/docs for 2.0 release

2020-10-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-10257.
-
Resolution: Fixed

Issue resolved by pull request 8421
[https://github.com/apache/arrow/pull/8421]

> [R] Prepare news/docs for 2.0 release
> -
>
> Key: ARROW-10257
> URL: https://issues.apache.org/jira/browse/ARROW-10257
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10261) [Rust] [BREAKING] Lists should take Field instead of DataType

2020-10-09 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10261:
--

 Summary: [Rust] [BREAKING] Lists should take Field instead of 
DataType
 Key: ARROW-10261
 URL: https://issues.apache.org/jira/browse/ARROW-10261
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Integration, Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


There is currently no way of tracking nested field metadata on lists. For 
example, if a list's children are nullable, there's no way of telling just by 
looking at the Field.

This causes problems with integration testing, and also affects Parquet 
roundtrips.

I propose the breaking change of [Large|FixedSize]List taking a Field instead 
of Box, as this will overcome this issue, and ensure that the Rust 
implementation passes integration tests.

CC [~andygrove] [~jorgecarleitao] [~alamb]  [~jhorstmann] ([~carols10cents] as 
this addresses some of the roundtrip failures).

I'm leaning towards this landing in 3.0.0, as I'd love for us to have completed 
or made significant traction on the Arrow Parquet writer (and reader), and 
integration testing, by then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Parquet

2020-10-09 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated ARROW-9812:

Description: 
Hi,

I'm having problems using 'map' data type in Arrow/parquet/pandas.

I'm able to convert a pandas data frame to Arrow with a map data type.

When I write Arrow to Parquet, it seems to work, but I'm not sure if the data 
type is written correctly.

When I read back Parquet to Arrow, it fails saying "reading list of structs" is 
not supported. It seems that map is stored as list of structs.

There are two problems here:
 # -Map data type doesn't work from Arrow -> Pandas-. Fixed in ARROW-10151
 # Map data type doesn't get written to or read from Arrow -> Parquet.

Questions:

1. Am I doing something wrong? Is there a way to get these to work? 

2. If these are unsupported features, will this be fixed in a future version? 
Do you plans or ETA?

The following code example (followed by output) should demonstrate the issues:

I'm using Arrow 1.0.0 and Pandas 1.0.5.

Thanks!

Mayur
{code:java}
$ cat arrowtest.py

import pyarrow as pa
import pandas as pd
import pyarrow.parquet as pq
import traceback as tb
import io

print(f'PyArrow Version = {pa.__version__}')
print(f'Pandas Version = {pd.__version__}')

df1 = pd.DataFrame({'a': [[('b', '2')]]})
print(f'df1')
print(f'{df1}')

print(f'Pandas -> Arrow')
try:
t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', 
pa.map_(pa.string(), pa.string()))]))
print('PASSED')
print(t1)
except:
print(f'FAILED')
tb.print_exc()

print(f'Arrow -> Pandas')
try:
t1.to_pandas()
print('PASSED')
except:
print(f'FAILED')
tb.print_exc()print(f'Arrow -> Parquet')

fh = io.BytesIO()
try:
pq.write_table(t1, fh)
print('PASSED')
except:
print('FAILED')
tb.print_exc()

print(f'Parquet -> Arrow')
try:
t2 = pq.read_table(source=fh)
print('PASSED')
print(t2)
except:
print('FAILED')
tb.print_exc()
{code}
{code:java}
$ python3.6 arrowtest.py
PyArrow Version = 1.0.0 
Pandas Version = 1.0.5 
df1 
a 0 [(b, 2)] 
 
Pandas -> Arrow 
PASSED 
pyarrow.Table 
a: map
 child 0, entries: struct not null
 child 0, key: string not null
 child 1, value: string 
 
Arrow -> Pandas 
FAILED 
Traceback (most recent call last):
File "arrowtest.py", line 26, in  t1.to_pandas() 
File "pyarrow/array.pxi", line 715, in pyarrow.lib._PandasConvertible.to_pandas 
File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File 
"XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in 
table_to_blockmanager blocks = _table_to_blocks(options, table, categories, 
ext_columns_dtypes) 
File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 
1115, in _table_to_blocks list(extension_columns.keys())) 
File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File 
"pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for 
Arrow data of type map is known. 
 
Arrow -> Parquet 
PASSED 
 
Parquet -> Arrow 
FAILED 
Traceback (most recent call last): File "arrowtest.py", line 43, in  t2 
= pq.read_table(source=fh) 
File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in 
read_table use_pandas_metadata=use_pandas_metadata) 
File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1474, in 
read use_threads=use_threads 
File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table 
File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table 
File "pyarrow/error.pxi", line 122, in 
pyarrow.lib.pyarrow_internal_check_status 
File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet 
files not yet supported: key_value: list not null> not null
{code}
Updated to indicate to Pandas conversion done, but not yet for Parquet.

  was:
Hi,

I'm having problems using 'map' data type in Arrow/parquet/pandas.

I'm able to convert a pandas data frame to Arrow with a map data type.

But, -Arrow to Pandas doesn't work.-  Fixed in ARROW-10151

When I write Arrow to Parquet, it seems to work, but I'm not sure if the data 
type is written correctly.

When I read back Parquet to Arrow, it fails saying "reading list of structs" is 
not supported. It seems that map is stored as list of structs.

There are two problems here:
 # -Map data type doesn't work from Arrow -> Pandas-. Fixed in ARROW-10151
 # Map data type doesn't get written to or read from Arrow -> Parquet.

Questions:

1. Am I doing something wrong? Is there a way to get these to work? 

2. If these are unsupported features, will this be fixed in a future version? 
Do you plans or ETA?

The following code example (followed by output) should demonstrate the issues:

I'm using Arrow 1.0.0 and Pandas 1.0.5.

Thanks!

Mayur

[jira] [Updated] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Parquet

2020-10-09 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated ARROW-9812:

Description: 
Hi,

I'm having problems using 'map' data type in Arrow/parquet/pandas.

I'm able to convert a pandas data frame to Arrow with a map data type.

But, -Arrow to Pandas doesn't work.-  Fixed in ARROW-10151

When I write Arrow to Parquet, it seems to work, but I'm not sure if the data 
type is written correctly.

When I read back Parquet to Arrow, it fails saying "reading list of structs" is 
not supported. It seems that map is stored as list of structs.

There are two problems here:
 # -Map data type doesn't work from Arrow -> Pandas-. Fixed in ARROW-10151
 # Map data type doesn't get written to or read from Arrow -> Parquet.

Questions:

1. Am I doing something wrong? Is there a way to get these to work? 

2. If these are unsupported features, will this be fixed in a future version? 
Do you plans or ETA?

The following code example (followed by output) should demonstrate the issues:

I'm using Arrow 1.0.0 and Pandas 1.0.5.

Thanks!

Mayur
{code:java}
$ cat arrowtest.py

import pyarrow as pa
import pandas as pd
import pyarrow.parquet as pq
import traceback as tb
import io

print(f'PyArrow Version = {pa.__version__}')
print(f'Pandas Version = {pd.__version__}')

df1 = pd.DataFrame({'a': [[('b', '2')]]})
print(f'df1')
print(f'{df1}')

print(f'Pandas -> Arrow')
try:
t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', 
pa.map_(pa.string(), pa.string()))]))
print('PASSED')
print(t1)
except:
print(f'FAILED')
tb.print_exc()

print(f'Arrow -> Pandas')
try:
t1.to_pandas()
print('PASSED')
except:
print(f'FAILED')
tb.print_exc()print(f'Arrow -> Parquet')

fh = io.BytesIO()
try:
pq.write_table(t1, fh)
print('PASSED')
except:
print('FAILED')
tb.print_exc()

print(f'Parquet -> Arrow')
try:
t2 = pq.read_table(source=fh)
print('PASSED')
print(t2)
except:
print('FAILED')
tb.print_exc()
{code}
{code:java}
$ python3.6 arrowtest.py
PyArrow Version = 1.0.0 
Pandas Version = 1.0.5 
df1 
a 0 [(b, 2)] 
 
Pandas -> Arrow 
PASSED 
pyarrow.Table 
a: map
 child 0, entries: struct not null
 child 0, key: string not null
 child 1, value: string 
 
Arrow -> Pandas 
FAILED 
Traceback (most recent call last):
File "arrowtest.py", line 26, in  t1.to_pandas() 
File "pyarrow/array.pxi", line 715, in pyarrow.lib._PandasConvertible.to_pandas 
File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File 
"XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in 
table_to_blockmanager blocks = _table_to_blocks(options, table, categories, 
ext_columns_dtypes) 
File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 
1115, in _table_to_blocks list(extension_columns.keys())) 
File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File 
"pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for 
Arrow data of type map is known. 
 
Arrow -> Parquet 
PASSED 
 
Parquet -> Arrow 
FAILED 
Traceback (most recent call last): File "arrowtest.py", line 43, in  t2 
= pq.read_table(source=fh) 
File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in 
read_table use_pandas_metadata=use_pandas_metadata) 
File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1474, in 
read use_threads=use_threads 
File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table 
File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table 
File "pyarrow/error.pxi", line 122, in 
pyarrow.lib.pyarrow_internal_check_status 
File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet 
files not yet supported: key_value: list not null> not null
{code}
Updated to indicate to Pandas conversion done, but not yet for Parquet.

  was:
Hi,

I'm having problems using 'map' data type in Arrow/parquet/pandas.

I'm able to convert a pandas data frame to Arrow with a map data type.

But, Arrow to Pandas doesn't work.

When I write Arrow to Parquet, it seems to work, but I'm not sure if the data 
type is written correctly.

When I read back Parquet to Arrow, it fails saying "reading list of structs" is 
not supported. It seems that map is stored as list of structs.

There are two problems here:
 # Map data type doesn't work from Arrow -> Pandas.
 # Map data type doesn't get written to or read from Arrow -> Parquet.

Questions:

1. Am I doing something wrong? Is there a way to get these to work? 

2. If these are unsupported features, will this be fixed in a future version? 
Do you plans or ETA?

The following code example (followed by output) should demonstrate the issues:

I'm using Arrow 1.0.0 and Pandas 1.0.5.

Thanks!


[jira] [Updated] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Parquet

2020-10-09 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated ARROW-9812:

Summary: [Python] Map data types doesn't work from Arrow to Parquet  (was: 
[Python] Map data types doesn't work from Arrow to Pandas and Parquet)

> [Python] Map data types doesn't work from Arrow to Parquet
> --
>
> Key: ARROW-9812
> URL: https://issues.apache.org/jira/browse/ARROW-9812
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mayur Srivastava
>Priority: Major
>
> Hi,
> I'm having problems using 'map' data type in Arrow/parquet/pandas.
> I'm able to convert a pandas data frame to Arrow with a map data type.
> But, Arrow to Pandas doesn't work.
> When I write Arrow to Parquet, it seems to work, but I'm not sure if the data 
> type is written correctly.
> When I read back Parquet to Arrow, it fails saying "reading list of structs" 
> is not supported. It seems that map is stored as list of structs.
> There are two problems here:
>  # Map data type doesn't work from Arrow -> Pandas.
>  # Map data type doesn't get written to or read from Arrow -> Parquet.
> Questions:
> 1. Am I doing something wrong? Is there a way to get these to work? 
> 2. If these are unsupported features, will this be fixed in a future version? 
> Do you plans or ETA?
> The following code example (followed by output) should demonstrate the issues:
> I'm using Arrow 1.0.0 and Pandas 1.0.5.
> Thanks!
> Mayur
> {code:java}
> $ cat arrowtest.py
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> import traceback as tb
> import io
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df1 = pd.DataFrame({'a': [[('b', '2')]]})
> print(f'df1')
> print(f'{df1}')
> print(f'Pandas -> Arrow')
> try:
> t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', 
> pa.map_(pa.string(), pa.string()))]))
> print('PASSED')
> print(t1)
> except:
> print(f'FAILED')
> tb.print_exc()
> print(f'Arrow -> Pandas')
> try:
> t1.to_pandas()
> print('PASSED')
> except:
> print(f'FAILED')
> tb.print_exc()print(f'Arrow -> Parquet')
> fh = io.BytesIO()
> try:
> pq.write_table(t1, fh)
> print('PASSED')
> except:
> print('FAILED')
> tb.print_exc()
> 
> print(f'Parquet -> Arrow')
> try:
> t2 = pq.read_table(source=fh)
> print('PASSED')
> print(t2)
> except:
> print('FAILED')
> tb.print_exc()
> {code}
> {code:java}
> $ python3.6 arrowtest.py
> PyArrow Version = 1.0.0 
> Pandas Version = 1.0.5 
> df1 
> a 0 [(b, 2)] 
>  
> Pandas -> Arrow 
> PASSED 
> pyarrow.Table 
> a: map
>  child 0, entries: struct not null
>  child 0, key: string not null
>  child 1, value: string 
>  
> Arrow -> Pandas 
> FAILED 
> Traceback (most recent call last):
> File "arrowtest.py", line 26, in  t1.to_pandas() 
> File "pyarrow/array.pxi", line 715, in 
> pyarrow.lib._PandasConvertible.to_pandas 
> File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File 
> "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in 
> table_to_blockmanager blocks = _table_to_blocks(options, table, categories, 
> ext_columns_dtypes) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 
> 1115, in _table_to_blocks list(extension_columns.keys())) 
> File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File 
> "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for 
> Arrow data of type map is known. 
>  
> Arrow -> Parquet 
> PASSED 
>  
> Parquet -> Arrow 
> FAILED 
> Traceback (most recent call last): File "arrowtest.py", line 43, in  
> t2 = pq.read_table(source=fh) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in 
> read_table use_pandas_metadata=use_pandas_metadata) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1474, in 
> read use_threads=use_threads 
> File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table 
> File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table 
> File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status 
> File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet 
> files not yet supported: key_value: list null, value: string> not null> not null
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Pandas and Parquet

2020-10-09 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211472#comment-17211472
 ] 

Bryan Cutler edited comment on ARROW-9812 at 10/9/20, 11:50 PM:


Hi [~admrsh] , I implemented Map types to Pandas conversion recently in 
ARROW-10151, but looks like I forgot that line you pointed out in 
{{types.pxi}}. That should be in for the upcoming release, if you are able to 
do a PR for it's cut - likely today or tomorrow - that would be great. 
Otherwise, I can go ahead and add it. I will update this Jira to reflect Pandas 
conversion is complete. I made ARROW-10260 to add {{to_pandas_dtype}}. Thanks!


was (Author: bryanc):
Hi [~admrsh] , I implemented Map types to Pandas conversion recently in 
ARROW-10151, but looks like I forgot that line you pointed out in 
{{types.pxi}}. That should be in for the upcoming release, if you are able to 
do a PR for it's cut - likely today or tomorrow - that would be great. 
Otherwise, I can go ahead and add it. I will update this Jira to reflect Pandas 
conversion is complete. Thanks!

> [Python] Map data types doesn't work from Arrow to Pandas and Parquet
> -
>
> Key: ARROW-9812
> URL: https://issues.apache.org/jira/browse/ARROW-9812
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mayur Srivastava
>Priority: Major
>
> Hi,
> I'm having problems using 'map' data type in Arrow/parquet/pandas.
> I'm able to convert a pandas data frame to Arrow with a map data type.
> But, Arrow to Pandas doesn't work.
> When I write Arrow to Parquet, it seems to work, but I'm not sure if the data 
> type is written correctly.
> When I read back Parquet to Arrow, it fails saying "reading list of structs" 
> is not supported. It seems that map is stored as list of structs.
> There are two problems here:
>  # Map data type doesn't work from Arrow -> Pandas.
>  # Map data type doesn't get written to or read from Arrow -> Parquet.
> Questions:
> 1. Am I doing something wrong? Is there a way to get these to work? 
> 2. If these are unsupported features, will this be fixed in a future version? 
> Do you plans or ETA?
> The following code example (followed by output) should demonstrate the issues:
> I'm using Arrow 1.0.0 and Pandas 1.0.5.
> Thanks!
> Mayur
> {code:java}
> $ cat arrowtest.py
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> import traceback as tb
> import io
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df1 = pd.DataFrame({'a': [[('b', '2')]]})
> print(f'df1')
> print(f'{df1}')
> print(f'Pandas -> Arrow')
> try:
> t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', 
> pa.map_(pa.string(), pa.string()))]))
> print('PASSED')
> print(t1)
> except:
> print(f'FAILED')
> tb.print_exc()
> print(f'Arrow -> Pandas')
> try:
> t1.to_pandas()
> print('PASSED')
> except:
> print(f'FAILED')
> tb.print_exc()print(f'Arrow -> Parquet')
> fh = io.BytesIO()
> try:
> pq.write_table(t1, fh)
> print('PASSED')
> except:
> print('FAILED')
> tb.print_exc()
> 
> print(f'Parquet -> Arrow')
> try:
> t2 = pq.read_table(source=fh)
> print('PASSED')
> print(t2)
> except:
> print('FAILED')
> tb.print_exc()
> {code}
> {code:java}
> $ python3.6 arrowtest.py
> PyArrow Version = 1.0.0 
> Pandas Version = 1.0.5 
> df1 
> a 0 [(b, 2)] 
>  
> Pandas -> Arrow 
> PASSED 
> pyarrow.Table 
> a: map
>  child 0, entries: struct not null
>  child 0, key: string not null
>  child 1, value: string 
>  
> Arrow -> Pandas 
> FAILED 
> Traceback (most recent call last):
> File "arrowtest.py", line 26, in  t1.to_pandas() 
> File "pyarrow/array.pxi", line 715, in 
> pyarrow.lib._PandasConvertible.to_pandas 
> File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File 
> "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in 
> table_to_blockmanager blocks = _table_to_blocks(options, table, categories, 
> ext_columns_dtypes) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 
> 1115, in _table_to_blocks list(extension_columns.keys())) 
> File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File 
> "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for 
> Arrow data of type map is known. 
>  
> Arrow -> Parquet 
> PASSED 
>  
> Parquet -> Arrow 
> FAILED 
> Traceback (most recent call last): File "arrowtest.py", line 43, in  
> t2 = pq.read_table(source=fh) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in 
> read_table use_pandas_metadata=use_pandas_metadata) 
> File 

[jira] [Created] (ARROW-10260) [Python] Missing MapType to Pandas dtype

2020-10-09 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-10260:


 Summary: [Python] Missing MapType to Pandas dtype
 Key: ARROW-10260
 URL: https://issues.apache.org/jira/browse/ARROW-10260
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Bryan Cutler


The Map type conversion to Pandas done in ARROW-10151 forgot to add dtype 
mapping for {{to_pandas_dtype()}}

 
{code:java}
In [2]: d = pa.map_(pa.int64(), pa.float64())   
 In [3]: d.to_pandas_dtype()

  
---
NotImplementedError   Traceback (most recent call last)
 in 
> 1 
d.to_pandas_dtype()~/miniconda2/envs/pyarrow-test/lib/python3.7/site-packages/pyarrow/types.pxi
 in pyarrow.lib.DataType.to_pandas_dtype()NotImplementedError: map{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Pandas and Parquet

2020-10-09 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211472#comment-17211472
 ] 

Bryan Cutler commented on ARROW-9812:
-

Hi [~admrsh] , I implemented Map types to Pandas conversion recently in 
ARROW-10151, but looks like I forgot that line you pointed out in 
{{types.pxi}}. That should be in for the upcoming release, if you are able to 
do a PR for it's cut - likely today or tomorrow - that would be great. 
Otherwise, I can go ahead and add it. I will update this Jira to reflect Pandas 
conversion is complete. Thanks!

> [Python] Map data types doesn't work from Arrow to Pandas and Parquet
> -
>
> Key: ARROW-9812
> URL: https://issues.apache.org/jira/browse/ARROW-9812
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mayur Srivastava
>Priority: Major
>
> Hi,
> I'm having problems using 'map' data type in Arrow/parquet/pandas.
> I'm able to convert a pandas data frame to Arrow with a map data type.
> But, Arrow to Pandas doesn't work.
> When I write Arrow to Parquet, it seems to work, but I'm not sure if the data 
> type is written correctly.
> When I read back Parquet to Arrow, it fails saying "reading list of structs" 
> is not supported. It seems that map is stored as list of structs.
> There are two problems here:
>  # Map data type doesn't work from Arrow -> Pandas.
>  # Map data type doesn't get written to or read from Arrow -> Parquet.
> Questions:
> 1. Am I doing something wrong? Is there a way to get these to work? 
> 2. If these are unsupported features, will this be fixed in a future version? 
> Do you plans or ETA?
> The following code example (followed by output) should demonstrate the issues:
> I'm using Arrow 1.0.0 and Pandas 1.0.5.
> Thanks!
> Mayur
> {code:java}
> $ cat arrowtest.py
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> import traceback as tb
> import io
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df1 = pd.DataFrame({'a': [[('b', '2')]]})
> print(f'df1')
> print(f'{df1}')
> print(f'Pandas -> Arrow')
> try:
> t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', 
> pa.map_(pa.string(), pa.string()))]))
> print('PASSED')
> print(t1)
> except:
> print(f'FAILED')
> tb.print_exc()
> print(f'Arrow -> Pandas')
> try:
> t1.to_pandas()
> print('PASSED')
> except:
> print(f'FAILED')
> tb.print_exc()print(f'Arrow -> Parquet')
> fh = io.BytesIO()
> try:
> pq.write_table(t1, fh)
> print('PASSED')
> except:
> print('FAILED')
> tb.print_exc()
> 
> print(f'Parquet -> Arrow')
> try:
> t2 = pq.read_table(source=fh)
> print('PASSED')
> print(t2)
> except:
> print('FAILED')
> tb.print_exc()
> {code}
> {code:java}
> $ python3.6 arrowtest.py
> PyArrow Version = 1.0.0 
> Pandas Version = 1.0.5 
> df1 
> a 0 [(b, 2)] 
>  
> Pandas -> Arrow 
> PASSED 
> pyarrow.Table 
> a: map
>  child 0, entries: struct not null
>  child 0, key: string not null
>  child 1, value: string 
>  
> Arrow -> Pandas 
> FAILED 
> Traceback (most recent call last):
> File "arrowtest.py", line 26, in  t1.to_pandas() 
> File "pyarrow/array.pxi", line 715, in 
> pyarrow.lib._PandasConvertible.to_pandas 
> File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File 
> "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in 
> table_to_blockmanager blocks = _table_to_blocks(options, table, categories, 
> ext_columns_dtypes) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 
> 1115, in _table_to_blocks list(extension_columns.keys())) 
> File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File 
> "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for 
> Arrow data of type map is known. 
>  
> Arrow -> Parquet 
> PASSED 
>  
> Parquet -> Arrow 
> FAILED 
> Traceback (most recent call last): File "arrowtest.py", line 43, in  
> t2 = pq.read_table(source=fh) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in 
> read_table use_pandas_metadata=use_pandas_metadata) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1474, in 
> read use_threads=use_threads 
> File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table 
> File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table 
> File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status 
> File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet 
> files not yet supported: key_value: 

[jira] [Created] (ARROW-10259) [Rust] Support field metadata

2020-10-09 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10259:
--

 Summary: [Rust] Support field metadata
 Key: ARROW-10259
 URL: https://issues.apache.org/jira/browse/ARROW-10259
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale


The biggest hurdle to adding field metadata is HashMap and HashSet not 
implementing Hash, Ord and PartialOrd.

I was thinking of implementing the metadata as a Vec<(String, String)> to 
overcome this limitation, and then serializing correctly to JSON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10258) [Rust] Support extension arrays

2020-10-09 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10258:
--

 Summary: [Rust] Support extension arrays
 Key: ARROW-10258
 URL: https://issues.apache.org/jira/browse/ARROW-10258
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Integration, Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


This should include:
 * supporting the Arrow format
 * supporting field metadata

We can optionally:
 * support recognising known extensions (like UUID)

I'm mainly opening this up for wider visibility, I noticed that I was catching 
strays from metadata integration tests failing because Field doesn't support 
metadata :(



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10258) [Rust] Support extension arrays

2020-10-09 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10258:
---
Fix Version/s: 3.0.0

> [Rust] Support extension arrays
> ---
>
> Key: ARROW-10258
> URL: https://issues.apache.org/jira/browse/ARROW-10258
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Integration, Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Priority: Major
> Fix For: 3.0.0
>
>
> This should include:
>  * supporting the Arrow format
>  * supporting field metadata
> We can optionally:
>  * support recognising known extensions (like UUID)
> I'm mainly opening this up for wider visibility, I noticed that I was 
> catching strays from metadata integration tests failing because Field doesn't 
> support metadata :(



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10257) [R] Prepare news/docs for 2.0 release

2020-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10257:
---
Labels: pull-request-available  (was: )

> [R] Prepare news/docs for 2.0 release
> -
>
> Key: ARROW-10257
> URL: https://issues.apache.org/jira/browse/ARROW-10257
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8810) [R] Add documentation about Parquet format, appending to stream format

2020-10-09 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211468#comment-17211468
 ] 

Neal Richardson commented on ARROW-8810:


Doing in ARROW-10257

> [R] Add documentation about Parquet format, appending to stream format
> --
>
> Key: ARROW-8810
> URL: https://issues.apache.org/jira/browse/ARROW-8810
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Reporter: Carl Boettiger
>Assignee: Neal Richardson
>Priority: Minor
> Fix For: 2.0.0
>
>
> Is it possible to append new rows to an existing .parquet file using the R 
> client's arrow::write_parquet(), in a manner similar to the `append=TRUE` 
> argument in text-based output formats like write.table()? 
>  
> Apologies as this is perhaps more a question of documentation or user 
> interface, or maybe just my ignorance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10257) [R] Prepare news/docs for 2.0 release

2020-10-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-10257:
---

 Summary: [R] Prepare news/docs for 2.0 release
 Key: ARROW-10257
 URL: https://issues.apache.org/jira/browse/ARROW-10257
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 2.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8296) [C++][Dataset] IpcFileFormat should support writing files with compressed buffers

2020-10-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-8296.

Resolution: Fixed

Issue resolved by pull request 8389
[https://github.com/apache/arrow/pull/8389]

> [C++][Dataset] IpcFileFormat should support writing files with compressed 
> buffers
> -
>
> Key: ARROW-8296
> URL: https://issues.apache.org/jira/browse/ARROW-8296
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9870) [R] Friendly interface for filesystems (S3)

2020-10-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-9870.

Resolution: Fixed

Issue resolved by pull request 8351
[https://github.com/apache/arrow/pull/8351]

> [R] Friendly interface for filesystems (S3)
> ---
>
> Key: ARROW-9870
> URL: https://issues.apache.org/jira/browse/ARROW-9870
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The Filesystem methods don't provide a human-friendly interface for basic 
> operations like ls, mkdir, etc. Since we provide access to S3 and potentially 
> other cloud storage, it would be nice to have simple methods for exploring it.
> Additional ideas:
> * S3Bucket class/constructor: it's basically a SubTreeFileSystem containing 
> S3FS and a path, except that we can auto-detect a bucket's region.
> * Add a class like the FileLocator C++ struct list(fs, path). _also_ kinda 
> like a SubTreeFileSystem, but with different methods and intents. Aside from 
> use in ls/mkdir/cp, it could be used in file reader/writers instead of having 
> an extra {{filesystem}} argument added everywhere, e.g. 
> {{fs$path("path/to/file")}}. See 
> https://github.com/apache/arrow/pull/8197#discussion_r494325934



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10114) [R] Segfault in to_dataframe_parallel with deeply nested structs

2020-10-09 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-10114.
-
Fix Version/s: (was: 3.0.0)
   2.0.0
   Resolution: Fixed

Issue resolved by pull request 8411
[https://github.com/apache/arrow/pull/8411]

> [R] Segfault in to_dataframe_parallel with deeply nested structs
> 
>
> Key: ARROW-10114
> URL: https://issues.apache.org/jira/browse/ARROW-10114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: > sessionInfo()
> R version 3.6.3 (2020-02-29)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Linux Mint 19.3
> Matrix products: default
> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
> locale:
>  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C  
>  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=sv_SE.UTF-8LC_MESSAGES=en_US.UTF-8   
>  [7] LC_PAPER=sv_SE.UTF-8   LC_NAME=C 
>  [9] LC_ADDRESS=C   LC_TELEPHONE=C
> [11] LC_MEASUREMENT=sv_SE.UTF-8 LC_IDENTIFICATION=C   
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base 
> other attached packages:
> [1] arrow_1.0.1
> loaded via a namespace (and not attached):
>  [1] tidyselect_1.1.0 bit_4.0.4compiler_3.6.3   magrittr_1.5
>  [5] assertthat_0.2.1 R6_2.4.1 glue_1.4.1   Rcpp_1.0.5  
>  [9] bit64_4.0.2  vctrs_0.3.2  rlang_0.4.7  purrr_0.3.4 
>Reporter: Markus Skyttner
>Assignee: Romain Francois
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
> Attachments: Dockerfile, Makefile, reprex_10114.R
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A .jsonl file (newline separated JSON) created from open data available at 
> [ftp://ftp.libris.kb.se/pub/spa/swepub-deduplicated-2019-12-29.zip] is used 
> with the R package arrow (installed from CRAN) using the following statement:
> > arrow::read_json_arrow("~/.config/swepub/head.jsonl")
> It crashes RStudio with no error message. At the R prompt, the error message 
> is:
> Error in Table__to_dataframe(x, use_threads = option_use_threads()) : 
>  SET_VECTOR_ELT() can only be applied to a 'list', not a 'integer'
> The file "head.jsonl" above was created from the extracted zip's .jsonl file 
> with the *nix "head -1 $BIG_JSONL_FILE" command. It can be parsed with 
> jsonlite and tidyjson.
> Also got this error message at one point:
> > arrow::read_json_arrow("head.jsonl", as_data_frame = TRUE)
> *** caught segfault ***
> address 0x8, cause 'memory not mapped'
> Traceback:
>  1: structure(x, extra_cols = colonnade[extra_cols], class = 
> "pillar_squeezed_colonnade")
>  2: new_colonnade_sqeezed(out, colonnade = x, extra_cols = extra_cols)
>  3: pillar::squeeze(x$mcf, width = width)
>  4: format.trunc_mat(mat)
>  5: format(mat)
>  6: format.tbl(x, ..., n = n, width = width, n_extra = n_extra)
>  7: format(x, ..., n = n, width = width, n_extra = n_extra)
>  8: paste0(..., collapse = "\n")
>  9: cli::cat_line(format(x, ..., n = n, width = width, n_extra = n_extra))
> 10: print.tbl(x)
> 11: (function (x, ...) UseMethod("print"))(x)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10255) [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking

2020-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10255:
---
Labels: pull-request-available  (was: )

> [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking
> ---
>
> Key: ARROW-10255
> URL: https://issues.apache.org/jira/browse/ARROW-10255
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Affects Versions: 0.17.1
>Reporter: Paul Taylor
>Assignee: Paul Taylor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Presently most of our public classes can't be easily 
> [tree-shaken|https://webpack.js.org/guides/tree-shaking/] by library 
> consumers. This is a problem for libraries that only need to use parts of 
> Arrow.
> For example, the vis.gl projects have an integration test that imports three 
> of our simpler classes and tests the resulting bundle size:
> {code:javascript}
> import {Schema, Field, Float32} from 'apache-arrow';
> // | Bundle Size| Compressed 
> // | 202KB (207112) KB  | 45KB (46618) KB
> {code}
> We can help solve this with the following changes:
> * Add "sideEffects": false to our ESM package.json
> * Reorganize our imports to only include what's needed
> * Eliminate or move some static/member methods to standalone exported 
> functions
> * Wrap the utf8 util's node Buffer detection in eval so Webpack doesn't 
> compile in its own Buffer shim
> * Removing flatbuffers namespaces from generated TS because these defeat 
> Webpack's tree-shaking ability
> Candidate functions for removal/moving to standalone functions:
> * Schema.new, Schema.from, Schema.prototype.compareTo
> * Field.prototype.compareTo
> * Type.prototype.compareTo
> * Table.new, Table.from
> * Column.new
> * Vector.new, Vector.from
> * RecordBatchReader.from
> After applying a few of the above changes to the Schema and flatbuffers 
> files, I was able to reduce the vis.gl's import size 90%:
> {code:javascript}
> // Bundle Size  | Compressed
> // 24KB (24942) KB  | 6KB (6154) KB
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10256) [C++][Flight] Disable -Werror carefully

2020-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10256:
---
Labels: pull-request-available  (was: )

> [C++][Flight] Disable -Werror carefully
> ---
>
> Key: ARROW-10256
> URL: https://issues.apache.org/jira/browse/ARROW-10256
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10256) [C++][Flight] Disable -Werror carefully

2020-10-09 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-10256:


 Summary: [C++][Flight] Disable -Werror carefully
 Key: ARROW-10256
 URL: https://issues.apache.org/jira/browse/ARROW-10256
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10220) [JS] Cache javascript utf-8 dictionary keys?

2020-10-09 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10220:
-
Summary: [JS] Cache javascript utf-8 dictionary keys?  (was: Cache 
javascript utf-8 dictionary keys?)

> [JS] Cache javascript utf-8 dictionary keys?
> 
>
> Key: ARROW-10220
> URL: https://issues.apache.org/jira/browse/ARROW-10220
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Affects Versions: 1.0.1
>Reporter: Ben Schmidt
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> String decoding from arrow tables is a major bottleneck in using arrow in 
> Javascript–it can take a second to decode a million rows. For utf-8 types, 
> I'm not sure what could be done; but some memoization would help utf-8 
> dictionary types.
> Currently, the javascript implementation decodes a utf-8 string every time 
> you request an item from a dictionary with utf-8 data. If arrow cached the 
> decoded strings to a native js Map, routine operations like looping over all 
> the entries in a text column might be on the order of 10x faster. Here's an 
> observable notebook [benchmarking that and a couple other 
> strategies|https://observablehq.com/@bmschmidt/faster-arrow-dictionary-unpacking].
> I would file a pull request, but 1) I would have to learn some typescript to 
> do so, and 2) this idea may be undesirable because it creates new objects 
> that will increase the memory footprint of a table, rather than just using 
> the typed arrays.
> Some discussion of how the real-world issues here affect the arquero project 
> is [here|https://github.com/uwdata/arquero/issues/1].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10255) [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking

2020-10-09 Thread Paul Taylor (Jira)
Paul Taylor created ARROW-10255:
---

 Summary: [JS] Reorganize imports and exports to be more friendly 
to ESM tree-shaking
 Key: ARROW-10255
 URL: https://issues.apache.org/jira/browse/ARROW-10255
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Affects Versions: 0.17.1
Reporter: Paul Taylor
Assignee: Paul Taylor


Presently most of our public classes can't be easily 
[tree-shaken|https://webpack.js.org/guides/tree-shaking/] by library consumers. 
This is a problem for libraries that only need to use parts of Arrow.

For example, the vis.gl projects have an integration test that imports three of 
our simpler classes and tests the resulting bundle size:

{code:javascript}
import {Schema, Field, Float32} from 'apache-arrow';

// | Bundle Size| Compressed 
// | 202KB (207112) KB  | 45KB (46618) KB
{code}

We can help solve this with the following changes:
* Add "sideEffects": false to our ESM package.json
* Reorganize our imports to only include what's needed
* Eliminate or move some static/member methods to standalone exported functions
* Wrap the utf8 util's node Buffer detection in eval so Webpack doesn't compile 
in its own Buffer shim
* Removing flatbuffers namespaces from generated TS because these defeat 
Webpack's tree-shaking ability

Candidate functions for removal/moving to standalone functions:
* Schema.new, Schema.from, Schema.prototype.compareTo
* Field.prototype.compareTo
* Type.prototype.compareTo
* Table.new, Table.from
* Column.new
* Vector.new, Vector.from
* RecordBatchReader.from

After applying a few of the above changes to the Schema and flatbuffers files, 
I was able to reduce the vis.gl's import size 90%:
{code:javascript}
// Bundle Size  | Compressed
// 24KB (24942) KB  | 6KB (6154) KB
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10254) [R] Revisit (ab)use of SubTreeFileSystem

2020-10-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-10254:
---

 Summary: [R] Revisit (ab)use of SubTreeFileSystem
 Key: ARROW-10254
 URL: https://issues.apache.org/jira/browse/ARROW-10254
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 3.0.0


Followup to ARROW-9870



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10252) [Python] Add option to skip inclusion of Arrow headers in Python installation

2020-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10252:
---
Labels: pull-request-available  (was: )

> [Python] Add option to skip inclusion of Arrow headers in Python installation
> -
>
> Key: ARROW-10252
> URL: https://issues.apache.org/jira/browse/ARROW-10252
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We don't want to have them as part of the conda package as the single source 
> should be {{arrow-cpp}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10253) [Python] Don't bundle plasma-store-server in pyarrow conda package

2020-10-09 Thread Uwe Korn (Jira)
Uwe Korn created ARROW-10253:


 Summary: [Python] Don't bundle plasma-store-server in pyarrow 
conda package
 Key: ARROW-10253
 URL: https://issues.apache.org/jira/browse/ARROW-10253
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging, Python
Reporter: Uwe Korn
Assignee: Uwe Korn


We currently have it in the {{arrow-cpp}} and the {{pyarrow}} conda package, we 
should only have it in {{arrow-cpp}} as this is always there and also the 
source of the binary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10252) [Python] Add option to skip inclusion of Arrow headers in Python installation

2020-10-09 Thread Uwe Korn (Jira)
Uwe Korn created ARROW-10252:


 Summary: [Python] Add option to skip inclusion of Arrow headers in 
Python installation
 Key: ARROW-10252
 URL: https://issues.apache.org/jira/browse/ARROW-10252
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging, Python
Reporter: Uwe Korn
Assignee: Uwe Korn


We don't want to have them as part of the conda package as the single source 
should be {{arrow-cpp}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10251) [Rust] [DataFusion] MemTable::load() should load partitions in parallel

2020-10-09 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-10251:
---
Description: 
MemTable::load() should load partitions in parallel using async tasks, rather 
than loading one partition at a time.

Also, we should make batch size configurable. It is currently hard-coded to 
1024*1024 which can be quite inefficient.

  was:
MemTable::load() should load partitions in parallel using async tasks, rather 
than loading onw partition at a time.

Also, we should make batch size configurable. It is currently hard-coded to 
1024*1024 which can be quite inefficient.


> [Rust] [DataFusion] MemTable::load() should load partitions in parallel
> ---
>
> Key: ARROW-10251
> URL: https://issues.apache.org/jira/browse/ARROW-10251
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>  Labels: beginner
> Fix For: 3.0.0
>
>
> MemTable::load() should load partitions in parallel using async tasks, rather 
> than loading one partition at a time.
> Also, we should make batch size configurable. It is currently hard-coded to 
> 1024*1024 which can be quite inefficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10251) [Rust] [DataFusion] MemTable::load() should load partitions in parallel

2020-10-09 Thread Andy Grove (Jira)
Andy Grove created ARROW-10251:
--

 Summary: [Rust] [DataFusion] MemTable::load() should load 
partitions in parallel
 Key: ARROW-10251
 URL: https://issues.apache.org/jira/browse/ARROW-10251
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
 Fix For: 3.0.0


MemTable::load() should load partitions in parallel using async tasks, rather 
than loading onw partition at a time.

Also, we should make batch size configurable. It is currently hard-coded to 
1024*1024 which can be quite inefficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Pandas and Parquet

2020-10-09 Thread Derek Marsh (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211270#comment-17211270
 ] 

Derek Marsh edited comment on ARROW-9812 at 10/9/20, 6:33 PM:
--

Hi all,

I've searched existing issues as best I can and this issue mentions "1. Map 
data type doesn't work from Arrow -> Pandas."

I built the project from master at roughly 15:00 UTC today (October 9) and 
added one line before I built pyarrow:
{code:java}
_Type_MAP: np.object_,{code}
after this line: 
[types.pxi|https://github.com/apache/arrow/blob/master/python/pyarrow/types.pxi#L49]

This enables Table.to_pandas() to convert a MapType to List[Tuple[...]]
{code:java}
>>> import pyarrow as pa
>>> d = pa.map_(pa.int64(), pa.float64())
>>> d.to_pandas_dtype()
{code}
{code:java}
>>> tbl
pyarrow.Table
stored_on: double
vals: map
 child 0, entries: struct not null
 child 0, key: int64 not null
 child 1, value: double
>>> tbl.to_pydict()
{'stored_on': [1585347700.204351], 'vals': [[(514, 12.0), (515, 1300.0), (519, 
125.0), (2978, 126.0), (3236, 13107.0), (3237, 1.0), (3238, 1.0), (3239, 3.0), 
(3240, 3.0)]]}
>>> df = tbl.to_pandas()
>>> df.vals
0 [(514, 12.0), (515, 1300.0), (519, 125.0), (29...
Name: vals, dtype: object
>>> df.vals.iloc[0]
[(514, 12.0), (515, 1300.0), (519, 125.0), (2978, 126.0), (3236, 13107.0), 
(3237, 1.0), (3238, 1.0), (3239, 3.0), (3240, 3.0)]
>>> df.vals.iloc[0][0]
(514, 12.0){code}
I understand this is a very trivial working example, but am interested what any 
maintainers think about this solution and if it merits further 
testing/consideration.

Thanks.


was (Author: admrsh):
Hi all,

I've searched existing issues as best I can and this issue mentions "1. Map 
data type doesn't work from Arrow -> Pandas."

I built the project from master at roughly 15:00 UTC today (October 9) and 
added one line before I built pyarrow:
{code:java}
_Type_MAP: np.object_,{code}
after this line: 
[https://github.com/apache/arrow/blob/master/python/pyarrow/types.pxi#L49|types.pxi]

This enables Table.to_pandas() to convert a MapType to List[Tuple[...]]
{code:java}
>>> import pyarrow as pa
>>> d = pa.map_(pa.int64(), pa.float64())
>>> d.to_pandas_dtype()
{code}
{code:java}
>>> tbl
pyarrow.Table
stored_on: double
vals: map
 child 0, entries: struct not null
 child 0, key: int64 not null
 child 1, value: double
>>> tbl.to_pydict()
{'stored_on': [1585347700.204351], 'vals': [[(514, 12.0), (515, 1300.0), (519, 
125.0), (2978, 126.0), (3236, 13107.0), (3237, 1.0), (3238, 1.0), (3239, 3.0), 
(3240, 3.0)]]}
>>> df = tbl.to_pandas()
>>> df.vals
0 [(514, 12.0), (515, 1300.0), (519, 125.0), (29...
Name: vals, dtype: object
>>> df.vals.iloc[0]
[(514, 12.0), (515, 1300.0), (519, 125.0), (2978, 126.0), (3236, 13107.0), 
(3237, 1.0), (3238, 1.0), (3239, 3.0), (3240, 3.0)]
>>> df.vals.iloc[0][0]
(514, 12.0){code}
I understand this is a very trivial working example, but am interested what any 
maintainers think about this solution and if it merits further 
testing/consideration.

Thanks.

> [Python] Map data types doesn't work from Arrow to Pandas and Parquet
> -
>
> Key: ARROW-9812
> URL: https://issues.apache.org/jira/browse/ARROW-9812
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mayur Srivastava
>Priority: Major
>
> Hi,
> I'm having problems using 'map' data type in Arrow/parquet/pandas.
> I'm able to convert a pandas data frame to Arrow with a map data type.
> But, Arrow to Pandas doesn't work.
> When I write Arrow to Parquet, it seems to work, but I'm not sure if the data 
> type is written correctly.
> When I read back Parquet to Arrow, it fails saying "reading list of structs" 
> is not supported. It seems that map is stored as list of structs.
> There are two problems here:
>  # Map data type doesn't work from Arrow -> Pandas.
>  # Map data type doesn't get written to or read from Arrow -> Parquet.
> Questions:
> 1. Am I doing something wrong? Is there a way to get these to work? 
> 2. If these are unsupported features, will this be fixed in a future version? 
> Do you plans or ETA?
> The following code example (followed by output) should demonstrate the issues:
> I'm using Arrow 1.0.0 and Pandas 1.0.5.
> Thanks!
> Mayur
> {code:java}
> $ cat arrowtest.py
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> import traceback as tb
> import io
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df1 = pd.DataFrame({'a': [[('b', '2')]]})
> print(f'df1')
> print(f'{df1}')
> print(f'Pandas -> Arrow')
> try:
> t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', 
> pa.map_(pa.string(), pa.string()))]))
> print('PASSED')
> print(t1)
> except:
> 

[jira] [Commented] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Pandas and Parquet

2020-10-09 Thread Derek Marsh (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211270#comment-17211270
 ] 

Derek Marsh commented on ARROW-9812:


Hi all,

I've searched existing issues as best I can and this issue mentions "1. Map 
data type doesn't work from Arrow -> Pandas."

I built the project from master at roughly 15:00 UTC today (October 9) and 
added one line before I built pyarrow:
{code:java}
_Type_MAP: np.object_,{code}
after this line: 
[https://github.com/apache/arrow/blob/master/python/pyarrow/types.pxi#L49|types.pxi]

This enables Table.to_pandas() to convert a MapType to List[Tuple[...]]
{code:java}
>>> import pyarrow as pa
>>> d = pa.map_(pa.int64(), pa.float64())
>>> d.to_pandas_dtype()
{code}
{code:java}
>>> tbl
pyarrow.Table
stored_on: double
vals: map
 child 0, entries: struct not null
 child 0, key: int64 not null
 child 1, value: double
>>> tbl.to_pydict()
{'stored_on': [1585347700.204351], 'vals': [[(514, 12.0), (515, 1300.0), (519, 
125.0), (2978, 126.0), (3236, 13107.0), (3237, 1.0), (3238, 1.0), (3239, 3.0), 
(3240, 3.0)]]}
>>> df = tbl.to_pandas()
>>> df.vals
0 [(514, 12.0), (515, 1300.0), (519, 125.0), (29...
Name: vals, dtype: object
>>> df.vals.iloc[0]
[(514, 12.0), (515, 1300.0), (519, 125.0), (2978, 126.0), (3236, 13107.0), 
(3237, 1.0), (3238, 1.0), (3239, 3.0), (3240, 3.0)]
>>> df.vals.iloc[0][0]
(514, 12.0){code}
I understand this is a very trivial working example, but am interested what any 
maintainers think about this solution and if it merits further 
testing/consideration.

Thanks.

> [Python] Map data types doesn't work from Arrow to Pandas and Parquet
> -
>
> Key: ARROW-9812
> URL: https://issues.apache.org/jira/browse/ARROW-9812
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mayur Srivastava
>Priority: Major
>
> Hi,
> I'm having problems using 'map' data type in Arrow/parquet/pandas.
> I'm able to convert a pandas data frame to Arrow with a map data type.
> But, Arrow to Pandas doesn't work.
> When I write Arrow to Parquet, it seems to work, but I'm not sure if the data 
> type is written correctly.
> When I read back Parquet to Arrow, it fails saying "reading list of structs" 
> is not supported. It seems that map is stored as list of structs.
> There are two problems here:
>  # Map data type doesn't work from Arrow -> Pandas.
>  # Map data type doesn't get written to or read from Arrow -> Parquet.
> Questions:
> 1. Am I doing something wrong? Is there a way to get these to work? 
> 2. If these are unsupported features, will this be fixed in a future version? 
> Do you plans or ETA?
> The following code example (followed by output) should demonstrate the issues:
> I'm using Arrow 1.0.0 and Pandas 1.0.5.
> Thanks!
> Mayur
> {code:java}
> $ cat arrowtest.py
> import pyarrow as pa
> import pandas as pd
> import pyarrow.parquet as pq
> import traceback as tb
> import io
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df1 = pd.DataFrame({'a': [[('b', '2')]]})
> print(f'df1')
> print(f'{df1}')
> print(f'Pandas -> Arrow')
> try:
> t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', 
> pa.map_(pa.string(), pa.string()))]))
> print('PASSED')
> print(t1)
> except:
> print(f'FAILED')
> tb.print_exc()
> print(f'Arrow -> Pandas')
> try:
> t1.to_pandas()
> print('PASSED')
> except:
> print(f'FAILED')
> tb.print_exc()print(f'Arrow -> Parquet')
> fh = io.BytesIO()
> try:
> pq.write_table(t1, fh)
> print('PASSED')
> except:
> print('FAILED')
> tb.print_exc()
> 
> print(f'Parquet -> Arrow')
> try:
> t2 = pq.read_table(source=fh)
> print('PASSED')
> print(t2)
> except:
> print('FAILED')
> tb.print_exc()
> {code}
> {code:java}
> $ python3.6 arrowtest.py
> PyArrow Version = 1.0.0 
> Pandas Version = 1.0.5 
> df1 
> a 0 [(b, 2)] 
>  
> Pandas -> Arrow 
> PASSED 
> pyarrow.Table 
> a: map
>  child 0, entries: struct not null
>  child 0, key: string not null
>  child 1, value: string 
>  
> Arrow -> Pandas 
> FAILED 
> Traceback (most recent call last):
> File "arrowtest.py", line 26, in  t1.to_pandas() 
> File "pyarrow/array.pxi", line 715, in 
> pyarrow.lib._PandasConvertible.to_pandas 
> File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File 
> "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in 
> table_to_blockmanager blocks = _table_to_blocks(options, table, categories, 
> ext_columns_dtypes) 
> File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 
> 1115, in _table_to_blocks list(extension_columns.keys())) 
> File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File 
> "pyarrow/error.pxi", line 

[jira] [Updated] (ARROW-9956) [C++][Gandiva] Implement Binary string function in Gandiva

2020-10-09 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9956:

Summary: [C++][Gandiva] Implement Binary string function in Gandiva  (was: 
Implement Binary string function in Gandiva)

> [C++][Gandiva] Implement Binary string function in Gandiva
> --
>
> Key: ARROW-9956
> URL: https://issues.apache.org/jira/browse/ARROW-9956
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva
>Reporter: Naman Udasi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Implementation for new binary_string function in gandiva.
> Function take in a normal string or a hexadecimal string( 
> _Eg:\x41\x20\x42\x20\x43_) and converts it to VARBINARY (byte array).
> Is generally used with CAST functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10215) [Rust] [DataFusion] Rename "Source" typedef

2020-10-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão resolved ARROW-10215.
--
Fix Version/s: (was: 3.0.0)
   2.0.0
   Resolution: Fixed

Issue resolved by pull request 8408
[https://github.com/apache/arrow/pull/8408]

> [Rust] [DataFusion] Rename "Source" typedef
> ---
>
> Key: ARROW-10215
> URL: https://issues.apache.org/jira/browse/ARROW-10215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Jorge Leitão
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The name "Source" for this type doesn't make sense to me. I would like to 
> discuss alternate names for it.
> {code:java}
> type Source = Box; {code}
> My first thoughts are:
>  * RecordBatchIterator
>  * RecordBatchStream
>  * SendableRecordBatchReader



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10206) [Python][C++][FlightRPC] Add client option to disable server validation

2020-10-09 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10206:
-
Component/s: C++

> [Python][C++][FlightRPC] Add client option to disable server validation
> ---
>
> Key: ARROW-10206
> URL: https://issues.apache.org/jira/browse/ARROW-10206
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Note that this requires using grpc-cpp version 1.25 or higher.
> This requires using GRPC's TlsCredentials class, which is in a different 
> namespace for 1.25-1.31 vs. 1.32+ as well.
> This class and its related options provide an option to disable server 
> certificate checks and require the caller to supply a callback to be used 
> instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10206) [Python][C++][FlightRPC] Add client option to disable server validation

2020-10-09 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10206:
-
Component/s: Python

> [Python][C++][FlightRPC] Add client option to disable server validation
> ---
>
> Key: ARROW-10206
> URL: https://issues.apache.org/jira/browse/ARROW-10206
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Note that this requires using grpc-cpp version 1.25 or higher.
> This requires using GRPC's TlsCredentials class, which is in a different 
> namespace for 1.25-1.31 vs. 1.32+ as well.
> This class and its related options provide an option to disable server 
> certificate checks and require the caller to supply a callback to be used 
> instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10206) [Python][C++][FlightRPC] Add client option to disable server validation

2020-10-09 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-10206.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8325
[https://github.com/apache/arrow/pull/8325]

> [Python][C++][FlightRPC] Add client option to disable server validation
> ---
>
> Key: ARROW-10206
> URL: https://issues.apache.org/jira/browse/ARROW-10206
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Note that this requires using grpc-cpp version 1.25 or higher.
> This requires using GRPC's TlsCredentials class, which is in a different 
> namespace for 1.25-1.31 vs. 1.32+ as well.
> This class and its related options provide an option to disable server 
> certificate checks and require the caller to supply a callback to be used 
> instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10250) [FlightRPC][C++] Remove default constructor for FlightClientOptions

2020-10-09 Thread David Li (Jira)
David Li created ARROW-10250:


 Summary: [FlightRPC][C++] Remove default constructor for 
FlightClientOptions
 Key: ARROW-10250
 URL: https://issues.apache.org/jira/browse/ARROW-10250
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, FlightRPC
Reporter: David Li
 Fix For: 3.0.0


We should delete the default constructor for FlightClientOptions and require 
the struct to always be initialized with Defaults().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10248) [C++][Dataset] Dataset writing does not write schema metadata

2020-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10248:
---
Labels: pull-request-available  (was: )

> [C++][Dataset] Dataset writing does not write schema metadata
> -
>
> Key: ARROW-10248
> URL: https://issues.apache.org/jira/browse/ARROW-10248
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Not sure if this is related to the writing refactor that landed yesterday, 
> but `write_dataset` does not preserve the schema metadata (eg used for pandas 
> metadata):
> {code}
> In [20]: df = pd.DataFrame({'a': [1, 2, 3]})
> In [21]: table = pa.Table.from_pandas(df)
> In [22]: table.schema
> Out[22]: 
> a: int64
> -- schema metadata --
> pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 396
> In [23]: ds.write_dataset(table, "test_write_dataset_pandas", 
> format="parquet")
> In [24]: pq.read_table("test_write_dataset_pandas/part-0.parquet").schema
> Out[24]: 
> a: int64
>   -- field metadata --
>   PARQUET:field_id: '1'
> {code}
> I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't 
> yet look into how easy it would be to fix.
> cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10243) [Rust] [Datafusion] Optimize literal expression evaluation

2020-10-09 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-10243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211092#comment-17211092
 ] 

Jorge Leitão commented on ARROW-10243:
--

All great ideas. Yes!

> [Rust] [Datafusion] Optimize literal expression evaluation
> --
>
> Key: ARROW-10243
> URL: https://issues.apache.org/jira/browse/ARROW-10243
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Priority: Major
> Attachments: flamegraph.svg
>
>
> While benchmarking the tpch query I noticed that the physical literal 
> expression takes up a sizable amount of time. I think the creation of the 
> corresponding array for numeric literals can be speed up by creating Buffer 
> and ArrayData directly without going through a builder. That also allows to 
> skip building a null bitmap for non-null literals.
> I'm also thinking whether it might be possible to cache the created array. 
> For queries without a WHERE clause, I'd expect all batches except the last to 
> have the same length. I'm not sure though where to store the cached value.
> Another possible optimization could be to cast literals already on the 
> logical plan side. In the tpch query the literal `1` is of type `u64` in the 
> logical plan and then needs to be processed by a cast kernel to convert to 
> `f64` for usage in an arithmetic expression.
> The attached flamegraph is of 10 runs of tpch, with the data being loaded 
> into memory before running the queries (See ARROW-10240).
> {code}
> flamegraph ./target/release/tpch --iterations 10 --path ../tpch-dbgen 
> --format tbl --query 1 --batch-size 4096 -c1 --load
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10175) [CI] Nightly hdfs integration test job fails

2020-10-09 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-10175.
-
Resolution: Fixed

Issue resolved by pull request 8413
[https://github.com/apache/arrow/pull/8413]

> [CI] Nightly hdfs integration test job fails
> 
>
> Key: ARROW-10175
> URL: https://issues.apache.org/jira/browse/ARROW-10175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Neal Richardson
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Two tests fail:
> https://github.com/ursa-labs/crossbow/runs/1204680589
> [removed bogus investigation]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10203) [Doc] Capture guidance for endianness support in contributors guide.

2020-10-09 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-10203:

Summary: [Doc] Capture guidance for endianness support in contributors 
guide.  (was: Capture guidance for endianness support in contributors guide.)

> [Doc] Capture guidance for endianness support in contributors guide.
> 
>
> Key: ARROW-10203
> URL: https://issues.apache.org/jira/browse/ARROW-10203
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3ccak7z5t--hhhr9dy43pyhd6m-xou4qogwqvlwzsg-koxxjpt...@mail.gmail.com%3e



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10231) [CI] Unable to download minio in arm32v7 docker image

2020-10-09 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-10231.
-
Resolution: Fixed

Issue resolved by pull request 8396
[https://github.com/apache/arrow/pull/8396]

> [CI] Unable to download minio in arm32v7 docker image
> -
>
> Key: ARROW-10231
> URL: https://issues.apache.org/jira/browse/ARROW-10231
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: CI
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> See build log https://github.com/apache/arrow/runs/1224947766#step:5:2021



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8355) [Python] Reduce the number of pandas dependent test cases in test_feather

2020-10-09 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-8355.
--
Fix Version/s: (was: 3.0.0)
   2.0.0
   Resolution: Fixed

Issue resolved by pull request 8244
[https://github.com/apache/arrow/pull/8244]

> [Python] Reduce the number of pandas dependent test cases in test_feather
> -
>
> Key: ARROW-8355
> URL: https://issues.apache.org/jira/browse/ARROW-8355
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Andrew Wieteska
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> See comment https://github.com/apache/arrow/pull/6849#discussion_r404160096



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10248) [C++][Dataset] Dataset writing does not write schema metadata

2020-10-09 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-10248:


Assignee: Ben Kietzman

> [C++][Dataset] Dataset writing does not write schema metadata
> -
>
> Key: ARROW-10248
> URL: https://issues.apache.org/jira/browse/ARROW-10248
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 2.0.0
>
>
> Not sure if this is related to the writing refactor that landed yesterday, 
> but `write_dataset` does not preserve the schema metadata (eg used for pandas 
> metadata):
> {code}
> In [20]: df = pd.DataFrame({'a': [1, 2, 3]})
> In [21]: table = pa.Table.from_pandas(df)
> In [22]: table.schema
> Out[22]: 
> a: int64
> -- schema metadata --
> pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 396
> In [23]: ds.write_dataset(table, "test_write_dataset_pandas", 
> format="parquet")
> In [24]: pq.read_table("test_write_dataset_pandas/part-0.parquet").schema
> Out[24]: 
> a: int64
>   -- field metadata --
>   PARQUET:field_id: '1'
> {code}
> I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't 
> yet look into how easy it would be to fix.
> cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7957) [Python] ParquetDataset cannot take HadoopFileSystem as filesystem

2020-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7957:
--
Labels: pull-request-available  (was: )

> [Python] ParquetDataset cannot take HadoopFileSystem as filesystem
> --
>
> Key: ARROW-7957
> URL: https://issues.apache.org/jira/browse/ARROW-7957
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Catherine
>Assignee: Joris Van den Bossche
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{from pyarrow.fs import HadoopFileSystem}}
>  {{import pyarrow.parquet as pq}}
>  
> {{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
>  {{hdfs, path = HadoopFileSystem.from_uri(file_name)}}
>  {{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}}
>  
> has error:
>  {{OSError: Unrecognized filesystem:  'pyarrow._hdfs.HadoopFileSystem'>}}
>  
> When I tried using the deprecated {{HadoopFileSystem}}:
> {{import pyarrow}}
>  {{import pyarrow.parquet as pq}}
>  
> {{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
> {{hdfs = pyarrow.hdfs.connect('localhost', 9000)}}
> {{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}}
> {{pa_schema = dataset.schema.to_arrow_schema()}}
> {{pieces = dataset.pieces}}
> {{for piece in pieces: }}
> {{    print(piece.path)}}
>  
> {{piece.path}} lose the {{hdfs://localhost:9000}} prefix.
>  
> I think {{ParquetDataset}} should accept {{pyarrow.fs.}}{{HadoopFileSystem as 
> filesystem?}}
> And {{piece.path}} should have the prefix?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7957) [Python] ParquetDataset cannot take HadoopFileSystem as filesystem

2020-10-09 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7957:
-
Fix Version/s: (was: 3.0.0)
   2.0.0

> [Python] ParquetDataset cannot take HadoopFileSystem as filesystem
> --
>
> Key: ARROW-7957
> URL: https://issues.apache.org/jira/browse/ARROW-7957
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Catherine
>Assignee: Joris Van den Bossche
>Priority: Critical
> Fix For: 2.0.0
>
>
> {{from pyarrow.fs import HadoopFileSystem}}
>  {{import pyarrow.parquet as pq}}
>  
> {{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
>  {{hdfs, path = HadoopFileSystem.from_uri(file_name)}}
>  {{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}}
>  
> has error:
>  {{OSError: Unrecognized filesystem:  'pyarrow._hdfs.HadoopFileSystem'>}}
>  
> When I tried using the deprecated {{HadoopFileSystem}}:
> {{import pyarrow}}
>  {{import pyarrow.parquet as pq}}
>  
> {{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
> {{hdfs = pyarrow.hdfs.connect('localhost', 9000)}}
> {{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}}
> {{pa_schema = dataset.schema.to_arrow_schema()}}
> {{pieces = dataset.pieces}}
> {{for piece in pieces: }}
> {{    print(piece.path)}}
>  
> {{piece.path}} lose the {{hdfs://localhost:9000}} prefix.
>  
> I think {{ParquetDataset}} should accept {{pyarrow.fs.}}{{HadoopFileSystem as 
> filesystem?}}
> And {{piece.path}} should have the prefix?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10245) [CI] Update the conda docker images to use miniforge instead of miniconda

2020-10-09 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211045#comment-17211045
 ] 

Uwe Korn commented on ARROW-10245:
--

This adds ppc64le, aarch64 and osx-arm64 as supported architectures but should 
only be a different download URL. Be aware that miniforge doesn't include 
defaults as a default channel as well that it is unavailable for Windows.

> [CI] Update the conda docker images to use miniforge instead of miniconda
> -
>
> Key: ARROW-10245
> URL: https://issues.apache.org/jira/browse/ARROW-10245
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
>
> So we could support more architectures 
> https://github.com/conda-forge/miniforge
> cc [~uwe]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10175) [CI] Nightly hdfs integration test job fails

2020-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10175:
---
Labels: pull-request-available  (was: )

> [CI] Nightly hdfs integration test job fails
> 
>
> Key: ARROW-10175
> URL: https://issues.apache.org/jira/browse/ARROW-10175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Neal Richardson
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Two tests fail:
> https://github.com/ursa-labs/crossbow/runs/1204680589
> [removed bogus investigation]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10175) [CI] Nightly hdfs integration test job fails

2020-10-09 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210982#comment-17210982
 ] 

Joris Van den Bossche commented on ARROW-10175:
---

OK, needed to pass some {{use_legacy_dataset=True}}, because the default is now 
to use the dataset implemenation, which of course doesn't work when passing 
legacy filesystems.

Now, the first error, which is reading from an URI and _not_ passing a legacy 
HadoopFileSystem object, seems a legitimate bug (because passing an URI should 
"just" use the new implementation):

{code}
pyarrow.lib.ArrowInvalid: Path 
'/tmp/pyarrow-test-838/multi-parquet-uri-48569714efc74397816722c9c6723191/0.parquet'
 is not relative to '/user/root'
{code}

> [CI] Nightly hdfs integration test job fails
> 
>
> Key: ARROW-10175
> URL: https://issues.apache.org/jira/browse/ARROW-10175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Python
>Reporter: Neal Richardson
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 2.0.0
>
>
> Two tests fail:
> https://github.com/ursa-labs/crossbow/runs/1204680589
> [removed bogus investigation]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10246) [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when duplicate values are present

2020-10-09 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210933#comment-17210933
 ] 

Joris Van den Bossche commented on ARROW-10246:
---

This was already reported as ARROW-10237 and fixed in the meantime.  
But we should have notified the mailing about that, sorry about that! Thanks 
for looking into it anyway

> [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when 
> duplicate values are present
> -
>
> Key: ARROW-10246
> URL: https://issues.apache.org/jira/browse/ARROW-10246
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Matt Jadczak
>Priority: Major
>
> Copying this from [the mailing 
> list|https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E]
> We can observe the following odd behaviour when round-tripping data via 
> parquet using pyarrow, when the data contains dictionary arrays with 
> duplicate values.
>  
> {code:java}
> import pyarrow as pa
>  import pyarrow.parquet as pq
> my_table = pa.Table.from_batches(
>  [
>  pa.RecordBatch.from_arrays(
>  [
>  pa.array([0, 1, 2, 3, 4]),
>  pa.DictionaryArray.from_arrays(
>  pa.array([0, 1, 2, 3, 4]),
>  pa.array(['a', 'd', 'c', 'd', 'e'])
>  )
>  ],
>  names=['foo', 'bar']
>  )
>  ]
>  )
>  my_table.validate(full=True)
> pq.write_table(my_table, "foo.parquet")
> read_table = pq.ParquetFile("foo.parquet").read()
>  read_table.validate(full=True)
> print(my_table.column(1).to_pylist())
>  print(read_table.column(1).to_pylist())
> assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
> {code}
> Both tables pass full validation, yet the last three lines print:
> {code:java}
> ['a', 'd', 'c', 'd', 'e']
> ['a', 'd', 'c', 'e', 'a']
> Traceback (most recent call last):
>  File 
> "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", line 
> 29, in 
>  assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
> AssertionError{code}
> Which clearly doesn't look right!
>  
> It seems to me that the reason this is happening is that when re-encoding an 
> Arrow dictionary as a Parquet one, the function at
> [https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773]
> is called to create a Parquet DictEncoder out of the Arrow dictionary data. 
> This internally uses a map from value to index, and this map is constructed 
> by continually calling GetOrInsert on a memo table. When called with 
> duplicate values as in Al's example, the duplicates do not cause a new 
> dictionary index to be allocated, but instead return the existing one (which 
> is just ignored). However, the caller assumes that the resulting Parquet 
> dictionary uses the exact same indices as the Arrow one, and proceeds to just 
> copy the index data directly. In Al's example, this results in an invalid 
> dictionary index being written (that it is somehow wrapped around when 
> reading again, rather than crashing, is potentially a second bug).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-10246) [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when duplicate values are present

2020-10-09 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-10246.
-
Resolution: Duplicate

> [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when 
> duplicate values are present
> -
>
> Key: ARROW-10246
> URL: https://issues.apache.org/jira/browse/ARROW-10246
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Matt Jadczak
>Priority: Major
>
> Copying this from [the mailing 
> list|https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E]
> We can observe the following odd behaviour when round-tripping data via 
> parquet using pyarrow, when the data contains dictionary arrays with 
> duplicate values.
>  
> {code:java}
> import pyarrow as pa
>  import pyarrow.parquet as pq
> my_table = pa.Table.from_batches(
>  [
>  pa.RecordBatch.from_arrays(
>  [
>  pa.array([0, 1, 2, 3, 4]),
>  pa.DictionaryArray.from_arrays(
>  pa.array([0, 1, 2, 3, 4]),
>  pa.array(['a', 'd', 'c', 'd', 'e'])
>  )
>  ],
>  names=['foo', 'bar']
>  )
>  ]
>  )
>  my_table.validate(full=True)
> pq.write_table(my_table, "foo.parquet")
> read_table = pq.ParquetFile("foo.parquet").read()
>  read_table.validate(full=True)
> print(my_table.column(1).to_pylist())
>  print(read_table.column(1).to_pylist())
> assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
> {code}
> Both tables pass full validation, yet the last three lines print:
> {code:java}
> ['a', 'd', 'c', 'd', 'e']
> ['a', 'd', 'c', 'e', 'a']
> Traceback (most recent call last):
>  File 
> "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", line 
> 29, in 
>  assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
> AssertionError{code}
> Which clearly doesn't look right!
>  
> It seems to me that the reason this is happening is that when re-encoding an 
> Arrow dictionary as a Parquet one, the function at
> [https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773]
> is called to create a Parquet DictEncoder out of the Arrow dictionary data. 
> This internally uses a map from value to index, and this map is constructed 
> by continually calling GetOrInsert on a memo table. When called with 
> duplicate values as in Al's example, the duplicates do not cause a new 
> dictionary index to be allocated, but instead return the existing one (which 
> is just ignored). However, the caller assumes that the resulting Parquet 
> dictionary uses the exact same indices as the Arrow one, and proceeds to just 
> copy the index data directly. In Al's example, this results in an invalid 
> dictionary index being written (that it is somehow wrapped around when 
> reading again, rather than crashing, is potentially a second bug).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9952) [Python] Use pyarrow.dataset writing for pq.write_to_dataset

2020-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9952:
--
Labels: pull-request-available  (was: )

> [Python] Use pyarrow.dataset writing for pq.write_to_dataset
> 
>
> Key: ARROW-9952
> URL: https://issues.apache.org/jira/browse/ARROW-9952
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now ARROW-9658 and ARROW-9893 are in, we can explore using the 
> {{pyarrow.dataset}} writing capabilities in {{parquet.write_to_dataset}}.
> Similarly as was done in {{pq.read_table}}, we could initially have a keyword 
> to switch between both implementations, eventually defaulting to the new 
> datasets one, and to deprecated the old (inefficient) python implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10249) [Rust]: Support Dictionary types for ListArrays in arrow json reader

2020-10-09 Thread Mahmut Bulut (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahmut Bulut updated ARROW-10249:
-
Summary: [Rust]: Support Dictionary types for ListArrays in arrow json 
reader  (was: [Rust]: Support Dictionary types in arrow json reader)

> [Rust]: Support Dictionary types for ListArrays in arrow json reader
> 
>
> Key: ARROW-10249
> URL: https://issues.apache.org/jira/browse/ARROW-10249
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Mahmut Bulut
>Priority: Major
>
> Currently, dictionary types are not supported in Arrow JSON reader. It would 
> be nice to add dictionary type support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10249) [Rust]: Support Dictionary types for ListArrays in arrow json reader

2020-10-09 Thread Mahmut Bulut (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahmut Bulut updated ARROW-10249:
-
Description: Currently, dictionary types for listarrays are not supported 
in Arrow JSON reader. It would be nice to add dictionary type support.  (was: 
Currently, dictionary types are not supported in Arrow JSON reader. It would be 
nice to add dictionary type support.)

> [Rust]: Support Dictionary types for ListArrays in arrow json reader
> 
>
> Key: ARROW-10249
> URL: https://issues.apache.org/jira/browse/ARROW-10249
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Mahmut Bulut
>Priority: Major
>
> Currently, dictionary types for listarrays are not supported in Arrow JSON 
> reader. It would be nice to add dictionary type support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10249) [Rust]: Support Dictionary types in arrow json reader

2020-10-09 Thread Mahmut Bulut (Jira)
Mahmut Bulut created ARROW-10249:


 Summary: [Rust]: Support Dictionary types in arrow json reader
 Key: ARROW-10249
 URL: https://issues.apache.org/jira/browse/ARROW-10249
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Mahmut Bulut


Currently, dictionary types are not supported in Arrow JSON reader. It would be 
nice to add dictionary type support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10248) [C++][Dataset] Dataset writing does not write schema metadata

2020-10-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10248:
-

 Summary: [C++][Dataset] Dataset writing does not write schema 
metadata
 Key: ARROW-10248
 URL: https://issues.apache.org/jira/browse/ARROW-10248
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 2.0.0


Not sure if this is related to the writing refactor that landed yesterday, but 
`write_dataset` does not preserve the schema metadata (eg used for pandas 
metadata):

{code}
In [20]: df = pd.DataFrame({'a': [1, 2, 3]})

In [21]: table = pa.Table.from_pandas(df)

In [22]: table.schema
Out[22]: 
a: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 396

In [23]: ds.write_dataset(table, "test_write_dataset_pandas", format="parquet")

In [24]: pq.read_table("test_write_dataset_pandas/part-0.parquet").schema
Out[24]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
{code}

I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't 
yet look into how easy it would be to fix.

cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field

2020-10-09 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210866#comment-17210866
 ] 

Joris Van den Bossche commented on ARROW-10247:
---

cc [~bkietz]

> [C++][Dataset] Cannot write dataset with dictionary column as partition field
> -
>
> Key: ARROW-10247
> URL: https://issues.apache.org/jira/browse/ARROW-10247
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 2.0.0
>
>
> When the column to use for partitioning is dictionary encoded, we get this 
> error:
> {code}
> In [9]: import pyarrow.dataset as ds
> In [10]: part = ["xxx"] * 3 + ["yyy"] * 3
> ...: table = pa.table([
> ...: pa.array(range(len(part))),
> ...: pa.array(part).dictionary_encode(),
> ...: ], names=['col', 'part'])
> In [11]: part = ds.partitioning(table.select(["part"]).schema)
> In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
> partitioning=part)
> ---
> ArrowTypeErrorTraceback (most recent call last)
>  in 
> > 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
> partitioning=part)
> ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, 
> base_dir, basename_template, format, partitioning, schema, filesystem, 
> file_options, use_threads)
> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> --> 775 filesystem, partitioning, file_options, use_threads,
> 776 )
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowTypeError: scalar xxx (of type string) is invalid for part: 
> dictionary
> In ../src/arrow/dataset/filter.cc, line 1082, code: 
> VisitConjunctionMembers(*and_.left_operand(), visitor)
> In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, 
> [&](const std::string& name, const std::shared_ptr& value) { auto&& 
> _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { 
> ::arrow::Status __s = 
> ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if 
> ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); 
> _st.AddContextLine("../src/arrow/dataset/partition.cc", 257, 
> "(_error_or_value28).status()"); return _st; } } while (0); } while (false); 
> auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const 
> auto& field = schema_->field(match[0]); if 
> (!value->type->Equals(field->type())) { return Status::TypeError("scalar ", 
> value->ToString(), " (of type ", *value->type, ") is invalid for ", 
> field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); 
> })
> In ../src/arrow/dataset/file_base.cc, line 321, code: 
> (_error_or_value24).status()
> In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish()
> {code}
> While this seems a quit normal use case, as this column will typically be 
> repeated many times (and we also support reading it as such with dictionary 
> type, so a roundtrip is currently not possible in that case)
> I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't 
> yet look into how easy it would be to fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field

2020-10-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10247:
-

 Summary: [C++][Dataset] Cannot write dataset with dictionary 
column as partition field
 Key: ARROW-10247
 URL: https://issues.apache.org/jira/browse/ARROW-10247
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 2.0.0


When the column to use for partitioning is dictionary encoded, we get this 
error:

{code}
In [9]: import pyarrow.dataset as ds

In [10]: part = ["xxx"] * 3 + ["yyy"] * 3
...: table = pa.table([
...: pa.array(range(len(part))),
...: pa.array(part).dictionary_encode(),
...: ], names=['col', 'part'])

In [11]: part = ds.partitioning(table.select(["part"]).schema)

In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
partitioning=part)
---
ArrowTypeErrorTraceback (most recent call last)
 in 
> 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
partitioning=part)

~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, base_dir, 
basename_template, format, partitioning, schema, filesystem, file_options, 
use_threads)
773 _filesystemdataset_write(
774 data, base_dir, basename_template, schema,
--> 775 filesystem, partitioning, file_options, use_threads,
776 )

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset._filesystemdataset_write()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowTypeError: scalar xxx (of type string) is invalid for part: 
dictionary
In ../src/arrow/dataset/filter.cc, line 1082, code: 
VisitConjunctionMembers(*and_.left_operand(), visitor)
In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, [&](const 
std::string& name, const std::shared_ptr& value) { auto&& 
_error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { 
::arrow::Status __s = 
::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if 
((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); 
_st.AddContextLine("../src/arrow/dataset/partition.cc", 257, 
"(_error_or_value28).status()"); return _st; } } while (0); } while (false); 
auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const 
auto& field = schema_->field(match[0]); if 
(!value->type->Equals(field->type())) { return Status::TypeError("scalar ", 
value->ToString(), " (of type ", *value->type, ") is invalid for ", 
field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); })
In ../src/arrow/dataset/file_base.cc, line 321, code: 
(_error_or_value24).status()
In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish()
{code}

While this seems a quit normal use case, as this column will typically be 
repeated many times (and we also support reading it as such with dictionary 
type, so a roundtrip is currently not possible in that case)

I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't 
yet look into how easy it would be to fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10114) [R] Segfault in to_dataframe_parallel with deeply nested structs

2020-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10114:
---
Labels: pull-request-available  (was: )

> [R] Segfault in to_dataframe_parallel with deeply nested structs
> 
>
> Key: ARROW-10114
> URL: https://issues.apache.org/jira/browse/ARROW-10114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: > sessionInfo()
> R version 3.6.3 (2020-02-29)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Linux Mint 19.3
> Matrix products: default
> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
> locale:
>  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C  
>  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=sv_SE.UTF-8LC_MESSAGES=en_US.UTF-8   
>  [7] LC_PAPER=sv_SE.UTF-8   LC_NAME=C 
>  [9] LC_ADDRESS=C   LC_TELEPHONE=C
> [11] LC_MEASUREMENT=sv_SE.UTF-8 LC_IDENTIFICATION=C   
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base 
> other attached packages:
> [1] arrow_1.0.1
> loaded via a namespace (and not attached):
>  [1] tidyselect_1.1.0 bit_4.0.4compiler_3.6.3   magrittr_1.5
>  [5] assertthat_0.2.1 R6_2.4.1 glue_1.4.1   Rcpp_1.0.5  
>  [9] bit64_4.0.2  vctrs_0.3.2  rlang_0.4.7  purrr_0.3.4 
>Reporter: Markus Skyttner
>Assignee: Romain Francois
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
> Attachments: Dockerfile, Makefile, reprex_10114.R
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A .jsonl file (newline separated JSON) created from open data available at 
> [ftp://ftp.libris.kb.se/pub/spa/swepub-deduplicated-2019-12-29.zip] is used 
> with the R package arrow (installed from CRAN) using the following statement:
> > arrow::read_json_arrow("~/.config/swepub/head.jsonl")
> It crashes RStudio with no error message. At the R prompt, the error message 
> is:
> Error in Table__to_dataframe(x, use_threads = option_use_threads()) : 
>  SET_VECTOR_ELT() can only be applied to a 'list', not a 'integer'
> The file "head.jsonl" above was created from the extracted zip's .jsonl file 
> with the *nix "head -1 $BIG_JSONL_FILE" command. It can be parsed with 
> jsonlite and tidyjson.
> Also got this error message at one point:
> > arrow::read_json_arrow("head.jsonl", as_data_frame = TRUE)
> *** caught segfault ***
> address 0x8, cause 'memory not mapped'
> Traceback:
>  1: structure(x, extra_cols = colonnade[extra_cols], class = 
> "pillar_squeezed_colonnade")
>  2: new_colonnade_sqeezed(out, colonnade = x, extra_cols = extra_cols)
>  3: pillar::squeeze(x$mcf, width = width)
>  4: format.trunc_mat(mat)
>  5: format(mat)
>  6: format.tbl(x, ..., n = n, width = width, n_extra = n_extra)
>  7: format(x, ..., n = n, width = width, n_extra = n_extra)
>  8: paste0(..., collapse = "\n")
>  9: cli::cat_line(format(x, ..., n = n, width = width, n_extra = n_extra))
> 10: print.tbl(x)
> 11: (function (x, ...) UseMethod("print"))(x)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10246) [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when duplicate values are present

2020-10-09 Thread Matt Jadczak (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Jadczak updated ARROW-10246:
-
Component/s: Python
 C++

> [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when 
> duplicate values are present
> -
>
> Key: ARROW-10246
> URL: https://issues.apache.org/jira/browse/ARROW-10246
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Matt Jadczak
>Priority: Major
>
> Copying this from [the mailing 
> list|https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E]
> We can observe the following odd behaviour when round-tripping data via 
> parquet using pyarrow, when the data contains dictionary arrays with 
> duplicate values.
>  
> {code:java}
> import pyarrow as pa
>  import pyarrow.parquet as pq
> my_table = pa.Table.from_batches(
>  [
>  pa.RecordBatch.from_arrays(
>  [
>  pa.array([0, 1, 2, 3, 4]),
>  pa.DictionaryArray.from_arrays(
>  pa.array([0, 1, 2, 3, 4]),
>  pa.array(['a', 'd', 'c', 'd', 'e'])
>  )
>  ],
>  names=['foo', 'bar']
>  )
>  ]
>  )
>  my_table.validate(full=True)
> pq.write_table(my_table, "foo.parquet")
> read_table = pq.ParquetFile("foo.parquet").read()
>  read_table.validate(full=True)
> print(my_table.column(1).to_pylist())
>  print(read_table.column(1).to_pylist())
> assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
> {code}
> Both tables pass full validation, yet the last three lines print:
> {code:java}
> ['a', 'd', 'c', 'd', 'e']
> ['a', 'd', 'c', 'e', 'a']
> Traceback (most recent call last):
>  File 
> "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", line 
> 29, in 
>  assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
> AssertionError{code}
> Which clearly doesn't look right!
>  
> It seems to me that the reason this is happening is that when re-encoding an 
> Arrow dictionary as a Parquet one, the function at
> [https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773]
> is called to create a Parquet DictEncoder out of the Arrow dictionary data. 
> This internally uses a map from value to index, and this map is constructed 
> by continually calling GetOrInsert on a memo table. When called with 
> duplicate values as in Al's example, the duplicates do not cause a new 
> dictionary index to be allocated, but instead return the existing one (which 
> is just ignored). However, the caller assumes that the resulting Parquet 
> dictionary uses the exact same indices as the Arrow one, and proceeds to just 
> copy the index data directly. In Al's example, this results in an invalid 
> dictionary index being written (that it is somehow wrapped around when 
> reading again, rather than crashing, is potentially a second bug).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10246) [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when duplicate values are present

2020-10-09 Thread Matt Jadczak (Jira)
Matt Jadczak created ARROW-10246:


 Summary: [Python] Incorrect conversion of Arrow dictionary to 
Parquet dictionary when duplicate values are present
 Key: ARROW-10246
 URL: https://issues.apache.org/jira/browse/ARROW-10246
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Matt Jadczak


Copying this from [the mailing 
list|https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E]

We can observe the following odd behaviour when round-tripping data via parquet 
using pyarrow, when the data contains dictionary arrays with duplicate values.

 
{code:java}
import pyarrow as pa
 import pyarrow.parquet as pq
my_table = pa.Table.from_batches(
 [
 pa.RecordBatch.from_arrays(
 [
 pa.array([0, 1, 2, 3, 4]),
 pa.DictionaryArray.from_arrays(
 pa.array([0, 1, 2, 3, 4]),
 pa.array(['a', 'd', 'c', 'd', 'e'])
 )
 ],
 names=['foo', 'bar']
 )
 ]
 )
 my_table.validate(full=True)
pq.write_table(my_table, "foo.parquet")
read_table = pq.ParquetFile("foo.parquet").read()
 read_table.validate(full=True)
print(my_table.column(1).to_pylist())
 print(read_table.column(1).to_pylist())
assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
{code}
Both tables pass full validation, yet the last three lines print:


{code:java}
['a', 'd', 'c', 'd', 'e']
['a', 'd', 'c', 'e', 'a']
Traceback (most recent call last):
 File "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", 
line 29, in 
 assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()
AssertionError{code}
Which clearly doesn't look right!

 

It seems to me that the reason this is happening is that when re-encoding an 
Arrow dictionary as a Parquet one, the function at

[https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773]

is called to create a Parquet DictEncoder out of the Arrow dictionary data. 
This internally uses a map from value to index, and this map is constructed by 
continually calling GetOrInsert on a memo table. When called with duplicate 
values as in Al's example, the duplicates do not cause a new dictionary index 
to be allocated, but instead return the existing one (which is just ignored). 
However, the caller assumes that the resulting Parquet dictionary uses the 
exact same indices as the Arrow one, and proceeds to just copy the index data 
directly. In Al's example, this results in an invalid dictionary index being 
written (that it is somehow wrapped around when reading again, rather than 
crashing, is potentially a second bug).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9695) [Rust][DataFusion] Improve documentation on LogicalPlan variants

2020-10-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão reassigned ARROW-9695:
---

Assignee: Andrew Lamb

> [Rust][DataFusion] Improve documentation on LogicalPlan variants
> 
>
> Key: ARROW-9695
> URL: https://issues.apache.org/jira/browse/ARROW-9695
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I think we could improve the documentation somewhat on LogicalPlan nodes. I 
> will submit a PR with a proposal. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9733) [Rust][DataFusion] Aggregates COUNT/MIN/MAX don't work on VARCHAR columns

2020-10-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-9733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão reassigned ARROW-9733:
---

Assignee: Jorge Leitão

> [Rust][DataFusion] Aggregates COUNT/MIN/MAX don't work on VARCHAR columns
> -
>
> Key: ARROW-9733
> URL: https://issues.apache.org/jira/browse/ARROW-9733
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
> Attachments: repro.csv
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> h2. Reproducer:
> Create a table with a string column:
> Repro:
> {code}
> CREATE EXTERNAL TABLE repro(a INT, b VARCHAR)
> STORED AS CSV
> WITH HEADER ROW
> LOCATION 'repro.csv';
> {code}
> The contents of repro.csv are as follows (also attached):
> {code}
> a,b
> 1,One
> 1,Two
> 2,One
> 2,Two
> 2,Two
> {code}
> Now, run a query that tries to aggregate that column:
> {code}
> select a, count(b) from repro group by a;
> {code}
> *Actual behavior*:
> {code}
> > select a, count(b) from repro group by a;
> ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for 
> result of aggregate expression")))
> {code}
> *Expected Behavior*:
> The query runs and produces results
> {code}
> a, count(b)
> 1,2
> 2,3
> {code}
> h2. Discussion
> Using Min/Max aggregates on varchar also doesn't work (but should):
> {code}
> > select a, min(b) from repro group by a;
> ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for 
> result of aggregate expression")))
> > select a, max(b) from repro group by a;
> ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for 
> result of aggregate expression")))
> {code}
> Fascinatingly these formulations work fine:
> {code}
> > select a, count(a) from repro group by a;
> +---+--+
> | a | count(a) |
> +---+--+
> | 2 | 3|
> | 1 | 2|
> +---+--+
> 2 row in set. Query took 0 seconds.
> > select a, count(1) from repro group by a;
> +---+-+
> | a | count(UInt8(1)) |
> +---+-+
> | 2 | 3   |
> | 1 | 2   |
> +---+-+
> 2 row in set. Query took 0 seconds.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9759) [Rust] [DataFusion] Implement DataFrame::sort

2020-10-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-9759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão reassigned ARROW-9759:
---

Assignee: Andy Grove

> [Rust] [DataFusion] Implement DataFrame::sort
> -
>
> Key: ARROW-9759
> URL: https://issues.apache.org/jira/browse/ARROW-9759
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Implement DataFrame::sort



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9742) [Rust] Create one standard DataFrame API

2020-10-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-9742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão reassigned ARROW-9742:
---

Assignee: Andy Grove

> [Rust] Create one standard DataFrame API
> 
>
> Key: ARROW-9742
> URL: https://issues.apache.org/jira/browse/ARROW-9742
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
>  There was a discussion in last Arrow sync call about the fact that there are 
> numerous Rust DataFrame projects and it would be good to have one standard, 
> in the Arrow repo.
> I do think it would be good to have a DataFrame trait in Arrow, with an 
> implementation in DataFusion, and making it possible for other projects to 
> extend/replace the implementation e.g. for distributed compute, or for GPU 
> compute, as two examples. 
> [~jhorstmann] Does this capture what you were suggesting in the call?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10231) [CI] Unable to download minio in arm32v7 docker image

2020-10-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-10231:
-
Component/s: CI

> [CI] Unable to download minio in arm32v7 docker image
> -
>
> Key: ARROW-10231
> URL: https://issues.apache.org/jira/browse/ARROW-10231
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: CI
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See build log https://github.com/apache/arrow/runs/1224947766#step:5:2021



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9760) [Rust] [DataFusion] Implement DataFrame::explain

2020-10-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-9760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão reassigned ARROW-9760:
---

Assignee: Jorge Leitão

> [Rust] [DataFusion] Implement DataFrame::explain
> 
>
> Key: ARROW-9760
> URL: https://issues.apache.org/jira/browse/ARROW-9760
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Affects Versions: 2.0.0
>Reporter: Andy Grove
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Implement DataFrame::explain - we already have explain implemented in the SQL 
> API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9793) [Rust] [DataFusion] Tests failing in master

2020-10-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão reassigned ARROW-9793:
---

Assignee: Jorge Leitão

> [Rust] [DataFusion] Tests failing in master
> ---
>
> Key: ARROW-9793
> URL: https://issues.apache.org/jira/browse/ARROW-9793
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10245) [CI] Update the conda docker images to use miniforge instead of miniconda

2020-10-09 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-10245:
---

 Summary: [CI] Update the conda docker images to use miniforge 
instead of miniconda
 Key: ARROW-10245
 URL: https://issues.apache.org/jira/browse/ARROW-10245
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Krisztian Szucs


So we could support more architectures https://github.com/conda-forge/miniforge

cc [~uwe]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9518) [Python] Deprecate pyarrow serialization

2020-10-09 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-9518.

Resolution: Fixed

Issue resolved by pull request 8255
[https://github.com/apache/arrow/pull/8255]

> [Python] Deprecate pyarrow serialization
> 
>
> Key: ARROW-9518
> URL: https://issues.apache.org/jira/browse/ARROW-9518
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available, pyarrow-serialization
> Fix For: 2.0.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Per mailing list discussion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10216) [Rust] Simd implementation of min/max aggregation kernels for primitive types

2020-10-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörn Horstmann reassigned ARROW-10216:
--

Assignee: Jörn Horstmann

> [Rust] Simd implementation of min/max aggregation kernels for primitive types
> -
>
> Key: ARROW-10216
> URL: https://issues.apache.org/jira/browse/ARROW-10216
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Major
>
> Using a similar approach as the sum kernel (ARROW-10015). Instead of 
> initializing the accumulator with 0 we'd need the largest/smallest possible 
> value for each ArrowNumericType (i.e. u64::MAX or +-Inf)
> Pseudo code for min aggregation 
> {code}
> // initialize accumulator
> min_acc = +Inf
> // aggregate each chunk
> min_acc = min(min_acc, select(valid, value, +Inf))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9956) Implement Binary string function in Gandiva

2020-10-09 Thread Praveen Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar resolved ARROW-9956.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8201
[https://github.com/apache/arrow/pull/8201]

> Implement Binary string function in Gandiva
> ---
>
> Key: ARROW-9956
> URL: https://issues.apache.org/jira/browse/ARROW-9956
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva
>Reporter: Naman Udasi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Implementation for new binary_string function in gandiva.
> Function take in a normal string or a hexadecimal string( 
> _Eg:\x41\x20\x42\x20\x43_) and converts it to VARBINARY (byte array).
> Is generally used with CAST functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10244) [Python][Docs] Add docs on using pyarrow.dataset.parquet_dataset

2020-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10244:
---
Labels: pull-request-available  (was: )

> [Python][Docs] Add docs on using pyarrow.dataset.parquet_dataset
> 
>
> Key: ARROW-10244
> URL: https://issues.apache.org/jira/browse/ARROW-10244
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10244) [Python][Docs] Add docs on using pyarrow.dataset.parquet_dataset

2020-10-09 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-10244:
-

Assignee: Joris Van den Bossche

> [Python][Docs] Add docs on using pyarrow.dataset.parquet_dataset
> 
>
> Key: ARROW-10244
> URL: https://issues.apache.org/jira/browse/ARROW-10244
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10244) [Python][Docs] Add docs on using pyarrow.dataset.parquet_dataset

2020-10-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10244:
-

 Summary: [Python][Docs] Add docs on using 
pyarrow.dataset.parquet_dataset
 Key: ARROW-10244
 URL: https://issues.apache.org/jira/browse/ARROW-10244
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 2.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10243) [Rust] [Datafusion] Optimize literal expression evaluation

2020-10-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörn Horstmann updated ARROW-10243:
---
Description: 
While benchmarking the tpch query I noticed that the physical literal 
expression takes up a sizable amount of time. I think the creation of the 
corresponding array for numeric literals can be speed up by creating Buffer and 
ArrayData directly without going through a builder. That also allows to skip 
building a null bitmap for non-null literals.

I'm also thinking whether it might be possible to cache the created array. For 
queries without a WHERE clause, I'd expect all batches except the last to have 
the same length. I'm not sure though where to store the cached value.

Another possible optimization could be to cast literals already on the logical 
plan side. In the tpch query the literal `1` is of type `u64` in the logical 
plan and then needs to be processed by a cast kernel to convert to `f64` for 
usage in an arithmetic expression.

The attached flamegraph is of 10 runs of tpch, with the data being loaded into 
memory before running the queries (See ARROW-10240).

{code}
flamegraph ./target/release/tpch --iterations 10 --path ../tpch-dbgen --format 
tbl --query 1 --batch-size 4096 -c1 --load
{code}

  was:
While benchmarking the tpch query I noticed that the physical literal 
expression takes up a sizable amount of time. I think the creation of the 
corresponding array for numeric literals can be speed up by creating Buffer and 
ArrayData directly without going through a builder. That also allows to skip 
building a null bitmap for non-null literals.

I'm also thinking whether it might be possible to cache the created array. For 
queries without a WHERE clause, I'd expect all batches except the last to have 
the same length. I'm not sure though where to store the cached value.

Another possible optimization could be to cast literals already on the logical 
plan side. In the tpch query the literal `1` is of type `u64` in the logical 
plan and then needs to be processed by a cast kernel to convert to `f64` for 
usage in an arithmetic expression.

The attached flamegraph is of 10 runs of tpch, with the data being loaded into 
memory before running the queries (See ARROW-10240).


> [Rust] [Datafusion] Optimize literal expression evaluation
> --
>
> Key: ARROW-10243
> URL: https://issues.apache.org/jira/browse/ARROW-10243
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Priority: Major
> Attachments: flamegraph.svg
>
>
> While benchmarking the tpch query I noticed that the physical literal 
> expression takes up a sizable amount of time. I think the creation of the 
> corresponding array for numeric literals can be speed up by creating Buffer 
> and ArrayData directly without going through a builder. That also allows to 
> skip building a null bitmap for non-null literals.
> I'm also thinking whether it might be possible to cache the created array. 
> For queries without a WHERE clause, I'd expect all batches except the last to 
> have the same length. I'm not sure though where to store the cached value.
> Another possible optimization could be to cast literals already on the 
> logical plan side. In the tpch query the literal `1` is of type `u64` in the 
> logical plan and then needs to be processed by a cast kernel to convert to 
> `f64` for usage in an arithmetic expression.
> The attached flamegraph is of 10 runs of tpch, with the data being loaded 
> into memory before running the queries (See ARROW-10240).
> {code}
> flamegraph ./target/release/tpch --iterations 10 --path ../tpch-dbgen 
> --format tbl --query 1 --batch-size 4096 -c1 --load
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10243) [Rust] [Datafusion] Optimize literal expression evaluation

2020-10-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörn Horstmann updated ARROW-10243:
---
Description: 
While benchmarking the tpch query I noticed that the physical literal 
expression takes up a sizable amount of time. I think the creation of the 
corresponding array for numeric literals can be speed up by creating Buffer and 
ArrayData directly without going through a builder. That also allows to skip 
building a null bitmap for non-null literals.

I'm also thinking whether it might be possible to cache the created array. For 
queries without a WHERE clause, I'd expect all batches except the last to have 
the same length. I'm not sure though where to store the cached value.

Another possible optimization could be to cast literals already on the logical 
plan side. In the tpch query the literal `1` is of type `u64` in the logical 
plan and then needs to be processed by a cast kernel to convert to `f64` for 
usage in an arithmetic expression.

The attached flamegraph is of 10 runs of tpch, with the data being loaded into 
memory before running the queries (See ARROW-10240).

  was:
While benchmarking the tpch query I noticed that the physical literal 
expression takes up a sizable amount of time. I think the creation of the 
corresponding array for numeric literals can be speed up by creating Buffer and 
ArrayData directly without going through a builder. That also allows to skip 
building a null bitmap for non-null literals.

I'm also thinking whether it might be possible to cache the created array. For 
queries without a WHERE clause, I'd expect all batches except the last to have 
the same length. I'm not sure though where to store the cached value.

Another possible optimization could be to cast literals already on the logical 
plan side. In the tpch query the literal `1` is of type `u64` in the logical 
plan and then needs to be processed by a cast kernel to convert to `f64` for 
usage in an arithmetic expression.


> [Rust] [Datafusion] Optimize literal expression evaluation
> --
>
> Key: ARROW-10243
> URL: https://issues.apache.org/jira/browse/ARROW-10243
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Priority: Major
> Attachments: flamegraph.svg
>
>
> While benchmarking the tpch query I noticed that the physical literal 
> expression takes up a sizable amount of time. I think the creation of the 
> corresponding array for numeric literals can be speed up by creating Buffer 
> and ArrayData directly without going through a builder. That also allows to 
> skip building a null bitmap for non-null literals.
> I'm also thinking whether it might be possible to cache the created array. 
> For queries without a WHERE clause, I'd expect all batches except the last to 
> have the same length. I'm not sure though where to store the cached value.
> Another possible optimization could be to cast literals already on the 
> logical plan side. In the tpch query the literal `1` is of type `u64` in the 
> logical plan and then needs to be processed by a cast kernel to convert to 
> `f64` for usage in an arithmetic expression.
> The attached flamegraph is of 10 runs of tpch, with the data being loaded 
> into memory before running the queries (See ARROW-10240).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10243) [Rust] [Datafusion] Optimize literal expression evaluation

2020-10-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörn Horstmann updated ARROW-10243:
---
Attachment: flamegraph.svg

> [Rust] [Datafusion] Optimize literal expression evaluation
> --
>
> Key: ARROW-10243
> URL: https://issues.apache.org/jira/browse/ARROW-10243
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Priority: Major
> Attachments: flamegraph.svg
>
>
> While benchmarking the tpch query I noticed that the physical literal 
> expression takes up a sizable amount of time. I think the creation of the 
> corresponding array for numeric literals can be speed up by creating Buffer 
> and ArrayData directly without going through a builder. That also allows to 
> skip building a null bitmap for non-null literals.
> I'm also thinking whether it might be possible to cache the created array. 
> For queries without a WHERE clause, I'd expect all batches except the last to 
> have the same length. I'm not sure though where to store the cached value.
> Another possible optimization could be to cast literals already on the 
> logical plan side. In the tpch query the literal `1` is of type `u64` in the 
> logical plan and then needs to be processed by a cast kernel to convert to 
> `f64` for usage in an arithmetic expression.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10243) [Rust] [Datafusion] Optimize literal expression evaluation

2020-10-09 Thread Jira
Jörn Horstmann created ARROW-10243:
--

 Summary: [Rust] [Datafusion] Optimize literal expression evaluation
 Key: ARROW-10243
 URL: https://issues.apache.org/jira/browse/ARROW-10243
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jörn Horstmann


While benchmarking the tpch query I noticed that the physical literal 
expression takes up a sizable amount of time. I think the creation of the 
corresponding array for numeric literals can be speed up by creating Buffer and 
ArrayData directly without going through a builder. That also allows to skip 
building a null bitmap for non-null literals.

I'm also thinking whether it might be possible to cache the created array. For 
queries without a WHERE clause, I'd expect all batches except the last to have 
the same length. I'm not sure though where to store the cached value.

Another possible optimization could be to cast literals already on the logical 
plan side. In the tpch query the literal `1` is of type `u64` in the logical 
plan and then needs to be processed by a cast kernel to convert to `f64` for 
usage in an arithmetic expression.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10240) [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query

2020-10-09 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210700#comment-17210700
 ] 

Jörn Horstmann commented on ARROW-10240:


Hi [~andygrove], I was already working on this and should have assigned the 
ticket to me directly. 

> [Rust] [Datafusion] Optionally load tpch data into memory before running 
> benchmark query
> 
>
> Key: ARROW-10240
> URL: https://issues.apache.org/jira/browse/ARROW-10240
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The tpch benchmark runtime seems to be dominated by csv parsing code and it 
> is really difficult to see any performance hotspots related to actual query 
> execution in a flamegraph.
> With the date in memory and more iterations it should be easier to profile 
> and find bottlenecks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10240) [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query

2020-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10240:
---
Labels: pull-request-available  (was: )

> [Rust] [Datafusion] Optionally load tpch data into memory before running 
> benchmark query
> 
>
> Key: ARROW-10240
> URL: https://issues.apache.org/jira/browse/ARROW-10240
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The tpch benchmark runtime seems to be dominated by csv parsing code and it 
> is really difficult to see any performance hotspots related to actual query 
> execution in a flamegraph.
> With the date in memory and more iterations it should be easier to profile 
> and find bottlenecks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9879) [Python] ChunkedArray.__getitem__ doesn't work with numpy scalars

2020-10-09 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-9879.

Resolution: Fixed

Issue resolved by pull request 8072
[https://github.com/apache/arrow/pull/8072]

> [Python] ChunkedArray.__getitem__ doesn't work with numpy scalars
> -
>
> Key: ARROW-9879
> URL: https://issues.apache.org/jira/browse/ARROW-9879
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
>  
> {{import pyarrow as pa
> import numpy as np
> pa.chunked_array(pa.array([1,2]))[np.int32(0)]}}
> fails with error {{TypeError: key must either be a slice or integer}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10240) [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query

2020-10-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörn Horstmann reassigned ARROW-10240:
--

Assignee: Jörn Horstmann

> [Rust] [Datafusion] Optionally load tpch data into memory before running 
> benchmark query
> 
>
> Key: ARROW-10240
> URL: https://issues.apache.org/jira/browse/ARROW-10240
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Minor
>
> The tpch benchmark runtime seems to be dominated by csv parsing code and it 
> is really difficult to see any performance hotspots related to actual query 
> execution in a flamegraph.
> With the date in memory and more iterations it should be easier to profile 
> and find bottlenecks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1614) [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType

2020-10-09 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210634#comment-17210634
 ] 

Bryan Cutler commented on ARROW-1614:
-

[~rokm] for our purposes, it wasn't necessary to use pyarrow.Tensor, but there 
are some limitations with it currently so maybe there are some trade-offs. 
Please go ahead and start if you like and I'd be happy to help review and 
discuss further.

> [C++] Add a Tensor logical value type with constant dimensions, implemented 
> using ExtensionType
> ---
>
> Key: ARROW-1614
> URL: https://issues.apache.org/jira/browse/ARROW-1614
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Format
>Reporter: Wes McKinney
>Priority: Major
>
> In an Arrow table, we would like to add support for a column that has values 
> cells each containing a tensor value, with all tensors having the same 
> dimensions. These would be stored as a binary value, plus some metadata to 
> store type and shape/strides.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10215) [Rust] [DataFusion] Rename "Source" typedef

2020-10-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão reassigned ARROW-10215:


Assignee: Jorge Leitão

> [Rust] [DataFusion] Rename "Source" typedef
> ---
>
> Key: ARROW-10215
> URL: https://issues.apache.org/jira/browse/ARROW-10215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Jorge Leitão
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The name "Source" for this type doesn't make sense to me. I would like to 
> discuss alternate names for it.
> {code:java}
> type Source = Box; {code}
> My first thoughts are:
>  * RecordBatchIterator
>  * RecordBatchStream
>  * SendableRecordBatchReader



--
This message was sent by Atlassian Jira
(v8.3.4#803005)