date:20200930

[jira] [Created] (ARROW-10140) No data for map column of a parquet file created from pyarrow and pandas

2020-09-30 Thread Chen Ming (Jira)

Chen Ming created ARROW-10140:
-

 Summary: No data for map column of a parquet file created from 
pyarrow and pandas
 Key: ARROW-10140
 URL: https://issues.apache.org/jira/browse/ARROW-10140
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 1.0.1
Reporter: Chen Ming
 Attachments: test_map.py

Hi,

I'm having problems reading parquet files with 'map' data type created by 
pyarrow.

I followed 
[https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries]
 to convert a pandas DF to an arrow table, then call write_table to output a 
parquet file:

(We also referred to https://issues.apache.org/jira/browse/ARROW-9812)
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

print(f'PyArrow Version = {pa.__version__}')
print(f'Pandas Version = {pd.__version__}')

df = pd.DataFrame({
 'col1': pd.Series([
 [('id', 'something'), ('value2', 'else')],
 [('id', 'something2'), ('value','else2')],
 ]),
 'col2': pd.Series(['foo', 'bar'])
 })

udt = pa.map_(pa.string(), pa.string())
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
table = pa.Table.from_pandas(df, schema)
pq.write_table(table, './test_map.parquet')
{code}
The above code (attached as test_map.py) runs smoothly on my developing 
computer:
{code:java}
PyArrow Version = 1.0.1
Pandas Version = 1.1.2
{code}
And generated the test_map.parquet file (attached as test_map.parquet) 
successfully.

Then I use parquet-tools (1.11.1) to read the file, but get the following 
output:
{code:java}
$ java -jar parquet-tools-1.11.1.jar head test_map.parquet
col1:
.key_value:
.key_value:
col2 = foo

col1:
.key_value:
.key_value:
col2 = bar
{code}
I also checked the schema of the parquet file:
{code:java}
java -jar parquet-tools-1.11.1.jar schema test_map.parquet
message schema {
  optional group col1 (MAP) {
repeated group key_value {
  required binary key (STRING);
  optional binary value (STRING);
}
  }
  optional binary col2 (STRING);
}{code}
Am I doing something wrong? 

We need to output the data a parquet files, and query them later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10141) [Rust][Arrow] Improve performance of filter kernel

2020-09-30 Thread Andrew Lamb (Jira)

Andrew Lamb created ARROW-10141:
---

 Summary: [Rust][Arrow] Improve performance of filter kernel
 Key: ARROW-10141
 URL: https://issues.apache.org/jira/browse/ARROW-10141
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Andrew Lamb


As [~jorgecarleitao] noted here: 
https://github.com/apache/arrow/pull/8303#issuecomment-701328143

The improvement of the filter kernel (and likely others) could be improved by 
avoiding creating intermediate copies. The code currently:

# creates Vec> through an iteration
# copies Vec> to the two buffers (when from_opt_vec is called)

it may be more efficient to create the buffers during the iteration, so that we 
avoid the copy (Vec -> buffers). In other words, the code in from_opt_vec could 
have been "injected" into the filter execution, where the MutableBuffer and 
offsets and values buffer are created before the loop, and new elements are 
directly written to it. 

(as a side note, this is why he proposed ARROW-10030 
https://github.com/apache/arrow/pull/8211  : IMO there is some boiler-plate 
copy-pasting to

* initialize buffers
* iterate
* create ArrayData from buffers

which will continue to grow as we add more kernels, and whose pattern seems to 
be a FromIter of fixed size



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10145) [C++][Dataset] Integer-like partition field values outside int32 range error on reading

2020-09-30 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-10145:
-

 Summary: [C++][Dataset] Integer-like partition field values 
outside int32 range error on reading
 Key: ARROW-10145
 URL: https://issues.apache.org/jira/browse/ARROW-10145
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


>From 
>https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset

Small reproducer:

{code}
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'part': [3760212050]*10, 'col': range(10)})
pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part'])

In [35]: pq.read_table("test_int64_partition/")
...
ArrowInvalid: error parsing '3760212050' as scalar of type int32
In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this)
In ../src/arrow/dataset/partition.cc, line 218, code: 
(_error_or_value26).status()
In ../src/arrow/dataset/partition.cc, line 229, code: 
(_error_or_value27).status()
In ../src/arrow/dataset/discovery.cc, line 256, code: 
(_error_or_value17).status()

In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True)
Out[36]: 
pyarrow.Table
col: int64
part: dictionary
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10148) [Rust] Add documentation to lib.rs

2020-09-30 Thread Jira

Jorge Leitão created ARROW-10148:


 Summary: [Rust] Add documentation to lib.rs
 Key: ARROW-10148
 URL: https://issues.apache.org/jira/browse/ARROW-10148
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge Leitão
Assignee: Jorge Leitão
 Fix For: 2.0.0


Currently, the crate page looks rather empty.

This issue aims to move the documentation from the README to the crate, so that 
it has a broader audience and follows Rust best practices.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10146) [Python] Parquet metadata to_dict raises attribute error

2020-09-30 Thread Florian Jetter (Jira)

Florian Jetter created ARROW-10146:
--

 Summary: [Python] Parquet metadata to_dict raises attribute error
 Key: ARROW-10146
 URL: https://issues.apache.org/jira/browse/ARROW-10146
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Florian Jetter


h2. Description

When accessing rowgroup metadata and trying to convert it to a dict using the 
method {{to_dict}} I sometimes receive an Attribute error.

This can be consistently produced with an empty dataframe (see example below) 
but I have also seen it already for non empty dataframes. I couldn't track down 
what makes the non-empty Attribute errors special, therefore the example below.
h2. Expected behaviour

I would expect the to_dict to always, consistently return a dictionary with the 
appropriate metadata and statistics irregardless of the file content.
h2. Minimal Example

 
{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({"col": [1]}).head(0)
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(table, buf)
reader = pa.BufferReader(buf.getvalue())
parquet_file = pq.ParquetFile(reader)
# Raises Attribute Error
parquet_file.metadata.to_dict()
{code}
h3. Traceback
{code:java}
~/miniconda3/envs/kartothek-dev/lib/python3.7/site-packages/pyarrow/_parquet.pyx
 in pyarrow._parquet.FileMetaData.to_dict()

~/miniconda3/envs/kartothek-dev/lib/python3.7/site-packages/pyarrow/_parquet.pyx
 in pyarrow._parquet.RowGroupMetaData.to_dict()

~/miniconda3/envs/kartothek-dev/lib/python3.7/site-packages/pyarrow/_parquet.pyx
 in pyarrow._parquet.ColumnChunkMetaData.to_dict()

AttributeError: 'NoneType' object has no attribute 'to_dict'
{code}
h3. Versions
{code:java}
In [28]: pa.__version__
Out[28]: '1.0.1'

In [29]: pd.__version__
Out[29]: '1.0.5'
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10147) [Python] Constructing pandas metadata fails if an Index name is not JSON-serializable by default

2020-09-30 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-10147:


 Summary: [Python] Constructing pandas metadata fails if an Index 
name is not JSON-serializable by default
 Key: ARROW-10147
 URL: https://issues.apache.org/jira/browse/ARROW-10147
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 2.0.0


originally reported in https://github.com/apache/arrow/issues/8270

here's a minimal reproduction:

{code}
In [24]: idx = pd.RangeIndex(0, 4, name=np.int64(6))
   

In [25]: df = pd.DataFrame(index=idx)   
   

In [26]: pa.table(df)   
   
---
TypeError Traceback (most recent call last)
 in 
> 1 pa.table(df)

~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.table()

~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

~/code/arrow/python/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, 
preserve_index, nthreads, columns, safe)
604 pandas_metadata = construct_metadata(df, column_names, 
index_columns,
605  index_descriptors, 
preserve_index,
--> 606  types)
607 metadata = deepcopy(schema.metadata) if schema.metadata else dict()
608 metadata.update(pandas_metadata)

~/code/arrow/python/pyarrow/pandas_compat.py in construct_metadata(df, 
column_names, index_levels, index_descriptors, preserve_index, types)
243 'version': pa.__version__
244 },
--> 245 'pandas_version': _pandas_api.version
246 }).encode('utf8')
247 }

~/miniconda/envs/arrow-3.7/lib/python3.7/json/__init__.py in dumps(obj, 
skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, 
default, sort_keys, **kw)
229 cls is None and indent is None and separators is None and
230 default is None and not sort_keys and not kw):
--> 231 return _default_encoder.encode(obj)
232 if cls is None:
233 cls = JSONEncoder

~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in encode(self, o)
197 # exceptions aren't as detailed.  The list call should be 
roughly
198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
200 if not isinstance(chunks, (list, tuple)):
201 chunks = list(chunks)

~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in iterencode(self, o, 
_one_shot)
255 self.key_separator, self.item_separator, self.sort_keys,
256 self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)
258 
259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in default(self, o)
177 
178 """
--> 179 raise TypeError(f'Object of type {o.__class__.__name__} '
180 f'is not JSON serializable')
181 

TypeError: Object of type int64 is not JSON serializable
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10144) Add support for using the TLS_SNI extension

2020-09-30 Thread James Duong (Jira)

James Duong created ARROW-10144:
---

 Summary: Add support for using the TLS_SNI extension
 Key: ARROW-10144
 URL: https://issues.apache.org/jira/browse/ARROW-10144
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, FlightRPC, Java, Python
Reporter: James Duong


When using encryption, add support for the TLS_SNI extension 
(https://en.wikipedia.org/wiki/Server_Name_Indication).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10142) [C++] RecordBatchStreamReader should use StreamDecoder

2020-09-30 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-10142:
--

 Summary: [C++] RecordBatchStreamReader should use StreamDecoder
 Key: ARROW-10142
 URL: https://issues.apache.org/jira/browse/ARROW-10142
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Antoine Pitrou
 Fix For: 3.0.0


There's no reason to duplicate some of the stream reading logic, and re-using 
StreamDecoder would ensure the behaviour of both classes matches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10143) [C++] ArrayRangeEquals should accept EqualOptions

2020-09-30 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-10143:
--

 Summary: [C++] ArrayRangeEquals should accept EqualOptions
 Key: ARROW-10143
 URL: https://issues.apache.org/jira/browse/ARROW-10143
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
 Fix For: 3.0.0


Besides, the underlying implementations of ArrayEquals and ArrayRangeEquals 
should be shared (right now they are duplicated).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10140) No data for map column of a parquet file created from pyarrow and pandas

[jira] [Created] (ARROW-10141) [Rust][Arrow] Improve performance of filter kernel

[jira] [Created] (ARROW-10145) [C++][Dataset] Integer-like partition field values outside int32 range error on reading

[jira] [Created] (ARROW-10148) [Rust] Add documentation to lib.rs

[jira] [Created] (ARROW-10146) [Python] Parquet metadata to_dict raises attribute error

[jira] [Created] (ARROW-10147) [Python] Constructing pandas metadata fails if an Index name is not JSON-serializable by default

[jira] [Created] (ARROW-10144) Add support for using the TLS_SNI extension

[jira] [Created] (ARROW-10142) [C++] RecordBatchStreamReader should use StreamDecoder

[jira] [Created] (ARROW-10143) [C++] ArrayRangeEquals should accept EqualOptions

9 matches

Site Navigation

Mail list logo

Footer information