[jira] [Created] (ARROW-10140) No data for map column of a parquet file created from pyarrow and pandas
Chen Ming created ARROW-10140: - Summary: No data for map column of a parquet file created from pyarrow and pandas Key: ARROW-10140 URL: https://issues.apache.org/jira/browse/ARROW-10140 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 1.0.1 Reporter: Chen Ming Attachments: test_map.py Hi, I'm having problems reading parquet files with 'map' data type created by pyarrow. I followed [https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries] to convert a pandas DF to an arrow table, then call write_table to output a parquet file: (We also referred to https://issues.apache.org/jira/browse/ARROW-9812) {code:java} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq print(f'PyArrow Version = {pa.__version__}') print(f'Pandas Version = {pd.__version__}') df = pd.DataFrame({ 'col1': pd.Series([ [('id', 'something'), ('value2', 'else')], [('id', 'something2'), ('value','else2')], ]), 'col2': pd.Series(['foo', 'bar']) }) udt = pa.map_(pa.string(), pa.string()) schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())]) table = pa.Table.from_pandas(df, schema) pq.write_table(table, './test_map.parquet') {code} The above code (attached as test_map.py) runs smoothly on my developing computer: {code:java} PyArrow Version = 1.0.1 Pandas Version = 1.1.2 {code} And generated the test_map.parquet file (attached as test_map.parquet) successfully. Then I use parquet-tools (1.11.1) to read the file, but get the following output: {code:java} $ java -jar parquet-tools-1.11.1.jar head test_map.parquet col1: .key_value: .key_value: col2 = foo col1: .key_value: .key_value: col2 = bar {code} I also checked the schema of the parquet file: {code:java} java -jar parquet-tools-1.11.1.jar schema test_map.parquet message schema { optional group col1 (MAP) { repeated group key_value { required binary key (STRING); optional binary value (STRING); } } optional binary col2 (STRING); }{code} Am I doing something wrong? We need to output the data a parquet files, and query them later. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10141) [Rust][Arrow] Improve performance of filter kernel
Andrew Lamb created ARROW-10141: --- Summary: [Rust][Arrow] Improve performance of filter kernel Key: ARROW-10141 URL: https://issues.apache.org/jira/browse/ARROW-10141 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb As [~jorgecarleitao] noted here: https://github.com/apache/arrow/pull/8303#issuecomment-701328143 The improvement of the filter kernel (and likely others) could be improved by avoiding creating intermediate copies. The code currently: # creates Vec> through an iteration # copies Vec> to the two buffers (when from_opt_vec is called) it may be more efficient to create the buffers during the iteration, so that we avoid the copy (Vec -> buffers). In other words, the code in from_opt_vec could have been "injected" into the filter execution, where the MutableBuffer and offsets and values buffer are created before the loop, and new elements are directly written to it. (as a side note, this is why he proposed ARROW-10030 https://github.com/apache/arrow/pull/8211 : IMO there is some boiler-plate copy-pasting to * initialize buffers * iterate * create ArrayData from buffers which will continue to grow as we add more kernels, and whose pattern seems to be a FromIter of fixed size -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10145) [C++][Dataset] Integer-like partition field values outside int32 range error on reading
Joris Van den Bossche created ARROW-10145: - Summary: [C++][Dataset] Integer-like partition field values outside int32 range error on reading Key: ARROW-10145 URL: https://issues.apache.org/jira/browse/ARROW-10145 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche >From >https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset Small reproducer: {code} import pyarrow as pa import pyarrow.parquet as pq table = pa.table({'part': [3760212050]*10, 'col': range(10)}) pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part']) In [35]: pq.read_table("test_int64_partition/") ... ArrowInvalid: error parsing '3760212050' as scalar of type int32 In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this) In ../src/arrow/dataset/partition.cc, line 218, code: (_error_or_value26).status() In ../src/arrow/dataset/partition.cc, line 229, code: (_error_or_value27).status() In ../src/arrow/dataset/discovery.cc, line 256, code: (_error_or_value17).status() In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True) Out[36]: pyarrow.Table col: int64 part: dictionary {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10148) [Rust] Add documentation to lib.rs
Jorge Leitão created ARROW-10148: Summary: [Rust] Add documentation to lib.rs Key: ARROW-10148 URL: https://issues.apache.org/jira/browse/ARROW-10148 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Leitão Assignee: Jorge Leitão Fix For: 2.0.0 Currently, the crate page looks rather empty. This issue aims to move the documentation from the README to the crate, so that it has a broader audience and follows Rust best practices. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10146) [Python] Parquet metadata to_dict raises attribute error
Florian Jetter created ARROW-10146: -- Summary: [Python] Parquet metadata to_dict raises attribute error Key: ARROW-10146 URL: https://issues.apache.org/jira/browse/ARROW-10146 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Florian Jetter h2. Description When accessing rowgroup metadata and trying to convert it to a dict using the method {{to_dict}} I sometimes receive an Attribute error. This can be consistently produced with an empty dataframe (see example below) but I have also seen it already for non empty dataframes. I couldn't track down what makes the non-empty Attribute errors special, therefore the example below. h2. Expected behaviour I would expect the to_dict to always, consistently return a dictionary with the appropriate metadata and statistics irregardless of the file content. h2. Minimal Example {code:python} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.DataFrame({"col": [1]}).head(0) table = pa.Table.from_pandas(df) buf = pa.BufferOutputStream() pq.write_table(table, buf) reader = pa.BufferReader(buf.getvalue()) parquet_file = pq.ParquetFile(reader) # Raises Attribute Error parquet_file.metadata.to_dict() {code} h3. Traceback {code:java} ~/miniconda3/envs/kartothek-dev/lib/python3.7/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.FileMetaData.to_dict() ~/miniconda3/envs/kartothek-dev/lib/python3.7/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.RowGroupMetaData.to_dict() ~/miniconda3/envs/kartothek-dev/lib/python3.7/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ColumnChunkMetaData.to_dict() AttributeError: 'NoneType' object has no attribute 'to_dict' {code} h3. Versions {code:java} In [28]: pa.__version__ Out[28]: '1.0.1' In [29]: pd.__version__ Out[29]: '1.0.5' {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10147) [Python] Constructing pandas metadata fails if an Index name is not JSON-serializable by default
Wes McKinney created ARROW-10147: Summary: [Python] Constructing pandas metadata fails if an Index name is not JSON-serializable by default Key: ARROW-10147 URL: https://issues.apache.org/jira/browse/ARROW-10147 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 2.0.0 originally reported in https://github.com/apache/arrow/issues/8270 here's a minimal reproduction: {code} In [24]: idx = pd.RangeIndex(0, 4, name=np.int64(6)) In [25]: df = pd.DataFrame(index=idx) In [26]: pa.table(df) --- TypeError Traceback (most recent call last) in > 1 pa.table(df) ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.table() ~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas() ~/code/arrow/python/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe) 604 pandas_metadata = construct_metadata(df, column_names, index_columns, 605 index_descriptors, preserve_index, --> 606 types) 607 metadata = deepcopy(schema.metadata) if schema.metadata else dict() 608 metadata.update(pandas_metadata) ~/code/arrow/python/pyarrow/pandas_compat.py in construct_metadata(df, column_names, index_levels, index_descriptors, preserve_index, types) 243 'version': pa.__version__ 244 }, --> 245 'pandas_version': _pandas_api.version 246 }).encode('utf8') 247 } ~/miniconda/envs/arrow-3.7/lib/python3.7/json/__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw) 229 cls is None and indent is None and separators is None and 230 default is None and not sort_keys and not kw): --> 231 return _default_encoder.encode(obj) 232 if cls is None: 233 cls = JSONEncoder ~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in encode(self, o) 197 # exceptions aren't as detailed. The list call should be roughly 198 # equivalent to the PySequence_Fast that ''.join() would do. --> 199 chunks = self.iterencode(o, _one_shot=True) 200 if not isinstance(chunks, (list, tuple)): 201 chunks = list(chunks) ~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in iterencode(self, o, _one_shot) 255 self.key_separator, self.item_separator, self.sort_keys, 256 self.skipkeys, _one_shot) --> 257 return _iterencode(o, 0) 258 259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr, ~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in default(self, o) 177 178 """ --> 179 raise TypeError(f'Object of type {o.__class__.__name__} ' 180 f'is not JSON serializable') 181 TypeError: Object of type int64 is not JSON serializable {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10144) Add support for using the TLS_SNI extension
James Duong created ARROW-10144: --- Summary: Add support for using the TLS_SNI extension Key: ARROW-10144 URL: https://issues.apache.org/jira/browse/ARROW-10144 Project: Apache Arrow Issue Type: Improvement Components: C++, FlightRPC, Java, Python Reporter: James Duong When using encryption, add support for the TLS_SNI extension (https://en.wikipedia.org/wiki/Server_Name_Indication). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10142) [C++] RecordBatchStreamReader should use StreamDecoder
Antoine Pitrou created ARROW-10142: -- Summary: [C++] RecordBatchStreamReader should use StreamDecoder Key: ARROW-10142 URL: https://issues.apache.org/jira/browse/ARROW-10142 Project: Apache Arrow Issue Type: Wish Components: C++ Reporter: Antoine Pitrou Fix For: 3.0.0 There's no reason to duplicate some of the stream reading logic, and re-using StreamDecoder would ensure the behaviour of both classes matches. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10143) [C++] ArrayRangeEquals should accept EqualOptions
Antoine Pitrou created ARROW-10143: -- Summary: [C++] ArrayRangeEquals should accept EqualOptions Key: ARROW-10143 URL: https://issues.apache.org/jira/browse/ARROW-10143 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou Fix For: 3.0.0 Besides, the underlying implementations of ArrayEquals and ArrayRangeEquals should be shared (right now they are duplicated). -- This message was sent by Atlassian Jira (v8.3.4#803005)