[ https://issues.apache.org/jira/browse/ARROW-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou resolved ARROW-2503. ----------------------------------- Resolution: Fixed Fix Version/s: 0.10.0 Issue resolved by pull request 1945 [https://github.com/apache/arrow/pull/1945] > Trailing space character in RowGroup statistics of pyarrow.parquet.ParquetFile > ------------------------------------------------------------------------------ > > Key: ARROW-2503 > URL: https://issues.apache.org/jira/browse/ARROW-2503 > Project: Apache Arrow > Issue Type: Bug > Affects Versions: 0.9.0 > Reporter: Julius Neuffer > Priority: Minor > Labels: pull-request-available > Fix For: 0.10.0 > > Time Spent: 2h > Remaining Estimate: 0h > > When reading a parquet file containing a string column, the _RowGroup_ > statistics contain a trailing space character for the string column. The > example below shows the behavior. > {code} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > # create and write arrow table as parquet > df = pd.DataFrame({'string_column': ['some', 'string', 'values', 'here']}) > table = pa.Table.from_pandas(df) > pq.write_table(table, 'example.parquet') > # read parquet file metadata and print string column statistics > pq_file = pq.ParquetFile(open('example.parquet', 'rb')) > print(pq_file.metadata.row_group(0).column(0).statistics.max) # yields > b'values ' > print(pq_file.metadata.row_group(0).column(0).statistics.min) # yields b'here > ' > {code} > For other data types I did not observe this problem, even though the > statistics are always strings. > When reading the same file with _fastparquet_, there is no trailing space > character, which implies that this problem occurs in the reading path of > _pyarrow.parquet_. I am aware that this might well be an issue with > _parquet-cpp_, but as I face this bug as a _pyarrow_ user, I report it here. > I'll try to investigate this further and report back here. > > *Update:* > The trailing space is added in _parquet-cpp_. _pyarrow_ calls the function > _FormatStatValue_ which adds the trailing space > (https://github.com/apache/parquet-cpp/blob/master/src/parquet/types.cc#L52). > There is no comment there to explain it. Does anyone here know what the > reason is? -- This message was sent by Atlassian JIRA (v7.6.3#76005)