[jira] [Commented] (ARROW-4139) [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set
[ https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861272#comment-16861272 ] Wes McKinney commented on ARROW-4139: - I can pick up this one for 0.14.0, but someone else is free to take it from me > [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is > set > --- > > Key: ARROW-4139 > URL: https://issues.apache.org/jira/browse/ARROW-4139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Matthew Rocklin >Assignee: Wes McKinney >Priority: Minor > Labels: parquet, pull-request-available, python > Fix For: 0.14.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > When writing Pandas data to Parquet format and reading it back again I find > that that statistics of text columns are stored as byte arrays rather than as > unicode text. > I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding > of how best to manage statistics. (I'd be quite happy to learn that it was > the latter). > Here is a minimal example > {code:python} > import pandas as pd > df = pd.DataFrame({'x': ['a']}) > df.to_parquet('df.parquet') > import pyarrow.parquet as pq > pf = pq.ParquetDataset('df.parquet') > piece = pf.pieces[0] > rg = piece.row_group(0) > md = piece.get_metadata(pq.ParquetFile) > rg = md.row_group(0) > c = rg.column(0) > >>> c > > file_offset: 63 > file_path: > physical_type: BYTE_ARRAY > num_values: 1 > path_in_schema: x > is_stats_set: True > statistics: > > has_min_max: True > min: b'a' > max: b'a' > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > compression: SNAPPY > encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') > has_dictionary_page: True > dictionary_page_offset: 4 > data_page_offset: 25 > total_compressed_size: 59 > total_uncompressed_size: 55 > >>> type(c.statistics.min) > bytes > {code} > My guess is that we would want to store a logical type in the statistics like > UNICODE, though I don't have enough experience with Parquet data types to > know if this is a good idea or possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4139) [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set
[ https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829738#comment-16829738 ] Deepak Majeti commented on ARROW-4139: -- The statistics are fixed for UTF-8 types. https://github.com/apache/arrow/blob/de84293d9c93fe721cd127f1a27acc94fe290f3f/cpp/src/parquet/types.cc#L260 https://github.com/apache/arrow/blob/de84293d9c93fe721cd127f1a27acc94fe290f3f/cpp/src/parquet/metadata.cc#L140 is an out of date comment and must be fixed. See comment here https://github.com/apache/arrow/blob/de84293d9c93fe721cd127f1a27acc94fe290f3f/cpp/src/parquet/metadata.cc#L558 > [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is > set > --- > > Key: ARROW-4139 > URL: https://issues.apache.org/jira/browse/ARROW-4139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Matthew Rocklin >Priority: Minor > Labels: parquet, pull-request-available, python > Fix For: 0.14.0 > > Time Spent: 1h > Remaining Estimate: 0h > > When writing Pandas data to Parquet format and reading it back again I find > that that statistics of text columns are stored as byte arrays rather than as > unicode text. > I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding > of how best to manage statistics. (I'd be quite happy to learn that it was > the latter). > Here is a minimal example > {code:python} > import pandas as pd > df = pd.DataFrame({'x': ['a']}) > df.to_parquet('df.parquet') > import pyarrow.parquet as pq > pf = pq.ParquetDataset('df.parquet') > piece = pf.pieces[0] > rg = piece.row_group(0) > md = piece.get_metadata(pq.ParquetFile) > rg = md.row_group(0) > c = rg.column(0) > >>> c > > file_offset: 63 > file_path: > physical_type: BYTE_ARRAY > num_values: 1 > path_in_schema: x > is_stats_set: True > statistics: > > has_min_max: True > min: b'a' > max: b'a' > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > compression: SNAPPY > encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') > has_dictionary_page: True > dictionary_page_offset: 4 > data_page_offset: 25 > total_compressed_size: 59 > total_uncompressed_size: 55 > >>> type(c.statistics.min) > bytes > {code} > My guess is that we would want to store a logical type in the statistics like > UNICODE, though I don't have enough experience with Parquet data types to > know if this is a good idea or possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4139) [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set
[ https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829730#comment-16829730 ] Wes McKinney commented on ARROW-4139: - I don't think there is a sorting issue with UTF-8 types -- [~mdeepak] worked through this I believe > [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is > set > --- > > Key: ARROW-4139 > URL: https://issues.apache.org/jira/browse/ARROW-4139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Matthew Rocklin >Priority: Minor > Labels: parquet, pull-request-available, python > Fix For: 0.14.0 > > Time Spent: 1h > Remaining Estimate: 0h > > When writing Pandas data to Parquet format and reading it back again I find > that that statistics of text columns are stored as byte arrays rather than as > unicode text. > I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding > of how best to manage statistics. (I'd be quite happy to learn that it was > the latter). > Here is a minimal example > {code:python} > import pandas as pd > df = pd.DataFrame({'x': ['a']}) > df.to_parquet('df.parquet') > import pyarrow.parquet as pq > pf = pq.ParquetDataset('df.parquet') > piece = pf.pieces[0] > rg = piece.row_group(0) > md = piece.get_metadata(pq.ParquetFile) > rg = md.row_group(0) > c = rg.column(0) > >>> c > > file_offset: 63 > file_path: > physical_type: BYTE_ARRAY > num_values: 1 > path_in_schema: x > is_stats_set: True > statistics: > > has_min_max: True > min: b'a' > max: b'a' > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > compression: SNAPPY > encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') > has_dictionary_page: True > dictionary_page_offset: 4 > data_page_offset: 25 > total_compressed_size: 59 > total_uncompressed_size: 55 > >>> type(c.statistics.min) > bytes > {code} > My guess is that we would want to store a logical type in the statistics like > UNICODE, though I don't have enough experience with Parquet data types to > know if this is a good idea or possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4139) [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set
[ https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829633#comment-16829633 ] Michael Eaton commented on ARROW-4139: -- The following comment seems to suggest that there is a latent issue sorting unsigned types, specifically mentioning UTF-8. If this is indeed the case, then the sorting will have to be fixed before progress can continue on this issue. [https://github.com/apache/arrow/blob/de84293d9c93fe721cd127f1a27acc94fe290f3f/cpp/src/parquet/metadata.cc#L140] > [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is > set > --- > > Key: ARROW-4139 > URL: https://issues.apache.org/jira/browse/ARROW-4139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Matthew Rocklin >Priority: Minor > Labels: parquet, pull-request-available, python > Fix For: 0.14.0 > > Time Spent: 1h > Remaining Estimate: 0h > > When writing Pandas data to Parquet format and reading it back again I find > that that statistics of text columns are stored as byte arrays rather than as > unicode text. > I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding > of how best to manage statistics. (I'd be quite happy to learn that it was > the latter). > Here is a minimal example > {code:python} > import pandas as pd > df = pd.DataFrame({'x': ['a']}) > df.to_parquet('df.parquet') > import pyarrow.parquet as pq > pf = pq.ParquetDataset('df.parquet') > piece = pf.pieces[0] > rg = piece.row_group(0) > md = piece.get_metadata(pq.ParquetFile) > rg = md.row_group(0) > c = rg.column(0) > >>> c > > file_offset: 63 > file_path: > physical_type: BYTE_ARRAY > num_values: 1 > path_in_schema: x > is_stats_set: True > statistics: > > has_min_max: True > min: b'a' > max: b'a' > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > compression: SNAPPY > encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') > has_dictionary_page: True > dictionary_page_offset: 4 > data_page_offset: 25 > total_compressed_size: 59 > total_uncompressed_size: 55 > >>> type(c.statistics.min) > bytes > {code} > My guess is that we would want to store a logical type in the statistics like > UNICODE, though I don't have enough experience with Parquet data types to > know if this is a good idea or possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4139) [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set
[ https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823647#comment-16823647 ] Michael Eaton commented on ARROW-4139: -- I submitted a rough WIP PR, in order to facilitate conversation. > [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is > set > --- > > Key: ARROW-4139 > URL: https://issues.apache.org/jira/browse/ARROW-4139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Matthew Rocklin >Priority: Minor > Labels: parquet, python > Fix For: 0.14.0 > > > When writing Pandas data to Parquet format and reading it back again I find > that that statistics of text columns are stored as byte arrays rather than as > unicode text. > I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding > of how best to manage statistics. (I'd be quite happy to learn that it was > the latter). > Here is a minimal example > {code:python} > import pandas as pd > df = pd.DataFrame({'x': ['a']}) > df.to_parquet('df.parquet') > import pyarrow.parquet as pq > pf = pq.ParquetDataset('df.parquet') > piece = pf.pieces[0] > rg = piece.row_group(0) > md = piece.get_metadata(pq.ParquetFile) > rg = md.row_group(0) > c = rg.column(0) > >>> c > > file_offset: 63 > file_path: > physical_type: BYTE_ARRAY > num_values: 1 > path_in_schema: x > is_stats_set: True > statistics: > > has_min_max: True > min: b'a' > max: b'a' > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > compression: SNAPPY > encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') > has_dictionary_page: True > dictionary_page_offset: 4 > data_page_offset: 25 > total_compressed_size: 59 > total_uncompressed_size: 55 > >>> type(c.statistics.min) > bytes > {code} > My guess is that we would want to store a logical type in the statistics like > UNICODE, though I don't have enough experience with Parquet data types to > know if this is a good idea or possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4139) [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set
[ https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823146#comment-16823146 ] Wes McKinney commented on ARROW-4139: - I added this to my list of things to look at early this week > [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is > set > --- > > Key: ARROW-4139 > URL: https://issues.apache.org/jira/browse/ARROW-4139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Matthew Rocklin >Priority: Minor > Labels: parquet, python > Fix For: 0.14.0 > > > When writing Pandas data to Parquet format and reading it back again I find > that that statistics of text columns are stored as byte arrays rather than as > unicode text. > I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding > of how best to manage statistics. (I'd be quite happy to learn that it was > the latter). > Here is a minimal example > {code:python} > import pandas as pd > df = pd.DataFrame({'x': ['a']}) > df.to_parquet('df.parquet') > import pyarrow.parquet as pq > pf = pq.ParquetDataset('df.parquet') > piece = pf.pieces[0] > rg = piece.row_group(0) > md = piece.get_metadata(pq.ParquetFile) > rg = md.row_group(0) > c = rg.column(0) > >>> c > > file_offset: 63 > file_path: > physical_type: BYTE_ARRAY > num_values: 1 > path_in_schema: x > is_stats_set: True > statistics: > > has_min_max: True > min: b'a' > max: b'a' > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > compression: SNAPPY > encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') > has_dictionary_page: True > dictionary_page_offset: 4 > data_page_offset: 25 > total_compressed_size: 59 > total_uncompressed_size: 55 > >>> type(c.statistics.min) > bytes > {code} > My guess is that we would want to store a logical type in the statistics like > UNICODE, though I don't have enough experience with Parquet data types to > know if this is a good idea or possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4139) [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set
[ https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823025#comment-16823025 ] Uwe L. Korn commented on ARROW-4139: Have a look at https://github.com/apache/arrow/pull/2623/files which implements predicate pushdown in Arrow on the Python side. There should also be some code in there that handles the physical to logical conversion. > [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is > set > --- > > Key: ARROW-4139 > URL: https://issues.apache.org/jira/browse/ARROW-4139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Matthew Rocklin >Priority: Minor > Labels: parquet, python > Fix For: 0.14.0 > > > When writing Pandas data to Parquet format and reading it back again I find > that that statistics of text columns are stored as byte arrays rather than as > unicode text. > I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding > of how best to manage statistics. (I'd be quite happy to learn that it was > the latter). > Here is a minimal example > {code:python} > import pandas as pd > df = pd.DataFrame({'x': ['a']}) > df.to_parquet('df.parquet') > import pyarrow.parquet as pq > pf = pq.ParquetDataset('df.parquet') > piece = pf.pieces[0] > rg = piece.row_group(0) > md = piece.get_metadata(pq.ParquetFile) > rg = md.row_group(0) > c = rg.column(0) > >>> c > > file_offset: 63 > file_path: > physical_type: BYTE_ARRAY > num_values: 1 > path_in_schema: x > is_stats_set: True > statistics: > > has_min_max: True > min: b'a' > max: b'a' > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > compression: SNAPPY > encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') > has_dictionary_page: True > dictionary_page_offset: 4 > data_page_offset: 25 > total_compressed_size: 59 > total_uncompressed_size: 55 > >>> type(c.statistics.min) > bytes > {code} > My guess is that we would want to store a logical type in the statistics like > UNICODE, though I don't have enough experience with Parquet data types to > know if this is a good idea or possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4139) [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set
[ https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822085#comment-16822085 ] Matthew Rocklin commented on ARROW-4139: I don't have strong thoughts about the API. I mostly care that downstream projects don't have to special case a variety of types. I expect that that special casing will be quite brittle and break over time. Ideally we wouldn't be special casing these things in Arrow either. My guess is that there is already code somewhere in Arrow that knows how to convert these values in a consistent way. Hopefully that same code would be engaged for statistics as well, that way we would have confidence that things wouldn't drift in the future. > [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is > set > --- > > Key: ARROW-4139 > URL: https://issues.apache.org/jira/browse/ARROW-4139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Matthew Rocklin >Priority: Minor > Labels: parquet, python > Fix For: 0.14.0 > > > When writing Pandas data to Parquet format and reading it back again I find > that that statistics of text columns are stored as byte arrays rather than as > unicode text. > I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding > of how best to manage statistics. (I'd be quite happy to learn that it was > the latter). > Here is a minimal example > {code:python} > import pandas as pd > df = pd.DataFrame({'x': ['a']}) > df.to_parquet('df.parquet') > import pyarrow.parquet as pq > pf = pq.ParquetDataset('df.parquet') > piece = pf.pieces[0] > rg = piece.row_group(0) > md = piece.get_metadata(pq.ParquetFile) > rg = md.row_group(0) > c = rg.column(0) > >>> c > > file_offset: 63 > file_path: > physical_type: BYTE_ARRAY > num_values: 1 > path_in_schema: x > is_stats_set: True > statistics: > > has_min_max: True > min: b'a' > max: b'a' > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > compression: SNAPPY > encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') > has_dictionary_page: True > dictionary_page_offset: 4 > data_page_offset: 25 > total_compressed_size: 59 > total_uncompressed_size: 55 > >>> type(c.statistics.min) > bytes > {code} > My guess is that we would want to store a logical type in the statistics like > UNICODE, though I don't have enough experience with Parquet data types to > know if this is a good idea or possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4139) [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set
[ https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821482#comment-16821482 ] Michael Eaton commented on ARROW-4139: -- I was able to access the logical types using pyarrow as: md = piece.get_metadata(pq.ParquetFile) rg = md.row_group(0) ps=pq.ParquetSchema(md) lt = ps.column(i).logical_type And then do appropriate conversion using decode('UTF8') and np.datetime64() respectively. [~mrocklin], does it work for you to write an accessor that returns the converted statistics given column? Or what would work best for your use case? e.g. get_statistics( column_index ) > [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is > set > --- > > Key: ARROW-4139 > URL: https://issues.apache.org/jira/browse/ARROW-4139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Matthew Rocklin >Priority: Minor > Labels: parquet, python > Fix For: 0.14.0 > > > When writing Pandas data to Parquet format and reading it back again I find > that that statistics of text columns are stored as byte arrays rather than as > unicode text. > I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding > of how best to manage statistics. (I'd be quite happy to learn that it was > the latter). > Here is a minimal example > {code:python} > import pandas as pd > df = pd.DataFrame({'x': ['a']}) > df.to_parquet('df.parquet') > import pyarrow.parquet as pq > pf = pq.ParquetDataset('df.parquet') > piece = pf.pieces[0] > rg = piece.row_group(0) > md = piece.get_metadata(pq.ParquetFile) > rg = md.row_group(0) > c = rg.column(0) > >>> c > > file_offset: 63 > file_path: > physical_type: BYTE_ARRAY > num_values: 1 > path_in_schema: x > is_stats_set: True > statistics: > > has_min_max: True > min: b'a' > max: b'a' > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > compression: SNAPPY > encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') > has_dictionary_page: True > dictionary_page_offset: 4 > data_page_offset: 25 > total_compressed_size: 59 > total_uncompressed_size: 55 > >>> type(c.statistics.min) > bytes > {code} > My guess is that we would want to store a logical type in the statistics like > UNICODE, though I don't have enough experience with Parquet data types to > know if this is a good idea or possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4139) [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set
[ https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819225#comment-16819225 ] Michael Eaton commented on ARROW-4139: -- Adding a decode("utf8") to ParquetType_BYTE_ARRAY seems simple enough and yields min,max of type utf8. I can only guess that this breaks statistics for logical types dependent on ParquetType_BYTE_ARRAY. What are these types, and how to go about adding a logical accessor if necessary? > [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is > set > --- > > Key: ARROW-4139 > URL: https://issues.apache.org/jira/browse/ARROW-4139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Matthew Rocklin >Priority: Minor > Labels: parquet, python > Fix For: 0.14.0 > > > When writing Pandas data to Parquet format and reading it back again I find > that that statistics of text columns are stored as byte arrays rather than as > unicode text. > I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding > of how best to manage statistics. (I'd be quite happy to learn that it was > the latter). > Here is a minimal example > {code:python} > import pandas as pd > df = pd.DataFrame({'x': ['a']}) > df.to_parquet('df.parquet') > import pyarrow.parquet as pq > pf = pq.ParquetDataset('df.parquet') > piece = pf.pieces[0] > rg = piece.row_group(0) > md = piece.get_metadata(pq.ParquetFile) > rg = md.row_group(0) > c = rg.column(0) > >>> c > > file_offset: 63 > file_path: > physical_type: BYTE_ARRAY > num_values: 1 > path_in_schema: x > is_stats_set: True > statistics: > > has_min_max: True > min: b'a' > max: b'a' > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > compression: SNAPPY > encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') > has_dictionary_page: True > dictionary_page_offset: 4 > data_page_offset: 25 > total_compressed_size: 59 > total_uncompressed_size: 55 > >>> type(c.statistics.min) > bytes > {code} > My guess is that we would want to store a logical type in the statistics like > UNICODE, though I don't have enough experience with Parquet data types to > know if this is a good idea or possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4139) [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set
[ https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802291#comment-16802291 ] Matthew Rocklin commented on ARROW-4139: Perhaps relatedly, it would be useful to get physical types for statistics of other types. Here is another issue I ran into with timestamps {code} In [1]: import pandas as pd In [2]: df = pd.util.testing.makeTimeDataFrame() In [3]: df.head() Out[3]: A B C D 2000-01-03 1.255856 -1.092558 -1.454595 0.898535 2000-01-04 -1.006590 0.640467 -2.249877 0.068293 2000-01-05 -1.525559 0.567070 1.039230 -0.967301 2000-01-06 -0.773395 -1.565619 0.025786 0.106949 2000-01-07 -0.079000 0.367165 1.746211 -0.097441 In [4]: df.to_parquet('foo.parquet') In [5]: import pyarrow.parquet as pq In [6]: p = pq.ParquetDataset('foo.parquet') In [7]: piece = p.pieces[0] In [8]: md = piece.get_metadata(open_file_func=lambda fn: open(fn, mode='rb')) In [9]: rg = md.row_group(0) In [10]: rg.column(4) Out[10]: file_offset: 1913 file_path: physical_type: INT64 num_values: 30 path_in_schema: __index_level_0__ is_stats_set: True statistics: has_min_max: True min: 94685760 max: 95022720 null_count: 0 distinct_count: 0 num_values: 30 physical_type: INT64 compression: SNAPPY encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') has_dictionary_page: True dictionary_page_offset: 1635 data_page_offset: 1842 total_compressed_size: 278 total_uncompressed_size: 325 In [11]: rg.column(4).statistics.min # I want this to be some sort of timestamp object Out[11]: 94685760 {code} Somewhat unrelatedly, there is a lot of boiler plate to get down to that information. If there are nicer ways to get to statistics I'd be interested in hearing about them. > [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is > set > --- > > Key: ARROW-4139 > URL: https://issues.apache.org/jira/browse/ARROW-4139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Matthew Rocklin >Priority: Minor > Labels: parquet, python > Fix For: 0.14.0 > > > When writing Pandas data to Parquet format and reading it back again I find > that that statistics of text columns are stored as byte arrays rather than as > unicode text. > I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding > of how best to manage statistics. (I'd be quite happy to learn that it was > the latter). > Here is a minimal example > {code:python} > import pandas as pd > df = pd.DataFrame({'x': ['a']}) > df.to_parquet('df.parquet') > import pyarrow.parquet as pq > pf = pq.ParquetDataset('df.parquet') > piece = pf.pieces[0] > rg = piece.row_group(0) > md = piece.get_metadata(pq.ParquetFile) > rg = md.row_group(0) > c = rg.column(0) > >>> c > > file_offset: 63 > file_path: > physical_type: BYTE_ARRAY > num_values: 1 > path_in_schema: x > is_stats_set: True > statistics: > > has_min_max: True > min: b'a' > max: b'a' > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > compression: SNAPPY > encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') > has_dictionary_page: True > dictionary_page_offset: 4 > data_page_offset: 25 > total_compressed_size: 59 > total_uncompressed_size: 55 > >>> type(c.statistics.min) > bytes > {code} > My guess is that we would want to store a logical type in the statistics like > UNICODE, though I don't have enough experience with Parquet data types to > know if this is a good idea or possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4139) [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set
[ https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732236#comment-16732236 ] Uwe L. Korn commented on ARROW-4139: We currently always return the physical types in all cases for the statistics. We should add {{logical_{min,max}}} as additional accessors. > [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is > set > --- > > Key: ARROW-4139 > URL: https://issues.apache.org/jira/browse/ARROW-4139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Matthew Rocklin >Priority: Minor > Labels: parquet, python > Fix For: 0.13.0 > > > When writing Pandas data to Parquet format and reading it back again I find > that that statistics of text columns are stored as byte arrays rather than as > unicode text. > I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding > of how best to manage statistics. (I'd be quite happy to learn that it was > the latter). > Here is a minimal example > {code:python} > import pandas as pd > df = pd.DataFrame({'x': ['a']}) > df.to_parquet('df.parquet') > import pyarrow.parquet as pq > pf = pq.ParquetDataset('df.parquet') > piece = pf.pieces[0] > rg = piece.row_group(0) > md = piece.get_metadata(pq.ParquetFile) > rg = md.row_group(0) > c = rg.column(0) > >>> c > > file_offset: 63 > file_path: > physical_type: BYTE_ARRAY > num_values: 1 > path_in_schema: x > is_stats_set: True > statistics: > > has_min_max: True > min: b'a' > max: b'a' > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > compression: SNAPPY > encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') > has_dictionary_page: True > dictionary_page_offset: 4 > data_page_offset: 25 > total_compressed_size: 59 > total_uncompressed_size: 55 > >>> type(c.statistics.min) > bytes > {code} > My guess is that we would want to store a logical type in the statistics like > UNICODE, though I don't have enough experience with Parquet data types to > know if this is a good idea or possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)