[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126918#comment-17126918 ] Wes McKinney commented on ARROW-1644: - Thanks. We should follow up on the mailing list discussion and see what is the latest game plan for implementing the Parquet nested reader. Some of my colleagues should be able to help > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126858#comment-17126858 ] Eric Czech commented on ARROW-1644: --- This may be another useful example: [https://storage.googleapis.com/open-targets-data-releases/20.04/input/evidence-files/progeny-2018-07-23.json.gz] It's a 620K (uncompressed) set of json records with gene pathways that regulate various types of cancer. It has a good mix of structs within structs, arrays with structs, arrays of structs that are themselves in other structs, etc. > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17086167#comment-17086167 ] Dmitry Kalinkin commented on ARROW-1644: Here: https://transfer.sh/w4IQ0/test_nested.parquet This is written with rust/parquet and tested to be readable with parquet-tools > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085408#comment-17085408 ] Micah Kornfield commented on ARROW-1644: For anyone that is interested in reading mixed level data, if there is an ability to provide sample parquet files (probably no more then 5-10MB of data) to run microbenchmarks against it would help ensure we are writing code with the right trade-off. > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027060#comment-17027060 ] Wes McKinney commented on ARROW-1644: - Please see the recent e-mail discussion https://lists.apache.org/thread.html/r831c41a3f9b353cfc187dc2092515fe707d4ebf13cbb727524f1c600%40%3Cdev.arrow.apache.org%3E > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026874#comment-17026874 ] Zack Gancarz commented on ARROW-1644: - Hi Wes, any progress on this one? Seems like it's a common use case as a lot of people have the use case to save nested protobuffs to .parquet. Thank you kindly > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979367#comment-16979367 ] Rinke Hoekstra commented on ARROW-1644: --- [~davlee1...@yahoo.com] good point about the multi-threaded loading, but (at the risk of being pedantic) it is valid JSON Lines, but not valid JSON: most if not all JSON parsers will raise an exception at the missing enclosing brackets and missing comma. In any case, the issue is now raised at: https://issues.apache.org/jira/browse/ARROW-7226 > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979362#comment-16979362 ] David Lee commented on ARROW-1644: -- The format is valid. http://jsonlines.org Line delimited json is a better format for data since you can leverage threads to speed up read operation. > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979326#comment-16979326 ] Joris Van den Bossche commented on ARROW-1644: -- [~RinkeHoekstra] that looks unrelated (the json reader is mostly independent from the parquet IO). Can you open a separate JIRA ticket? > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979315#comment-16979315 ] Rinke Hoekstra commented on ARROW-1644: --- I was just trying this with the example found in the pyarrow docs at [http://arrow.apache.org/docs/python/json.html] The documented example does not work. Is this related to this issue, or is it another matter? It says to load the following JSON file: ```\{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}} {"a": [3, 4, 5], "b": \{"c": false, "d": "2019-04-01"}}``` I fixed this to make it valid (but that's another issue): ``` [\{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}}, {"a": [3, 4, 5], "b": \{"c": false, "d": "2019-04-01"}}] ``` Then reading the JSON from a file called `my_data.json`: ``` from pyarrow import json table = json.read_json("my_data.json") ``` Gives the following error: ``` --- ArrowInvalid Traceback (most recent call last) in () 1 from pyarrow import json > 2 table = json.read_json('test.json') ~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/_json.pyx in pyarrow._json.read_json() ~/.local/share/virtualenvs/parquet-ifRxINoC/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: JSON parse error: A column changed from object to array ``` > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969712#comment-16969712 ] Micah Kornfield commented on ARROW-1644: The code isn't really super useable since it is based on the old repo and a lot of changes have been made (and it had a performance regression). I haven't had time to work on this, but still hope to get some bandwidth in the next month or so. But if there are motivated parties I'm happy to remove my name from the assignment. > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969465#comment-16969465 ] William Young commented on ARROW-1644: -- Are there plans to merge this code? I have a use-case. > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911731#comment-16911731 ] Wes McKinney commented on ARROW-1644: - Note that contributing to other parts of the project helps free up developers to work on larger projects like this. > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911426#comment-16911426 ] Brian Phillips commented on ARROW-1644: --- My main use case for (py)arrow is converting very nested protobuf data to parquet for storage. Currently I'm forced to store as json instead because there is no nested data support. Would love to see this implemented, but unfortunately can't be much help as I don't know C++. > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911378#comment-16911378 ] Wes McKinney commented on ARROW-1644: - Nope, the large projects I had planned are done. The only further work I'd be interested in would be expanding the encoders / low-level column reader/writer classes to handle more dictionary-encoded types. None of that should affect the nested data disassembly / reassembly logic. One of my goals with these recent refactors was actually to move the "flat" serialization/deserialization code "out of the way" (since the prior effort on this caused performance regressions on flat data) > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911011#comment-16911011 ] Micah Kornfield commented on ARROW-1644: [~bhogan-mitre] there isn't much an update. I put this off a little because @wesm was doing some major refactoring. If you want to contribute, we can probably divide the work for read and write (provided @wesm isn't planning on any more major refactorings ...). > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16910669#comment-16910669 ] Wes McKinney commented on ARROW-1644: - I'm not aware of any updates; there are no patches available yet > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890246#comment-16890246 ] Micah Kornfield commented on ARROW-1644: Since there has been some interest on the old PR, I'll give a quick status update. I'm about 50% done getting the write path finished, I hope to have this done by end of this week or next. I'll then be starting on the read path. It is likely I will try to leverage some code from the old PR or [https://github.com/apache/arrow/pull/4066] but will have a better idea once I take a close look. > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890158#comment-16890158 ] ASF GitHub Bot commented on ARROW-1644: --- wesm commented on pull request #462: ARROW-1644: [C++] Initial cut of implementing deserialization of arbitrary nested groups from Parquet to Arrow URL: https://github.com/apache/parquet-cpp/pull/462 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877937#comment-16877937 ] Wes McKinney commented on ARROW-1644: - Yes, essentially. 1-level encoding {code} group schema { optional INT32 some_other_value repeated T list_item; } {code} 2-level encoding {code} group schema { optional INT32 some_other_value repeated group list_value { optional/required T list_item; } } {code} 3-level encoding {code} group schema { optional INT32 some_other_value optional/required group list_value LIST { repeated group box { optional/required T list_item; } } } {code} note * The 1-level encoding can only encode {{array NOT NULL}}. * The 2-level encoding can only encode {{array NOT NULL}} * The 3-level encoding can encode either nullity of list items or the lists themselves {{array [nullable?]}} The decode path is slightly different for the 1/2 level cases > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877444#comment-16877444 ] Micah Kornfield commented on ARROW-1644: are you referencing https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types? > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877438#comment-16877438 ] Wes McKinney commented on ARROW-1644: - [~emkornfi...@gmail.com] please keep in mind there are multiple styles of nested data encoding (1-, 2-, and 3-level list encoding), this can be known from the schema so we'll probably need to support all 3 kinds > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)