[jira] [Created] (ARROW-16893) Add quoting style support for pyarrow.csv.WriteOptions

2022-06-23 Thread David Lee (Jira)
David Lee created ARROW-16893:
-

 Summary: Add quoting style support for pyarrow.csv.WriteOptions
 Key: ARROW-16893
 URL: https://issues.apache.org/jira/browse/ARROW-16893
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 8.0.0
Reporter: David Lee


https://issues.apache.org/jira/browse/ARROW-14905

The quoting style option was added for  C++, but is not supported in Python.

The pyarrow.csv writer module currently produces a CSV file where all strings 
are double quoted with no option to not wrap strings in double quotes.

The C++ default for quoting style is "needed"

"portfolioID","marketValue","notionalMarketValue","weight","notionalWeight"
"ABCXYZ12345",26260.74,0.039716113109573174,26260.74,0.039716113109573174



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16629) Apache Arrow Flight transport speed improvement for list structures

2022-05-23 Thread David Lee (Jira)
David Lee created ARROW-16629:
-

 Summary: Apache Arrow Flight transport speed improvement for list 
structures
 Key: ARROW-16629
 URL: https://issues.apache.org/jira/browse/ARROW-16629
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC
Affects Versions: 8.0.0
Reporter: David Lee


I just started testing using Arrow Flight to send results from a GraphQL server 
with FlightServer() running on i.

GraphQL defines a schema for your data output which can be mapped to an Arrow 
schema so I thought it would make sense to try using Arrow Flight to transport 
results instead of using REST style JSON records.

Arrow Flight was 66% faster in all case, but it didn't scale as the number of 
child records increased. I suspect that serializing structs or lists needs some 
improvement..

Here is the discussion I opened including links to test scripts.

[https://github.com/mirumee/ariadne/discussions/867]

10 records it was 0.049 seconds faster or 80% faster
1 records it was 0.109 seconds faster or 66% faster
10 million records it was 54 seconds faster or 66% faster.

Also here is the data structure that is sent across the wire..

pyarrow.Table
data: struct, int_list: 
list, length: int64, string_list: list, time_spent: 
double>>
child 0, test_lists: struct, int_list: 
list, length: int64, string_list: list, time_spent: 
double>
child 0, float_list: list
child 0, item: double
child 1, int_list: list
child 0, item: int64
child 2, length: int64
child 3, string_list: list
child 0, item: string
child 4, time_spent: double

data: [
-- is_valid: all not null
-- child 0 type: struct, int_list: list, length: int64, string_list: list, time_spent: double>
-- is_valid: all not null
-- child 0 type: list
[[13.500371672273381,17.747395152140353,28.973205439157457,1.361443415643098,19.029191125636135,14.62284718057391,18.44333922481529,7.906278860251386,14.402464768126993,5.826040531772251]]
-- child 1 type: list
[[23,3,21,15,20,4,10,16,23,25]]
-- child 2 type: int64
[10]
-- child 3 type: list
[["qypsupwtxy","vrxptpspyt","qpvruwsuqq","ywwpyxrvrt","wswutpxxqv","tsyypstxvv","ytprpqsxsx","wtwsxvprvu","suwtrvqvwp","wtsrwywwty"]]
-- child 4 type: double
[0]]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels

2019-11-21 Thread David Lee (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979362#comment-16979362
 ] 

David Lee edited comment on ARROW-1644 at 11/21/19 3:38 PM:


The format is valid. [http://jsonlines.org|http://jsonlines.org/]
 Line delimited json is a better format for data since you can leverage threads 
to speed up read operations.

You also added a comma and bracket incorrectly which turned valid jsonl to 
invalid json. They should be outside the curly braces.

{{[{"a": [1, 2], "b": {"c": true, "d": "1991-02-03",
{{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01" ]


was (Author: davlee1...@yahoo.com):
The format is valid. [http://jsonlines.org|http://jsonlines.org/]
 Line delimited json is a better format for data since you can leverage threads 
to speed up read operations.

You also added a comma and bracket incorrectly which turned valid jsonl to 
invalid json. They should be outside the curly braces.

> [C++][Parquet] Read and write nested Parquet data with a mix of struct and 
> list nesting levels
> --
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 1.0.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-0')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
> nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels

2019-11-21 Thread David Lee (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979362#comment-16979362
 ] 

David Lee edited comment on ARROW-1644 at 11/21/19 3:33 PM:


The format is valid. [http://jsonlines.org|http://jsonlines.org/]
 Line delimited json is a better format for data since you can leverage threads 
to speed up read operations.

You also added a comma and bracket incorrectly which turned valid jsonl to 
invalid json. They should be outside the curly braces.


was (Author: davlee1...@yahoo.com):
The format is valid. [http://jsonlines.org|http://jsonlines.org/]
 Line delimited json is a better format for data since you can leverage threads 
to speed up read operations.

You also added a comma incorrectly above which turned valid jsonl to invalid 
json.

> [C++][Parquet] Read and write nested Parquet data with a mix of struct and 
> list nesting levels
> --
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 1.0.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-0')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
> nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels

2019-11-21 Thread David Lee (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979362#comment-16979362
 ] 

David Lee edited comment on ARROW-1644 at 11/21/19 3:30 PM:


The format is valid. [http://jsonlines.org|http://jsonlines.org/]
 Line delimited json is a better format for data since you can leverage threads 
to speed up read operations.

You also added a comma incorrectly above which turned valid jsonl to invalid 
json.


was (Author: davlee1...@yahoo.com):
The format is valid. [http://jsonlines.org|http://jsonlines.org/]
 Line delimited json is a better format for data since you can leverage threads 
to speed up read operations.

> [C++][Parquet] Read and write nested Parquet data with a mix of struct and 
> list nesting levels
> --
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 1.0.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-0')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
> nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels

2019-11-21 Thread David Lee (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979362#comment-16979362
 ] 

David Lee edited comment on ARROW-1644 at 11/21/19 3:27 PM:


The format is valid. [http://jsonlines.org|http://jsonlines.org/]
 Line delimited json is a better format for data since you can leverage threads 
to speed up read operations.


was (Author: davlee1...@yahoo.com):
The format is valid. http://jsonlines.org
Line delimited json is a better format for data since you can leverage threads 
to speed up read operation.

> [C++][Parquet] Read and write nested Parquet data with a mix of struct and 
> list nesting levels
> --
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 1.0.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-0')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
> nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels

2019-11-21 Thread David Lee (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979362#comment-16979362
 ] 

David Lee commented on ARROW-1644:
--

The format is valid. http://jsonlines.org
Line delimited json is a better format for data since you can leverage threads 
to speed up read operation.

> [C++][Parquet] Read and write nested Parquet data with a mix of struct and 
> list nesting levels
> --
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 1.0.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-0')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
> nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()

2019-07-23 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891152#comment-16891152
 ] 

David Lee commented on ARROW-6001:
--

 

Table.from_dict in 0.14.1 looks fine. The code I originally reviewed iterated 
through the ordered dictionary keys instead of the schema field names.

Here's some testing samples for to_pylist() and from_pylist()

 
{code:java}
test_schema = pa.schema([
 pa.field('id', pa.int16()),
 pa.field('struct_test', pa.list_(pa.struct([pa.field("child_id", pa.int16()), 
pa.field("child_name", pa.string())]))),
 pa.field('list_test', pa.list_(pa.int16()))
])
test_data = [
{'id': 1, 'struct_test': [{'child_id': 11, 'child_name': '_11'}, {'child_id': 
12, 'child_name': '_12'}], 'list_test': [1,2,3]},
{'id': 2, 'struct_test': [{'child_id': 21, 'child_name': '_21'}], 'list_test': 
[4,5]} 
]
test_tbl = from_pylist(test_data, schema = test_schema)
test_list = to_pylist(test_tbl)
test_tbl
test_list
{code}
 

> Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve 
> pandas.to_dict()
> 
>
> Key: ARROW-6001
> URL: https://issues.apache.org/jira/browse/ARROW-6001
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: David Lee
>Priority: Minor
>
> I noticed that pyarrow.Table.to_pydict() exists, but 
> pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to 
> create one, but it doesn't take into account potential mismatches between 
> column order and number of columns.
> I'm including some code I've written which I've been using to handle arrow 
> conversions to ordered dictionaries and lists of dictionaries.. I've also 
> included an example where this can be used to speed up pandas.to_dict() by a 
> factor of 6x.
>  
> {code:java}
> def from_pylist(pylist, names=None, schema=None, safe=True):
>     """
>     Converts a python list of dictionaries to a pyarrow table
>     :param pylist: pylist list of dictionaries
>     :param names: list of column names
>     :param schema: pyarrow schema
>     :param safe: True or False
>     :return: arrow table
>     """
>     arrow_columns = list()
>     if schema:
>     for column in schema.names:
>     arrow_columns.append(pa.array([v[column] if column in v else None 
> for v in pylist], safe=safe, 
> type=schema.types[schema.get_field_index(column)]))
>     arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
>     else:
>     for column in names:
>     arrow_columns.append(pa.array([v[column] if column in v else None 
> for v in pylist], safe=safe))
>     arrow_table = pa.Table.from_arrays(arrow_columns, names)
>     return arrow_table
> def to_pylist(arrow_table, index_columns=None):
>     """
>     Converts a pyarrow table to a python list of dictionaries
>     :param arrow_table: arrow table
>     :param index_columns: columns to index
>     :return: python list of dictionaries
>     """
>     pydict = arrow_table.to_pydict()
>     if index_columns:
>     columns = arrow_table.schema.names
>     columns.append("_index")
>     pylist = [{column: tuple([pydict[index_column][row] for index_column 
> in index_columns]) if column == '_index' else pydict[column][row] for column 
> in columns} for row in range(arrow_table.num_rows)]
>     else:
>     pylist = [{column: pydict[column][row] for column in 
> arrow_table.schema.names} for row in range(arrow_table.num_rows)]
>     return pylist
> def from_pydict(pydict, names=None, schema=None, safe=True):
>     """
>     Converts a pyarrow table to a python ordered dictionary
>     :param pydict: ordered dictionary
>     :param names: list of column names
>     :param schema: pyarrow schema
>     :param safe: True or False
>     :return: arrow table
>     """
>     arrow_columns = list()
>     dict_columns = list(pydict.keys())
>     if schema:
>     for column in schema.names:
>     if column in pydict:
>     arrow_columns.append(pa.array(pydict[column], safe=safe, 
> type=schema.types[schema.get_field_index(column)]))
>     else:
>     arrow_columns.append(pa.array([None] * 
> len(pydict[dict_columns[0]]), safe=safe, 
> type=schema.types[schema.get_field_index(column)]))
>     arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
>     else:
>     if not names:
>     names = dict_columns
>     for column in names:
>     if column in dict_columns:
>     arrow_columns.append(pa.array(pydict[column], safe=safe))
>     else:
>     arrow_columns.append(pa.array([None] * 
> len(pydict[dict_columns[0]]), safe=safe))
>     arrow_table = pa.Table.from_arrays(arrow_columns, names)
>     return arrow_table
> def 

[jira] [Updated] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()

2019-07-22 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-6001:
-
Description: 
I noticed that pyarrow.Table.to_pydict() exists, but 
pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create 
one, but it doesn't take into account potential mismatches between column order 
and number of columns.

I'm including some code I've written which I've been using to handle arrow 
conversions to ordered dictionaries and lists of dictionaries.. I've also 
included an example where this can be used to speed up pandas.to_dict() by a 
factor of 6x.

 
{code:java}
def from_pylist(pylist, names=None, schema=None, safe=True):
    """
    Converts a python list of dictionaries to a pyarrow table
    :param pylist: pylist list of dictionaries
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    if schema:
    for column in schema.names:
    arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
    arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
    for column in names:
    arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
    arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def to_pylist(arrow_table, index_columns=None):
    """
    Converts a pyarrow table to a python list of dictionaries
    :param arrow_table: arrow table
    :param index_columns: columns to index
    :return: python list of dictionaries
    """
    pydict = arrow_table.to_pydict()
    if index_columns:
    columns = arrow_table.schema.names
    columns.append("_index")
    pylist = [{column: tuple([pydict[index_column][row] for index_column in 
index_columns]) if column == '_index' else pydict[column][row] for column in 
columns} for row in range(arrow_table.num_rows)]
    else:
    pylist = [{column: pydict[column][row] for column in 
arrow_table.schema.names} for row in range(arrow_table.num_rows)]
    return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
    """
    Converts a pyarrow table to a python ordered dictionary
    :param pydict: ordered dictionary
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    dict_columns = list(pydict.keys())
    if schema:
    for column in schema.names:
    if column in pydict:
    arrow_columns.append(pa.array(pydict[column], safe=safe, 
type=schema.types[schema.get_field_index(column)]))
    else:
    arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe, 
type=schema.types[schema.get_field_index(column)]))
    arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
    if not names:
    names = dict_columns
    for column in names:
    if column in dict_columns:
    arrow_columns.append(pa.array(pydict[column], safe=safe))
    else:
    arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe))
    arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def get_indexed_values(arrow_table, index_columns):
    """
    returns back a set of unique values for a list of columns.
    :param arrow_table: arrow_table
    :param index_columns: list of column names
    :return: set of tuples
    """
    pydict = arrow_table.to_pydict()
    index_set = set([tuple([pydict[index_column][row] for index_column in 
index_columns]) for row in range(arrow_table.num_rows)])
    return index_set
{code}
Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict()

 
{code:java}
# benchmark panda conversion to python objects
print('**benchmark 1 million rows**')
start_time = time.time()
python_df1 = panda_df1.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python: " + str(total_time))

start_time = time.time()
arrow_df1 = pa.Table.from_pandas(panda_df1)
pydict = arrow_df1.to_pydict()
python_df1 = [{column: pydict[column][row] for column in 
arrow_df1.schema.names} for row in range(arrow_df1.num_rows)]
total_time = time.time() - start_time
print("pandas to arrow to python: " + str(total_time))

print('**benchmark 4 million rows**')
start_time = time.time()
python_df4 = panda_df4.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python:: " + str(total_time))

start_time = time.time()
arrow_df4 = pa.Table.from_pandas(panda_df4)
pydict = arrow_df4.to_pydict()
python_df4 = [{column: pydict[column][row] for column in 

[jira] [Updated] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()

2019-07-22 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-6001:
-
Description: 
I noticed that pyarrow.Table.to_pydict() exists, but 
pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create 
one, but it doesn't take into account potential mismatches between column order 
and number of columns.

I'm including some code I've written which I've been using to handle arrow 
conversions to ordered dictionaries and lists of dictionaries.. I've also 
included an example where this can be used to speed up pandas.to_dict() by a 
factor of 6x.

 
{code:java}
def from_pylist(pylist, names=None, schema=None, safe=True):
    """
    Converts a python list of dictionaries to a pyarrow table
    :param pylist: pylist list of dictionaries
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    if schema:
    for column in schema.names:
    arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
    arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
    for column in names:
    arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
    arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def to_pylist(arrow_table, index_columns=None):
    """
    Converts a pyarrow table to a python list of dictionaries
    :param arrow_table: arrow table
    :param index_columns: columns to index
    :return: python list of dictionaries
    """
    pydict = arrow_table.to_pydict()
    if index_columns:
    columns = arrow_table.schema.names
    columns.append("_index")
    pylist = [{column: tuple([pydict[index_column][row] for index_column in 
index_columns]) if column == '_index' else pydict[column][row] for column in 
columns} for row in range(arrow_table.num_rows)]
    else:
    pylist = [{column: pydict[column][row] for column in 
arrow_table.schema.names} for row in range(arrow_table.num_rows)]
    return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
    """
    Converts a pyarrow table to a python ordered dictionary
    :param pydict: ordered dictionary
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    dict_columns = list(pydict.keys())
    if schema:
    for column in schema.names:
    if column in pydict:
    arrow_columns.append(pa.array(pydict[column], safe=safe, 
type=schema.types[schema.get_field_index(column)]))
    else:
    arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe, 
type=schema.types[schema.get_field_index(column)]))
    arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
    if not names:
    names = dict_columns
    for column in names:
    if column in dict_columns:
    arrow_columns.append(pa.array(pydict[column], safe=safe))
    else:
    arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe))
    arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def get_indexed_values(arrow_table, index_columns):
    """
    returns back a set of unique values for a list of columns.
    :param arrow_table: arrow_table
    :param index_columns: list of column names
    :return: set of tuples
    """
    pydict = arrow_table.to_pydict()
    index_set = set([tuple([pydict[index_column][row] for index_column in 
index_columns]) for row in range(arrow_table.num_rows)])
    return index_set
{code}
Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict()

 
{code:java}
# benchmark panda conversion to python objects.
start_time = time.time()
python_df1 = panda_df1.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python - 1 million rows - " + str(total_time))

start_time = time.time()
python_df4 = panda_df4.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to arrow to python - 1 million rows - " + str(total_time))

start_time = time.time()
arrow_df1 = pa.Table.from_pandas(panda_df1)
pydict = arrow_df1.to_pydict()
python_df1 = [{column: pydict[column][row] for column in 
arrow_df1.schema.names} for row in range(arrow_df1.num_rows)]
total_time = time.time() - start_time
print("pandas to python - 4 million rows - " + str(total_time))

start_time = time.time()
arrow_df4 = pa.Table.from_pandas(panda_df4)
pydict = arrow_df4.to_pydict()
python_df4 = [{column: pydict[column][row] for column in 
arrow_df4.schema.names} for 

[jira] [Updated] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()

2019-07-22 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-6001:
-
Description: 
I noticed that pyarrow.Table.to_pydict() exists, but 
pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create 
one, but it doesn't take into account potential mismatches between column order 
and number of columns.

I'm including some code I've written which I've been using to handle arrow 
conversions to ordered dictionaries and lists of dictionaries.. I've also 
included an example where this can be used to speed up pandas.to_dict() by a 
factor of 6x.

 
{code:java}
def from_pylist(pylist, names=None, schema=None, safe=True):
    """
    Converts a python list of dictionaries to a pyarrow table
    :param pylist: pylist list of dictionaries
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    if schema:
    for column in schema.names:
    arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
    arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
    for column in names:
    arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
    arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def to_pylist(arrow_table, index_columns=None):
    """
    Converts a pyarrow table to a python list of dictionaries
    :param arrow_table: arrow table
    :param index_columns: columns to index
    :return: python list of dictionaries
    """
    pydict = arrow_table.to_pydict()
    if index_columns:
    columns = arrow_table.schema.names
    columns.append("_index")
    pylist = [{column: tuple([pydict[index_column][row] for index_column in 
index_columns]) if column == '_index' else pydict[column][row] for column in 
columns} for row in range(arrow_table.num_rows)]
    else:
    pylist = [{column: pydict[column][row] for column in 
arrow_table.schema.names} for row in range(arrow_table.num_rows)]
    return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
    """
    Converts a pyarrow table to a python ordered dictionary
    :param pydict: ordered dictionary
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    dict_columns = list(pydict.keys())
    if schema:
    for column in schema.names:
    if column in pydict:
    arrow_columns.append(pa.array(pydict[column], safe=safe, 
type=schema.types[schema.get_field_index(column)]))
    else:
    arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe, 
type=schema.types[schema.get_field_index(column)]))
    arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
    if not names:
    names = dict_columns
    for column in names:
    if column in dict_columns:
    arrow_columns.append(pa.array(pydict[column], safe=safe))
    else:
    arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe))
    arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def get_indexed_values(arrow_table, index_columns):
    """
    returns back a set of unique values for a list of columns.
    :param arrow_table: arrow_table
    :param index_columns: list of column names
    :return: set of tuples
    """
    pydict = arrow_table.to_pydict()
    index_set = set([tuple([pydict[index_column][row] for index_column in 
index_columns]) for row in range(arrow_table.num_rows)])
    return index_set
{code}
Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict()

 
{code:java}
# benchmark panda conversion to python objects.
start_time = time.time()
python_df1 = panda_df1.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python - 1 million rows - " + str(total_time))

start_time = time.time()
python_df4 = panda_df4.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to arrow to python - 1 million rows - " + str(total_time))

start_time = time.time()
arrow_df1 = pa.Table.from_pandas(panda_df1)
pydict = arrow_df1.to_pydict()
python_df1 = [{column: pydict[column][row] for column in 
arrow_df1.schema.names} for row in range(arrow_df1.num_rows)]
total_time = time.time() - start_time
print("pandas to python - 4 million rows - " + str(total_time))

start_time = time.time()
arrow_df4 = pa.Table.from_pandas(panda_df4)
pydict = arrow_df4.to_pydict()
python_df4 = [{column: pydict[column][row] for column in 
arrow_df4.schema.names} for 

[jira] [Updated] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()

2019-07-22 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-6001:
-
Description: 
I noticed that pyarrow.Table.to_pydict() exists, but 
pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create 
one, but it doesn't take into account potential mismatches between column order 
and number of columns.

I'm including some code I've written which I've been using to handle arrow 
conversions to ordered dictionaries and lists of dictionaries.. I've also 
included an example where this can be used to speed up pandas.to_dict() by a 
factor of 6x.

 
{code:java}
def from_pylist(pylist, names=None, schema=None, safe=True):
    """
    Converts a python list of dictionaries to a pyarrow table
    :param pylist: pylist list of dictionaries
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    if schema:
    for column in schema.names:
    arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
    arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
    for column in names:
    arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
    arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def to_pylist(arrow_table, index_columns=None):
    """
    Converts a pyarrow table to a python list of dictionaries
    :param arrow_table: arrow table
    :param index_columns: columns to index
    :return: python list of dictionaries
    """
    pydict = arrow_table.to_pydict()
    if index_columns:
    columns = arrow_table.schema.names
    columns.append("_index")
    pylist = [{column: tuple([pydict[index_column][row] for index_column in 
index_columns]) if column == '_index' else pydict[column][row] for column in 
columns} for row in range(arrow_table.num_rows)]
    else:
    pylist = [{column: pydict[column][row] for column in 
arrow_table.schema.names} for row in range(arrow_table.num_rows)]
    return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
    """
    Converts a pyarrow table to a python ordered dictionary
    :param pydict: ordered dictionary
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    dict_columns = list(pydict.keys())
    if schema:
    for column in schema.names:
    if column in pydict:
    arrow_columns.append(pa.array(pydict[column], safe=safe, 
type=schema.types[schema.get_field_index(column)]))
    else:
    arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe, 
type=schema.types[schema.get_field_index(column)]))
    arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
    if not names:
    names = dict_columns
    for column in names:
    if column in dict_columns:
    arrow_columns.append(pa.array(pydict[column], safe=safe))
    else:
    arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe))
    arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def get_indexed_values(arrow_table, index_columns):
    """
    returns back a set of unique values for a list of columns.
    :param arrow_table: arrow_table
    :param index_columns: list of column names
    :return: set of tuples
    """
    pydict = arrow_table.to_pydict()
    index_set = set([tuple([pydict[index_column][row] for index_column in 
index_columns]) for row in range(arrow_table.num_rows)])
    return index_set
{code}
Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict()

 
{code:java}
# benchmark panda conversion to python objects.
start_time = time.time()
python_df1 = panda_df1.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python - 1 million rows - " + str(total_time))

start_time = time.time()
python_df4 = panda_df4.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python - 4 million rows - " + str(total_time))

start_time = time.time()
arrow_df1 = pa.Table.from_pandas(panda_df1)
pydict = arrow_df1.to_pydict()
python_df1 = [{column: pydict[column][row] for column in 
arrow_df1.schema.names} for row in range(arrow_df1.num_rows)]
total_time = time.time() - start_time
print("pandas to arrow to python - 1 million rows - " + str(total_time))

start_time = time.time()
arrow_df4 = pa.Table.from_pandas(panda_df4)
pydict = arrow_df4.to_pydict()
python_df4 = [{column: pydict[column][row] for column in 
arrow_df4.schema.names} for 

[jira] [Updated] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()

2019-07-22 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-6001:
-
Description: 
I noticed that pyarrow.Table.to_pydict() exists, but there is no 
pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create 
one, but it doesn't take into account potential mismatches between column order 
and number of columns.

I'm including some code I've written which I've been using to handle arrow 
conversions to ordered dictionaries and lists of dictionaries.. I've also 
included an example where this can be used to speed up pandas.to_dict() by a 
factor of 6x.

 
{code:java}
def from_pylist(pylist, names=None, schema=None, safe=True):
    """
    Converts a python list of dictionaries to a pyarrow table
    :param pylist: pylist list of dictionaries
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    if schema:
    for column in schema.names:
    arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
    arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
    for column in names:
    arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
    arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def to_pylist(arrow_table, index_columns=None):
    """
    Converts a pyarrow table to a python list of dictionaries
    :param arrow_table: arrow table
    :param index_columns: columns to index
    :return: python list of dictionaries
    """
    pydict = arrow_table.to_pydict()
    if index_columns:
    columns = arrow_table.schema.names
    columns.append("_index")
    pylist = [{column: tuple([pydict[index_column][row] for index_column in 
index_columns]) if column == '_index' else pydict[column][row] for column in 
columns} for row in range(arrow_table.num_rows)]
    else:
    pylist = [{column: pydict[column][row] for column in 
arrow_table.schema.names} for row in range(arrow_table.num_rows)]
    return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
    """
    Converts a pyarrow table to a python ordered dictionary
    :param pydict: ordered dictionary
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    dict_columns = list(pydict.keys())
    if schema:
    for column in schema.names:
    if column in pydict:
    arrow_columns.append(pa.array(pydict[column], safe=safe, 
type=schema.types[schema.get_field_index(column)]))
    else:
    arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe, 
type=schema.types[schema.get_field_index(column)]))
    arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
    if not names:
    names = dict_columns
    for column in names:
    if column in dict_columns:
    arrow_columns.append(pa.array(pydict[column], safe=safe))
    else:
    arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe))
    arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def get_indexed_values(arrow_table, index_columns):
    """
    returns back a set of unique values for a list of columns.
    :param arrow_table: arrow_table
    :param index_columns: list of column names
    :return: set of tuples
    """
    pydict = arrow_table.to_pydict()
    index_set = set([tuple([pydict[index_column][row] for index_column in 
index_columns]) for row in range(arrow_table.num_rows)])
    return index_set
{code}
Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict()

 
{code:java}
# benchmark panda conversion to python objects.
start_time = time.time()
python_df1 = panda_df1.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python - 1 million rows - " + str(total_time))

start_time = time.time()
python_df4 = panda_df4.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python - 4 million rows - " + str(total_time))

start_time = time.time()
arrow_df1 = pa.Table.from_pandas(panda_df1)
pydict = arrow_df1.to_pydict()
python_df1 = [{column: pydict[column][row] for column in 
arrow_df1.schema.names} for row in range(arrow_df1.num_rows)]
total_time = time.time() - start_time
print("pandas to arrow to python - 1 million rows - " + str(total_time))

start_time = time.time()
arrow_df4 = pa.Table.from_pandas(panda_df4)
pydict = arrow_df4.to_pydict()
python_df4 = [{column: pydict[column][row] for column in 

[jira] [Commented] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()

2019-07-22 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890345#comment-16890345
 ] 

David Lee commented on ARROW-6001:
--

Current implementation

> Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve 
> pandas.to_dict()
> 
>
> Key: ARROW-6001
> URL: https://issues.apache.org/jira/browse/ARROW-6001
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: David Lee
>Priority: Minor
>
> I noticed that pyarrow.Table.to_pydict() exists, but there is no 
> pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to 
> create one, but it doesn't take into account potential mismatches between 
> column order and number of columns.
> I'm attached some code I've written which I've been using to handle arrow to 
> ordered dictionaries and arrow to lists of dictionaries.. I've also included 
> an example where this can be used to speed up pandas.to_dict() by a factor of 
> 6x.
>  
> {code:java}
> def from_pylist(pylist, names=None, schema=None, safe=True):
>     """
>     Converts a python list of dictionaries to a pyarrow table
>     :param pylist: pylist list of dictionaries
>     :param names: list of column names
>     :param schema: pyarrow schema
>     :param safe: True or False
>     :return: arrow table
>     """
>     arrow_columns = list()
>     if schema:
>     for column in schema.names:
>     arrow_columns.append(pa.array([v[column] if column in v else None 
> for v in pylist], safe=safe, 
> type=schema.types[schema.get_field_index(column)]))
>     arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
>     else:
>     for column in names:
>     arrow_columns.append(pa.array([v[column] if column in v else None 
> for v in pylist], safe=safe))
>     arrow_table = pa.Table.from_arrays(arrow_columns, names)
>     return arrow_table
> def to_pylist(arrow_table, index_columns=None):
>     """
>     Converts a pyarrow table to a python list of dictionaries
>     :param arrow_table: arrow table
>     :param index_columns: columns to index
>     :return: python list of dictionaries
>     """
>     pydict = arrow_table.to_pydict()
>     if index_columns:
>     columns = arrow_table.schema.names
>     columns.append("_index")
>     pylist = [{column: tuple([pydict[index_column][row] for index_column 
> in index_columns]) if column == '_index' else pydict[column][row] for column 
> in columns} for row in range(arrow_table.num_rows)]
>     else:
>     pylist = [{column: pydict[column][row] for column in 
> arrow_table.schema.names} for row in range(arrow_table.num_rows)]
>     return pylist
> def from_pydict(pydict, names=None, schema=None, safe=True):
>     """
>     Converts a pyarrow table to a python ordered dictionary
>     :param pydict: ordered dictionary
>     :param names: list of column names
>     :param schema: pyarrow schema
>     :param safe: True or False
>     :return: arrow table
>     """
>     arrow_columns = list()
>     dict_columns = list(pydict.keys())
>     if schema:
>     for column in schema.names:
>     if column in pydict:
>     arrow_columns.append(pa.array(pydict[column], safe=safe, 
> type=schema.types[schema.get_field_index(column)]))
>     else:
>     arrow_columns.append(pa.array([None] * 
> len(pydict[dict_columns[0]]), safe=safe, 
> type=schema.types[schema.get_field_index(column)]))
>     arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
>     else:
>     if not names:
>     names = dict_columns
>     for column in names:
>     if column in dict_columns:
>     arrow_columns.append(pa.array(pydict[column], safe=safe))
>     else:
>     arrow_columns.append(pa.array([None] * 
> len(pydict[dict_columns[0]]), safe=safe))
>     arrow_table = pa.Table.from_arrays(arrow_columns, names)
>     return arrow_table
> def get_indexed_values(arrow_table, index_columns):
>     """
>     returns back a set of unique values for a list of columns.
>     :param arrow_table: arrow_table
>     :param index_columns: list of column names
>     :return: set of tuples
>     """
>     pydict = arrow_table.to_pydict()
>     index_set = set([tuple([pydict[index_column][row] for index_column in 
> index_columns]) for row in range(arrow_table.num_rows)])
>     return index_set
> {code}
> Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict()
>  
> {code:java}
> # benchmark panda conversion to python objects.
> start_time = time.time()
> python_df1 = panda_df1.to_dict(orient='records')
> total_time = time.time() - start_time
> print("pandas to python - 1 million rows - " + str(total_time))
> 

[jira] [Updated] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()

2019-07-22 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-6001:
-
Description: 
I noticed that pyarrow.Table.to_pydict() exists, but there is no 
pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create 
one, but it doesn't take into account potential mismatches between column order 
and number of columns.

I'm attached some code I've written which I've been using to handle arrow to 
ordered dictionaries and arrow to lists of dictionaries.. I've also included an 
example where this can be used to speed up pandas.to_dict() by a factor of 6x.

 
{code:java}
def from_pylist(pylist, names=None, schema=None, safe=True):
    """
    Converts a python list of dictionaries to a pyarrow table
    :param pylist: pylist list of dictionaries
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    if schema:
    for column in schema.names:
    arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
    arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
    for column in names:
    arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
    arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def to_pylist(arrow_table, index_columns=None):
    """
    Converts a pyarrow table to a python list of dictionaries
    :param arrow_table: arrow table
    :param index_columns: columns to index
    :return: python list of dictionaries
    """
    pydict = arrow_table.to_pydict()
    if index_columns:
    columns = arrow_table.schema.names
    columns.append("_index")
    pylist = [{column: tuple([pydict[index_column][row] for index_column in 
index_columns]) if column == '_index' else pydict[column][row] for column in 
columns} for row in range(arrow_table.num_rows)]
    else:
    pylist = [{column: pydict[column][row] for column in 
arrow_table.schema.names} for row in range(arrow_table.num_rows)]
    return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
    """
    Converts a pyarrow table to a python ordered dictionary
    :param pydict: ordered dictionary
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    dict_columns = list(pydict.keys())
    if schema:
    for column in schema.names:
    if column in pydict:
    arrow_columns.append(pa.array(pydict[column], safe=safe, 
type=schema.types[schema.get_field_index(column)]))
    else:
    arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe, 
type=schema.types[schema.get_field_index(column)]))
    arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
    if not names:
    names = dict_columns
    for column in names:
    if column in dict_columns:
    arrow_columns.append(pa.array(pydict[column], safe=safe))
    else:
    arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe))
    arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def get_indexed_values(arrow_table, index_columns):
    """
    returns back a set of unique values for a list of columns.
    :param arrow_table: arrow_table
    :param index_columns: list of column names
    :return: set of tuples
    """
    pydict = arrow_table.to_pydict()
    index_set = set([tuple([pydict[index_column][row] for index_column in 
index_columns]) for row in range(arrow_table.num_rows)])
    return index_set
{code}
Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict()

 
{code:java}
# benchmark panda conversion to python objects.
start_time = time.time()
python_df1 = panda_df1.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python - 1 million rows - " + str(total_time))

start_time = time.time()
python_df4 = panda_df4.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python - 4 million rows - " + str(total_time))

start_time = time.time()
arrow_df1 = pa.Table.from_pandas(panda_df1)
pydict = arrow_df1.to_pydict()
python_df1 = [{column: pydict[column][row] for column in 
arrow_df1.schema.names} for row in range(arrow_df1.num_rows)]
total_time = time.time() - start_time
print("pandas to arrow to python - 1 million rows - " + str(total_time))

start_time = time.time()
arrow_df4 = pa.Table.from_pandas(panda_df4)
pydict = arrow_df4.to_pydict()
python_df4 = [{column: pydict[column][row] for column in 
arrow_df4.schema.names} 

[jira] [Created] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()

2019-07-22 Thread David Lee (JIRA)
David Lee created ARROW-6001:


 Summary: Add from_pydict(), from_pylist() and to_pylist() to 
pyarrow.Table + improve pandas.to_dict()
 Key: ARROW-6001
 URL: https://issues.apache.org/jira/browse/ARROW-6001
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: David Lee


I noticed that pyarrow.Table.to_pydict() exists, but there is no 
pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create 
one, but it doesn't take into account potential mismatches between column order 
and number of columns.

I'm attached some code I've written which I've been using to handle arrow to 
ordered dictionaries and arrow to lists of dictionaries.. I've also included an 
example where this can be used to speed up pandas.to_dict() by a factor of 20x.

 
{code:java}
def from_pylist(pylist, names=None, schema=None, safe=True):
    """
    Converts a python list of dictionaries to a pyarrow table
    :param pylist: pylist list of dictionaries
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    if schema:
    for column in schema.names:
    arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
    arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
    for column in names:
    arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
    arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def to_pylist(arrow_table, index_columns=None):
    """
    Converts a pyarrow table to a python list of dictionaries
    :param arrow_table: arrow table
    :param index_columns: columns to index
    :return: python list of dictionaries
    """
    pydict = arrow_table.to_pydict()
    if index_columns:
    columns = arrow_table.schema.names
    columns.append("_index")
    pylist = [{column: tuple([pydict[index_column][row] for index_column in 
index_columns]) if column == '_index' else pydict[column][row] for column in 
columns} for row in range(arrow_table.num_rows)]
    else:
    pylist = [{column: pydict[column][row] for column in 
arrow_table.schema.names} for row in range(arrow_table.num_rows)]
    return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
    """
    Converts a pyarrow table to a python ordered dictionary
    :param pydict: ordered dictionary
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    dict_columns = list(pydict.keys())
    if schema:
    for column in schema.names:
    if column in pydict:
    arrow_columns.append(pa.array(pydict[column], safe=safe, 
type=schema.types[schema.get_field_index(column)]))
    else:
    arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe, 
type=schema.types[schema.get_field_index(column)]))
    arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
    if not names:
    names = dict_columns
    for column in names:
    if column in dict_columns:
    arrow_columns.append(pa.array(pydict[column], safe=safe))
    else:
    arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe))
    arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def get_indexed_values(arrow_table, index_columns):
    """
    returns back a set of unique values for a list of columns.
    :param arrow_table: arrow_table
    :param index_columns: list of column names
    :return: set of tuples
    """
    pydict = arrow_table.to_pydict()
    index_set = set([tuple([pydict[index_column][row] for index_column in 
index_columns]) for row in range(arrow_table.num_rows)])
    return index_set
{code}
Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict()

 
{code:java}
# benchmark panda conversion to python objects.
start_time = time.time()
python_df1 = panda_df1.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python - 1 million rows - " + str(total_time))
start_time = time.time()
python_df4 = panda_df4.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python - 4 million rows - " + str(total_time))
start_time = time.time()
arrow_df1 = pa.Table.from_pandas(panda_df1)
pydict = arrow_df1.to_pydict()
python_df1 = [{column: pydict[column][row] for column in 
arrow_df1.schema.names} for row in range(arrow_df1.num_rows)]
total_time = time.time() - start_time
print("pandas to arrow to python - 1 million rows - " + str(total_time))

[jira] [Commented] (ARROW-4814) [Python] Exception when writing nested columns that are tuples to parquet

2019-03-11 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789782#comment-16789782
 ] 

David Lee commented on ARROW-4814:
--

Same issue as https://issues.apache.org/jira/browse/ARROW-1644. I also don't 
think you can just write a tuple to parquet without defined names for each 
tuple element. A tuple doesn't really convert into the JSON schema model. 

[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]

 

> [Python] Exception when writing nested columns that are tuples to parquet
> -
>
> Key: ARROW-4814
> URL: https://issues.apache.org/jira/browse/ARROW-4814
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
> Environment: 4.20.8-100.fc28.x86_64
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: pandas, parquet
> Attachments: df_to_parquet_fail.py, test.csv
>
>
> I get an exception when I try to write a {{pandas.DataFrame}} to a parquet 
> file where one of the columns has tuples in them.  I use tuples here because 
> it allows for easier querying in pandas (see ARROW-3806 for a more detailed 
> description).
> {code}
> Traceback (most recent call last):
>   File "df_to_parquet_fail.py", line 5, in 
> df.to_parquet("test.parquet")  # crashes
>   File "/home/user/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
> line 2203, in to_parquet  
>  
> partition_cols=partition_cols, **kwargs)
>   File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parquet.py", 
> line 252, in to_parquet   
>  
> partition_cols=partition_cols, **kwargs)
>   File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parquet.py", 
> line 113, in write
>  
> table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
>   File "pyarrow/table.pxi", line 1141, in pyarrow.lib.Table.from_pandas
>   File 
> "/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 431, in dataframe_to_arrays  
>  
> convert_types)]
>   File 
> "/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 430, in
>  
> for c, t in zip(columns_to_convert,
>   File 
> "/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 426, in convert_column   
>  
> raise e
>   File 
> "/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 420, in convert_column   
>  
> return pa.array(col, type=ty, from_pandas=True, safe=safe)
>   File "pyarrow/array.pxi", line 176, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: ("Could not convert ('G',) with type tuple: did not 
> recognize Python value type when inferring an Arrow data type", 'Conversion 
> failed for column ALTS with type object')
> {code}
> The issue maybe replicated with the attached script and csv file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-1644) [Python] Read and write nested Parquet data with a mix of struct and list nesting levels

2019-03-05 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784864#comment-16784864
 ] 

David Lee edited comment on ARROW-1644 at 3/6/19 1:21 AM:
--

I've been able to write parquet columns which are lists, but I haven't been 
able to write a column which is a list of struct(s)

This works:
{code:java}
schema = pa.schema([
pa.field('test_id', pa.string()),
pa.field('a', pa.list_(pa.string())),
pa.field('b', pa.list_(pa.int32()))
])
{code}
This structure isn't supported yet
{code:java}
schema = pa.schema([
pa.field('test_id', pa.string()),
pa.field('testlist', pa.list_(pa.struct([('a', pa.string()), ('b', 
pa.int32())])))
])

new_records = list()
new_records.append({'test_id': '123', 'testlist': [{'a': 'xyz', 'b': 22}]})
new_records.append({'test_id': '789', 'testlist': [{'a': 'aaa', 'b': 33}]})

arrow_columns = list()

for column in schema.names:
arrow_columns.append(pa.array([v[column] for v in new_records], 
type=schema.types[schema.get_field_index(column)]))

arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)

arrow_table
arrow_table[0]
arrow_table[1]
arrow_table[1][0]
arrow_table[1][1]

>>> pq.write_table(arrow_table, "test.parquet")
Traceback (most recent call last):
packages/pyarrow/parquet.py", line 1160, in write_table
writer.write_table(table, row_group_size=row_group_size)
self.writer.write_table(table, row_group_size=row_group_size)
File "pyarrow/_parquet.pyx", line 924, in 
pyarrow._parquet.ParquetWriter.write_table
File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children

{code}
Supporting structs is the missing piece to being able to save structured JSON 
as columnar parquet which would make json searchable.


was (Author: davlee1...@yahoo.com):
I've been able to write parquet columns which are lists, but I haven't been 
able to write a column which is a list of struct(s)

This works:
{code:java}
schema = pa.schema([
pa.field('test_id', pa.string()),
pa.field('a', pa.list_(pa.string())),
pa.field('b', pa.list_(pa.int32()))
])
{code}
This structure isn't supported yet
{code:java}
schema = pa.schema([
pa.field('test_id', pa.string()),
pa.field('testlist', pa.list_(pa.struct([('a', pa.string()), ('b', 
pa.int32())])))
])

new_records = list()
new_records.append({'test_id': '123', 'testlist': [{'a': 'xyz', 'b': 22}]})
new_records.append({'test_id': '789', 'testlist': [{'a': 'aaa', 'b': 33}]})

arrow_columns = list()

for column in schema.names:
arrow_columns.append(pa.array([v[column] for v in new_records], 
type=schema.types[schema.get_field_index(column)]))

arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)

arrow_table
arrow_table[0]
arrow_table[1]
arrow_table[1][0]
arrow_table[1][1]

>>> pq.write_table(arrow_table, "test.parquet")
Traceback (most recent call last):
packages/pyarrow/parquet.py", line 1160, in write_table
writer.write_table(table, row_group_size=row_group_size)
File "/proj/pag/python/current/lib/python3.6/site-packages/pyarrow/parquet.py", 
line 405, in write_table
self.writer.write_table(table, row_group_size=row_group_size)
File "pyarrow/_parquet.pyx", line 924, in 
pyarrow._parquet.ParquetWriter.write_table
File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children

{code}
Supporting structs is the missing piece to being able to save structured JSON 
as columnar parquet which would make json searchable.

> [Python] Read and write nested Parquet data with a mix of struct and list 
> nesting levels
> 
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
>Assignee: Joshua Storck
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", 

[jira] [Commented] (ARROW-1644) [Python] Read and write nested Parquet data with a mix of struct and list nesting levels

2019-03-05 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784864#comment-16784864
 ] 

David Lee commented on ARROW-1644:
--

I've been able to write parquet columns which are lists, but I haven't been 
able to write a column which is a list of struct(s)

This works:
{code:java}
schema = pa.schema([
pa.field('test_id', pa.string()),
pa.field('a', pa.list_(pa.string())),
pa.field('b', pa.list_(pa.int32()))
])
{code}
This structure isn't supported yet
{code:java}
schema = pa.schema([
pa.field('test_id', pa.string()),
pa.field('testlist', pa.list_(pa.struct([('a', pa.string()), ('b', 
pa.int32())])))
])

new_records = list()
new_records.append({'test_id': '123', 'testlist': [{'a': 'xyz', 'b': 22}]})
new_records.append({'test_id': '789', 'testlist': [{'a': 'aaa', 'b': 33}]})

arrow_columns = list()

for column in schema.names:
arrow_columns.append(pa.array([v[column] for v in new_records], 
type=schema.types[schema.get_field_index(column)]))

arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)

arrow_table
arrow_table[0]
arrow_table[1]
arrow_table[1][0]
arrow_table[1][1]

>>> pq.write_table(arrow_table, "test.parquet")
Traceback (most recent call last):
packages/pyarrow/parquet.py", line 1160, in write_table
writer.write_table(table, row_group_size=row_group_size)
File "/proj/pag/python/current/lib/python3.6/site-packages/pyarrow/parquet.py", 
line 405, in write_table
self.writer.write_table(table, row_group_size=row_group_size)
File "pyarrow/_parquet.pyx", line 924, in 
pyarrow._parquet.ParquetWriter.write_table
File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children

{code}
Supporting structs is the missing piece to being able to save structured JSON 
as columnar parquet which would make json searchable.

> [Python] Read and write nested Parquet data with a mix of struct and list 
> nesting levels
> 
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
>Assignee: Joshua Storck
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-0')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
> nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4032) [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist()

2019-01-09 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738614#comment-16738614
 ] 

David Lee commented on ARROW-4032:
--

Tests: With and Without safe=False

{code:python}
my_list = [
{'a':'one', 'b': 1},
{'a':'two', 'b': 2},
{'a':'three', 'b': 3},
{'a':'missing', 'b': None}
]

schema = pa.schema([
pa.field('a', pa.string()),
pa.field('b', pa.int16())
])

arrow_table = from_pylist(my_list, schema=schema)
arrow_table2 = pa.Table.from_pandas(pd.DataFrame(my_list), preserve_index=False)
arrow_table3 = pa.Table.from_pandas(pd.DataFrame(my_list), schema = schema, 
preserve_index=False, safe=False)

>>> arrow_table.schema
a: string
b: int16

>>> arrow_table2.schema
a: string
b: double
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [], "column_indexes": [], "columns": [{"na'
  b'me": "a", "field_name": "a", "pandas_type": "unicode", "nump'
  b'y_type": "object", "metadata": null}, {"name": "b", "field_n'
  b'ame": "b", "pandas_type": "float64", "numpy_type": "float64"'
  b', "metadata": null}], "pandas_version": "0.23.4"}')])

>>> arrow_table3.schema
a: string
b: int16
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [], "column_indexes": [], "columns": [{"na'
  b'me": "a", "field_name": "a", "pandas_type": "unicode", "nump'
  b'y_type": "object", "metadata": null}, {"name": "b", "field_n'
  b'ame": "b", "pandas_type": "int16", "numpy_type": "float64", '
  b'"metadata": null}], "pandas_version": "0.23.4"}')])

{code}

> [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and 
> to_pylist()
> --
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pylist(test_list, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist()

2019-01-09 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738590#comment-16738590
 ] 

David Lee edited comment on ARROW-4032 at 1/9/19 7:42 PM:
--

Been testing this internally and haven't seen any problems or performance 
issues.. Removed all my pyarrow  <> pandas code so I don't have to deal with 
all the numpy problems with types and NULL support.

 
{code:python}
def from_pylist(pylist, names=None, schema=None, safe=True):
arrow_columns = list()
if schema:
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
else:
for column in names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, names)
return arrow_table

def to_pylist(arrow_table):
pydict = arrow_table.to_pydict()
pylist = [{column: pydict[column][row] for column in 
arrow_table.schema.names} for row in range(arrow_table.num_rows)]
return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
arrow_columns = list()
dict_columns = list(pydict.keys())
if schema:
for column in schema.names:
if column in pydict:
arrow_columns.append(pa.array(pydict[column], safe=safe, 
type=schema.types[schema.get_field_index(column)]))
else:
arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe, 
type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
else:
if not names:
names = dict_columns
for column in names:
if column in dict_columns:
arrow_columns.append(pa.array(pydict[column], safe=safe))
else:
arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, names)
return arrow_table

def get_table_keys(arrow_table, key_columns):
pydict = arrow_table.to_pydict()
keys_set = set([tuple([pydict[key_column][row] for key_column in 
key_columns]) for row in range(arrow_table.num_rows)])
return keys_set

 {code}
 


was (Author: davlee1...@yahoo.com):
Been testing this internally and haven't seen any problems or performance 
issues.. Removed all my pyarrow  <> pandas code so I don't have to deal with 
all the numpy problems with types and NULL support.

 
{code}
def from_pylist(pylist, names=None, schema=None, safe=True):
arrow_columns = list()
if schema:
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
else:
for column in names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, names)
return arrow_table

def to_pylist(arrow_table):
pydict = arrow_table.to_pydict()
pylist = [{column: pydict[column][row] for column in 
arrow_table.schema.names} for row in range(arrow_table.num_rows)]
return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
arrow_columns = list()
dict_columns = list(pydict.keys())
if schema:
for column in schema.names:
if column in pydict:
arrow_columns.append(pa.array(pydict[column], safe=safe, 
type=schema.types[schema.get_field_index(column)]))
else:
arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe, 
type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
else:
if not names:
names = dict_columns
for column in names:
if column in dict_columns:
arrow_columns.append(pa.array(pydict[column], safe=safe))
else:
arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, names)
return arrow_table

def get_table_keys(arrow_table, key_columns):
pydict = arrow_table.to_pydict()
keys_set = set([tuple([pydict[key_column][row] for key_column in 
key_columns])  
return keys_set
{code}
 

> [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and 
> to_pylist()
> --
>
> 

[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist()

2019-01-09 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738590#comment-16738590
 ] 

David Lee edited comment on ARROW-4032 at 1/9/19 7:40 PM:
--

Been testing this internally and haven't seen any problems or performance 
issues.. Removed all my pyarrow  <> pandas code so I don't have to deal with 
all the numpy problems with types and NULL support.

 
{code}
def from_pylist(pylist, names=None, schema=None, safe=True):
arrow_columns = list()
if schema:
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
else:
for column in names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, names)
return arrow_table

def to_pylist(arrow_table):
pydict = arrow_table.to_pydict()
pylist = [{column: pydict[column][row] for column in 
arrow_table.schema.names} for row in range(arrow_table.num_rows)]
return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
arrow_columns = list()
dict_columns = list(pydict.keys())
if schema:
for column in schema.names:
if column in pydict:
arrow_columns.append(pa.array(pydict[column], safe=safe, 
type=schema.types[schema.get_field_index(column)]))
else:
arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe, 
type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
else:
if not names:
names = dict_columns
for column in names:
if column in dict_columns:
arrow_columns.append(pa.array(pydict[column], safe=safe))
else:
arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, names)
return arrow_table

def get_table_keys(arrow_table, key_columns):
pydict = arrow_table.to_pydict()
keys_set = set([tuple([pydict[key_column][row] for key_column in 
key_columns])  
return keys_set
{code}
 


was (Author: davlee1...@yahoo.com):
Been testing this internally and haven't seen any problems or performance 
issues.. Removed all my pyarrow  <> pandas code so I don't have to deal with 
all the numpy problems with types and NULL support.

 
{code:python}
def from_pylist(pylist, names=None, schema=None, safe=True):
arrow_columns = list()
if schema:
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
else:
for column in names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, names)
return arrow_table

def to_pylist(arrow_table):
pydict = arrow_table.to_pydict()
pylist = [{column: pydict[column][row] for column in 
arrow_table.schema.names} for row in range(arrow_table.num_rows)]
return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
arrow_columns = list()
dict_columns = list(pydict.keys())
if schema:
for column in schema.names:
if column in pydict:
arrow_columns.append(pa.array(pydict[column], safe=safe, 
type=schema.types[schema.get_field_index(column)]))
else:
arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe, 
type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
else:
if not names:
names = dict_columns
for column in names:
if column in dict_columns:
arrow_columns.append(pa.array(pydict[column], safe=safe))
else:
arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, names)
return arrow_table

def get_table_keys(arrow_table, key_columns):
pydict = arrow_table.to_pydict()
keys_set = set([tuple([pydict[key_column][row] for key_column in 
key_columns]) for row in range(arrow_table.num_rows)])
return keys_set
{code}
 

> [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and 
> to_pylist()
> --
>
> 

[jira] [Commented] (ARROW-4032) [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist()

2019-01-09 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738590#comment-16738590
 ] 

David Lee commented on ARROW-4032:
--

Been testing this internally and haven't seen any problems or performance 
issues.. Removed all my pyarrow  <> pandas code so I don't have to deal with 
all the numpy problems with types and NULL support.

 
{code:python}
def from_pylist(pylist, names=None, schema=None, safe=True):
arrow_columns = list()
if schema:
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
else:
for column in names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, names)
return arrow_table

def to_pylist(arrow_table):
pydict = arrow_table.to_pydict()
pylist = [{column: pydict[column][row] for column in 
arrow_table.schema.names} for row in range(arrow_table.num_rows)]
return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
arrow_columns = list()
dict_columns = list(pydict.keys())
if schema:
for column in schema.names:
if column in pydict:
arrow_columns.append(pa.array(pydict[column], safe=safe, 
type=schema.types[schema.get_field_index(column)]))
else:
arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe, 
type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
else:
if not names:
names = dict_columns
for column in names:
if column in dict_columns:
arrow_columns.append(pa.array(pydict[column], safe=safe))
else:
arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, names)
return arrow_table

def get_table_keys(arrow_table, key_columns):
pydict = arrow_table.to_pydict()
keys_set = set([tuple([pydict[key_column][row] for key_column in 
key_columns]) for row in range(arrow_table.num_rows)])
return keys_set
{code}
 

> [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and 
> to_pylist()
> --
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pylist(test_list, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist()

2018-12-17 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721986#comment-16721986
 ] 

David Lee edited comment on ARROW-4032 at 12/17/18 7:44 PM:


Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas.. 
{code}
def from_pylist(pylist, names=None, schema=None, safe=True):
arrow_columns = list()
if schema:
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
else:
for column in names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, names)
return arrow_table

def to_pylist(arrow_table):
pylist = list()
for row in range(arrow_table.num_rows):
pylist.append({arrow_table.schema.names[i]: arrow_table[i][row] for i 
in range(arrow_table.num_columns)})
return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
arrow_columns = list()
dict_columns = list(pydict.keys())
if schema:
for column in schema.names:
if column in pydict:
arrow_columns.append(pa.array(pydict[column], safe=safe, 
type=schema.types[schema.get_field_index(column)]))
else:
arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe, 
type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
else:
if not names:
names = dict_columns
for column in names:
if column in dict_columns:
arrow_columns.append(pa.array(pydict[column], safe=safe))
else:
arrow_columns.append(pa.array([None] * 
len(pydict[dict_columns[0]]), safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, names)
return arrow_table
{code}


was (Author: davlee1...@yahoo.com):
Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, names=None, schema=None, safe=True):
arrow_columns = list()
if schema:
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
else:
for column in names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, names)
return arrow_table

def to_pylist(arrow_table):
pylist = list()
for row in range(arrow_table.num_rows):
pylist.append({arrow_table.schema.names[i]: arrow_table[i][row] for i 
in range(arrow_table.num_columns)})
return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
arrow_names = list()
arrow_columns = list()
for column, values in pydict.items():
arrow_names.append(column)
arrow_columns.append(pa.array(values))
arrow_table = pa.Table.from_arrays(arrow_columns, arrow_names)
return arrow_table{code}

> [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and 
> to_pylist()
> --
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not 

[jira] [Commented] (ARROW-4032) [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist()

2018-12-17 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723259#comment-16723259
 ] 

David Lee commented on ARROW-4032:
--

I'll see if I can do a git pull and submit a change..

> [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and 
> to_pylist()
> --
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pylist(test_list, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist()

2018-12-17 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Summary: [Python] New pyarrow.Table functions: from_pydict(), from_pylist() 
and to_pylist()  (was: [Python] New pyarrow.Table.from_pylist() function)

> [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and 
> to_pylist()
> --
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pylist(test_list, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function

2018-12-17 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721986#comment-16721986
 ] 

David Lee edited comment on ARROW-4032 at 12/17/18 6:28 PM:


Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, names=None, schema=None, safe=True):
arrow_columns = list()
if schema:
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
else:
for column in names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, names)
return arrow_table

def to_pylist(arrow_table):
pylist = list()
for row in range(arrow_table.num_rows):
pylist.append({arrow_table.schema.names[i]: arrow_table[i][row] for i 
in range(arrow_table.num_columns)})
return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
arrow_names = list()
arrow_columns = list()
for column, values in pydict.items():
arrow_names.append(column)
arrow_columns.append(pa.array(values))
arrow_table = pa.Table.from_arrays(arrow_columns, arrow_names)
return arrow_table{code}


was (Author: davlee1...@yahoo.com):
Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, names=None, schema=None, safe=True):
arrow_columns = list()
if schema:
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type = 
schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
else:
for column in names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, names)
return arrow_table

def to_pylist(arrow_table):
pylist = list()
for row in range(arrow_table.num_rows):
pylist.append({arrow_table.schema.names[i]: arrow_table[i][row] for i 
in range(arrow_table.num_columns)})
return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
arrow_names = list()
arrow_columns = list()
for column, values in pydict.items():
arrow_names.append(column)
arrow_columns.append(pa.array(values))
arrow_table = pa.Table.from_arrays(arrow_columns, arrow_names)
return arrow_table{code}

> [Python] New pyarrow.Table.from_pylist() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pylist(test_list, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function

2018-12-17 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721986#comment-16721986
 ] 

David Lee edited comment on ARROW-4032 at 12/17/18 6:28 PM:


Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, names=None, schema=None, safe=True):
arrow_columns = list()
if schema:
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe, type = 
schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
else:
for column in names:
arrow_columns.append(pa.array([v[column] if column in v else None 
for v in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, names)
return arrow_table

def to_pylist(arrow_table):
pylist = list()
for row in range(arrow_table.num_rows):
pylist.append({arrow_table.schema.names[i]: arrow_table[i][row] for i 
in range(arrow_table.num_columns)})
return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
arrow_names = list()
arrow_columns = list()
for column, values in pydict.items():
arrow_names.append(column)
arrow_columns.append(pa.array(values))
arrow_table = pa.Table.from_arrays(arrow_columns, arrow_names)
return arrow_table{code}


was (Author: davlee1...@yahoo.com):
Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, schema, safe=True):
arrow_columns = list()
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
return arrow_table

def to_pylist(arrow_table):
pylist = list()
columns = arrow_table.schema.names
rows = len(arrow_table[columns[0]])
for row in range(rows):
pylist.append({key: arrow_table[key][row] for key in columns})
return pylist
{code}

> [Python] New pyarrow.Table.from_pylist() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pylist(test_list, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function

2018-12-17 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721986#comment-16721986
 ] 

David Lee edited comment on ARROW-4032 at 12/17/18 3:53 PM:


Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, schema, safe=True):
arrow_columns = list()
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
return arrow_table

def to_pylist(arrow_table):
pylist = list()
columns = arrow_table.schema.names
rows = len(arrow_table[columns[0]])
for row in range(rows):
pylist.append({key: arrow_table[key][row] for key in columns})
return pylist
{code}


was (Author: davlee1...@yahoo.com):
Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, schema, safe=True):
arrow_columns = list()
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
return arrow_table

def to_pylist(arrow_table):
od = pyarrow.Table.to_pydict(arrow_table)
pylist = list()
columns = arrow_table.schema.names
rows = len(arrow_table[columns[0]])
for row in range(rows):
pylist.append({key: arrow_table[key][row] for key in columns})
return pylist
{code}

> [Python] New pyarrow.Table.from_pylist() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pylist(test_list, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function

2018-12-14 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721986#comment-16721986
 ] 

David Lee edited comment on ARROW-4032 at 12/15/18 3:58 AM:


Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, schema, safe=True):
arrow_columns = list()
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
return arrow_table

def to_pylist(arrow_table):
od = pyarrow.Table.to_pydict(arrow_table)
pylist = list()
columns = arrow_table.schema.names
rows = len(arrow_table[columns[0]])
for row in range(rows):
pylist.append({key: arrow_table[key][row] for key in columns})
return pylist
{code}


was (Author: davlee1...@yahoo.com):
Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, schema, safe=True):
arrow_columns = list()
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
return arrow_table

def to_pylist(arrow_table):
od = pyarrow.Table.to_pydict(arrow_table)
pylist = list()
columns = list(arrow_table.keys())
rows = len(arrow_table[columns[0]])
for row in range(rows):
pylist.append({key: arrow_table[key][row] for key in columns})
return pylist
{code}

> [Python] New pyarrow.Table.from_pylist() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pylist(test_list, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function

2018-12-14 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721986#comment-16721986
 ] 

David Lee commented on ARROW-4032:
--

Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, schema, safe=True):
arrow_columns = list()
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

def to_pylist(arrow_table):
od = pyarrow.Table.to_pydict(arrow_table)
pylist = list()
columns = list(arrow_table.keys())
rows = len(arrow_table[columns[0]])
for row in range(rows):
pylist.append({key: arrow_table[key][row] for key in columns})
return pylist
{code}

> [Python] New pyarrow.Table.from_pylist() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pylist(test_list, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function

2018-12-14 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721986#comment-16721986
 ] 

David Lee edited comment on ARROW-4032 at 12/15/18 3:53 AM:


Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, schema, safe=True):
arrow_columns = list()
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
return arrow_table

def to_pylist(arrow_table):
od = pyarrow.Table.to_pydict(arrow_table)
pylist = list()
columns = list(arrow_table.keys())
rows = len(arrow_table[columns[0]])
for row in range(rows):
pylist.append({key: arrow_table[key][row] for key in columns})
return pylist
{code}


was (Author: davlee1...@yahoo.com):
Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, schema, safe=True):
arrow_columns = list()
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

def to_pylist(arrow_table):
od = pyarrow.Table.to_pydict(arrow_table)
pylist = list()
columns = list(arrow_table.keys())
rows = len(arrow_table[columns[0]])
for row in range(rows):
pylist.append({key: arrow_table[key][row] for key in columns})
return pylist
{code}

> [Python] New pyarrow.Table.from_pylist() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pylist(test_list, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function

2018-12-14 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Summary: [Python] New pyarrow.Table.from_pylist() function  (was: [Python] 
New pyarrow.Table.from_pydict() function)

> [Python] New pyarrow.Table.from_pylist() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pylist(test_list, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Description: 
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

test_list = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pylist(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pylist(test_list, schema=test_schema)
{code}

  was:
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

test_list = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pydict(test_list, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pydict(test_list, schema=test_schema)
{code}


> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = 

[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Description: 
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

test_list = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pydict(test_list, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pydict(test_list, schema=test_schema)
{code}

  was:
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pydict(pylist, schema=test_schema)
{code}


> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pydict(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = 

[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Description: 
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pydict(pylist, schema=test_schema)
{code}

  was:
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pydict(pylist, schema=test_schema)
{code}


> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> pylist = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pydict(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if 

[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Description: 
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pydict(pylist, schema=test_schema)
{code}

  was:
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v in 
pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pydict(pylist, schema=test_schema)
{code}


> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> pylist = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pydict(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist]))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return 

[jira] [Commented] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721745#comment-16721745
 ] 

David Lee commented on ARROW-4032:
--

Updated the sample code to include Schema and Safe options..

Passing in a schema will allow conversions from microseconds to milliseconds.

> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> pylist = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pydict(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for v in 
> pylist]))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pydict(pylist, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Description: 
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v in 
pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pydict(pylist, schema=test_schema)
{code}

  was:
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

test_list = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": datetime.now()}
]

def from_pydict(pylist, columns):
arrow_columns = list()
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy'])

{code}
Additional work would be needed to pass in a schema object if you want to 
refine data types further. I think the existing code from from_pandas() to do 
that would work.


> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> pylist = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pydict(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for v in 
> pylist]))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pydict(pylist, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Description: 
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

test_list = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": datetime.now()}
]

def from_pydict(pylist, columns):
arrow_columns = list()
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy'])

{code}
 

  was:
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": datetime.now()}
]

def from_pydict(pylist, columns):
arrow_columns = list()
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

test = from_pydict(pylist, ['name' , 'age', 'city', 'birthday', 'dummy'])

{code}
 


> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": datetime.now()}
> ]
> def from_pydict(pylist, columns):
> arrow_columns = list()
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist]))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> return arrow_table
> test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy'])
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Description: 
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

test_list = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": datetime.now()}
]

def from_pydict(pylist, columns):
arrow_columns = list()
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy'])

{code}
Additional work would be needed to pass in a schema object if you want to 
refine data types further. I think the existing code from from_pandas() to do 
that would work.

  was:
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

test_list = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": datetime.now()}
]

def from_pydict(pylist, columns):
arrow_columns = list()
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy'])

{code}
 


> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": datetime.now()}
> ]
> def from_pydict(pylist, columns):
> arrow_columns = list()
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist]))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> return arrow_table
> test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy'])
> {code}
> Additional work would be needed to pass in a schema object if you want to 
> refine data types further. I think the existing code from from_pandas() to do 
> that would work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)
David Lee created ARROW-4032:


 Summary: [Python] New pyarrow.Table.from_pydict() function
 Key: ARROW-4032
 URL: https://issues.apache.org/jira/browse/ARROW-4032
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: David Lee


Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": datetime.now()}
]

def from_pydict(pylist, columns):
arrow_columns = list()
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

test = from_pydict(pylist, ['name' , 'age', 'city', 'birthday', 'dummy'])

{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3907) [Python] from_pandas errors when schemas are used with lower resolution timestamps

2018-12-13 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720638#comment-16720638
 ] 

David Lee commented on ARROW-3907:
--

Yeah i'm trying to figure out what the best way to preserve INTs when 
converting json to parquet..

The problem is more or less summarized here.
[https://pandas.pydata.org/pandas-docs/stable/gotchas.html]

There are a lot of gotchas with each step.

json.loads() works fine.

pandas.DataFrame() is a problem if every record doesn't contain the same 
columns.

Using pandas.DataFrame.reindex() to add missing columns adds a bunch of NaN 
values.

Adding NaN values will force change a column's dtype from INT64 to FLOAT64.

NaNs are a problem to begin with because if you convert it to Parquet you end 
up with Zeros instead of Nulls.

Running pandas.DataFrame.reindex(fill_value=None) doesn't work because passing 
in None is equal to pandas.DataFrame.reindex() without any params.

Only way to replace NaNs with None is with pandas.DataFrame.where().

After replacing NaNs you can then change the dtype of the column from FLOAT64 
back to INT64

It's basically a lot of hoops to go through to preserve your original JSON INT 
as a Parquet INT.

Maybe the best solution is to create a pyarrow.Table.from_pydict() function to 
create a arrow table from a python dictionary. We have this gap with 
pyarrow.Table.to_pydict(), pyarrow.Table.to_pandas() and 
pyarrow.Table.from_pandas().

> [Python] from_pandas errors when schemas are used with lower resolution 
> timestamps
> --
>
> Key: ARROW-3907
> URL: https://issues.apache.org/jira/browse/ARROW-3907
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: David Lee
>Priority: Major
> Fix For: 0.11.1
>
>
> When passing in a schema object to from_pandas a resolution error occurs if 
> the schema uses a lower resolution timestamp. Do we need to also add 
> "coerce_timestamps" and "allow_truncated_timestamps" parameters found in 
> write_table() to from_pandas()?
> Error:
> pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would 
> lose data: 1532015191753713000', 'Conversion failed for column modified with 
> type datetime64[ns]')
> Code:
>  
> {code:java}
> processed_schema = pa.schema([
> pa.field('Id', pa.string()),
> pa.field('modified', pa.timestamp('ms')),
> pa.field('records', pa.int32())
> ])
> pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3992) pyarrow compile from source issues on RedHat 7.4

2018-12-10 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715888#comment-16715888
 ] 

David Lee commented on ARROW-3992:
--

Ok this worked. The instructions are missing one line after conda create:

conda activate pyarrow-dev

> pyarrow compile from source issues on RedHat 7.4
> 
>
> Key: ARROW-3992
> URL: https://issues.apache.org/jira/browse/ARROW-3992
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: David Lee
>Priority: Minor
>
> Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after 
> running into the same problems with RedHat 7.4.
> [https://arrow.apache.org/docs/python/development.html#development]
> Additional steps taken:
> Added double-conversion, glog and hypothesis: 
> {code:java}
> conda create -y -q -n pyarrow-dev \
> python=3.6 numpy six setuptools cython pandas pytest double-conversion \
> cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\
> gflags brotli jemalloc lz4-c zstd -c conda-forge
> {code}
>  
> Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: 
> {code:java}
> export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64
> py.test pyarrow
> {code}
>  
> Added extra symlinks with a period at the end to fix string concatenation 
> issues. Running setup.py for the first time didn't need this, but running 
> setup.py a second time would error out with:
> {code:java}
> CMake Error: File 
> /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. does not exist.
> {code}
>  
> There is an extra period at the end of the *.so files so I had to make 
> symlinks with extra periods. 
> {code:java}
> ln -s libparquet.so.12.0.0 libparquet.so.
> ln -s libplasma.so.12.0.0 libplasma.so.
> ln -s libarrow.so.12.0.0 libarrow.so.
> ln -s libarrow_python.so.12.0.0 libarrow_python.so.
> {code}
>  
> Creating a wheel file using --with-plasma gives the following error: 
> {code:java}
> error: [Errno 2] No such file or directory: 'release/plasma_store_server'
> {code}
> Had to create the wheel file without plasma, but it isn't packaged correctly. 
> The hacked symlinked shared libs are included instead of libarrow.so.12
> {code:java}
> copying build/lib.linux-x86_64-3.6/pyarrow/libarrow.so. -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> copying build/lib.linux-x86_64-3.6/pyarrow/libarrow.so -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> copying build/lib.linux-x86_64-3.6/pyarrow/libarrow_python.so. -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> copying build/lib.linux-x86_64-3.6/pyarrow/libarrow_python.so -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> copying build/lib.linux-x86_64-3.6/pyarrow/libplasma.so. -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> copying build/lib.linux-x86_64-3.6/pyarrow/libplasma.so -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-3992) pyarrow compile from source issues on RedHat 7.4

2018-12-10 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee closed ARROW-3992.


Updated instructions work

> pyarrow compile from source issues on RedHat 7.4
> 
>
> Key: ARROW-3992
> URL: https://issues.apache.org/jira/browse/ARROW-3992
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: David Lee
>Priority: Minor
>
> Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after 
> running into the same problems with RedHat 7.4.
> [https://arrow.apache.org/docs/python/development.html#development]
> Additional steps taken:
> Added double-conversion, glog and hypothesis: 
> {code:java}
> conda create -y -q -n pyarrow-dev \
> python=3.6 numpy six setuptools cython pandas pytest double-conversion \
> cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\
> gflags brotli jemalloc lz4-c zstd -c conda-forge
> {code}
>  
> Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: 
> {code:java}
> export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64
> py.test pyarrow
> {code}
>  
> Added extra symlinks with a period at the end to fix string concatenation 
> issues. Running setup.py for the first time didn't need this, but running 
> setup.py a second time would error out with:
> {code:java}
> CMake Error: File 
> /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. does not exist.
> {code}
>  
> There is an extra period at the end of the *.so files so I had to make 
> symlinks with extra periods. 
> {code:java}
> ln -s libparquet.so.12.0.0 libparquet.so.
> ln -s libplasma.so.12.0.0 libplasma.so.
> ln -s libarrow.so.12.0.0 libarrow.so.
> ln -s libarrow_python.so.12.0.0 libarrow_python.so.
> {code}
>  
> Creating a wheel file using --with-plasma gives the following error: 
> {code:java}
> error: [Errno 2] No such file or directory: 'release/plasma_store_server'
> {code}
> Had to create the wheel file without plasma, but it isn't packaged correctly. 
> The hacked symlinked shared libs are included instead of libarrow.so.12
> {code:java}
> copying build/lib.linux-x86_64-3.6/pyarrow/libarrow.so. -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> copying build/lib.linux-x86_64-3.6/pyarrow/libarrow.so -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> copying build/lib.linux-x86_64-3.6/pyarrow/libarrow_python.so. -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> copying build/lib.linux-x86_64-3.6/pyarrow/libarrow_python.so -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> copying build/lib.linux-x86_64-3.6/pyarrow/libplasma.so. -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> copying build/lib.linux-x86_64-3.6/pyarrow/libplasma.so -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3992) pyarrow compile from source issues on RedHat 7.4

2018-12-10 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee resolved ARROW-3992.
--
Resolution: Not A Problem

Updated install instructions work

> pyarrow compile from source issues on RedHat 7.4
> 
>
> Key: ARROW-3992
> URL: https://issues.apache.org/jira/browse/ARROW-3992
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: David Lee
>Priority: Minor
>
> Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after 
> running into the same problems with RedHat 7.4.
> [https://arrow.apache.org/docs/python/development.html#development]
> Additional steps taken:
> Added double-conversion, glog and hypothesis: 
> {code:java}
> conda create -y -q -n pyarrow-dev \
> python=3.6 numpy six setuptools cython pandas pytest double-conversion \
> cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\
> gflags brotli jemalloc lz4-c zstd -c conda-forge
> {code}
>  
> Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: 
> {code:java}
> export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64
> py.test pyarrow
> {code}
>  
> Added extra symlinks with a period at the end to fix string concatenation 
> issues. Running setup.py for the first time didn't need this, but running 
> setup.py a second time would error out with:
> {code:java}
> CMake Error: File 
> /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. does not exist.
> {code}
>  
> There is an extra period at the end of the *.so files so I had to make 
> symlinks with extra periods. 
> {code:java}
> ln -s libparquet.so.12.0.0 libparquet.so.
> ln -s libplasma.so.12.0.0 libplasma.so.
> ln -s libarrow.so.12.0.0 libarrow.so.
> ln -s libarrow_python.so.12.0.0 libarrow_python.so.
> {code}
>  
> Creating a wheel file using --with-plasma gives the following error: 
> {code:java}
> error: [Errno 2] No such file or directory: 'release/plasma_store_server'
> {code}
> Had to create the wheel file without plasma, but it isn't packaged correctly. 
> The hacked symlinked shared libs are included instead of libarrow.so.12
> {code:java}
> copying build/lib.linux-x86_64-3.6/pyarrow/libarrow.so. -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> copying build/lib.linux-x86_64-3.6/pyarrow/libarrow.so -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> copying build/lib.linux-x86_64-3.6/pyarrow/libarrow_python.so. -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> copying build/lib.linux-x86_64-3.6/pyarrow/libarrow_python.so -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> copying build/lib.linux-x86_64-3.6/pyarrow/libplasma.so. -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> copying build/lib.linux-x86_64-3.6/pyarrow/libplasma.so -> 
> build/bdist.linux-x86_64/wheel/pyarrow
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3992) pyarrow compile from source issues on RedHat 7.4

2018-12-10 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-3992:
-
Description: 
Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after 
running into the same problems with RedHat 7.4.

[https://arrow.apache.org/docs/python/development.html#development]

Additional steps taken:

Added double-conversion, glog and hypothesis: 
{code:java}
conda create -y -q -n pyarrow-dev \
python=3.6 numpy six setuptools cython pandas pytest double-conversion \
cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\
gflags brotli jemalloc lz4-c zstd -c conda-forge
{code}
 

Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: 
{code:java}
export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64
py.test pyarrow
{code}
 

Added extra symlinks with a period at the end to fix string concatenation 
issues. Running setup.py for the first time didn't need this, but running 
setup.py a second time would error out with:
{code:java}
CMake Error: File /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. 
does not exist.
{code}
 

There is an extra period at the end of the *.so files so I had to make symlinks 
with extra periods. 
{code:java}
ln -s libparquet.so.12.0.0 libparquet.so.
ln -s libplasma.so.12.0.0 libplasma.so.
ln -s libarrow.so.12.0.0 libarrow.so.
ln -s libarrow_python.so.12.0.0 libarrow_python.so.
{code}
 

Creating a wheel file using --with-plasma gives the following error: 
{code:java}
error: [Errno 2] No such file or directory: 'release/plasma_store_server'
{code}
Had to create the wheel file without plasma, but it isn't packaged correctly. 
The hacked symlinked shared libs are included instead of libarrow.so.12
{code:java}
copying build/lib.linux-x86_64-3.6/pyarrow/libarrow.so. -> 
build/bdist.linux-x86_64/wheel/pyarrow
copying build/lib.linux-x86_64-3.6/pyarrow/libarrow.so -> 
build/bdist.linux-x86_64/wheel/pyarrow
copying build/lib.linux-x86_64-3.6/pyarrow/libarrow_python.so. -> 
build/bdist.linux-x86_64/wheel/pyarrow
copying build/lib.linux-x86_64-3.6/pyarrow/libarrow_python.so -> 
build/bdist.linux-x86_64/wheel/pyarrow
copying build/lib.linux-x86_64-3.6/pyarrow/libplasma.so. -> 
build/bdist.linux-x86_64/wheel/pyarrow
copying build/lib.linux-x86_64-3.6/pyarrow/libplasma.so -> 
build/bdist.linux-x86_64/wheel/pyarrow

{code}
 

  was:
Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after 
running into the same problems with RedHat 7.4.

[https://arrow.apache.org/docs/python/development.html#development]

Additional steps taken:

Added double-conversion, glog and hypothesis: 
{code:java}
conda create -y -q -n pyarrow-dev \
python=3.6 numpy six setuptools cython pandas pytest double-conversion \
cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\
gflags brotli jemalloc lz4-c zstd -c conda-forge
{code}
 

Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: 
{code:java}
export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64
py.test pyarrow
{code}
 

Added extra symlinks with a period at the end to fix string concatenation 
issues. Running setup.py for the first time didn't need this, but running 
setup.py a second time would error out with:
{code:java}
CMake Error: File /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. 
does not exist.
{code}
 

There is an extra period at the end of the *.so files so I had to make symlinks 
with extra periods. 
{code:java}
ln -s libparquet.so.12.0.0 libparquet.so.
ln -s libplasma.so.12.0.0 libplasma.so.
ln -s libarrow.so.12.0.0 libarrow.so.
ln -s libarrow_python.so.12.0.0 libarrow_python.so.
{code}
 

Creating a wheel file using --with-plasma gives the following error: 
{code:java}
error: [Errno 2] No such file or directory: 'release/plasma_store_server'
{code}
Had to create the wheel file without plasma..


> pyarrow compile from source issues on RedHat 7.4
> 
>
> Key: ARROW-3992
> URL: https://issues.apache.org/jira/browse/ARROW-3992
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: David Lee
>Priority: Minor
>
> Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after 
> running into the same problems with RedHat 7.4.
> [https://arrow.apache.org/docs/python/development.html#development]
> Additional steps taken:
> Added double-conversion, glog and hypothesis: 
> {code:java}
> conda create -y -q -n pyarrow-dev \
> python=3.6 numpy six setuptools cython pandas pytest double-conversion \
> cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\
> gflags brotli jemalloc lz4-c zstd -c conda-forge
> {code}
>  
> Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: 
> {code:java}
> export 

[jira] [Updated] (ARROW-3992) pyarrow compile from source issues on RedHat 7.4

2018-12-10 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-3992:
-
Description: 
Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after 
running into the same problems with RedHat 7.4.

[https://arrow.apache.org/docs/python/development.html#development]

Additional steps taken:

Added double-conversion, glog and hypothesis: 
{code:java}
conda create -y -q -n pyarrow-dev \
python=3.6 numpy six setuptools cython pandas pytest double-conversion \
cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\
gflags brotli jemalloc lz4-c zstd -c conda-forge
{code}
 

Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: 
{code:java}
export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64
py.test pyarrow
{code}
 

Added extra symlinks with a period at the end to fix string concatenation 
issues. Running setup.py for the first time didn't need this, but running 
setup.py a second time would error out with:
{code:java}
CMake Error: File /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. 
does not exist.
{code}
 

There is an extra period at the end of the *.so files so I had to make symlinks 
with extra periods. 
{code:java}
ln -s libparquet.so.12.0.0 libparquet.so.
ln -s libplasma.so.12.0.0 libplasma.so.
ln -s libarrow.so.12.0.0 libarrow.so.
ln -s libarrow_python.so.12.0.0 libarrow_python.so.
{code}
 

Creating a wheel file using --with-plasma gives the following error: 
{code:java}
error: [Errno 2] No such file or directory: 'release/plasma_store_server'
{code}
Had to create the wheel file without plasma..

  was:
Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after 
running into the same problems with RedHat 7.4.

[https://arrow.apache.org/docs/python/development.html#development]

Additional steps taken:

Added double-conversion, glog and hypothesis:

 
{code:java}
conda create -y -q -n pyarrow-dev \
python=3.6 numpy six setuptools cython pandas pytest double-conversion \
cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\
gflags brotli jemalloc lz4-c zstd -c conda-forge
{code}
Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow:

 

 
{code:java}
export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64
py.test pyarrow
{code}
 

Added extra symlinks with a period at the end to fix string concatenation 
issues. Running setup.py for the first time didn't need this, but running 
setup.py a second time would error out with:
{code:java}
CMake Error: File /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. 
does not exist.
{code}
There is an extra period at the end of the *.so files so I had to make symlinks 
with extra periods.

 
{code:java}
ln -s libparquet.so.12.0.0 libparquet.so.
ln -s libplasma.so.12.0.0 libplasma.so.
ln -s libarrow.so.12.0.0 libarrow.so.
ln -s libarrow_python.so.12.0.0 libarrow_python.so.
{code}
Creating a wheel file using --with-plasma gives the following error:

 
{code:java}
error: [Errno 2] No such file or directory: 'release/plasma_store_server'
{code}
Had to create the wheel file without plasma..


> pyarrow compile from source issues on RedHat 7.4
> 
>
> Key: ARROW-3992
> URL: https://issues.apache.org/jira/browse/ARROW-3992
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: David Lee
>Priority: Minor
>
> Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after 
> running into the same problems with RedHat 7.4.
> [https://arrow.apache.org/docs/python/development.html#development]
> Additional steps taken:
> Added double-conversion, glog and hypothesis: 
> {code:java}
> conda create -y -q -n pyarrow-dev \
> python=3.6 numpy six setuptools cython pandas pytest double-conversion \
> cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\
> gflags brotli jemalloc lz4-c zstd -c conda-forge
> {code}
>  
> Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: 
> {code:java}
> export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64
> py.test pyarrow
> {code}
>  
> Added extra symlinks with a period at the end to fix string concatenation 
> issues. Running setup.py for the first time didn't need this, but running 
> setup.py a second time would error out with:
> {code:java}
> CMake Error: File 
> /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. does not exist.
> {code}
>  
> There is an extra period at the end of the *.so files so I had to make 
> symlinks with extra periods. 
> {code:java}
> ln -s libparquet.so.12.0.0 libparquet.so.
> ln -s libplasma.so.12.0.0 libplasma.so.
> ln -s libarrow.so.12.0.0 libarrow.so.
> ln -s libarrow_python.so.12.0.0 libarrow_python.so.
> {code}
>  
> 

[jira] [Created] (ARROW-3992) pyarrow compile from source issues on RedHat 7.4

2018-12-10 Thread David Lee (JIRA)
David Lee created ARROW-3992:


 Summary: pyarrow compile from source issues on RedHat 7.4
 Key: ARROW-3992
 URL: https://issues.apache.org/jira/browse/ARROW-3992
 Project: Apache Arrow
  Issue Type: Bug
Reporter: David Lee


Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after 
running into the same problems with RedHat 7.4.

[https://arrow.apache.org/docs/python/development.html#development]

Additional steps taken:

Added double-conversion, glog and hypothesis:

 
{code:java}
conda create -y -q -n pyarrow-dev \
python=3.6 numpy six setuptools cython pandas pytest double-conversion \
cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\
gflags brotli jemalloc lz4-c zstd -c conda-forge
{code}
Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow:

 

 
{code:java}
export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64
py.test pyarrow
{code}
 

Added extra symlinks with a period at the end to fix string concatenation 
issues. Running setup.py for the first time didn't need this, but running 
setup.py a second time would error out with:
{code:java}
CMake Error: File /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. 
does not exist.
{code}
There is an extra period at the end of the *.so files so I had to make symlinks 
with extra periods.

 
{code:java}
ln -s libparquet.so.12.0.0 libparquet.so.
ln -s libplasma.so.12.0.0 libplasma.so.
ln -s libarrow.so.12.0.0 libarrow.so.
ln -s libarrow_python.so.12.0.0 libarrow_python.so.
{code}
Creating a wheel file using --with-plasma gives the following error:

 
{code:java}
error: [Errno 2] No such file or directory: 'release/plasma_store_server'
{code}
Had to create the wheel file without plasma..



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3918) [Python] ParquetWriter.write_table doesn't support coerce_timestamps or allow_truncated_timestamps

2018-12-07 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713374#comment-16713374
 ] 

David Lee commented on ARROW-3918:
--

Fixed in Master..

https://github.com/apache/arrow/commit/10b204ec2532d8e30be157bcfd3af53d41f42ffb

> [Python] ParquetWriter.write_table doesn't support coerce_timestamps or 
> allow_truncated_timestamps
> --
>
> Key: ARROW-3918
> URL: https://issues.apache.org/jira/browse/ARROW-3918
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: David Lee
>Priority: Major
>
> Error: Table Schema does not match schema used to create file.
> The 0.11.1 release added these parameters to pyarrow.parquet.write_table(), 
> but they are missing from pyarrow.parquet.ParquetWriter.write_table().. I'm 
> seeing mismatches between the table schema and the file schema, but they are 
> identical in the error message with modified: timestamp[ms] column types in 
> both schemas. The only thing which looks odd is the Pandas metadata that has 
> a modified column with a panda datatype of datetime and a numpy datatype of 
> datetime64[ns]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3956) [Python] ParquetWriter.write_table isn't working

2018-12-07 Thread David Lee (JIRA)
David Lee created ARROW-3956:


 Summary: [Python] ParquetWriter.write_table isn't working
 Key: ARROW-3956
 URL: https://issues.apache.org/jira/browse/ARROW-3956
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: David Lee


ParquetWriter.write_table is erroring out on table schema doesn't match file 
schema, but it does match.

 

Error:
{code:java}
>>> writer.write_table(arrow_table)
Traceback (most recent call last):
File "", line 1, in 
File "../lib/python3.6/site-packages/pyarrow/parquet.py", line 374, in 
write_table
raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
table:
col1: int64
col2: int64
metadata

{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
b' "col1", "field_name": "col1", "pandas_type": "int64", "numpy_ty'
b'pe": "int64", "metadata": null}, {"name": "col2", "field_name": '
b'"col2", "pandas_type": "int64", "numpy_type": "int64", "metadata'
b'": null}], "pandas_version": "0.23.4"}'} vs.
file:
col1: int64
col2: int64
{code}
Test Script:
{code:java}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)

arrow_table = pa.Table.from_pandas(df, preserve_index=False)
arrow_table

pq.write_table(arrow_table, "test.parquet")

test_schema = pa.schema([
pa.field('col1', pa.int64()),
pa.field('col2', pa.int64())
])

writer = pq.ParquetWriter("test2.parquet", use_dictionary=True, schema = 
test_schema, compression='snappy')
writer.write_table(arrow_table)
writer.close()
{code}
write_table() works, but ParquetWriter.write_table does not..

I think something is wrong with the schema object.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-3918) [Python] ParquetWriter.write_table doesn't support coerce_timestamps or allow_truncated_timestamps

2018-12-07 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee closed ARROW-3918.

Resolution: Unresolved

Closing and re-opening a new ticket. Looks like write_table is broken.

> [Python] ParquetWriter.write_table doesn't support coerce_timestamps or 
> allow_truncated_timestamps
> --
>
> Key: ARROW-3918
> URL: https://issues.apache.org/jira/browse/ARROW-3918
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: David Lee
>Priority: Major
>
> Error: Table Schema does not match schema used to create file.
> The 0.11.1 release added these parameters to pyarrow.parquet.write_table(), 
> but they are missing from pyarrow.parquet.ParquetWriter.write_table().. I'm 
> seeing mismatches between the table schema and the file schema, but they are 
> identical in the error message with modified: timestamp[ms] column types in 
> both schemas. The only thing which looks odd is the Pandas metadata that has 
> a modified column with a panda datatype of datetime and a numpy datatype of 
> datetime64[ns]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-3907) [Python] from_pandas errors when schemas are used with lower resolution timestamps

2018-12-03 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee closed ARROW-3907.

   Resolution: Not A Problem
Fix Version/s: 0.11.1

Closing for now. Not convinced Safe is the best solution to address timestamp 
resolution. If a schema is used it should be clear the intent is to convert 
pandas nanoseconds to a lower resolution. I think the same can be said for 
other types of conversions like floats to int.

> [Python] from_pandas errors when schemas are used with lower resolution 
> timestamps
> --
>
> Key: ARROW-3907
> URL: https://issues.apache.org/jira/browse/ARROW-3907
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: David Lee
>Priority: Major
> Fix For: 0.11.1
>
>
> When passing in a schema object to from_pandas a resolution error occurs if 
> the schema uses a lower resolution timestamp. Do we need to also add 
> "coerce_timestamps" and "allow_truncated_timestamps" parameters found in 
> write_table() to from_pandas()?
> Error:
> pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would 
> lose data: 1532015191753713000', 'Conversion failed for column modified with 
> type datetime64[ns]')
> Code:
>  
> {code:java}
> processed_schema = pa.schema([
> pa.field('Id', pa.string()),
> pa.field('modified', pa.timestamp('ms')),
> pa.field('records', pa.int32())
> ])
> pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3918) [Python] ParquetWriter.write_table doesn't support coerce_timestamps or allow_truncated_timestamps

2018-12-02 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706664#comment-16706664
 ] 

David Lee edited comment on ARROW-3918 at 12/3/18 5:17 AM:
---

Passed them into ParquetWriter and it still gives the same error..

File "../python3.6/site-packages/pyarrow/parquet.py", line 374, in write_table
 raise ValueError(msg)
 ValueError: Table schema does not match schema used to create file:
 table:
 Id: string
 modified: timestamp[ms]
 converter: string
 records: int32
 metadata
 
 {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [

{"name":' b' "Id", "field_name": "Id", "' b'pandas_type": "unicode", 
"numpy_type": "object", "metadata": nul' b'l}

,

{"name": "modified", "field_name": "modified", "pandas_type"' b': "datetime", 
"numpy_type": "datetime64[ns]", "metadata": null}

,'
 b'

{"name": "converter", "field_name": "converter", "pandas_type":' b' "unicode", 
"numpy_type": "object", "metadata": null}

,

{"name": ' b'"records", "field_name": "records", "pandas_type": "int32", "num' 
b'py_type": "int64", "metadata": null}

], "pandas_version": "0.23.4'
 b'"}'} vs.
 file:
 Id: string
 modified: timestamp[ms]
 converter: string
 records: int32

Code:

 
{code:java}
processed_schema = pa.schema([
pa.field('Id', pa.string()),
pa.field('modified', pa.timestamp('ms')),
pa.field('converter', pa.string()),
pa.field('records', pa.int32())
])

.

arrow_tables.append(pa.Table.from_pandas(df, schema=processed_schema, 
preserve_index=False, safe=False))

.

if len(arrow_tables) > 0:
writer = pq.ParquetWriter(os.path.join(self.conf['work_dir'], 
processed_file), schema=processed_schema, use_dictionary=True, 
compression='snappy', coerce_timestamps='ms', allow_truncated_timestamps=True)

for v in arrow_tables:
writer.write_table(v)
writer.close()
{code}
 

 


was (Author: davlee1...@yahoo.com):
Passed them into ParquetWriter and it still gives the same error..

File "../python3.6/site-packages/pyarrow/parquet.py", line 374, in write_table
 raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
table:
Id: string
modified: timestamp[ms]
converter: string
records: int32
metadata

{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
 b' "Id", "field_name": "Id", "'
 b'pandas_type": "unicode", "numpy_type": "object", "metadata": nul'
 b'l}, {"name": "modified", "field_name": "modified", "pandas_type"'
 b': "datetime", "numpy_type": "datetime64[ns]", "metadata": null},'
 b' {"name": "converter", "field_name": "converter", "pandas_type":'
 b' "unicode", "numpy_type": "object", "metadata": null}, {"name": '
 b'"records", "field_name": "records", "pandas_type": "int32", "num'
 b'py_type": "int64", "metadata": null}], "pandas_version": "0.23.4'
 b'"}'} vs.
file:
Id: string
modified: timestamp[ms]
converter: string
records: int32

Code:

 
{code:java}
processed_schema = pa.schema([
pa.field('Id', pa.string()),
pa.field('modified', pa.timestamp('ms')),
pa.field('converter', pa.string()),
pa.field('records', pa.int32())
])

if len(arrow_tables) > 0:
writer = pq.ParquetWriter(os.path.join(self.conf['work_dir'], 
processed_file), schema=processed_schema, use_dictionary=True, 
compression='snappy', coerce_timestamps='ms', allow_truncated_timestamps=True)

for v in arrow_tables:
writer.write_table(v)
writer.close()
{code}
 

 

> [Python] ParquetWriter.write_table doesn't support coerce_timestamps or 
> allow_truncated_timestamps
> --
>
> Key: ARROW-3918
> URL: https://issues.apache.org/jira/browse/ARROW-3918
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: David Lee
>Priority: Major
>
> Error: Table Schema does not match schema used to create file.
> The 0.11.1 release added these parameters to pyarrow.parquet.write_table(), 
> but they are missing from pyarrow.parquet.ParquetWriter.write_table().. I'm 
> seeing mismatches between the table schema and the file schema, but they are 
> identical in the error message with modified: timestamp[ms] column types in 
> both schemas. The only thing which looks odd is the Pandas metadata that has 
> a modified column with a panda datatype of datetime and a numpy datatype of 
> datetime64[ns]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3918) [Python] ParquetWriter.write_table doesn't support coerce_timestamps or allow_truncated_timestamps

2018-12-02 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706664#comment-16706664
 ] 

David Lee commented on ARROW-3918:
--

Passed them into ParquetWriter and it still gives the same error..

File "../python3.6/site-packages/pyarrow/parquet.py", line 374, in write_table
 raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
table:
Id: string
modified: timestamp[ms]
converter: string
records: int32
metadata

{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
 b' "Id", "field_name": "Id", "'
 b'pandas_type": "unicode", "numpy_type": "object", "metadata": nul'
 b'l}, {"name": "modified", "field_name": "modified", "pandas_type"'
 b': "datetime", "numpy_type": "datetime64[ns]", "metadata": null},'
 b' {"name": "converter", "field_name": "converter", "pandas_type":'
 b' "unicode", "numpy_type": "object", "metadata": null}, {"name": '
 b'"records", "field_name": "records", "pandas_type": "int32", "num'
 b'py_type": "int64", "metadata": null}], "pandas_version": "0.23.4'
 b'"}'} vs.
file:
Id: string
modified: timestamp[ms]
converter: string
records: int32

Code:

 
{code:java}
processed_schema = pa.schema([
pa.field('Id', pa.string()),
pa.field('modified', pa.timestamp('ms')),
pa.field('converter', pa.string()),
pa.field('records', pa.int32())
])

if len(arrow_tables) > 0:
writer = pq.ParquetWriter(os.path.join(self.conf['work_dir'], 
processed_file), schema=processed_schema, use_dictionary=True, 
compression='snappy', coerce_timestamps='ms', allow_truncated_timestamps=True)

for v in arrow_tables:
writer.write_table(v)
writer.close()
{code}
 

 

> [Python] ParquetWriter.write_table doesn't support coerce_timestamps or 
> allow_truncated_timestamps
> --
>
> Key: ARROW-3918
> URL: https://issues.apache.org/jira/browse/ARROW-3918
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: David Lee
>Priority: Major
>
> Error: Table Schema does not match schema used to create file.
> The 0.11.1 release added these parameters to pyarrow.parquet.write_table(), 
> but they are missing from pyarrow.parquet.ParquetWriter.write_table().. I'm 
> seeing mismatches between the table schema and the file schema, but they are 
> identical in the error message with modified: timestamp[ms] column types in 
> both schemas. The only thing which looks odd is the Pandas metadata that has 
> a modified column with a panda datatype of datetime and a numpy datatype of 
> datetime64[ns]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3907) [Python] from_pandas errors when schemas are used with lower resolution timestamps

2018-11-30 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705210#comment-16705210
 ] 

David Lee commented on ARROW-3907:
--

passing in safe=False works, but it is pretty hacky.. Another problem also pops 
up with ParquetWriter.write_table(). I'll open a separate ticket for that one.

The conversion from pandas nanoseconds to whatever timestamp resolution 
declared using pa.timestamp() in the schema object worked fine in 0.11.0.

Having to pass in coerce_timestamps, allow_truncated_timestamps and safe is 
pretty messy.

 

> [Python] from_pandas errors when schemas are used with lower resolution 
> timestamps
> --
>
> Key: ARROW-3907
> URL: https://issues.apache.org/jira/browse/ARROW-3907
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: David Lee
>Priority: Major
>
> When passing in a schema object to from_pandas a resolution error occurs if 
> the schema uses a lower resolution timestamp. Do we need to also add 
> "coerce_timestamps" and "allow_truncated_timestamps" parameters found in 
> write_table() to from_pandas()?
> Error:
> pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would 
> lose data: 1532015191753713000', 'Conversion failed for column modified with 
> type datetime64[ns]')
> Code:
>  
> {code:java}
> processed_schema = pa.schema([
> pa.field('Id', pa.string()),
> pa.field('modified', pa.timestamp('ms')),
> pa.field('records', pa.int32())
> ])
> pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3918) [Python] ParquetWriter.write_table doesn't support coerce_timestamps or allow_truncated_timestamps

2018-11-30 Thread David Lee (JIRA)
David Lee created ARROW-3918:


 Summary: [Python] ParquetWriter.write_table doesn't support 
coerce_timestamps or allow_truncated_timestamps
 Key: ARROW-3918
 URL: https://issues.apache.org/jira/browse/ARROW-3918
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.11.1
Reporter: David Lee


Error: Table Schema does not match schema used to create file.

The 0.11.1 release added these parameters to pyarrow.parquet.write_table(), but 
they are missing from pyarrow.parquet.ParquetWriter.write_table().. I'm seeing 
mismatches between the table schema and the file schema, but they are identical 
in the error message with modified: timestamp[ms] column types in both schemas. 
The only thing which looks odd is the Pandas metadata that has a modified 
column with a panda datatype of datetime and a numpy datatype of datetime64[ns]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3907) [Python] from_pandas errors when schemas are used with lower resolution timestamps

2018-11-29 Thread David Lee (JIRA)
David Lee created ARROW-3907:


 Summary: [Python] from_pandas errors when schemas are used with 
lower resolution timestamps
 Key: ARROW-3907
 URL: https://issues.apache.org/jira/browse/ARROW-3907
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.11.1
Reporter: David Lee


When passing in a schema object to from_pandas a resolution error occurs if the 
schema uses a lower resolution timestamp. Do we need to also add 
"coerce_timestamps" and "allow_truncated_timestamps" parameters found in 
write_table() to from_pandas()?

Error:

pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1532015191753713000', 'Conversion failed for column modified with 
type datetime64[ns]')



Code:

 
{code:java}
processed_schema = pa.schema([
pa.field('Id', pa.string()),
pa.field('modified', pa.timestamp('ms')),
pa.field('records', pa.int32())
])

pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False)
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3728) [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch

2018-11-29 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703544#comment-16703544
 ] 

David Lee edited comment on ARROW-3728 at 11/29/18 5:33 PM:


I'm finding the same problem as well.. This is similar to: 
https://jira.apache.org/jira/browse/ARROW-3065
 I think the underlying panda schema has changed between pyarrow releases so I 
can't merge old files with new files.

On the topic of merging parquet files.. This is something I do to try to create 
128 meg parquet files to match the HDFS blocksize configured in Hadoop.

It is not possible to predetermine the size of a parquet file when you mix in 
dictionary encoding + snappy compression, but you can work around it be merging 
smaller parquet files together as row groups.

Save two million rows of data per parquet file. This ends up creating multiple 
parquet files around 10 megs each after encoding and compression.
 Figure out which files should be merged by adding their file sizes together 
until it the sum comes in just under 128 megs which is between 95% and 100% of 
128 * 1024 * 1024 bytes.
 Read each parquet file in as a arrow table and write the arrow table to a new 
file as a row group. This is both fast and memory efficient since you only need 
to put two million rows of data in memory at a time.

On a separate topic I should probably open up a new issue / enhancement request.

A. Would it be possible to read a row group out of parquet file, modify it as a 
panda and then write it back to the original parquet file?

B. Would it be possible to add a boolean hidden status column to every parquet 
file? A status of True would mean the row is valid. A status of False would 
mean the row is deleted. Dremio uses an internal flag in Arrow data sets when 
doing SQL Union operations. It is more efficient to flag a record as deleted 
instead of trying to delete it out of a columnar memory format. If we could 
introduce something for columnar parquet you could in theory update parquet 
files by flagging the old record as deleted and reinserting the replacement 
record at the end of the existing file without having to shuffle / re-write the 
entire file.

 


was (Author: davlee1...@yahoo.com):
I'm finding the same problem as well.. This is similar to: 
https://jira.apache.org/jira/browse/ARROW-3065
I think the underlying panda schema has changed between pyarrow releases so I 
can't merge old files with new files.

On the topic of merging parquet files.. This is something I do to try to create 
128 meg parquet files to match the HDFS blocksize configured in Hadoop.

It is not possible to predetermine the size of a parquet file when you mix in 
dictionary encoding + snappy compression, but you can work around it be merging 
smaller parquet files together as row groups.

Save two million rows of data per parquet file. This ends up creating multiple 
parquet files around 10 megs each after encoding and compression.
Figure out which files should be merged by adding their file sizes together 
until it the sum comes in just under 128 megs which is between 95% and 100% of 
128 * 1024 * 1024 bytes.
Read each parquet file in as a arrow table and write the arrow table to a new 
file as a row group. This is both fast and memory efficient since you only need 
to put two million rows of data in memory at a time.
On a separate topic I should probably open up a new issue / enhancement request.

A. Would it be possible to read a row group out of parquet file, modify it as a 
panda and then write it back to the original parquet file?

B. Would it be possible to add a boolean hidden status column to every parquet 
file? A status of True would mean the row is valid. A status of False would 
mean the row is deleted. Dremio uses an internal flag in Arrow data sets when 
doing SQL Union operations. It is more efficient to flag a record as deleted 
instead of trying to delete it out of a columnar memory format. If we could 
introduce something for columnar parquet you could in theory update parquet 
files by flagging the old record as deleted and reinserting the replacement 
record at the end of the existing file without having to shuffle / re-write the 
entire file.

 

> [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch
> ---
>
> Key: ARROW-3728
> URL: https://issues.apache.org/jira/browse/ARROW-3728
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0, 0.11.0, 0.11.1
> Environment: Python 3.6.3
> OSX 10.14
>Reporter: Micah Williamson
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> From: 
> 

[jira] [Commented] (ARROW-3728) [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch

2018-11-29 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703544#comment-16703544
 ] 

David Lee commented on ARROW-3728:
--

I'm finding the same problem as well.. This is similar to: 
https://jira.apache.org/jira/browse/ARROW-3065
I think the underlying panda schema has changed between pyarrow releases so I 
can't merge old files with new files.

On the topic of merging parquet files.. This is something I do to try to create 
128 meg parquet files to match the HDFS blocksize configured in Hadoop.

It is not possible to predetermine the size of a parquet file when you mix in 
dictionary encoding + snappy compression, but you can work around it be merging 
smaller parquet files together as row groups.

Save two million rows of data per parquet file. This ends up creating multiple 
parquet files around 10 megs each after encoding and compression.
Figure out which files should be merged by adding their file sizes together 
until it the sum comes in just under 128 megs which is between 95% and 100% of 
128 * 1024 * 1024 bytes.
Read each parquet file in as a arrow table and write the arrow table to a new 
file as a row group. This is both fast and memory efficient since you only need 
to put two million rows of data in memory at a time.
On a separate topic I should probably open up a new issue / enhancement request.

A. Would it be possible to read a row group out of parquet file, modify it as a 
panda and then write it back to the original parquet file?

B. Would it be possible to add a boolean hidden status column to every parquet 
file? A status of True would mean the row is valid. A status of False would 
mean the row is deleted. Dremio uses an internal flag in Arrow data sets when 
doing SQL Union operations. It is more efficient to flag a record as deleted 
instead of trying to delete it out of a columnar memory format. If we could 
introduce something for columnar parquet you could in theory update parquet 
files by flagging the old record as deleted and reinserting the replacement 
record at the end of the existing file without having to shuffle / re-write the 
entire file.

 

> [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch
> ---
>
> Key: ARROW-3728
> URL: https://issues.apache.org/jira/browse/ARROW-3728
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0, 0.11.0, 0.11.1
> Environment: Python 3.6.3
> OSX 10.14
>Reporter: Micah Williamson
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> From: 
> https://stackoverflow.com/questions/53214288/merging-parquet-files-pandas-meta-in-schema-mismatch
>  
> I am trying to merge multiple parquet files into one. Their schemas are 
> identical field-wise but my {{ParquetWriter}} is complaining that they are 
> not. After some investigation I found that the pandas meta in the schemas are 
> different, causing this error.
>  
> Sample-
> {code:python}
> import pyarrow.parquet as pq
> pq_tables=[]
> for file_ in files:
> pq_table = pq.read_table(f'{MESS_DIR}/{file_}')
> pq_tables.append(pq_table)
> if writer is None:
> writer = pq.ParquetWriter(COMPRESSED_FILE, schema=pq_table.schema, 
> use_deprecated_int96_timestamps=True)
> writer.write_table(table=pq_table)
> {code}
> The error-
> {code}
> Traceback (most recent call last):
>   File "{PATH_TO}/main.py", line 68, in lambda_handler
> writer.write_table(table=pq_table)
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 335, in write_table
> raise ValueError(msg)
> ValueError: Table schema does not match schema used to create file:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata

2018-09-24 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626470#comment-16626470
 ] 

David Lee commented on ARROW-3065:
--

In pyarrow 0.9.0 the pandas metadata still says float64, but it works..

 
{code:java}
>>> tbl1.schema
col1: string
col2: string
metadata

{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
b' "col1", "field_name": "col1", "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": null}, {"name": "col2", "field_name'
b'": "col2", "pandas_type": "unicode", "numpy_type": "object", "me'
b'tadata": null}], "pandas_version": "0.23.0"}'}
>>> tbl2.schema
col1: string
col2: string
metadata

{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
b' "col1", "field_name": "col1", "pandas_type": "unicode", "numpy_'
b'type": "float64", "metadata": null}, {"name": "col2", "field_nam'
b'e": "col2", "pandas_type": "unicode", "numpy_type": "object", "m'
b'etadata": null}], "pandas_version": "0.23.0"}'}
>>> tbl3.schema
col1: string
col2: string
metadata

{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
b' "col1", "field_name": "col1", "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": null}, {"name": "col2", "field_name'
b'": "col2", "pandas_type": "unicode", "numpy_type": "object", "me'
b'tadata": null}], "pandas_version": "0.23.0"}'}
>>> tbl3[0]

chunk 0: 
[
'a',
'b',
'c',
'd',
'e',
'f',
'g',
'h'
]
chunk 1: 
[
'',
'',
'',
'',
'',
'',
'',
''
]

{code}
In the 0.10.0 example above that can't produce the error tbl3[0] comes back 
with:

 
{code:java}
>>> tbl3[0]

[
[
"a",
"b",
"c",
"d",
"e",
"f",
"g",
"h"
],
[
"",
"",
"",
"",
"",
"",
"",
""
]
]


{code}
 

> [Python] concat_tables() failing from bad Pandas Metadata
> -
>
> Key: ARROW-3065
> URL: https://issues.apache.org/jira/browse/ARROW-3065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: David Lee
>Priority: Major
> Fix For: 0.11.0
>
>
> Looks like the major bug from 
> https://issues.apache.org/jira/browse/ARROW-1941 is back...
> After I downgraded from 0.10.0 to 0.9.0, the error disappeared..
> {code:python}
> new_arrow_table = pa.concat_tables(my_arrow_tables)
>  File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
> {code}
> In order to debug this I saved the first 4 arrow tables to 4 parquet files 
> and inspected the parquet files. The parquet schema is identical, but the 
> Pandas Metadata is different.
> {code:python}
> for i in range(5):
>  pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
> {code}
> It looks like a column which contains empty strings is getting typed as 
> float64.
> {code:python}
> >>> test1.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null},
> >>> test1[0]
> 
> [
>   [
> "Z4",
> "SF",
> "J7",
> "W6",
> "L7",
> "Q9",
> "NE",
> "F7",
> >>> test2.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "float64", "metadata": null},
> >>> test2[0]
> 
> [
>   [
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata

2018-09-24 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626336#comment-16626336
 ] 

David Lee edited comment on ARROW-3065 at 9/24/18 8:17 PM:
---

This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote} 
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

schema = pa.schema([
 pa.field('col1', pa.string()),
 pa.field('col2', pa.string()),
 ])

df1 = pd.DataFrame([{"col1": v, "col2": v} for v in list("abcdefgh")])
df2 = pd.DataFrame([{"col2": v} for v in list("abcdefgh")])

df1 = df1.reindex(columns=schema.names)
df2 = df2.reindex(columns=schema.names)

tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)
tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)

tbl3 = pa.concat_tables([tbl1, tbl2])

Traceback (most recent call last):
{{ File "", line 1, in }}
 {{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
 {{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
 pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{code}
 

 
{quote}


was (Author: davlee1...@yahoo.com):
This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote} 
{code:java}
import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq
schema = pa.schema([
 pa.field('col1', pa.string()),
 pa.field('col2', pa.string()),
 ])
df1 = pd.DataFrame([{"col1": v, "col2": v} for v in list("abcdefgh")])
df2 = pd.DataFrame([{"col2": v} for v in list("abcdefgh")])
df1 = df1.reindex(columns=schema.names)
df2 = df2.reindex(columns=schema.names)
tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)
tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)
tbl3 = pa.concat_tables([tbl1, tbl2])
Traceback (most recent call last):
{{ File "", line 1, in }}
 {{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
 {{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
 pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{code}
 

 
{quote}

> [Python] concat_tables() failing from bad Pandas Metadata
> -
>
> Key: ARROW-3065
> URL: https://issues.apache.org/jira/browse/ARROW-3065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: David Lee
>Priority: Major
> Fix For: 0.12.0
>
>
> Looks like the major bug from 
> https://issues.apache.org/jira/browse/ARROW-1941 is back...
> After I downgraded from 0.10.0 to 0.9.0, the error disappeared..
> {code:python}
> new_arrow_table = pa.concat_tables(my_arrow_tables)
>  File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
> {code}
> In order to debug this I saved the first 4 arrow tables to 4 parquet files 
> and inspected the parquet files. The parquet schema is identical, but the 
> Pandas Metadata is different.
> {code:python}
> for i in range(5):
>  pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
> {code}
> It looks like a column which contains empty strings is getting typed as 
> float64.
> {code:python}
> >>> test1.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null},
> >>> test1[0]
> 
> [
>   [
> "Z4",
> "SF",
> "J7",
> "W6",
> "L7",
> "Q9",
> "NE",
> "F7",
> >>> test2.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "float64", "metadata": null},
> >>> test2[0]
> 
> [
>   [
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata

2018-09-24 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626336#comment-16626336
 ] 

David Lee edited comment on ARROW-3065 at 9/24/18 8:17 PM:
---

This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote} 
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

schema = pa.schema([
 pa.field('col1', pa.string()),
 pa.field('col2', pa.string()),
 ])

df1 = pd.DataFrame([{"col1": v, "col2": v} for v in list("abcdefgh")])
df2 = pd.DataFrame([{"col2": v} for v in list("abcdefgh")])

df1 = df1.reindex(columns=schema.names)
df2 = df2.reindex(columns=schema.names)

tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)
tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)

tbl3 = pa.concat_tables([tbl1, tbl2])

Traceback (most recent call last):
{{ File "", line 1, in }}
{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{code}
 

 
{quote}


was (Author: davlee1...@yahoo.com):
This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote} 
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

schema = pa.schema([
 pa.field('col1', pa.string()),
 pa.field('col2', pa.string()),
 ])

df1 = pd.DataFrame([{"col1": v, "col2": v} for v in list("abcdefgh")])
df2 = pd.DataFrame([{"col2": v} for v in list("abcdefgh")])

df1 = df1.reindex(columns=schema.names)
df2 = df2.reindex(columns=schema.names)

tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)
tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)

tbl3 = pa.concat_tables([tbl1, tbl2])

Traceback (most recent call last):
{{ File "", line 1, in }}
 {{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
 {{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
 pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{code}
 

 
{quote}

> [Python] concat_tables() failing from bad Pandas Metadata
> -
>
> Key: ARROW-3065
> URL: https://issues.apache.org/jira/browse/ARROW-3065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: David Lee
>Priority: Major
> Fix For: 0.12.0
>
>
> Looks like the major bug from 
> https://issues.apache.org/jira/browse/ARROW-1941 is back...
> After I downgraded from 0.10.0 to 0.9.0, the error disappeared..
> {code:python}
> new_arrow_table = pa.concat_tables(my_arrow_tables)
>  File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
> {code}
> In order to debug this I saved the first 4 arrow tables to 4 parquet files 
> and inspected the parquet files. The parquet schema is identical, but the 
> Pandas Metadata is different.
> {code:python}
> for i in range(5):
>  pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
> {code}
> It looks like a column which contains empty strings is getting typed as 
> float64.
> {code:python}
> >>> test1.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null},
> >>> test1[0]
> 
> [
>   [
> "Z4",
> "SF",
> "J7",
> "W6",
> "L7",
> "Q9",
> "NE",
> "F7",
> >>> test2.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "float64", "metadata": null},
> >>> test2[0]
> 
> [
>   [
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata

2018-09-24 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626336#comment-16626336
 ] 

David Lee edited comment on ARROW-3065 at 9/24/18 8:16 PM:
---

This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote} 
{code:java}
import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq
schema = pa.schema([
 pa.field('col1', pa.string()),
 pa.field('col2', pa.string()),
 ])
df1 = pd.DataFrame([{"col1": v, "col2": v} for v in list("abcdefgh")])
df2 = pd.DataFrame([{"col2": v} for v in list("abcdefgh")])
df1 = df1.reindex(columns=schema.names)
df2 = df2.reindex(columns=schema.names)
tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)
tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)
tbl3 = pa.concat_tables([tbl1, tbl2])
Traceback (most recent call last):
{{ File "", line 1, in }}
 {{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
 {{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
 pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{code}
 

 
{quote}


was (Author: davlee1...@yahoo.com):
This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote}import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq

schema = pa.schema([
 pa.field('col1', pa.string()),
 pa.field('col2', pa.string()),
 ])

df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")])
 df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")])

df1 = df1.reindex(columns=schema.names)
 df2 = df2.reindex(columns=schema.names)

tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)

tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)

tbl3 = pa.concat_tables([tbl1, tbl2])

Traceback (most recent call last):

{\{ File "", line 1, in }}
 \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
 \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
 pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{quote}

> [Python] concat_tables() failing from bad Pandas Metadata
> -
>
> Key: ARROW-3065
> URL: https://issues.apache.org/jira/browse/ARROW-3065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: David Lee
>Priority: Major
> Fix For: 0.12.0
>
>
> Looks like the major bug from 
> https://issues.apache.org/jira/browse/ARROW-1941 is back...
> After I downgraded from 0.10.0 to 0.9.0, the error disappeared..
> {code:python}
> new_arrow_table = pa.concat_tables(my_arrow_tables)
>  File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
> {code}
> In order to debug this I saved the first 4 arrow tables to 4 parquet files 
> and inspected the parquet files. The parquet schema is identical, but the 
> Pandas Metadata is different.
> {code:python}
> for i in range(5):
>  pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
> {code}
> It looks like a column which contains empty strings is getting typed as 
> float64.
> {code:python}
> >>> test1.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null},
> >>> test1[0]
> 
> [
>   [
> "Z4",
> "SF",
> "J7",
> "W6",
> "L7",
> "Q9",
> "NE",
> "F7",
> >>> test2.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "float64", "metadata": null},
> >>> test2[0]
> 
> [
>   [
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata

2018-09-24 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626336#comment-16626336
 ] 

David Lee edited comment on ARROW-3065 at 9/24/18 8:16 PM:
---

This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote}import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq

schema = pa.schema([
 pa.field('col1', pa.string()),
 pa.field('col2', pa.string()),
 ])

df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")])
 df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")])

df1 = df1.reindex(columns=schema.names)
 df2 = df2.reindex(columns=schema.names)

tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)

tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)

tbl3 = pa.concat_tables([tbl1, tbl2])

Traceback (most recent call last):

{\{ File "", line 1, in }}
 \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
 \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
 pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{quote}


was (Author: davlee1...@yahoo.com):
This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote}import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq

schema = pa.schema([
 pa.field('col1', pa.string()),
 pa.field('col2', pa.string()),
 ])

df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")])
 df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")])

df1 = df1.reindex(columns=schema.names)
 df2 = df2.reindex(columns=schema.names)

tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)

tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)

tbl3 = pa.concat_tables([tbl1, tbl2])

Traceback (most recent call last):

{\{ File "", line 1, in }}
 \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
 \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
 pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{quote}

> [Python] concat_tables() failing from bad Pandas Metadata
> -
>
> Key: ARROW-3065
> URL: https://issues.apache.org/jira/browse/ARROW-3065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: David Lee
>Priority: Major
> Fix For: 0.12.0
>
>
> Looks like the major bug from 
> https://issues.apache.org/jira/browse/ARROW-1941 is back...
> After I downgraded from 0.10.0 to 0.9.0, the error disappeared..
> {code:python}
> new_arrow_table = pa.concat_tables(my_arrow_tables)
>  File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
> {code}
> In order to debug this I saved the first 4 arrow tables to 4 parquet files 
> and inspected the parquet files. The parquet schema is identical, but the 
> Pandas Metadata is different.
> {code:python}
> for i in range(5):
>  pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
> {code}
> It looks like a column which contains empty strings is getting typed as 
> float64.
> {code:python}
> >>> test1.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null},
> >>> test1[0]
> 
> [
>   [
> "Z4",
> "SF",
> "J7",
> "W6",
> "L7",
> "Q9",
> "NE",
> "F7",
> >>> test2.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "float64", "metadata": null},
> >>> test2[0]
> 
> [
>   [
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata

2018-09-24 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626336#comment-16626336
 ] 

David Lee edited comment on ARROW-3065 at 9/24/18 8:15 PM:
---

This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote}import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq

schema = pa.schema([
 pa.field('col1', pa.string()),
 pa.field('col2', pa.string()),
 ])

df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")])
 df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")])

df1 = df1.reindex(columns=schema.names)
 df2 = df2.reindex(columns=schema.names)

tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)

tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)

tbl3 = pa.concat_tables([tbl1, tbl2])

Traceback (most recent call last):

{\{ File "", line 1, in }}
 \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
 \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
 pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{quote}


was (Author: davlee1...@yahoo.com):
This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote}import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq

schema = pa.schema([
 pa.field('col1', pa.string()),
 pa.field('col2', pa.string()),
 ])

df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")])
df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")])

df1 = df1.reindex(columns=schema.names)
 df2 = df2.reindex(columns=schema.names)

tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)

tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)

tbl3 = pa.concat_tables([tbl1, tbl2])

Traceback (most recent call last):

{\{ File "", line 1, in }}
 \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
 \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
 pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{quote}

> [Python] concat_tables() failing from bad Pandas Metadata
> -
>
> Key: ARROW-3065
> URL: https://issues.apache.org/jira/browse/ARROW-3065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: David Lee
>Priority: Major
> Fix For: 0.12.0
>
>
> Looks like the major bug from 
> https://issues.apache.org/jira/browse/ARROW-1941 is back...
> After I downgraded from 0.10.0 to 0.9.0, the error disappeared..
> {code:python}
> new_arrow_table = pa.concat_tables(my_arrow_tables)
>  File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
> {code}
> In order to debug this I saved the first 4 arrow tables to 4 parquet files 
> and inspected the parquet files. The parquet schema is identical, but the 
> Pandas Metadata is different.
> {code:python}
> for i in range(5):
>  pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
> {code}
> It looks like a column which contains empty strings is getting typed as 
> float64.
> {code:python}
> >>> test1.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null},
> >>> test1[0]
> 
> [
>   [
> "Z4",
> "SF",
> "J7",
> "W6",
> "L7",
> "Q9",
> "NE",
> "F7",
> >>> test2.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "float64", "metadata": null},
> >>> test2[0]
> 
> [
>   [
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata

2018-09-24 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626336#comment-16626336
 ] 

David Lee edited comment on ARROW-3065 at 9/24/18 7:58 PM:
---

This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote}import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq

schema = pa.schema([
 pa.field('col1', pa.string()),
 pa.field('col2', pa.string()),
 ])

df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")])
df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")])

df1 = df1.reindex(columns=schema.names)
 df2 = df2.reindex(columns=schema.names)

tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)

tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)

tbl3 = pa.concat_tables([tbl1, tbl2])

Traceback (most recent call last):

{\{ File "", line 1, in }}
 \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
 \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
 pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{quote}


was (Author: davlee1...@yahoo.com):
This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote}import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq

schema = pa.schema([
 pa.field('col1', pa.string()),
 pa.field('col2', pa.string()),
 ])
 {{df1 = pd.DataFrame([
Unknown macro: \{"col1"}
for v in list("abcdefgh")])}}
 {{df2 = pd.DataFrame([
Unknown macro: \{"col2"}
for v in list("abcdefgh")])}}

df1 = df1.reindex(columns=schema.names)
 df2 = df2.reindex(columns=schema.names)

tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)

tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)

tbl3 = pa.concat_tables([tbl1, tbl2])

Traceback (most recent call last):

{\{ File "", line 1, in }}
 \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
 \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
 pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{quote}

> [Python] concat_tables() failing from bad Pandas Metadata
> -
>
> Key: ARROW-3065
> URL: https://issues.apache.org/jira/browse/ARROW-3065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: David Lee
>Priority: Major
> Fix For: 0.12.0
>
>
> Looks like the major bug from 
> https://issues.apache.org/jira/browse/ARROW-1941 is back...
> After I downgraded from 0.10.0 to 0.9.0, the error disappeared..
> {code:python}
> new_arrow_table = pa.concat_tables(my_arrow_tables)
>  File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
> {code}
> In order to debug this I saved the first 4 arrow tables to 4 parquet files 
> and inspected the parquet files. The parquet schema is identical, but the 
> Pandas Metadata is different.
> {code:python}
> for i in range(5):
>  pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
> {code}
> It looks like a column which contains empty strings is getting typed as 
> float64.
> {code:python}
> >>> test1.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null},
> >>> test1[0]
> 
> [
>   [
> "Z4",
> "SF",
> "J7",
> "W6",
> "L7",
> "Q9",
> "NE",
> "F7",
> >>> test2.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "float64", "metadata": null},
> >>> test2[0]
> 
> [
>   [
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata

2018-09-24 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626336#comment-16626336
 ] 

David Lee edited comment on ARROW-3065 at 9/24/18 7:57 PM:
---

This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote}{{import pandas as pd}}
 {{import pyarrow as pa}}
 {{import pyarrow.parquet as pq}}

{{schema = pa.schema([}}
 {{pa.field('col1', pa.string()),}}
 {{pa.field('col2', pa.string()),}}
 {{])}}
 {{df1 = pd.DataFrame([\\{"col1": v, "col2": v} for v in list("abcdefgh")])}}
 {{df2 = pd.DataFrame([\\{"col2": v} for v in list("abcdefgh")])}}

{{df1 = df1.reindex(columns=schema.names)}}
 {{df2 = df2.reindex(columns=schema.names)}}

{{tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)}}

{{tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)}}

{{tbl3 = pa.concat_tables([tbl1, tbl2])}}

{{Traceback (most recent call last):}}

{\{ File "", line 1, in }}
 \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
 \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
 {{pyarrow.lib.ArrowInvalid: Schema at index 1 was different:}}
{quote}


was (Author: davlee1...@yahoo.com):
This test fails.. Tested against 0.10.0.. Works in 0.9.0


{{import pandas as pd}}
{{import pyarrow as pa}}
{{import pyarrow.parquet as pq}}{{schema = pa.schema([}}
{{pa.field('col1', pa.string()),}}
{{pa.field('col2', pa.string()),}}
{{])}}
{{df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")])}}
{{df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")])}}{{df1 = 
df1.reindex(columns=schema.names)}}
{{df2 = df2.reindex(columns=schema.names)}}{{tbl1 = pa.Table.from_pandas(df1, 
schema = schema, preserve_index=False)}}
{{tbl2 = pa.Table.from_pandas(df2, schema = schema, 
preserve_index=False)}}{{tbl3 = pa.concat_tables([tbl1, tbl2])}}{{Traceback 
(most recent call last):}}
{{ File "", line 1, in }}
{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
{{pyarrow.lib.ArrowInvalid: Schema at index 1 was different:}}

> [Python] concat_tables() failing from bad Pandas Metadata
> -
>
> Key: ARROW-3065
> URL: https://issues.apache.org/jira/browse/ARROW-3065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: David Lee
>Priority: Major
> Fix For: 0.12.0
>
>
> Looks like the major bug from 
> https://issues.apache.org/jira/browse/ARROW-1941 is back...
> After I downgraded from 0.10.0 to 0.9.0, the error disappeared..
> {code:python}
> new_arrow_table = pa.concat_tables(my_arrow_tables)
>  File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
> {code}
> In order to debug this I saved the first 4 arrow tables to 4 parquet files 
> and inspected the parquet files. The parquet schema is identical, but the 
> Pandas Metadata is different.
> {code:python}
> for i in range(5):
>  pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
> {code}
> It looks like a column which contains empty strings is getting typed as 
> float64.
> {code:python}
> >>> test1.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null},
> >>> test1[0]
> 
> [
>   [
> "Z4",
> "SF",
> "J7",
> "W6",
> "L7",
> "Q9",
> "NE",
> "F7",
> >>> test2.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "float64", "metadata": null},
> >>> test2[0]
> 
> [
>   [
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata

2018-09-24 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626336#comment-16626336
 ] 

David Lee edited comment on ARROW-3065 at 9/24/18 7:57 PM:
---

This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote}import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq

schema = pa.schema([
 pa.field('col1', pa.string()),
 pa.field('col2', pa.string()),
 ])
 {{df1 = pd.DataFrame([
Unknown macro: \{"col1"}
for v in list("abcdefgh")])}}
 {{df2 = pd.DataFrame([
Unknown macro: \{"col2"}
for v in list("abcdefgh")])}}

df1 = df1.reindex(columns=schema.names)
 df2 = df2.reindex(columns=schema.names)

tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)

tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)

tbl3 = pa.concat_tables([tbl1, tbl2])

Traceback (most recent call last):

{\{ File "", line 1, in }}
 \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
 \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
 pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
{quote}


was (Author: davlee1...@yahoo.com):
This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the 
column doesn't exist to start and is added using pandas.reindex(). The 
reasoning behind this is the original file(s) being converted to parquet may or 
may not contain all 100+ columns.
{quote}{{import pandas as pd}}
 {{import pyarrow as pa}}
 {{import pyarrow.parquet as pq}}

{{schema = pa.schema([}}
 {{pa.field('col1', pa.string()),}}
 {{pa.field('col2', pa.string()),}}
 {{])}}
 {{df1 = pd.DataFrame([\\{"col1": v, "col2": v} for v in list("abcdefgh")])}}
 {{df2 = pd.DataFrame([\\{"col2": v} for v in list("abcdefgh")])}}

{{df1 = df1.reindex(columns=schema.names)}}
 {{df2 = df2.reindex(columns=schema.names)}}

{{tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)}}

{{tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)}}

{{tbl3 = pa.concat_tables([tbl1, tbl2])}}

{{Traceback (most recent call last):}}

{\{ File "", line 1, in }}
 \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
 \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
 {{pyarrow.lib.ArrowInvalid: Schema at index 1 was different:}}
{quote}

> [Python] concat_tables() failing from bad Pandas Metadata
> -
>
> Key: ARROW-3065
> URL: https://issues.apache.org/jira/browse/ARROW-3065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: David Lee
>Priority: Major
> Fix For: 0.12.0
>
>
> Looks like the major bug from 
> https://issues.apache.org/jira/browse/ARROW-1941 is back...
> After I downgraded from 0.10.0 to 0.9.0, the error disappeared..
> {code:python}
> new_arrow_table = pa.concat_tables(my_arrow_tables)
>  File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
> {code}
> In order to debug this I saved the first 4 arrow tables to 4 parquet files 
> and inspected the parquet files. The parquet schema is identical, but the 
> Pandas Metadata is different.
> {code:python}
> for i in range(5):
>  pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
> {code}
> It looks like a column which contains empty strings is getting typed as 
> float64.
> {code:python}
> >>> test1.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null},
> >>> test1[0]
> 
> [
>   [
> "Z4",
> "SF",
> "J7",
> "W6",
> "L7",
> "Q9",
> "NE",
> "F7",
> >>> test2.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "float64", "metadata": null},
> >>> test2[0]
> 
> [
>   [
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata

2018-09-24 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626336#comment-16626336
 ] 

David Lee commented on ARROW-3065:
--

This test fails.. Tested against 0.10.0.. Works in 0.9.0


{{import pandas as pd}}
{{import pyarrow as pa}}
{{import pyarrow.parquet as pq}}{{schema = pa.schema([}}
{{pa.field('col1', pa.string()),}}
{{pa.field('col2', pa.string()),}}
{{])}}
{{df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")])}}
{{df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")])}}{{df1 = 
df1.reindex(columns=schema.names)}}
{{df2 = df2.reindex(columns=schema.names)}}{{tbl1 = pa.Table.from_pandas(df1, 
schema = schema, preserve_index=False)}}
{{tbl2 = pa.Table.from_pandas(df2, schema = schema, 
preserve_index=False)}}{{tbl3 = pa.concat_tables([tbl1, tbl2])}}{{Traceback 
(most recent call last):}}
{{ File "", line 1, in }}
{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}}
{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}}
{{pyarrow.lib.ArrowInvalid: Schema at index 1 was different:}}

> [Python] concat_tables() failing from bad Pandas Metadata
> -
>
> Key: ARROW-3065
> URL: https://issues.apache.org/jira/browse/ARROW-3065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: David Lee
>Priority: Major
> Fix For: 0.12.0
>
>
> Looks like the major bug from 
> https://issues.apache.org/jira/browse/ARROW-1941 is back...
> After I downgraded from 0.10.0 to 0.9.0, the error disappeared..
> {code:python}
> new_arrow_table = pa.concat_tables(my_arrow_tables)
>  File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
>   File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
> {code}
> In order to debug this I saved the first 4 arrow tables to 4 parquet files 
> and inspected the parquet files. The parquet schema is identical, but the 
> Pandas Metadata is different.
> {code:python}
> for i in range(5):
>  pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
> {code}
> It looks like a column which contains empty strings is getting typed as 
> float64.
> {code:python}
> >>> test1.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null},
> >>> test1[0]
> 
> [
>   [
> "Z4",
> "SF",
> "J7",
> "W6",
> "L7",
> "Q9",
> "NE",
> "F7",
> >>> test2.schema
> HoldingDetail_Id: string
> metadata
> 
> {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
> {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
> "unicode", "numpy_type": "float64", "metadata": null},
> >>> test2[0]
> 
> [
>   [
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> "",
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3065) concat_tables() failing from bad Pandas Metadata

2018-08-16 Thread David Lee (JIRA)
David Lee created ARROW-3065:


 Summary: concat_tables() failing from bad Pandas Metadata
 Key: ARROW-3065
 URL: https://issues.apache.org/jira/browse/ARROW-3065
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.10.0
Reporter: David Lee
 Fix For: 0.9.0


Looks like the major bug from https://issues.apache.org/jira/browse/ARROW-1941 
is back...

After I downgraded from 0.10.0 to 0.9.0, the error disappeared..

{code:python}
new_arrow_table = pa.concat_tables(my_arrow_tables)

 File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables
  File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
{code}

In order to debug this I saved the first 4 arrow tables to 4 parquet files and 
inspected the parquet files. The parquet schema is identical, but the Pandas 
Metadata is different.

{code:python}
for i in range(5):
 pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet")
{code}

It looks like a column which contains empty strings is getting typed as float64.

{code:python}
>>> test1.schema
HoldingDetail_Id: string
metadata

{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
{"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
"unicode", "numpy_type": "object", "metadata": null},

>>> test1[0]

[
  [
"Z4",
"SF",
"J7",
"W6",
"L7",
"Q9",
"NE",
"F7",


>>> test2.schema
HoldingDetail_Id: string
metadata

{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [
{"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": 
"unicode", "numpy_type": "float64", "metadata": null},

>>> test2[0]

[
  [
"",
"",
"",
"",
"",
"",
"",
"",
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)