[jira] [Updated] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values

2019-08-16 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5028:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Python][C++] Arrow to Parquet conversion drops and corrupts values
> ---
>
> Key: ARROW-5028
> URL: https://issues.apache.org/jira/browse/ARROW-5028
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python 3.6
>Reporter: Marco Neumann
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
> Attachments: dct.json.gz, dct.pickle.gz
>
>
> I am sorry if this bugs feels rather long and the reproduction data is large, 
> but I was not able to reduce the data even further while still triggering the 
> problem. I was able to trigger this behavior on master and on {{0.11.1}}.
> {code:python}
> import io
> import os.path
> import pickle
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> def dct_to_table(index_dct):
> labeled_array = pa.array(np.array(list(index_dct.keys(
> partition_array = pa.array(np.array(list(index_dct.values(
> return pa.Table.from_arrays(
> [labeled_array, partition_array], names=['a', 'b']
> )
> def check_pq_nulls(data):
> fp = io.BytesIO(data)
> pfile = pq.ParquetFile(fp)
> assert pfile.num_row_groups == 1
> md = pfile.metadata.row_group(0)
> col = md.column(1)
> assert col.path_in_schema == 'b.list.item'
> assert col.statistics.null_count == 0  # fails
> def roundtrip(table):
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> data = buf.getvalue().to_pybytes()
> # this fails:
> #   check_pq_nulls(data)
> reader = pa.BufferReader(data)
> return pq.read_table(reader)
> with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp:
> dct = pickle.load(fp)
> # this does NOT help:
> #   pa.set_cpu_count(1)
> #   import gc; gc.disable()
> table = dct_to_table(dct)
> # this fixes the issue:
> #   table = pa.Table.from_pandas(table.to_pandas())
> table2 = roundtrip(table)
> assert table.column('b').null_count == 0
> assert table2.column('b').null_count == 0  # fails
> # if table2 is converted to pandas, you can also observe that some values at 
> the end of column b are `['']` which clearly is not present in the original 
> data
> {code}
> I would also be thankful for any pointers on where the bug comes from or on 
> who to reduce the test case.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values

2019-07-10 Thread Marco Neumann (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Neumann updated ARROW-5028:
-
Attachment: dct.json.gz

> [Python][C++] Arrow to Parquet conversion drops and corrupts values
> ---
>
> Key: ARROW-5028
> URL: https://issues.apache.org/jira/browse/ARROW-5028
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python 3.6
>Reporter: Marco Neumann
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
> Attachments: dct.json.gz, dct.pickle.gz
>
>
> I am sorry if this bugs feels rather long and the reproduction data is large, 
> but I was not able to reduce the data even further while still triggering the 
> problem. I was able to trigger this behavior on master and on {{0.11.1}}.
> {code:python}
> import io
> import os.path
> import pickle
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> def dct_to_table(index_dct):
> labeled_array = pa.array(np.array(list(index_dct.keys(
> partition_array = pa.array(np.array(list(index_dct.values(
> return pa.Table.from_arrays(
> [labeled_array, partition_array], names=['a', 'b']
> )
> def check_pq_nulls(data):
> fp = io.BytesIO(data)
> pfile = pq.ParquetFile(fp)
> assert pfile.num_row_groups == 1
> md = pfile.metadata.row_group(0)
> col = md.column(1)
> assert col.path_in_schema == 'b.list.item'
> assert col.statistics.null_count == 0  # fails
> def roundtrip(table):
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> data = buf.getvalue().to_pybytes()
> # this fails:
> #   check_pq_nulls(data)
> reader = pa.BufferReader(data)
> return pq.read_table(reader)
> with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp:
> dct = pickle.load(fp)
> # this does NOT help:
> #   pa.set_cpu_count(1)
> #   import gc; gc.disable()
> table = dct_to_table(dct)
> # this fixes the issue:
> #   table = pa.Table.from_pandas(table.to_pandas())
> table2 = roundtrip(table)
> assert table.column('b').null_count == 0
> assert table2.column('b').null_count == 0  # fails
> # if table2 is converted to pandas, you can also observe that some values at 
> the end of column b are `['']` which clearly is not present in the original 
> data
> {code}
> I would also be thankful for any pointers on where the bug comes from or on 
> who to reduce the test case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values

2019-06-23 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5028:

Fix Version/s: (was: 0.14.0)
   1.0.0

> [Python][C++] Arrow to Parquet conversion drops and corrupts values
> ---
>
> Key: ARROW-5028
> URL: https://issues.apache.org/jira/browse/ARROW-5028
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python 3.6
>Reporter: Marco Neumann
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
> Attachments: dct.pickle.gz
>
>
> I am sorry if this bugs feels rather long and the reproduction data is large, 
> but I was not able to reduce the data even further while still triggering the 
> problem. I was able to trigger this behavior on master and on {{0.11.1}}.
> {code:python}
> import io
> import os.path
> import pickle
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> def dct_to_table(index_dct):
> labeled_array = pa.array(np.array(list(index_dct.keys(
> partition_array = pa.array(np.array(list(index_dct.values(
> return pa.Table.from_arrays(
> [labeled_array, partition_array], names=['a', 'b']
> )
> def check_pq_nulls(data):
> fp = io.BytesIO(data)
> pfile = pq.ParquetFile(fp)
> assert pfile.num_row_groups == 1
> md = pfile.metadata.row_group(0)
> col = md.column(1)
> assert col.path_in_schema == 'b.list.item'
> assert col.statistics.null_count == 0  # fails
> def roundtrip(table):
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> data = buf.getvalue().to_pybytes()
> # this fails:
> #   check_pq_nulls(data)
> reader = pa.BufferReader(data)
> return pq.read_table(reader)
> with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp:
> dct = pickle.load(fp)
> # this does NOT help:
> #   pa.set_cpu_count(1)
> #   import gc; gc.disable()
> table = dct_to_table(dct)
> # this fixes the issue:
> #   table = pa.Table.from_pandas(table.to_pandas())
> table2 = roundtrip(table)
> assert table.column('b').null_count == 0
> assert table2.column('b').null_count == 0  # fails
> # if table2 is converted to pandas, you can also observe that some values at 
> the end of column b are `['']` which clearly is not present in the original 
> data
> {code}
> I would also be thankful for any pointers on where the bug comes from or on 
> who to reduce the test case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values

2019-05-06 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5028:
-
Labels: parquet  (was: )

> [Python][C++] Arrow to Parquet conversion drops and corrupts values
> ---
>
> Key: ARROW-5028
> URL: https://issues.apache.org/jira/browse/ARROW-5028
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python 3.6
>Reporter: Marco Neumann
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
> Attachments: dct.pickle.gz
>
>
> I am sorry if this bugs feels rather long and the reproduction data is large, 
> but I was not able to reduce the data even further while still triggering the 
> problem. I was able to trigger this behavior on master and on {{0.11.1}}.
> {code:python}
> import io
> import os.path
> import pickle
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> def dct_to_table(index_dct):
> labeled_array = pa.array(np.array(list(index_dct.keys(
> partition_array = pa.array(np.array(list(index_dct.values(
> return pa.Table.from_arrays(
> [labeled_array, partition_array], names=['a', 'b']
> )
> def check_pq_nulls(data):
> fp = io.BytesIO(data)
> pfile = pq.ParquetFile(fp)
> assert pfile.num_row_groups == 1
> md = pfile.metadata.row_group(0)
> col = md.column(1)
> assert col.path_in_schema == 'b.list.item'
> assert col.statistics.null_count == 0  # fails
> def roundtrip(table):
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> data = buf.getvalue().to_pybytes()
> # this fails:
> #   check_pq_nulls(data)
> reader = pa.BufferReader(data)
> return pq.read_table(reader)
> with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp:
> dct = pickle.load(fp)
> # this does NOT help:
> #   pa.set_cpu_count(1)
> #   import gc; gc.disable()
> table = dct_to_table(dct)
> # this fixes the issue:
> #   table = pa.Table.from_pandas(table.to_pandas())
> table2 = roundtrip(table)
> assert table.column('b').null_count == 0
> assert table2.column('b').null_count == 0  # fails
> # if table2 is converted to pandas, you can also observe that some values at 
> the end of column b are `['']` which clearly is not present in the original 
> data
> {code}
> I would also be thankful for any pointers on where the bug comes from or on 
> who to reduce the test case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values

2019-03-28 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5028:

Summary: [Python][C++] Arrow to Parquet conversion drops and corrupts 
values  (was: Arrow->Parquet conversion drops and corrupts values)

> [Python][C++] Arrow to Parquet conversion drops and corrupts values
> ---
>
> Key: ARROW-5028
> URL: https://issues.apache.org/jira/browse/ARROW-5028
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python 3.6
>Reporter: Marco Neumann
>Priority: Major
> Fix For: 0.14.0
>
> Attachments: dct.pickle.gz
>
>
> I am sorry if this bugs feels rather long and the reproduction data is large, 
> but I was not able to reduce the data even further while still triggering the 
> problem. I was able to trigger this behavior on master and on {{0.11.1}}.
> {code:python}
> import io
> import os.path
> import pickle
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> def dct_to_table(index_dct):
> labeled_array = pa.array(np.array(list(index_dct.keys(
> partition_array = pa.array(np.array(list(index_dct.values(
> return pa.Table.from_arrays(
> [labeled_array, partition_array], names=['a', 'b']
> )
> def check_pq_nulls(data):
> fp = io.BytesIO(data)
> pfile = pq.ParquetFile(fp)
> assert pfile.num_row_groups == 1
> md = pfile.metadata.row_group(0)
> col = md.column(1)
> assert col.path_in_schema == 'b.list.item'
> assert col.statistics.null_count == 0  # fails
> def roundtrip(table):
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> data = buf.getvalue().to_pybytes()
> # this fails:
> #   check_pq_nulls(data)
> reader = pa.BufferReader(data)
> return pq.read_table(reader)
> with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp:
> dct = pickle.load(fp)
> # this does NOT help:
> #   pa.set_cpu_count(1)
> #   import gc; gc.disable()
> table = dct_to_table(dct)
> # this fixes the issue:
> #   table = pa.Table.from_pandas(table.to_pandas())
> table2 = roundtrip(table)
> assert table.column('b').null_count == 0
> assert table2.column('b').null_count == 0  # fails
> # if table2 is converted to pandas, you can also observe that some values at 
> the end of column b are `['']` which clearly is not present in the original 
> data
> {code}
> I would also be thankful for any pointers on where the bug comes from or on 
> who to reduce the test case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)