[jira] [Created] (ARROW-12635) [RUST] U64::MAX does not roundtrip through parquet
Marco Neumann created ARROW-12635: - Summary: [RUST] U64::MAX does not roundtrip through parquet Key: ARROW-12635 URL: https://issues.apache.org/jira/browse/ARROW-12635 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Marco Neumann Use the following test {code:java} #[test] fn u64_min_max() { let values = Arc::new(UInt64Array::from_iter_values(vec![u64::MIN, u64::MAX])); one_column_roundtrip("u64_min_max_single_column", values, false); } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7712) [CI][Crossbow] Fix or delete fuzzit jobs
[ https://issues.apache.org/jira/browse/ARROW-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025739#comment-17025739 ] Marco Neumann commented on ARROW-7712: -- [~apitrou] I think we should focus on a single solution. I don't have a very strong opinion on that. Fuzzit was nice because they approached me and offered their solution including some assistance, but OSS-fuzz is the de facto standard for OSS. > [CI][Crossbow] Fix or delete fuzzit jobs > > > Key: ARROW-7712 > URL: https://issues.apache.org/jira/browse/ARROW-7712 > Project: Apache Arrow > Issue Type: Task > Components: C++, Continuous Integration >Reporter: Neal Richardson >Priority: Major > > Not sure we need them now that we're using the OSS-Fuzz project, but they're > broken. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6872) [C++][Python] Empty table with dictionary-columns raises ArrowNotImplementedError
Marco Neumann created ARROW-6872: Summary: [C++][Python] Empty table with dictionary-columns raises ArrowNotImplementedError Key: ARROW-6872 URL: https://issues.apache.org/jira/browse/ARROW-6872 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.15.0 Reporter: Marco Neumann h2. Abstract As a pyarrow user, I would expect that I can create an empty table out of every schema that I created via pandas. This does not work for dictionary types (e.g. {{"category"}} dtypes). h2. Test Case This code: {code:python} import pandas as pd import pyarrow as pa df = pd.DataFrame({"x": pd.Series(["x", "y"], dtype="category")}) table = pa.Table.from_pandas(df) schema = table.schema table_empty = schema.empty_table() # boom {code} produces this exception: {noformat} Traceback (most recent call last): File "arrow_bug.py", line 8, in table_empty = schema.empty_table() File "pyarrow/types.pxi", line 860, in __iter__ File "pyarrow/array.pxi", line 211, in pyarrow.lib.array File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array File "pyarrow/error.pxi", line 86, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: Sequence converter for type dictionary not implemented {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5525) [C++][CI] Enable continuous fuzzing
[ https://issues.apache.org/jira/browse/ARROW-5525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931568#comment-16931568 ] Marco Neumann commented on ARROW-5525: -- {quote}[~marco.neumann.by] you are admin in the organisation. {quote} Didn't know that. [~pitrou] which mail address do you use for GitHub so I can add you to the Org? {quote}As far as I remember the fuzzing was a bit stalled as the arrow-ipc-fuzzing target was crashing constantly and it wasn't fix so it doesn't really accumulate any interesting corpus. {quote} I have tried to fix all known bugs and fixed the CI, so since some weeks it runs more or less smoothly again. One thing that we might change is to add some known arrow files to the seed corpus so we don't solely rely on the fuzzer to find valid files during the exploration. {quote}Also a lot was changed since we first integrated apache-arrow so if fuzzing is a again a priority I would love to help - transfer apache/arrow to new organisation (the old one was deprecated.) and update the Fuzzit CLI to latest version. {quote} That would help a lot I think. > [C++][CI] Enable continuous fuzzing > --- > > Key: ARROW-5525 > URL: https://issues.apache.org/jira/browse/ARROW-5525 > Project: Apache Arrow > Issue Type: Test > Components: C++ >Reporter: Marco Neumann >Assignee: Yevgeny Pats >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 6.5h > Remaining Estimate: 0h > > Since fuzzing kinda only works if done as a continuous background job, we > should find a way of doing so. This likely requires another service than > Travis. Basic requirements are: > * master builds should be submitted for fuzzing > * project members should be informed about new crashes (ideally not via > public issue due to potential security impact) -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5525) [C++][CI] Enable continuous fuzzing
[ https://issues.apache.org/jira/browse/ARROW-5525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931519#comment-16931519 ] Marco Neumann commented on ARROW-5525: -- There's [https://fuzzit.dev/] where you can login via GitHub, but I think your account must be linked to the {{apache/arrow}} organization (on Fuzzit, not on GitHub). That (to my understanding) must be done by the Fuzzit support team ([~yevgenyp] ?). > [C++][CI] Enable continuous fuzzing > --- > > Key: ARROW-5525 > URL: https://issues.apache.org/jira/browse/ARROW-5525 > Project: Apache Arrow > Issue Type: Test > Components: C++ >Reporter: Marco Neumann >Assignee: Yevgeny Pats >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 6.5h > Remaining Estimate: 0h > > Since fuzzing kinda only works if done as a continuous background job, we > should find a way of doing so. This likely requires another service than > Travis. Basic requirements are: > * master builds should be submitted for fuzzing > * project members should be informed about new crashes (ideally not via > public issue due to potential security impact) -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6424) [C++][Fuzzing] Fuzzit nightly is broken
Marco Neumann created ARROW-6424: Summary: [C++][Fuzzing] Fuzzit nightly is broken Key: ARROW-6424 URL: https://issues.apache.org/jira/browse/ARROW-6424 Project: Apache Arrow Issue Type: Bug Components: C++, Continuous Integration Reporter: Marco Neumann Assignee: Marco Neumann We don't get any new fuzzit uploads anymore, see https://circleci.com/gh/ursa-labs/crossbow/2296 for details. Seems like the binary is not found anymore: {noformat} ... + pushd /build/cpp /build/cpp / + mkdir ./relwithdebinfo/out + cp ./relwithdebinfo/arrow-ipc-fuzzing-test ./relwithdebinfo/out/fuzzer cp: cannot stat './relwithdebinfo/arrow-ipc-fuzzing-test': No such file or directory Exited with code 1 {noformat} Looking at https://github.com/ursa-labs/crossbow/branches/all?utf8=%E2%9C%93=fuzzit , it seems it is broken as of the 19th of August, and very likely due to [438a140142be423b1b2af2399567a0a8aeba9aa1|https://github.com/apache/arrow/commit/438a140142be423b1b2af2399567a0a8aeba9aa1]. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6273) [C++][Fuzzing] Add fuzzer for parquet->arrow read path
Marco Neumann created ARROW-6273: Summary: [C++][Fuzzing] Add fuzzer for parquet->arrow read path Key: ARROW-6273 URL: https://issues.apache.org/jira/browse/ARROW-6273 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Marco Neumann Assignee: Marco Neumann The parquet to arrow read path is likely the most commonly used one (esp. by pyarrow) and is a closed step that should allow us to fuzz the reading of untrusted parquet files into memory. This complements the existing arrow ipc fuzzer. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6270) [C++][Fuzzing] IPC reads do not check buffer indices
Marco Neumann created ARROW-6270: Summary: [C++][Fuzzing] IPC reads do not check buffer indices Key: ARROW-6270 URL: https://issues.apache.org/jira/browse/ARROW-6270 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Marco Neumann Assignee: Marco Neumann Attachments: crash-bd7e00178af2d236fdf041fcc1fb30975bf8fbca The attached crash was found by {{arrow-ipc-fuzzing-test}} and indicates that the IPC reader is not checking the flatbuffer encoded buffers for length and can produce out-of-bounds-reads. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6269) [C++][Fuzzing] IPC reads do not check decimal precision
Marco Neumann created ARROW-6269: Summary: [C++][Fuzzing] IPC reads do not check decimal precision Key: ARROW-6269 URL: https://issues.apache.org/jira/browse/ARROW-6269 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Marco Neumann Assignee: Marco Neumann Attachments: crash-5e88bae6ac5250714e8c8bc73b9d67b949fadbb4 The fuzzit runs found the attached crash. The underlying issue is that {{Decimal}} {{precision}} values are checked to late (in the {{Decima}} constructor instead of in the IPC code). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-5959) [C++][CI] Fuzzit does not know about branch + commit hash
[ https://issues.apache.org/jira/browse/ARROW-5959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Neumann reassigned ARROW-5959: Assignee: Marco Neumann (was: Yevgeny Pats) > [C++][CI] Fuzzit does not know about branch + commit hash > - > > Key: ARROW-5959 > URL: https://issues.apache.org/jira/browse/ARROW-5959 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Marco Neumann >Assignee: Marco Neumann >Priority: Minor > Labels: CI, fuzzer > > Reported > [here|https://github.com/apache/arrow/pull/4504#issuecomment-509932673], > fuzzit does not seem to retrieve the branch + commit hash, which is bad for > tracking. > h2. AC > * Fix CI setup > ([hint|https://github.com/apache/arrow/pull/4504#issuecomment-510415931]) > * Use {{set -euxo pipefail}} in > [\{{docker_build_and_fuzzit.sh}}]([https://github.com/apache/arrow/blob/master/ci/docker_build_and_fuzzit.sh]) > to prevent this issue in the future -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5990) RowGroupMetaData.column misses bounds check
Marco Neumann created ARROW-5990: Summary: RowGroupMetaData.column misses bounds check Key: ARROW-5990 URL: https://issues.apache.org/jira/browse/ARROW-5990 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.0 Reporter: Marco Neumann Assignee: Marco Neumann {{RowGroupMetaData.column}} currently does not check for negative or too large positive indices, leading to an potential interpreter crash. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-5987) [C++][Fuzzing] arrow-ipc-fuzzing-test crash 3c3f1b74f347ec6c8b0905e7126b9074b9dc5564
[ https://issues.apache.org/jira/browse/ARROW-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Neumann resolved ARROW-5987. -- Resolution: Cannot Reproduce > [C++][Fuzzing] arrow-ipc-fuzzing-test crash > 3c3f1b74f347ec6c8b0905e7126b9074b9dc5564 > > > Key: ARROW-5987 > URL: https://issues.apache.org/jira/browse/ARROW-5987 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Marco Neumann >Assignee: Marco Neumann >Priority: Major > Labels: fuzzer > Attachments: crash-3c3f1b74f347ec6c8b0905e7126b9074b9dc5564 > > > {{arrow-ipc-fuzzing-test}} found the attached attached crash. Reproduce with > {code} > arrow-ipc-fuzzing-test crash-3c3f1b74f347ec6c8b0905e7126b9074b9dc5564 > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5987) [C++][Fuzzing] arrow-ipc-fuzzing-test crash 3c3f1b74f347ec6c8b0905e7126b9074b9dc5564
[ https://issues.apache.org/jira/browse/ARROW-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888688#comment-16888688 ] Marco Neumann commented on ARROW-5987: -- I swear this was an issue earlier and was now magically resolved on master. I keep the arrow-testing PR open so we can at least include the crashing test file for further testing. > [C++][Fuzzing] arrow-ipc-fuzzing-test crash > 3c3f1b74f347ec6c8b0905e7126b9074b9dc5564 > > > Key: ARROW-5987 > URL: https://issues.apache.org/jira/browse/ARROW-5987 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Marco Neumann >Assignee: Marco Neumann >Priority: Major > Labels: fuzzer > Attachments: crash-3c3f1b74f347ec6c8b0905e7126b9074b9dc5564 > > > {{arrow-ipc-fuzzing-test}} found the attached attached crash. Reproduce with > {code} > arrow-ipc-fuzzing-test crash-3c3f1b74f347ec6c8b0905e7126b9074b9dc5564 > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5987) [C++][Fuzzing] arrow-ipc-fuzzing-test crash 3c3f1b74f347ec6c8b0905e7126b9074b9dc5564
Marco Neumann created ARROW-5987: Summary: [C++][Fuzzing] arrow-ipc-fuzzing-test crash 3c3f1b74f347ec6c8b0905e7126b9074b9dc5564 Key: ARROW-5987 URL: https://issues.apache.org/jira/browse/ARROW-5987 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Marco Neumann Assignee: Marco Neumann Attachments: crash-3c3f1b74f347ec6c8b0905e7126b9074b9dc5564 {{arrow-ipc-fuzzing-test}} found the attached attached crash. Reproduce with {code} arrow-ipc-fuzzing-test crash-3c3f1b74f347ec6c8b0905e7126b9074b9dc5564 {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5959) [C++][CI] Fuzzit does not know about branch + commit hash
Marco Neumann created ARROW-5959: Summary: [C++][CI] Fuzzit does not know about branch + commit hash Key: ARROW-5959 URL: https://issues.apache.org/jira/browse/ARROW-5959 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Marco Neumann Assignee: Yevgeny Pats Reported [here|https://github.com/apache/arrow/pull/4504#issuecomment-509932673], fuzzit does not seem to retrieve the branch + commit hash, which is bad for tracking. h2. AC * Fix CI setup ([hint|https://github.com/apache/arrow/pull/4504#issuecomment-510415931]) * Use {{set -euxo pipefail}} in [\{{docker_build_and_fuzzit.sh}}]([https://github.com/apache/arrow/blob/master/ci/docker_build_and_fuzzit.sh]) to prevent this issue in the future -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5921) [C++][Fuzzing] Missing nullptr checks in IPC
Marco Neumann created ARROW-5921: Summary: [C++][Fuzzing] Missing nullptr checks in IPC Key: ARROW-5921 URL: https://issues.apache.org/jira/browse/ARROW-5921 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.14.0 Reporter: Marco Neumann Assignee: Marco Neumann Attachments: crash-09f72ba2a52b80366ab676364abec850fc668168, crash-607e9caa76863a97f2694a769a1ae2fb83c55e02, crash-cb8cedb6ff8a6f164210c497d91069812ef5d6f8, crash-f37e71777ad0324b55b99224f2c7ffb0107bdfa2, crash-fd237566879dc60fff4d956d5fe3533d74a367f3 {{arrow-ipc-fuzzing-test}} found the attached attached crashes. Reproduce with {code} arrow-ipc-fuzzing-test crash-xxx {code} The attached crashes have all distinct sources and are all related with missing nullptr checks. I have a fix basically ready. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values
[ https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881852#comment-16881852 ] Marco Neumann edited comment on ARROW-5028 at 7/10/19 8:37 AM: --- *You need a massive machine (>10GB RAM) to run this!* [^dct.json.gz] {code:python} import io import json import os.path import numpy as np import pyarrow as pa import pyarrow.parquet as pq def dct_to_table(index_dct): labeled_array = pa.array(np.array(list(index_dct.keys( partition_array = pa.array(np.array(list(index_dct.values( return pa.Table.from_arrays( [labeled_array, partition_array], names=['a', 'b'] ) def check_pq_nulls(data): fp = io.BytesIO(data) pfile = pq.ParquetFile(fp) assert pfile.num_row_groups == 1 md = pfile.metadata.row_group(0) col = md.column(1) assert col.path_in_schema == 'b.list.item' assert col.statistics.null_count == 0 # fails def roundtrip(table): buf = pa.BufferOutputStream() pq.write_table(table, buf) data = buf.getvalue().to_pybytes() # this fails: # check_pq_nulls(data) reader = pa.BufferReader(data) return pq.read_table(reader) with open(os.path.join(os.path.dirname(__file__), 'dct.json'), 'rb') as fp: dct = json.load(fp) # this does NOT help: # pa.set_cpu_count(1) # import gc; gc.disable() table = dct_to_table(dct) # this fixes the issue: # table = pa.Table.from_pandas(table.to_pandas()) table2 = roundtrip(table) assert table.column('b').null_count == 0 assert table2.column('b').null_count == 0 # fails # if table2 is converted to pandas, you can also observe that some values at the end of column b are `['']` which clearly is not present in the original data {code} They content is the same as in the pickle file but due to missing object de-duplication, you need way more memory. Luckily, object de-duplication does not seem to be the underlying issue and the bug is still reproducible. was (Author: marco.neumann.by): *You need a massive machine (>10GB RAM) to run this!* [^dct.json.gz] {code:python} import io import json import os.path import numpy as np import pyarrow as pa import pyarrow.parquet as pq def dct_to_table(index_dct): labeled_array = pa.array(np.array(list(index_dct.keys( partition_array = pa.array(np.array(list(index_dct.values( return pa.Table.from_arrays( [labeled_array, partition_array], names=['a', 'b'] ) def check_pq_nulls(data): fp = io.BytesIO(data) pfile = pq.ParquetFile(fp) assert pfile.num_row_groups == 1 md = pfile.metadata.row_group(0) col = md.column(1) assert col.path_in_schema == 'b.list.item' assert col.statistics.null_count == 0 # fails def roundtrip(table): buf = pa.BufferOutputStream() pq.write_table(table, buf) data = buf.getvalue().to_pybytes() # this fails: # check_pq_nulls(data) reader = pa.BufferReader(data) return pq.read_table(reader) with open(os.path.join(os.path.dirname(__file__), 'dct.json'), 'rb') as fp: dct = json.load(fp) # this does NOT help: # pa.set_cpu_count(1) # import gc; gc.disable() table = dct_to_table(dct) # this fixes the issue: # table = pa.Table.from_pandas(table.to_pandas()) table2 = roundtrip(table) assert table.column('b').null_count == 0 assert table2.column('b').null_count == 0 # fails # if table2 is converted to pandas, you can also observe that some values at the end of column b are `['']` which clearly is not present in the original data {code} > [Python][C++] Arrow to Parquet conversion drops and corrupts values > --- > > Key: ARROW-5028 > URL: https://issues.apache.org/jira/browse/ARROW-5028 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1, 0.13.0 > Environment: python 3.6 >Reporter: Marco Neumann >Priority: Major > Labels: parquet > Fix For: 1.0.0 > > Attachments: dct.json.gz, dct.pickle.gz > > > I am sorry if this bugs feels rather long and the reproduction data is large, > but I was not able to reduce the data even further while still triggering the > problem. I was able to trigger this behavior on master and on {{0.11.1}}. > {code:python} > import io > import os.path > import pickle > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > def dct_to_table(index_dct): > labeled_array = pa.array(np.array(list(index_dct.keys( > partition_array = pa.array(np.array(list(index_dct.values( > return pa.Table.from_arrays( > [labeled_array, partition_array], names=['a', 'b'] > ) > def check_pq_nulls(data): > fp = io.BytesIO(data) > pfile = pq.ParquetFile(fp) > assert
[jira] [Commented] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values
[ https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881852#comment-16881852 ] Marco Neumann commented on ARROW-5028: -- *You need a massive machine (>10GB RAM) to run this!* [^dct.json.gz] {code:python} import io import json import os.path import numpy as np import pyarrow as pa import pyarrow.parquet as pq def dct_to_table(index_dct): labeled_array = pa.array(np.array(list(index_dct.keys( partition_array = pa.array(np.array(list(index_dct.values( return pa.Table.from_arrays( [labeled_array, partition_array], names=['a', 'b'] ) def check_pq_nulls(data): fp = io.BytesIO(data) pfile = pq.ParquetFile(fp) assert pfile.num_row_groups == 1 md = pfile.metadata.row_group(0) col = md.column(1) assert col.path_in_schema == 'b.list.item' assert col.statistics.null_count == 0 # fails def roundtrip(table): buf = pa.BufferOutputStream() pq.write_table(table, buf) data = buf.getvalue().to_pybytes() # this fails: # check_pq_nulls(data) reader = pa.BufferReader(data) return pq.read_table(reader) with open(os.path.join(os.path.dirname(__file__), 'dct.json'), 'rb') as fp: dct = json.load(fp) # this does NOT help: # pa.set_cpu_count(1) # import gc; gc.disable() table = dct_to_table(dct) # this fixes the issue: # table = pa.Table.from_pandas(table.to_pandas()) table2 = roundtrip(table) assert table.column('b').null_count == 0 assert table2.column('b').null_count == 0 # fails # if table2 is converted to pandas, you can also observe that some values at the end of column b are `['']` which clearly is not present in the original data {code} > [Python][C++] Arrow to Parquet conversion drops and corrupts values > --- > > Key: ARROW-5028 > URL: https://issues.apache.org/jira/browse/ARROW-5028 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1, 0.13.0 > Environment: python 3.6 >Reporter: Marco Neumann >Priority: Major > Labels: parquet > Fix For: 1.0.0 > > Attachments: dct.json.gz, dct.pickle.gz > > > I am sorry if this bugs feels rather long and the reproduction data is large, > but I was not able to reduce the data even further while still triggering the > problem. I was able to trigger this behavior on master and on {{0.11.1}}. > {code:python} > import io > import os.path > import pickle > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > def dct_to_table(index_dct): > labeled_array = pa.array(np.array(list(index_dct.keys( > partition_array = pa.array(np.array(list(index_dct.values( > return pa.Table.from_arrays( > [labeled_array, partition_array], names=['a', 'b'] > ) > def check_pq_nulls(data): > fp = io.BytesIO(data) > pfile = pq.ParquetFile(fp) > assert pfile.num_row_groups == 1 > md = pfile.metadata.row_group(0) > col = md.column(1) > assert col.path_in_schema == 'b.list.item' > assert col.statistics.null_count == 0 # fails > def roundtrip(table): > buf = pa.BufferOutputStream() > pq.write_table(table, buf) > data = buf.getvalue().to_pybytes() > # this fails: > # check_pq_nulls(data) > reader = pa.BufferReader(data) > return pq.read_table(reader) > with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp: > dct = pickle.load(fp) > # this does NOT help: > # pa.set_cpu_count(1) > # import gc; gc.disable() > table = dct_to_table(dct) > # this fixes the issue: > # table = pa.Table.from_pandas(table.to_pandas()) > table2 = roundtrip(table) > assert table.column('b').null_count == 0 > assert table2.column('b').null_count == 0 # fails > # if table2 is converted to pandas, you can also observe that some values at > the end of column b are `['']` which clearly is not present in the original > data > {code} > I would also be thankful for any pointers on where the bug comes from or on > who to reduce the test case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values
[ https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Neumann updated ARROW-5028: - Attachment: dct.json.gz > [Python][C++] Arrow to Parquet conversion drops and corrupts values > --- > > Key: ARROW-5028 > URL: https://issues.apache.org/jira/browse/ARROW-5028 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1, 0.13.0 > Environment: python 3.6 >Reporter: Marco Neumann >Priority: Major > Labels: parquet > Fix For: 1.0.0 > > Attachments: dct.json.gz, dct.pickle.gz > > > I am sorry if this bugs feels rather long and the reproduction data is large, > but I was not able to reduce the data even further while still triggering the > problem. I was able to trigger this behavior on master and on {{0.11.1}}. > {code:python} > import io > import os.path > import pickle > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > def dct_to_table(index_dct): > labeled_array = pa.array(np.array(list(index_dct.keys( > partition_array = pa.array(np.array(list(index_dct.values( > return pa.Table.from_arrays( > [labeled_array, partition_array], names=['a', 'b'] > ) > def check_pq_nulls(data): > fp = io.BytesIO(data) > pfile = pq.ParquetFile(fp) > assert pfile.num_row_groups == 1 > md = pfile.metadata.row_group(0) > col = md.column(1) > assert col.path_in_schema == 'b.list.item' > assert col.statistics.null_count == 0 # fails > def roundtrip(table): > buf = pa.BufferOutputStream() > pq.write_table(table, buf) > data = buf.getvalue().to_pybytes() > # this fails: > # check_pq_nulls(data) > reader = pa.BufferReader(data) > return pq.read_table(reader) > with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp: > dct = pickle.load(fp) > # this does NOT help: > # pa.set_cpu_count(1) > # import gc; gc.disable() > table = dct_to_table(dct) > # this fixes the issue: > # table = pa.Table.from_pandas(table.to_pandas()) > table2 = roundtrip(table) > assert table.column('b').null_count == 0 > assert table2.column('b').null_count == 0 # fails > # if table2 is converted to pandas, you can also observe that some values at > the end of column b are `['']` which clearly is not present in the original > data > {code} > I would also be thankful for any pointers on where the bug comes from or on > who to reduce the test case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values
[ https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880106#comment-16880106 ] Marco Neumann commented on ARROW-5028: -- [~emkornfi...@gmail.com] sorry for the late reply. I was building the code myself. You can use master or one of the mentioned versions ({{0.11.0}} or {{0.13.0}}). Regarding the file format: I've tried to dump this whole thing as json, but that parsing it requires excessive amounts of memory (due to the missing string-instance-deduplication used by pickle) and I wasn't able to read it back. If you have another idea, please let me know. > [Python][C++] Arrow to Parquet conversion drops and corrupts values > --- > > Key: ARROW-5028 > URL: https://issues.apache.org/jira/browse/ARROW-5028 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1, 0.13.0 > Environment: python 3.6 >Reporter: Marco Neumann >Priority: Major > Labels: parquet > Fix For: 1.0.0 > > Attachments: dct.pickle.gz > > > I am sorry if this bugs feels rather long and the reproduction data is large, > but I was not able to reduce the data even further while still triggering the > problem. I was able to trigger this behavior on master and on {{0.11.1}}. > {code:python} > import io > import os.path > import pickle > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > def dct_to_table(index_dct): > labeled_array = pa.array(np.array(list(index_dct.keys( > partition_array = pa.array(np.array(list(index_dct.values( > return pa.Table.from_arrays( > [labeled_array, partition_array], names=['a', 'b'] > ) > def check_pq_nulls(data): > fp = io.BytesIO(data) > pfile = pq.ParquetFile(fp) > assert pfile.num_row_groups == 1 > md = pfile.metadata.row_group(0) > col = md.column(1) > assert col.path_in_schema == 'b.list.item' > assert col.statistics.null_count == 0 # fails > def roundtrip(table): > buf = pa.BufferOutputStream() > pq.write_table(table, buf) > data = buf.getvalue().to_pybytes() > # this fails: > # check_pq_nulls(data) > reader = pa.BufferReader(data) > return pq.read_table(reader) > with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp: > dct = pickle.load(fp) > # this does NOT help: > # pa.set_cpu_count(1) > # import gc; gc.disable() > table = dct_to_table(dct) > # this fixes the issue: > # table = pa.Table.from_pandas(table.to_pandas()) > table2 = roundtrip(table) > assert table.column('b').null_count == 0 > assert table2.column('b').null_count == 0 # fails > # if table2 is converted to pandas, you can also observe that some values at > the end of column b are `['']` which clearly is not present in the original > data > {code} > I would also be thankful for any pointers on where the bug comes from or on > who to reduce the test case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5607) [C++][Fuzzing] arrow-ipc-fuzzing-test crash 607e9caa76863a97f2694a769a1ae2fb83c55e02
Marco Neumann created ARROW-5607: Summary: [C++][Fuzzing] arrow-ipc-fuzzing-test crash 607e9caa76863a97f2694a769a1ae2fb83c55e02 Key: ARROW-5607 URL: https://issues.apache.org/jira/browse/ARROW-5607 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.13.0 Reporter: Marco Neumann Attachments: crash-607e9caa76863a97f2694a769a1ae2fb83c55e02 {{arrow-ipc-fuzzing-test}} found the attached attached crash. Reproduce with {code} arrow-ipc-fuzzing-test crash-607e9caa76863a97f2694a769a1ae2fb83c55e02 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-5605) [C++][Fuzzing] arrow-ipc-fuzzing-test crash 74aec871d14bb6b07c72ea8f0e8c9f72cbe6b73c
[ https://issues.apache.org/jira/browse/ARROW-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Neumann reassigned ARROW-5605: Assignee: Marco Neumann > [C++][Fuzzing] arrow-ipc-fuzzing-test crash > 74aec871d14bb6b07c72ea8f0e8c9f72cbe6b73c > > > Key: ARROW-5605 > URL: https://issues.apache.org/jira/browse/ARROW-5605 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.13.0 >Reporter: Marco Neumann >Assignee: Marco Neumann >Priority: Major > Labels: fuzzer > Attachments: crash-74aec871d14bb6b07c72ea8f0e8c9f72cbe6b73c > > > {{arrow-ipc-fuzzing-test}} found the attached attached crash. Reproduce with > {code} > arrow-ipc-fuzzing-test crash-74aec871d14bb6b07c72ea8f0e8c9f72cbe6b73c > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5605) [C++][Fuzzing] arrow-ipc-fuzzing-test crash 74aec871d14bb6b07c72ea8f0e8c9f72cbe6b73c
Marco Neumann created ARROW-5605: Summary: [C++][Fuzzing] arrow-ipc-fuzzing-test crash 74aec871d14bb6b07c72ea8f0e8c9f72cbe6b73c Key: ARROW-5605 URL: https://issues.apache.org/jira/browse/ARROW-5605 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.13.0 Reporter: Marco Neumann Attachments: crash-74aec871d14bb6b07c72ea8f0e8c9f72cbe6b73c {{arrow-ipc-fuzzing-test}} found the attached attached crash. Reproduce with {code} arrow-ipc-fuzzing-test crash-74aec871d14bb6b07c72ea8f0e8c9f72cbe6b73c {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5593) [C++][Fuzzing] Test fuzzers against arrow-testing corpus
Marco Neumann created ARROW-5593: Summary: [C++][Fuzzing] Test fuzzers against arrow-testing corpus Key: ARROW-5593 URL: https://issues.apache.org/jira/browse/ARROW-5593 Project: Apache Arrow Issue Type: Test Components: C++ Reporter: Marco Neumann All fuzzers should be run against the corpus in [arrow-testing|https://github.com/apache/arrow-testing] to prevent regressions. The arrow CI should download the current corpus and run the fuzzers exactly once against each corpus applicable corpus file. The fuzzers must be build with address sanitizer enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5589) arrow-ipc-fuzzing-test crash 2354085db0125113f04f7bd23f54b85cca104713
[ https://issues.apache.org/jira/browse/ARROW-5589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Neumann updated ARROW-5589: - Description: {{arrow-ipc-fuzzing-test}} found the attached attached crash. Reproduce with {code} arrow-ipc-fuzzing-test crash-2354085db0125113f04f7bd23f54b85cca104713 {code} was: {{ipc-fuzzing-test}} found the attached attached crash. Reproduce with {code:bash} ipc-fuzzing-test crash-2354085db0125113f04f7bd23f54b85cca104713 {code} > arrow-ipc-fuzzing-test crash 2354085db0125113f04f7bd23f54b85cca104713 > - > > Key: ARROW-5589 > URL: https://issues.apache.org/jira/browse/ARROW-5589 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.13.0 >Reporter: Marco Neumann >Assignee: Marco Neumann >Priority: Major > Labels: fuzzer > Attachments: crash-2354085db0125113f04f7bd23f54b85cca104713 > > > {{arrow-ipc-fuzzing-test}} found the attached attached crash. Reproduce with > {code} > arrow-ipc-fuzzing-test crash-2354085db0125113f04f7bd23f54b85cca104713 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5589) arrow-ipc-fuzzing-test crash 2354085db0125113f04f7bd23f54b85cca104713
[ https://issues.apache.org/jira/browse/ARROW-5589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Neumann updated ARROW-5589: - Summary: arrow-ipc-fuzzing-test crash 2354085db0125113f04f7bd23f54b85cca104713 (was: ipc-fuzzing-test crash 2354085db0125113f04f7bd23f54b85cca104713) > arrow-ipc-fuzzing-test crash 2354085db0125113f04f7bd23f54b85cca104713 > - > > Key: ARROW-5589 > URL: https://issues.apache.org/jira/browse/ARROW-5589 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.13.0 >Reporter: Marco Neumann >Assignee: Marco Neumann >Priority: Major > Labels: fuzzer > Attachments: crash-2354085db0125113f04f7bd23f54b85cca104713 > > > {{ipc-fuzzing-test}} found the attached attached crash. Reproduce with > {code:bash} > ipc-fuzzing-test crash-2354085db0125113f04f7bd23f54b85cca104713 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5589) ipc-fuzzing-test crash 2354085db0125113f04f7bd23f54b85cca104713
Marco Neumann created ARROW-5589: Summary: ipc-fuzzing-test crash 2354085db0125113f04f7bd23f54b85cca104713 Key: ARROW-5589 URL: https://issues.apache.org/jira/browse/ARROW-5589 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.13.0 Reporter: Marco Neumann Assignee: Marco Neumann Attachments: crash-2354085db0125113f04f7bd23f54b85cca104713 {{ipc-fuzzing-test}} found the attached attached crash. Reproduce with {code:bash} ipc-fuzzing-test crash-2354085db0125113f04f7bd23f54b85cca104713 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5525) Enable continuous fuzzing
Marco Neumann created ARROW-5525: Summary: Enable continuous fuzzing Key: ARROW-5525 URL: https://issues.apache.org/jira/browse/ARROW-5525 Project: Apache Arrow Issue Type: Test Components: C++ Reporter: Marco Neumann Since fuzzing kinda only works if done as a continuous background job, we should find a way of doing so. This likely requires another service than Travis. Basic requirements are: * master builds should be submitted for fuzzing * project members should be informed about new crashes (ideally not via public issue due to potential security impact) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2256) [C++] Fuzzer builds fail out of the box on Ubuntu 16.04 using LLVM apt repos
[ https://issues.apache.org/jira/browse/ARROW-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858504#comment-16858504 ] Marco Neumann commented on ARROW-2256: -- I can confirm that and have a fix ready to commit. > [C++] Fuzzer builds fail out of the box on Ubuntu 16.04 using LLVM apt repos > > > Key: ARROW-2256 > URL: https://issues.apache.org/jira/browse/ARROW-2256 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Marco Neumann >Priority: Major > > I did a clean upgrade to 16.04 on one of my machine and ran into the problem > described here: > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=866087 > I think this can be resolved temporarily by symlinking the static library, > but we should document the problem so other devs know what to do when it > happens -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2256) [C++] Fuzzer builds fail out of the box on Ubuntu 16.04 using LLVM apt repos
[ https://issues.apache.org/jira/browse/ARROW-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Neumann reassigned ARROW-2256: Assignee: Marco Neumann > [C++] Fuzzer builds fail out of the box on Ubuntu 16.04 using LLVM apt repos > > > Key: ARROW-2256 > URL: https://issues.apache.org/jira/browse/ARROW-2256 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Marco Neumann >Priority: Major > > I did a clean upgrade to 16.04 on one of my machine and ran into the problem > described here: > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=866087 > I think this can be resolved temporarily by symlinking the static library, > but we should document the problem so other devs know what to do when it > happens -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values
[ https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854341#comment-16854341 ] Marco Neumann commented on ARROW-5028: -- Sadly not, since the debugging is quite complicated and I feel like I'm blindly digging through the code base. > [Python][C++] Arrow to Parquet conversion drops and corrupts values > --- > > Key: ARROW-5028 > URL: https://issues.apache.org/jira/browse/ARROW-5028 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1, 0.13.0 > Environment: python 3.6 >Reporter: Marco Neumann >Priority: Major > Labels: parquet > Fix For: 0.14.0 > > Attachments: dct.pickle.gz > > > I am sorry if this bugs feels rather long and the reproduction data is large, > but I was not able to reduce the data even further while still triggering the > problem. I was able to trigger this behavior on master and on {{0.11.1}}. > {code:python} > import io > import os.path > import pickle > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > def dct_to_table(index_dct): > labeled_array = pa.array(np.array(list(index_dct.keys( > partition_array = pa.array(np.array(list(index_dct.values( > return pa.Table.from_arrays( > [labeled_array, partition_array], names=['a', 'b'] > ) > def check_pq_nulls(data): > fp = io.BytesIO(data) > pfile = pq.ParquetFile(fp) > assert pfile.num_row_groups == 1 > md = pfile.metadata.row_group(0) > col = md.column(1) > assert col.path_in_schema == 'b.list.item' > assert col.statistics.null_count == 0 # fails > def roundtrip(table): > buf = pa.BufferOutputStream() > pq.write_table(table, buf) > data = buf.getvalue().to_pybytes() > # this fails: > # check_pq_nulls(data) > reader = pa.BufferReader(data) > return pq.read_table(reader) > with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp: > dct = pickle.load(fp) > # this does NOT help: > # pa.set_cpu_count(1) > # import gc; gc.disable() > table = dct_to_table(dct) > # this fixes the issue: > # table = pa.Table.from_pandas(table.to_pandas()) > table2 = roundtrip(table) > assert table.column('b').null_count == 0 > assert table2.column('b').null_count == 0 # fails > # if table2 is converted to pandas, you can also observe that some values at > the end of column b are `['']` which clearly is not present in the original > data > {code} > I would also be thankful for any pointers on where the bug comes from or on > who to reduce the test case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5166) [Python] Statistics for uint64 columns may overflow
Marco Neumann created ARROW-5166: Summary: [Python] Statistics for uint64 columns may overflow Key: ARROW-5166 URL: https://issues.apache.org/jira/browse/ARROW-5166 Project: Apache Arrow Issue Type: Bug Environment: python 3.6 pyarrow 0.13.0 Reporter: Marco Neumann Attachments: int64_statistics_overflow.parquet See the attached parquet file, where the statistics max value is smaller than the min value. You can roundtrip that file through pandas and store it back to provoke the same bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values
[ https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810953#comment-16810953 ] Marco Neumann commented on ARROW-5028: -- So the original table seems to be broken because the mentioned offset array to be jumping backwards. The following python code can be used to test this: {code} def get_offset(chunk, pos): b = chunk.buffers()[1] x = 0 for i in range(4): x = (x << 8) + b[pos * 4 + 3 - i] return x def check_table(table): assert table.num_columns == 2 column = table.column(1) assert column.data.num_chunks == 1 chunk = column.data.chunk(0) assert get_offset(chunk, 734168) < get_offset(chunk, 734169) # fails {code} [~wesmckinn] is it guaranteed that the offset should only go forwards? The current data looks like some kind of overflow to me, although it overflows at around 24 bits which is weird. > [Python][C++] Arrow to Parquet conversion drops and corrupts values > --- > > Key: ARROW-5028 > URL: https://issues.apache.org/jira/browse/ARROW-5028 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1, 0.13.0 > Environment: python 3.6 >Reporter: Marco Neumann >Priority: Major > Fix For: 0.14.0 > > Attachments: dct.pickle.gz > > > I am sorry if this bugs feels rather long and the reproduction data is large, > but I was not able to reduce the data even further while still triggering the > problem. I was able to trigger this behavior on master and on {{0.11.1}}. > {code:python} > import io > import os.path > import pickle > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > def dct_to_table(index_dct): > labeled_array = pa.array(np.array(list(index_dct.keys( > partition_array = pa.array(np.array(list(index_dct.values( > return pa.Table.from_arrays( > [labeled_array, partition_array], names=['a', 'b'] > ) > def check_pq_nulls(data): > fp = io.BytesIO(data) > pfile = pq.ParquetFile(fp) > assert pfile.num_row_groups == 1 > md = pfile.metadata.row_group(0) > col = md.column(1) > assert col.path_in_schema == 'b.list.item' > assert col.statistics.null_count == 0 # fails > def roundtrip(table): > buf = pa.BufferOutputStream() > pq.write_table(table, buf) > data = buf.getvalue().to_pybytes() > # this fails: > # check_pq_nulls(data) > reader = pa.BufferReader(data) > return pq.read_table(reader) > with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp: > dct = pickle.load(fp) > # this does NOT help: > # pa.set_cpu_count(1) > # import gc; gc.disable() > table = dct_to_table(dct) > # this fixes the issue: > # table = pa.Table.from_pandas(table.to_pandas()) > table2 = roundtrip(table) > assert table.column('b').null_count == 0 > assert table2.column('b').null_count == 0 # fails > # if table2 is converted to pandas, you can also observe that some values at > the end of column b are `['']` which clearly is not present in the original > data > {code} > I would also be thankful for any pointers on where the bug comes from or on > who to reduce the test case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values
[ https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810752#comment-16810752 ] Marco Neumann commented on ARROW-5028: -- More debugging results: * {{def_levels}} and {{rep_levels}} have different length (first one is 1 element too short) leading to an out-of-bounds / uninitialized read which explain the {{0}} seen in the last report * the place where a {{rep_levels}} entry is created but no data for {{def_levels}} is {{HandleNonNullList}} in {{writer.cc}} * the reason for that is that {{inner_length}} is negative. It seems to jump from a large number ({{16268812}}) to a small number ({{2}}) and then continues from there (6, 13, 17, ...) > [Python][C++] Arrow to Parquet conversion drops and corrupts values > --- > > Key: ARROW-5028 > URL: https://issues.apache.org/jira/browse/ARROW-5028 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1, 0.13.0 > Environment: python 3.6 >Reporter: Marco Neumann >Priority: Major > Fix For: 0.14.0 > > Attachments: dct.pickle.gz > > > I am sorry if this bugs feels rather long and the reproduction data is large, > but I was not able to reduce the data even further while still triggering the > problem. I was able to trigger this behavior on master and on {{0.11.1}}. > {code:python} > import io > import os.path > import pickle > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > def dct_to_table(index_dct): > labeled_array = pa.array(np.array(list(index_dct.keys( > partition_array = pa.array(np.array(list(index_dct.values( > return pa.Table.from_arrays( > [labeled_array, partition_array], names=['a', 'b'] > ) > def check_pq_nulls(data): > fp = io.BytesIO(data) > pfile = pq.ParquetFile(fp) > assert pfile.num_row_groups == 1 > md = pfile.metadata.row_group(0) > col = md.column(1) > assert col.path_in_schema == 'b.list.item' > assert col.statistics.null_count == 0 # fails > def roundtrip(table): > buf = pa.BufferOutputStream() > pq.write_table(table, buf) > data = buf.getvalue().to_pybytes() > # this fails: > # check_pq_nulls(data) > reader = pa.BufferReader(data) > return pq.read_table(reader) > with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp: > dct = pickle.load(fp) > # this does NOT help: > # pa.set_cpu_count(1) > # import gc; gc.disable() > table = dct_to_table(dct) > # this fixes the issue: > # table = pa.Table.from_pandas(table.to_pandas()) > table2 = roundtrip(table) > assert table.column('b').null_count == 0 > assert table2.column('b').null_count == 0 # fails > # if table2 is converted to pandas, you can also observe that some values at > the end of column b are `['']` which clearly is not present in the original > data > {code} > I would also be thankful for any pointers on where the bug comes from or on > who to reduce the test case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values
[ https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804793#comment-16804793 ] Marco Neumann commented on ARROW-5028: -- Short update: The error also occurs when: * Converting the arrow table to a batch, serializing the batch to bytes, deserializing it and converting it back to the table. This is in contrast to the Pandas roundtrip. * using parquet 2.0 * disabling dictionary encoding * disabling compression (default is SNAPPY) Digging deeper, I found out that the NULL value is created in {{column_writer.cc}} {{WriteMiniBatch}} to due the condition {{def_levels[i] == descr_->max_definition_level()}} . The max def level is 3 but for the last entry, the entry in {{def_levels}} is 0 which seems wrong. The origin of this datais {{GenerateLevels}} in {{writer.cc}} but I haven't figured out what is going on there. > [Python][C++] Arrow to Parquet conversion drops and corrupts values > --- > > Key: ARROW-5028 > URL: https://issues.apache.org/jira/browse/ARROW-5028 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1, 0.13.0 > Environment: python 3.6 >Reporter: Marco Neumann >Priority: Major > Fix For: 0.14.0 > > Attachments: dct.pickle.gz > > > I am sorry if this bugs feels rather long and the reproduction data is large, > but I was not able to reduce the data even further while still triggering the > problem. I was able to trigger this behavior on master and on {{0.11.1}}. > {code:python} > import io > import os.path > import pickle > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > def dct_to_table(index_dct): > labeled_array = pa.array(np.array(list(index_dct.keys( > partition_array = pa.array(np.array(list(index_dct.values( > return pa.Table.from_arrays( > [labeled_array, partition_array], names=['a', 'b'] > ) > def check_pq_nulls(data): > fp = io.BytesIO(data) > pfile = pq.ParquetFile(fp) > assert pfile.num_row_groups == 1 > md = pfile.metadata.row_group(0) > col = md.column(1) > assert col.path_in_schema == 'b.list.item' > assert col.statistics.null_count == 0 # fails > def roundtrip(table): > buf = pa.BufferOutputStream() > pq.write_table(table, buf) > data = buf.getvalue().to_pybytes() > # this fails: > # check_pq_nulls(data) > reader = pa.BufferReader(data) > return pq.read_table(reader) > with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp: > dct = pickle.load(fp) > # this does NOT help: > # pa.set_cpu_count(1) > # import gc; gc.disable() > table = dct_to_table(dct) > # this fixes the issue: > # table = pa.Table.from_pandas(table.to_pandas()) > table2 = roundtrip(table) > assert table.column('b').null_count == 0 > assert table2.column('b').null_count == 0 # fails > # if table2 is converted to pandas, you can also observe that some values at > the end of column b are `['']` which clearly is not present in the original > data > {code} > I would also be thankful for any pointers on where the bug comes from or on > who to reduce the test case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5028) Arrow->Parquet conversion drops and corrupts values
[ https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Neumann updated ARROW-5028: - Environment: python 3.6 > Arrow->Parquet conversion drops and corrupts values > --- > > Key: ARROW-5028 > URL: https://issues.apache.org/jira/browse/ARROW-5028 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1, 0.13.0 > Environment: python 3.6 >Reporter: Marco Neumann >Priority: Major > Attachments: dct.pickle.gz > > > I am sorry if this bugs feels rather long and the reproduction data is large, > but I was not able to reduce the data even further while still triggering the > problem. I was able to trigger this behavior on master and on {{0.11.1}}. > {code:python} > import io > import os.path > import pickle > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > def dct_to_table(index_dct): > labeled_array = pa.array(np.array(list(index_dct.keys( > partition_array = pa.array(np.array(list(index_dct.values( > return pa.Table.from_arrays( > [labeled_array, partition_array], names=['a', 'b'] > ) > def check_pq_nulls(data): > fp = io.BytesIO(data) > pfile = pq.ParquetFile(fp) > assert pfile.num_row_groups == 1 > md = pfile.metadata.row_group(0) > col = md.column(1) > assert col.path_in_schema == 'b.list.item' > assert col.statistics.null_count == 0 # fails > def roundtrip(table): > buf = pa.BufferOutputStream() > pq.write_table(table, buf) > data = buf.getvalue().to_pybytes() > # this fails: > # check_pq_nulls(data) > reader = pa.BufferReader(data) > return pq.read_table(reader) > with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp: > dct = pickle.load(fp) > # this does NOT help: > # pa.set_cpu_count(1) > # import gc; gc.disable() > table = dct_to_table(dct) > # this fixes the issue: > # table = pa.Table.from_pandas(table.to_pandas()) > table2 = roundtrip(table) > assert table.column('b').null_count == 0 > assert table2.column('b').null_count == 0 # fails > # if table2 is converted to pandas, you can also observe that some values at > the end of column b are `['']` which clearly is not present in the original > data > {code} > I would also be thankful for any pointers on where the bug comes from or on > who to reduce the test case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5028) Arrow->Parquet conversion drops and corrupts values
[ https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Neumann updated ARROW-5028: - Summary: Arrow->Parquet conversion drops and corrupts values (was: Arrow->Parquet store drops and corrupts values) > Arrow->Parquet conversion drops and corrupts values > --- > > Key: ARROW-5028 > URL: https://issues.apache.org/jira/browse/ARROW-5028 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1, 0.13.0 >Reporter: Marco Neumann >Priority: Major > Attachments: dct.pickle.gz > > > I am sorry if this bugs feels rather long and the reproduction data is large, > but I was not able to reduce the data even further while still triggering the > problem. I was able to trigger this behavior on master and on {{0.11.1}}. > {code:python} > import io > import os.path > import pickle > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > def dct_to_table(index_dct): > labeled_array = pa.array(np.array(list(index_dct.keys( > partition_array = pa.array(np.array(list(index_dct.values( > return pa.Table.from_arrays( > [labeled_array, partition_array], names=['a', 'b'] > ) > def check_pq_nulls(data): > fp = io.BytesIO(data) > pfile = pq.ParquetFile(fp) > assert pfile.num_row_groups == 1 > md = pfile.metadata.row_group(0) > col = md.column(1) > assert col.path_in_schema == 'b.list.item' > assert col.statistics.null_count == 0 # fails > def roundtrip(table): > buf = pa.BufferOutputStream() > pq.write_table(table, buf) > data = buf.getvalue().to_pybytes() > # this fails: > # check_pq_nulls(data) > reader = pa.BufferReader(data) > return pq.read_table(reader) > with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp: > dct = pickle.load(fp) > # this does NOT help: > # pa.set_cpu_count(1) > # import gc; gc.disable() > table = dct_to_table(dct) > # this fixes the issue: > # table = pa.Table.from_pandas(table.to_pandas()) > table2 = roundtrip(table) > assert table.column('b').null_count == 0 > assert table2.column('b').null_count == 0 # fails > # if table2 is converted to pandas, you can also observe that some values at > the end of column b are `['']` which clearly is not present in the original > data > {code} > I would also be thankful for any pointers on where the bug comes from or on > who to reduce the test case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5028) Arrow->Parquet store drops and corrupts values
Marco Neumann created ARROW-5028: Summary: Arrow->Parquet store drops and corrupts values Key: ARROW-5028 URL: https://issues.apache.org/jira/browse/ARROW-5028 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.11.1, 0.13.0 Reporter: Marco Neumann Attachments: dct.pickle.gz I am sorry if this bugs feels rather long and the reproduction data is large, but I was not able to reduce the data even further while still triggering the problem. I was able to trigger this behavior on master and on {{0.11.1}}. {code:python} import io import os.path import pickle import numpy as np import pyarrow as pa import pyarrow.parquet as pq def dct_to_table(index_dct): labeled_array = pa.array(np.array(list(index_dct.keys( partition_array = pa.array(np.array(list(index_dct.values( return pa.Table.from_arrays( [labeled_array, partition_array], names=['a', 'b'] ) def check_pq_nulls(data): fp = io.BytesIO(data) pfile = pq.ParquetFile(fp) assert pfile.num_row_groups == 1 md = pfile.metadata.row_group(0) col = md.column(1) assert col.path_in_schema == 'b.list.item' assert col.statistics.null_count == 0 # fails def roundtrip(table): buf = pa.BufferOutputStream() pq.write_table(table, buf) data = buf.getvalue().to_pybytes() # this fails: # check_pq_nulls(data) reader = pa.BufferReader(data) return pq.read_table(reader) with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp: dct = pickle.load(fp) # this does NOT help: # pa.set_cpu_count(1) # import gc; gc.disable() table = dct_to_table(dct) # this fixes the issue: # table = pa.Table.from_pandas(table.to_pandas()) table2 = roundtrip(table) assert table.column('b').null_count == 0 assert table2.column('b').null_count == 0 # fails # if table2 is converted to pandas, you can also observe that some values at the end of column b are `['']` which clearly is not present in the original data {code} I would also be thankful for any pointers on where the bug comes from or on who to reduce the test case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2963) [Python] Deadlock during fork-join and use_threads=True
[ https://issues.apache.org/jira/browse/ARROW-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566987#comment-16566987 ] Marco Neumann commented on ARROW-2963: -- The problem is that using threads worked in {{0.9.0}}, because (I think) there was no pool involved. > [Python] Deadlock during fork-join and use_threads=True > --- > > Key: ARROW-2963 > URL: https://issues.apache.org/jira/browse/ARROW-2963 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 > Environment: pandas==0.23.3 > pyarrow==0.10.0rc0 >Reporter: Marco Neumann >Assignee: Antoine Pitrou >Priority: Major > > The following code passes: > {noformat} > import os > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'x': [1]}) > table = pa.Table.from_pandas(df) > df = table.to_pandas(use_threads=False) > pid = os.fork() > if pid != 0: > os.waitpid(pid, 0) > {noformat} > but the following code will never finish (the {{waitpid}} calls blocks > forever, seems that the child process is frozen): > {noformat} > import os > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'x': [1]}) > table = pa.Table.from_pandas(df) > df = table.to_pandas(use_threads=True) > pid = os.fork() > if pid != 0: > os.waitpid(pid, 0) > {noformat} > (the only difference is {{use_threads=True}}) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2963) [Python] Deadlock during fork-join and use_threads=True
Marco Neumann created ARROW-2963: Summary: [Python] Deadlock during fork-join and use_threads=True Key: ARROW-2963 URL: https://issues.apache.org/jira/browse/ARROW-2963 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.10.0 Environment: pandas==0.23.3 pyarrow==0.10.0rc0 Reporter: Marco Neumann The following code passes: {noformat} import os import pandas as pd import pyarrow as pa df = pd.DataFrame({'x': [1]}) table = pa.Table.from_pandas(df) df = table.to_pandas(use_threads=False) pid = os.fork() if pid != 0: os.waitpid(pid, 0) {noformat} but the following code will never finish (the {{waitpid}} calls blocks forever, seems that the child process is frozen): {noformat} import os import pandas as pd import pyarrow as pa df = pd.DataFrame({'x': [1]}) table = pa.Table.from_pandas(df) df = table.to_pandas(use_threads=True) pid = os.fork() if pid != 0: os.waitpid(pid, 0) {noformat} (the only difference is {{use_threads=True}}) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2554) pa.array type inference bug when using NS-timestamp
[ https://issues.apache.org/jira/browse/ARROW-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Neumann reassigned ARROW-2554: Assignee: Marco Neumann > pa.array type inference bug when using NS-timestamp > --- > > Key: ARROW-2554 > URL: https://issues.apache.org/jira/browse/ARROW-2554 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Marco Neumann >Assignee: Marco Neumann >Priority: Major > Fix For: 0.10.0 > > > The following fails: > {noformat} > import pandas as pd > import pyarrow as pa > pa.array([pd.Timestamp('now').to_datetime64()]) > {noformat} > with {{ArrowNotImplementedError: Cannot convert NumPy datetime64 objects with > differing unit}}, but when you provide the correct type information directly, > it works: > {noformat} > import pandas as pd > import pyarrow as pa > pa.array([pd.Timestamp('now').to_datetime64()], type=pa.timestamp('ns')) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2554) pa.array type inference bug when using NS-timestamp
Marco Neumann created ARROW-2554: Summary: pa.array type inference bug when using NS-timestamp Key: ARROW-2554 URL: https://issues.apache.org/jira/browse/ARROW-2554 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Reporter: Marco Neumann The following fails: {noformat} import pandas as pd import pyarrow as pa pa.array([pd.Timestamp('now').to_datetime64()]) {noformat} with {{ArrowNotImplementedError: Cannot convert NumPy datetime64 objects with differing unit}}, but when you provide the correct type information directly, it works: {noformat} import pandas as pd import pyarrow as pa pa.array([pd.Timestamp('now').to_datetime64()], type=pa.timestamp('ns')) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2513) [Python] DictionaryType should give access to index type and dictionary array
[ https://issues.apache.org/jira/browse/ARROW-2513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Neumann reassigned ARROW-2513: Assignee: Marco Neumann > [Python] DictionaryType should give access to index type and dictionary array > - > > Key: ARROW-2513 > URL: https://issues.apache.org/jira/browse/ARROW-2513 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.9.0 >Reporter: Marco Neumann >Assignee: Marco Neumann >Priority: Minor > > Currently, only {{ordered}} is mapped from C Type to Python, but index type > and dictionary array are not accessible from Python. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2513) [Python] DictionaryType should give access to index type and dictionary array
Marco Neumann created ARROW-2513: Summary: [Python] DictionaryType should give access to index type and dictionary array Key: ARROW-2513 URL: https://issues.apache.org/jira/browse/ARROW-2513 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.9.0 Reporter: Marco Neumann Currently, only {{ordered}} is mapped from C Type to Python, but index type and dictionary array are not accessible from Python. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1589) [C++] Fuzzing for certain input formats
[ https://issues.apache.org/jira/browse/ARROW-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343242#comment-16343242 ] Marco Neumann commented on ARROW-1589: -- So the "empty input" is one of them. The fuzzing process is still failing there when address sanitizer is enabled since the {{BufferReader}} produces a out of bounce read. So even though you're testing this case in PR1503, the current code on master results in undefined behavior. > [C++] Fuzzing for certain input formats > --- > > Key: ARROW-1589 > URL: https://issues.apache.org/jira/browse/ARROW-1589 > Project: Apache Arrow > Issue Type: Test >Reporter: Marco Neumann >Assignee: Marco Neumann >Priority: Major > Labels: pull-request-available > > The arrow lib should have fuzzing tests for certain input formats, e.g. for > reading record batches from streams. Ideally, malformed input must not crash > the system but must report a proper error. This could easily be implemented > e.g. w/ [libfuzzer|https://llvm.org/docs/LibFuzzer.html] in combination with > address sanitizer (that's already implemented by Arrow's build system). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1589) [C++] Fuzzing for certain input formats
[ https://issues.apache.org/jira/browse/ARROW-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16315898#comment-16315898 ] Marco Neumann commented on ARROW-1589: -- I'll open a PR until end of January, sorry for the delay. The code is nearly ready but I've had some problems with the compilation workflow. > [C++] Fuzzing for certain input formats > --- > > Key: ARROW-1589 > URL: https://issues.apache.org/jira/browse/ARROW-1589 > Project: Apache Arrow > Issue Type: Test >Reporter: Marco Neumann >Assignee: Marco Neumann > > The arrow lib should have fuzzing tests for certain input formats, e.g. for > reading record batches from streams. Ideally, malformed input must not crash > the system but must report a proper error. This could easily be implemented > e.g. w/ [libfuzzer|https://llvm.org/docs/LibFuzzer.html] in combination with > address sanitizer (that's already implemented by Arrow's build system). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1589) [C++] Fuzzing for certain input formats
[ https://issues.apache.org/jira/browse/ARROW-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179136#comment-16179136 ] Marco Neumann commented on ARROW-1589: -- {quote}Please understand that this software we are discussing is primarily the work of a single volunteer developer (me)...{quote} I am very thankful for your work. Arrow and parquet are absolutely amazing. I just want to help out. Integrating an automatic fuzzing solution is rather trivial (I actually have the corresponding PR nearly ready, just the usage docs are missing) and can prevent so many silly bugs (produces by smart people). I do NOT expect you to fix all bugs and problems found by the fuzzer, but it can help finding missing test coverage and could (on a long term) improve the stability of the library and the security aspect. > [C++] Fuzzing for certain input formats > --- > > Key: ARROW-1589 > URL: https://issues.apache.org/jira/browse/ARROW-1589 > Project: Apache Arrow > Issue Type: Test >Reporter: Marco Neumann >Assignee: Marco Neumann > > The arrow lib should have fuzzing tests for certain input formats, e.g. for > reading record batches from streams. Ideally, malformed input must not crash > the system but must report a proper error. This could easily be implemented > e.g. w/ [libfuzzer|https://llvm.org/docs/LibFuzzer.html] in combination with > address sanitizer (that's already implemented by Arrow's build system). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1589) [C++] Fuzzing for certain input formats
[ https://issues.apache.org/jira/browse/ARROW-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178675#comment-16178675 ] Marco Neumann commented on ARROW-1589: -- Currently it is not clearly stated that the message stream is trusted, therefore the opposite will be assumed by developers. Also, the naming you are proposing will very likely mislead people, since the current naming within the library does not contain any information about trust ("trusted" or "untrusted") so users minds will likely default to "trusted". So the current way method should rather be prefixed w/ "trusted"/"unsafe"/"fast". A tiny example that already segfaults is the creation and read-out of an empty stream, which IMHO should not happen. The reason why unit testing is not sufficient is that the same kind of devs who are writing the code are also writing the unit tests and therefore won't be able to think outside their box. (that's not an offense, it's just human behavior and applies to all developers). > [C++] Fuzzing for certain input formats > --- > > Key: ARROW-1589 > URL: https://issues.apache.org/jira/browse/ARROW-1589 > Project: Apache Arrow > Issue Type: Test >Reporter: Marco Neumann >Assignee: Marco Neumann > > The arrow lib should have fuzzing tests for certain input formats, e.g. for > reading record batches from streams. Ideally, malformed input must not crash > the system but must report a proper error. This could easily be implemented > e.g. w/ [libfuzzer|https://llvm.org/docs/LibFuzzer.html] in combination with > address sanitizer (that's already implemented by Arrow's build system). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1589) Fuzzing for certain input formats
Marco Neumann created ARROW-1589: Summary: Fuzzing for certain input formats Key: ARROW-1589 URL: https://issues.apache.org/jira/browse/ARROW-1589 Project: Apache Arrow Issue Type: Test Reporter: Marco Neumann Assignee: Marco Neumann The arrow lib should have fuzzing tests for certain input formats, e.g. for reading record batches from streams. Ideally, malformed input must not crash the system but must report a proper error. This could easily be implemented e.g. w/ [libfuzzer|https://llvm.org/docs/LibFuzzer.html] in combination with address sanitizer (that's already implemented by Arrow's build system). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1276) Cannot serializer empty DataFrame to parquet
[ https://issues.apache.org/jira/browse/ARROW-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Neumann reassigned ARROW-1276: Assignee: Marco Neumann > Cannot serializer empty DataFrame to parquet > > > Key: ARROW-1276 > URL: https://issues.apache.org/jira/browse/ARROW-1276 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.5.0 >Reporter: Marco Neumann >Assignee: Marco Neumann >Priority: Minor > > The following code fails with {{pyarrow.lib.ArrowInvalid: Invalid: chunk size > per row_group must be greater than 0}} but should not: > {noformat} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > df = pd.DataFrame({'x': pd.Series([], dtype=int)}) > table = pa.Table.from_pandas(df) > buf = pa.InMemoryOutputStream() > pq.write_table(table, buf) > {noformat} > I have a test and a fix prepared and will upstream both in the upcoming days. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1276) Cannot serializer empty DataFrame to parquet
Marco Neumann created ARROW-1276: Summary: Cannot serializer empty DataFrame to parquet Key: ARROW-1276 URL: https://issues.apache.org/jira/browse/ARROW-1276 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.5.0 Reporter: Marco Neumann Priority: Minor The following code fails with {{pyarrow.lib.ArrowInvalid: Invalid: chunk size per row_group must be greater than 0}} but should not: {noformat} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.DataFrame({'x': pd.Series([], dtype=int)}) table = pa.Table.from_pandas(df) buf = pa.InMemoryOutputStream() pq.write_table(table, buf) {noformat} I have a test and a fix prepared and will upstream both in the upcoming days. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1083) Object categoricals are not serialized when only None is present
Marco Neumann created ARROW-1083: Summary: Object categoricals are not serialized when only None is present Key: ARROW-1083 URL: https://issues.apache.org/jira/browse/ARROW-1083 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.4.0 Reporter: Marco Neumann Priority: Minor The following code sample fails with {{pyarrow.lib.ArrowNotImplementedError: NotImplemented: unhandled type}} but should not: {noformat} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.DataFrame({'x': [None]}) df['x'] = df['x'].astype('category') table = pa.Table.from_pandas(df) buf = pa.InMemoryOutputStream() pq.write_table(table, buf) {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)