[jira] [Created] (ARROW-12635) [RUST] U64::MAX does not roundtrip through parquet

2021-05-03 Thread Marco Neumann (Jira)
Marco Neumann created ARROW-12635:
-

 Summary: [RUST] U64::MAX does not roundtrip through parquet
 Key: ARROW-12635
 URL: https://issues.apache.org/jira/browse/ARROW-12635
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Marco Neumann


Use the following test
{code:java}
#[test]
fn u64_min_max() {
let values = Arc::new(UInt64Array::from_iter_values(vec![u64::MIN, 
u64::MAX]));
one_column_roundtrip("u64_min_max_single_column", values, false);
}
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7712) [CI][Crossbow] Fix or delete fuzzit jobs

2020-01-29 Thread Marco Neumann (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025739#comment-17025739
 ] 

Marco Neumann commented on ARROW-7712:
--

[~apitrou] I think we should focus on a single solution. I don't have a very 
strong opinion on that. Fuzzit was nice because they approached me and offered 
their solution including some assistance, but OSS-fuzz is the de facto standard 
for OSS.

> [CI][Crossbow] Fix or delete fuzzit jobs
> 
>
> Key: ARROW-7712
> URL: https://issues.apache.org/jira/browse/ARROW-7712
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Continuous Integration
>Reporter: Neal Richardson
>Priority: Major
>
> Not sure we need them now that we're using the OSS-Fuzz project, but they're 
> broken. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6872) [C++][Python] Empty table with dictionary-columns raises ArrowNotImplementedError

2019-10-14 Thread Marco Neumann (Jira)
Marco Neumann created ARROW-6872:


 Summary: [C++][Python] Empty table with dictionary-columns raises 
ArrowNotImplementedError
 Key: ARROW-6872
 URL: https://issues.apache.org/jira/browse/ARROW-6872
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.15.0
Reporter: Marco Neumann


h2. Abstract
As a pyarrow user, I would expect that I can create an empty table out of every 
schema that I created via pandas. This does not work for dictionary types (e.g. 
{{"category"}} dtypes).

h2. Test Case
This code:

{code:python}
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({"x": pd.Series(["x", "y"], dtype="category")})
table = pa.Table.from_pandas(df)
schema = table.schema
table_empty = schema.empty_table()  # boom
{code}

produces this exception:

{noformat}
Traceback (most recent call last):
  File "arrow_bug.py", line 8, in 
table_empty = schema.empty_table()
  File "pyarrow/types.pxi", line 860, in __iter__
  File "pyarrow/array.pxi", line 211, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 86, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Sequence converter for type 
dictionary not implemented
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5525) [C++][CI] Enable continuous fuzzing

2019-09-17 Thread Marco Neumann (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931568#comment-16931568
 ] 

Marco Neumann commented on ARROW-5525:
--

{quote}[~marco.neumann.by] you are admin in the organisation.
{quote}
Didn't know that. [~pitrou] which mail address do you use for GitHub so I can 
add you to the Org?

 
{quote}As far as I remember the fuzzing was a bit stalled as the 
arrow-ipc-fuzzing target was crashing constantly and it wasn't fix so it 
doesn't really accumulate any interesting corpus.
{quote}
I have tried to fix all known bugs and fixed the CI, so since some weeks it 
runs more or less smoothly again. One thing that we might change is to add some 
known arrow files to the seed corpus so we don't solely rely on the fuzzer to 
find valid files during the exploration.
{quote}Also a lot was changed since we first integrated apache-arrow so if 
fuzzing is a again a priority I would love to help - transfer apache/arrow to 
new organisation (the old one was deprecated.) and update the Fuzzit CLI to 
latest version.
{quote}
That would help a lot I think.

> [C++][CI] Enable continuous fuzzing
> ---
>
> Key: ARROW-5525
> URL: https://issues.apache.org/jira/browse/ARROW-5525
> Project: Apache Arrow
>  Issue Type: Test
>  Components: C++
>Reporter: Marco Neumann
>Assignee: Yevgeny Pats
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Since fuzzing kinda only works if done as a continuous background job, we 
> should find a way of doing so. This likely requires another service than 
> Travis. Basic requirements are:
>  * master builds should be submitted for fuzzing
>  * project members should be informed about new crashes (ideally not via 
> public issue due to potential security impact)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5525) [C++][CI] Enable continuous fuzzing

2019-09-17 Thread Marco Neumann (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931519#comment-16931519
 ] 

Marco Neumann commented on ARROW-5525:
--

There's [https://fuzzit.dev/] where you can login via GitHub, but I think your 
account must be linked to the {{apache/arrow}} organization (on Fuzzit, not on 
GitHub). That (to my understanding) must be done by the Fuzzit support team 
([~yevgenyp] ?).

> [C++][CI] Enable continuous fuzzing
> ---
>
> Key: ARROW-5525
> URL: https://issues.apache.org/jira/browse/ARROW-5525
> Project: Apache Arrow
>  Issue Type: Test
>  Components: C++
>Reporter: Marco Neumann
>Assignee: Yevgeny Pats
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Since fuzzing kinda only works if done as a continuous background job, we 
> should find a way of doing so. This likely requires another service than 
> Travis. Basic requirements are:
>  * master builds should be submitted for fuzzing
>  * project members should be informed about new crashes (ideally not via 
> public issue due to potential security impact)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6424) [C++][Fuzzing] Fuzzit nightly is broken

2019-09-03 Thread Marco Neumann (Jira)
Marco Neumann created ARROW-6424:


 Summary: [C++][Fuzzing] Fuzzit nightly is broken
 Key: ARROW-6424
 URL: https://issues.apache.org/jira/browse/ARROW-6424
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Reporter: Marco Neumann
Assignee: Marco Neumann


We don't get any new fuzzit uploads anymore, see 
https://circleci.com/gh/ursa-labs/crossbow/2296 for details. Seems like the 
binary is not found anymore:

{noformat}
...
+ pushd /build/cpp
/build/cpp /
+ mkdir ./relwithdebinfo/out
+ cp ./relwithdebinfo/arrow-ipc-fuzzing-test ./relwithdebinfo/out/fuzzer
cp: cannot stat './relwithdebinfo/arrow-ipc-fuzzing-test': No such file or 
directory
Exited with code 1
{noformat}

Looking at 
https://github.com/ursa-labs/crossbow/branches/all?utf8=%E2%9C%93=fuzzit 
, it seems it is broken as of the 19th of August, and very likely due to 
[438a140142be423b1b2af2399567a0a8aeba9aa1|https://github.com/apache/arrow/commit/438a140142be423b1b2af2399567a0a8aeba9aa1].



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6273) [C++][Fuzzing] Add fuzzer for parquet->arrow read path

2019-08-16 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-6273:


 Summary: [C++][Fuzzing] Add fuzzer for parquet->arrow read path
 Key: ARROW-6273
 URL: https://issues.apache.org/jira/browse/ARROW-6273
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Marco Neumann
Assignee: Marco Neumann


The parquet to arrow read path is likely the most commonly used one (esp. by 
pyarrow) and is a closed step that should allow us to fuzz the reading of 
untrusted parquet files into memory. This complements the existing arrow ipc 
fuzzer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6270) [C++][Fuzzing] IPC reads do not check buffer indices

2019-08-16 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-6270:


 Summary: [C++][Fuzzing] IPC reads do not check buffer indices
 Key: ARROW-6270
 URL: https://issues.apache.org/jira/browse/ARROW-6270
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Marco Neumann
Assignee: Marco Neumann
 Attachments: crash-bd7e00178af2d236fdf041fcc1fb30975bf8fbca

The attached crash was found by {{arrow-ipc-fuzzing-test}} and indicates that 
the IPC reader is not checking the flatbuffer encoded buffers for length and 
can produce out-of-bounds-reads.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6269) [C++][Fuzzing] IPC reads do not check decimal precision

2019-08-16 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-6269:


 Summary: [C++][Fuzzing] IPC reads do not check decimal precision
 Key: ARROW-6269
 URL: https://issues.apache.org/jira/browse/ARROW-6269
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Marco Neumann
Assignee: Marco Neumann
 Attachments: crash-5e88bae6ac5250714e8c8bc73b9d67b949fadbb4

The fuzzit runs found the attached crash. The underlying issue is that 
{{Decimal}} {{precision}} values are checked to late (in the {{Decima}} 
constructor instead of in the IPC code).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-5959) [C++][CI] Fuzzit does not know about branch + commit hash

2019-07-26 Thread Marco Neumann (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Neumann reassigned ARROW-5959:


Assignee: Marco Neumann  (was: Yevgeny Pats)

> [C++][CI] Fuzzit does not know about branch + commit hash
> -
>
> Key: ARROW-5959
> URL: https://issues.apache.org/jira/browse/ARROW-5959
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>Priority: Minor
>  Labels: CI, fuzzer
>
> Reported 
> [here|https://github.com/apache/arrow/pull/4504#issuecomment-509932673], 
> fuzzit does not seem to retrieve the branch + commit hash, which is bad for 
> tracking.
> h2. AC
>  * Fix CI setup 
> ([hint|https://github.com/apache/arrow/pull/4504#issuecomment-510415931])
>  * Use {{set -euxo pipefail}} in 
> [\{{docker_build_and_fuzzit.sh}}]([https://github.com/apache/arrow/blob/master/ci/docker_build_and_fuzzit.sh])
>  to prevent this issue in the future



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5990) RowGroupMetaData.column misses bounds check

2019-07-19 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-5990:


 Summary: RowGroupMetaData.column misses bounds check
 Key: ARROW-5990
 URL: https://issues.apache.org/jira/browse/ARROW-5990
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.0
Reporter: Marco Neumann
Assignee: Marco Neumann


{{RowGroupMetaData.column}} currently does not check for negative or too large 
positive indices, leading to an potential interpreter crash.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5987) [C++][Fuzzing] arrow-ipc-fuzzing-test crash 3c3f1b74f347ec6c8b0905e7126b9074b9dc5564

2019-07-19 Thread Marco Neumann (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Neumann resolved ARROW-5987.
--
Resolution: Cannot Reproduce

> [C++][Fuzzing] arrow-ipc-fuzzing-test crash 
> 3c3f1b74f347ec6c8b0905e7126b9074b9dc5564
> 
>
> Key: ARROW-5987
> URL: https://issues.apache.org/jira/browse/ARROW-5987
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>Priority: Major
>  Labels: fuzzer
> Attachments: crash-3c3f1b74f347ec6c8b0905e7126b9074b9dc5564
>
>
> {{arrow-ipc-fuzzing-test}} found the attached attached crash. Reproduce with
> {code}
> arrow-ipc-fuzzing-test crash-3c3f1b74f347ec6c8b0905e7126b9074b9dc5564
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5987) [C++][Fuzzing] arrow-ipc-fuzzing-test crash 3c3f1b74f347ec6c8b0905e7126b9074b9dc5564

2019-07-19 Thread Marco Neumann (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888688#comment-16888688
 ] 

Marco Neumann commented on ARROW-5987:
--

I swear this was an issue earlier and was now magically resolved on master. I 
keep the arrow-testing PR open so we can at least include the crashing test 
file for further testing.

> [C++][Fuzzing] arrow-ipc-fuzzing-test crash 
> 3c3f1b74f347ec6c8b0905e7126b9074b9dc5564
> 
>
> Key: ARROW-5987
> URL: https://issues.apache.org/jira/browse/ARROW-5987
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>Priority: Major
>  Labels: fuzzer
> Attachments: crash-3c3f1b74f347ec6c8b0905e7126b9074b9dc5564
>
>
> {{arrow-ipc-fuzzing-test}} found the attached attached crash. Reproduce with
> {code}
> arrow-ipc-fuzzing-test crash-3c3f1b74f347ec6c8b0905e7126b9074b9dc5564
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5987) [C++][Fuzzing] arrow-ipc-fuzzing-test crash 3c3f1b74f347ec6c8b0905e7126b9074b9dc5564

2019-07-19 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-5987:


 Summary: [C++][Fuzzing] arrow-ipc-fuzzing-test crash 
3c3f1b74f347ec6c8b0905e7126b9074b9dc5564
 Key: ARROW-5987
 URL: https://issues.apache.org/jira/browse/ARROW-5987
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Marco Neumann
Assignee: Marco Neumann
 Attachments: crash-3c3f1b74f347ec6c8b0905e7126b9074b9dc5564

{{arrow-ipc-fuzzing-test}} found the attached attached crash. Reproduce with
{code}
arrow-ipc-fuzzing-test crash-3c3f1b74f347ec6c8b0905e7126b9074b9dc5564
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5959) [C++][CI] Fuzzit does not know about branch + commit hash

2019-07-16 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-5959:


 Summary: [C++][CI] Fuzzit does not know about branch + commit hash
 Key: ARROW-5959
 URL: https://issues.apache.org/jira/browse/ARROW-5959
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Marco Neumann
Assignee: Yevgeny Pats


Reported 
[here|https://github.com/apache/arrow/pull/4504#issuecomment-509932673], fuzzit 
does not seem to retrieve the branch + commit hash, which is bad for tracking.
h2. AC
 * Fix CI setup 
([hint|https://github.com/apache/arrow/pull/4504#issuecomment-510415931])
 * Use {{set -euxo pipefail}} in 
[\{{docker_build_and_fuzzit.sh}}]([https://github.com/apache/arrow/blob/master/ci/docker_build_and_fuzzit.sh])
 to prevent this issue in the future



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5921) [C++][Fuzzing] Missing nullptr checks in IPC

2019-07-12 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-5921:


 Summary: [C++][Fuzzing] Missing nullptr checks in IPC
 Key: ARROW-5921
 URL: https://issues.apache.org/jira/browse/ARROW-5921
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.14.0
Reporter: Marco Neumann
Assignee: Marco Neumann
 Attachments: crash-09f72ba2a52b80366ab676364abec850fc668168, 
crash-607e9caa76863a97f2694a769a1ae2fb83c55e02, 
crash-cb8cedb6ff8a6f164210c497d91069812ef5d6f8, 
crash-f37e71777ad0324b55b99224f2c7ffb0107bdfa2, 
crash-fd237566879dc60fff4d956d5fe3533d74a367f3

{{arrow-ipc-fuzzing-test}} found the attached attached crashes. Reproduce with
{code}
arrow-ipc-fuzzing-test crash-xxx
{code}

The attached crashes have all distinct sources and are all related with missing 
nullptr checks. I have a fix basically ready.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values

2019-07-10 Thread Marco Neumann (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881852#comment-16881852
 ] 

Marco Neumann edited comment on ARROW-5028 at 7/10/19 8:37 AM:
---

*You need a massive machine (>10GB RAM) to run this!*

 [^dct.json.gz] 

{code:python}
import io
import json
import os.path

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq


def dct_to_table(index_dct):
labeled_array = pa.array(np.array(list(index_dct.keys(
partition_array = pa.array(np.array(list(index_dct.values(

return pa.Table.from_arrays(
[labeled_array, partition_array], names=['a', 'b']
)


def check_pq_nulls(data):
fp = io.BytesIO(data)
pfile = pq.ParquetFile(fp)
assert pfile.num_row_groups == 1
md = pfile.metadata.row_group(0)
col = md.column(1)
assert col.path_in_schema == 'b.list.item'
assert col.statistics.null_count == 0  # fails


def roundtrip(table):
buf = pa.BufferOutputStream()
pq.write_table(table, buf)

data = buf.getvalue().to_pybytes()

# this fails:
#   check_pq_nulls(data)

reader = pa.BufferReader(data)
return pq.read_table(reader)


with open(os.path.join(os.path.dirname(__file__), 'dct.json'), 'rb') as fp:
dct = json.load(fp)


# this does NOT help:
#   pa.set_cpu_count(1)
#   import gc; gc.disable()

table = dct_to_table(dct)

# this fixes the issue:
#   table = pa.Table.from_pandas(table.to_pandas())

table2 = roundtrip(table)

assert table.column('b').null_count == 0
assert table2.column('b').null_count == 0  # fails

# if table2 is converted to pandas, you can also observe that some values at 
the end of column b are `['']` which clearly is not present in the original data
{code}

They content is the same as in the pickle file but due to missing object 
de-duplication, you need way more memory. Luckily, object de-duplication does 
not seem to be the underlying issue and the bug is still reproducible.


was (Author: marco.neumann.by):
*You need a massive machine (>10GB RAM) to run this!*

 [^dct.json.gz] 

{code:python}
import io
import json
import os.path

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq


def dct_to_table(index_dct):
labeled_array = pa.array(np.array(list(index_dct.keys(
partition_array = pa.array(np.array(list(index_dct.values(

return pa.Table.from_arrays(
[labeled_array, partition_array], names=['a', 'b']
)


def check_pq_nulls(data):
fp = io.BytesIO(data)
pfile = pq.ParquetFile(fp)
assert pfile.num_row_groups == 1
md = pfile.metadata.row_group(0)
col = md.column(1)
assert col.path_in_schema == 'b.list.item'
assert col.statistics.null_count == 0  # fails


def roundtrip(table):
buf = pa.BufferOutputStream()
pq.write_table(table, buf)

data = buf.getvalue().to_pybytes()

# this fails:
#   check_pq_nulls(data)

reader = pa.BufferReader(data)
return pq.read_table(reader)


with open(os.path.join(os.path.dirname(__file__), 'dct.json'), 'rb') as fp:
dct = json.load(fp)


# this does NOT help:
#   pa.set_cpu_count(1)
#   import gc; gc.disable()

table = dct_to_table(dct)

# this fixes the issue:
#   table = pa.Table.from_pandas(table.to_pandas())

table2 = roundtrip(table)

assert table.column('b').null_count == 0
assert table2.column('b').null_count == 0  # fails

# if table2 is converted to pandas, you can also observe that some values at 
the end of column b are `['']` which clearly is not present in the original data
{code}

> [Python][C++] Arrow to Parquet conversion drops and corrupts values
> ---
>
> Key: ARROW-5028
> URL: https://issues.apache.org/jira/browse/ARROW-5028
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python 3.6
>Reporter: Marco Neumann
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
> Attachments: dct.json.gz, dct.pickle.gz
>
>
> I am sorry if this bugs feels rather long and the reproduction data is large, 
> but I was not able to reduce the data even further while still triggering the 
> problem. I was able to trigger this behavior on master and on {{0.11.1}}.
> {code:python}
> import io
> import os.path
> import pickle
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> def dct_to_table(index_dct):
> labeled_array = pa.array(np.array(list(index_dct.keys(
> partition_array = pa.array(np.array(list(index_dct.values(
> return pa.Table.from_arrays(
> [labeled_array, partition_array], names=['a', 'b']
> )
> def check_pq_nulls(data):
> fp = io.BytesIO(data)
> pfile = pq.ParquetFile(fp)
> assert 

[jira] [Commented] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values

2019-07-10 Thread Marco Neumann (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881852#comment-16881852
 ] 

Marco Neumann commented on ARROW-5028:
--

*You need a massive machine (>10GB RAM) to run this!*

 [^dct.json.gz] 

{code:python}
import io
import json
import os.path

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq


def dct_to_table(index_dct):
labeled_array = pa.array(np.array(list(index_dct.keys(
partition_array = pa.array(np.array(list(index_dct.values(

return pa.Table.from_arrays(
[labeled_array, partition_array], names=['a', 'b']
)


def check_pq_nulls(data):
fp = io.BytesIO(data)
pfile = pq.ParquetFile(fp)
assert pfile.num_row_groups == 1
md = pfile.metadata.row_group(0)
col = md.column(1)
assert col.path_in_schema == 'b.list.item'
assert col.statistics.null_count == 0  # fails


def roundtrip(table):
buf = pa.BufferOutputStream()
pq.write_table(table, buf)

data = buf.getvalue().to_pybytes()

# this fails:
#   check_pq_nulls(data)

reader = pa.BufferReader(data)
return pq.read_table(reader)


with open(os.path.join(os.path.dirname(__file__), 'dct.json'), 'rb') as fp:
dct = json.load(fp)


# this does NOT help:
#   pa.set_cpu_count(1)
#   import gc; gc.disable()

table = dct_to_table(dct)

# this fixes the issue:
#   table = pa.Table.from_pandas(table.to_pandas())

table2 = roundtrip(table)

assert table.column('b').null_count == 0
assert table2.column('b').null_count == 0  # fails

# if table2 is converted to pandas, you can also observe that some values at 
the end of column b are `['']` which clearly is not present in the original data
{code}

> [Python][C++] Arrow to Parquet conversion drops and corrupts values
> ---
>
> Key: ARROW-5028
> URL: https://issues.apache.org/jira/browse/ARROW-5028
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python 3.6
>Reporter: Marco Neumann
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
> Attachments: dct.json.gz, dct.pickle.gz
>
>
> I am sorry if this bugs feels rather long and the reproduction data is large, 
> but I was not able to reduce the data even further while still triggering the 
> problem. I was able to trigger this behavior on master and on {{0.11.1}}.
> {code:python}
> import io
> import os.path
> import pickle
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> def dct_to_table(index_dct):
> labeled_array = pa.array(np.array(list(index_dct.keys(
> partition_array = pa.array(np.array(list(index_dct.values(
> return pa.Table.from_arrays(
> [labeled_array, partition_array], names=['a', 'b']
> )
> def check_pq_nulls(data):
> fp = io.BytesIO(data)
> pfile = pq.ParquetFile(fp)
> assert pfile.num_row_groups == 1
> md = pfile.metadata.row_group(0)
> col = md.column(1)
> assert col.path_in_schema == 'b.list.item'
> assert col.statistics.null_count == 0  # fails
> def roundtrip(table):
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> data = buf.getvalue().to_pybytes()
> # this fails:
> #   check_pq_nulls(data)
> reader = pa.BufferReader(data)
> return pq.read_table(reader)
> with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp:
> dct = pickle.load(fp)
> # this does NOT help:
> #   pa.set_cpu_count(1)
> #   import gc; gc.disable()
> table = dct_to_table(dct)
> # this fixes the issue:
> #   table = pa.Table.from_pandas(table.to_pandas())
> table2 = roundtrip(table)
> assert table.column('b').null_count == 0
> assert table2.column('b').null_count == 0  # fails
> # if table2 is converted to pandas, you can also observe that some values at 
> the end of column b are `['']` which clearly is not present in the original 
> data
> {code}
> I would also be thankful for any pointers on where the bug comes from or on 
> who to reduce the test case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values

2019-07-10 Thread Marco Neumann (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Neumann updated ARROW-5028:
-
Attachment: dct.json.gz

> [Python][C++] Arrow to Parquet conversion drops and corrupts values
> ---
>
> Key: ARROW-5028
> URL: https://issues.apache.org/jira/browse/ARROW-5028
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python 3.6
>Reporter: Marco Neumann
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
> Attachments: dct.json.gz, dct.pickle.gz
>
>
> I am sorry if this bugs feels rather long and the reproduction data is large, 
> but I was not able to reduce the data even further while still triggering the 
> problem. I was able to trigger this behavior on master and on {{0.11.1}}.
> {code:python}
> import io
> import os.path
> import pickle
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> def dct_to_table(index_dct):
> labeled_array = pa.array(np.array(list(index_dct.keys(
> partition_array = pa.array(np.array(list(index_dct.values(
> return pa.Table.from_arrays(
> [labeled_array, partition_array], names=['a', 'b']
> )
> def check_pq_nulls(data):
> fp = io.BytesIO(data)
> pfile = pq.ParquetFile(fp)
> assert pfile.num_row_groups == 1
> md = pfile.metadata.row_group(0)
> col = md.column(1)
> assert col.path_in_schema == 'b.list.item'
> assert col.statistics.null_count == 0  # fails
> def roundtrip(table):
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> data = buf.getvalue().to_pybytes()
> # this fails:
> #   check_pq_nulls(data)
> reader = pa.BufferReader(data)
> return pq.read_table(reader)
> with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp:
> dct = pickle.load(fp)
> # this does NOT help:
> #   pa.set_cpu_count(1)
> #   import gc; gc.disable()
> table = dct_to_table(dct)
> # this fixes the issue:
> #   table = pa.Table.from_pandas(table.to_pandas())
> table2 = roundtrip(table)
> assert table.column('b').null_count == 0
> assert table2.column('b').null_count == 0  # fails
> # if table2 is converted to pandas, you can also observe that some values at 
> the end of column b are `['']` which clearly is not present in the original 
> data
> {code}
> I would also be thankful for any pointers on where the bug comes from or on 
> who to reduce the test case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values

2019-07-08 Thread Marco Neumann (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880106#comment-16880106
 ] 

Marco Neumann commented on ARROW-5028:
--

[~emkornfi...@gmail.com] sorry for the late reply. I was building the code 
myself. You can use master or one of the mentioned versions ({{0.11.0}} or 
{{0.13.0}}). Regarding the file format: I've tried to dump this whole thing as 
json, but that parsing it requires excessive amounts of memory (due to the 
missing string-instance-deduplication used by pickle) and I wasn't able to read 
it back. If you have another idea, please let me know.

> [Python][C++] Arrow to Parquet conversion drops and corrupts values
> ---
>
> Key: ARROW-5028
> URL: https://issues.apache.org/jira/browse/ARROW-5028
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python 3.6
>Reporter: Marco Neumann
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
> Attachments: dct.pickle.gz
>
>
> I am sorry if this bugs feels rather long and the reproduction data is large, 
> but I was not able to reduce the data even further while still triggering the 
> problem. I was able to trigger this behavior on master and on {{0.11.1}}.
> {code:python}
> import io
> import os.path
> import pickle
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> def dct_to_table(index_dct):
> labeled_array = pa.array(np.array(list(index_dct.keys(
> partition_array = pa.array(np.array(list(index_dct.values(
> return pa.Table.from_arrays(
> [labeled_array, partition_array], names=['a', 'b']
> )
> def check_pq_nulls(data):
> fp = io.BytesIO(data)
> pfile = pq.ParquetFile(fp)
> assert pfile.num_row_groups == 1
> md = pfile.metadata.row_group(0)
> col = md.column(1)
> assert col.path_in_schema == 'b.list.item'
> assert col.statistics.null_count == 0  # fails
> def roundtrip(table):
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> data = buf.getvalue().to_pybytes()
> # this fails:
> #   check_pq_nulls(data)
> reader = pa.BufferReader(data)
> return pq.read_table(reader)
> with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp:
> dct = pickle.load(fp)
> # this does NOT help:
> #   pa.set_cpu_count(1)
> #   import gc; gc.disable()
> table = dct_to_table(dct)
> # this fixes the issue:
> #   table = pa.Table.from_pandas(table.to_pandas())
> table2 = roundtrip(table)
> assert table.column('b').null_count == 0
> assert table2.column('b').null_count == 0  # fails
> # if table2 is converted to pandas, you can also observe that some values at 
> the end of column b are `['']` which clearly is not present in the original 
> data
> {code}
> I would also be thankful for any pointers on where the bug comes from or on 
> who to reduce the test case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5607) [C++][Fuzzing] arrow-ipc-fuzzing-test crash 607e9caa76863a97f2694a769a1ae2fb83c55e02

2019-06-14 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-5607:


 Summary: [C++][Fuzzing] arrow-ipc-fuzzing-test crash 
607e9caa76863a97f2694a769a1ae2fb83c55e02
 Key: ARROW-5607
 URL: https://issues.apache.org/jira/browse/ARROW-5607
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.13.0
Reporter: Marco Neumann
 Attachments: crash-607e9caa76863a97f2694a769a1ae2fb83c55e02

{{arrow-ipc-fuzzing-test}} found the attached attached crash. Reproduce with
{code}
arrow-ipc-fuzzing-test crash-607e9caa76863a97f2694a769a1ae2fb83c55e02
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5605) [C++][Fuzzing] arrow-ipc-fuzzing-test crash 74aec871d14bb6b07c72ea8f0e8c9f72cbe6b73c

2019-06-14 Thread Marco Neumann (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Neumann reassigned ARROW-5605:


Assignee: Marco Neumann

> [C++][Fuzzing] arrow-ipc-fuzzing-test crash 
> 74aec871d14bb6b07c72ea8f0e8c9f72cbe6b73c
> 
>
> Key: ARROW-5605
> URL: https://issues.apache.org/jira/browse/ARROW-5605
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>Priority: Major
>  Labels: fuzzer
> Attachments: crash-74aec871d14bb6b07c72ea8f0e8c9f72cbe6b73c
>
>
> {{arrow-ipc-fuzzing-test}} found the attached attached crash. Reproduce with
> {code}
> arrow-ipc-fuzzing-test crash-74aec871d14bb6b07c72ea8f0e8c9f72cbe6b73c
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5605) [C++][Fuzzing] arrow-ipc-fuzzing-test crash 74aec871d14bb6b07c72ea8f0e8c9f72cbe6b73c

2019-06-14 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-5605:


 Summary: [C++][Fuzzing] arrow-ipc-fuzzing-test crash 
74aec871d14bb6b07c72ea8f0e8c9f72cbe6b73c
 Key: ARROW-5605
 URL: https://issues.apache.org/jira/browse/ARROW-5605
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.13.0
Reporter: Marco Neumann
 Attachments: crash-74aec871d14bb6b07c72ea8f0e8c9f72cbe6b73c

{{arrow-ipc-fuzzing-test}} found the attached attached crash. Reproduce with
{code}
arrow-ipc-fuzzing-test crash-74aec871d14bb6b07c72ea8f0e8c9f72cbe6b73c
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5593) [C++][Fuzzing] Test fuzzers against arrow-testing corpus

2019-06-13 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-5593:


 Summary: [C++][Fuzzing] Test fuzzers against arrow-testing corpus
 Key: ARROW-5593
 URL: https://issues.apache.org/jira/browse/ARROW-5593
 Project: Apache Arrow
  Issue Type: Test
  Components: C++
Reporter: Marco Neumann


All fuzzers should be run against the corpus in 
[arrow-testing|https://github.com/apache/arrow-testing] to prevent regressions. 
The arrow CI should download the current corpus and run the fuzzers exactly 
once against each corpus applicable corpus file. The fuzzers must be build with 
address sanitizer enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5589) arrow-ipc-fuzzing-test crash 2354085db0125113f04f7bd23f54b85cca104713

2019-06-13 Thread Marco Neumann (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Neumann updated ARROW-5589:
-
Description: 
{{arrow-ipc-fuzzing-test}} found the attached attached crash. Reproduce with
{code}
arrow-ipc-fuzzing-test crash-2354085db0125113f04f7bd23f54b85cca104713
{code}

  was:
{{ipc-fuzzing-test}} found the attached attached crash. Reproduce with

{code:bash}
ipc-fuzzing-test crash-2354085db0125113f04f7bd23f54b85cca104713
{code}



> arrow-ipc-fuzzing-test crash 2354085db0125113f04f7bd23f54b85cca104713
> -
>
> Key: ARROW-5589
> URL: https://issues.apache.org/jira/browse/ARROW-5589
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>Priority: Major
>  Labels: fuzzer
> Attachments: crash-2354085db0125113f04f7bd23f54b85cca104713
>
>
> {{arrow-ipc-fuzzing-test}} found the attached attached crash. Reproduce with
> {code}
> arrow-ipc-fuzzing-test crash-2354085db0125113f04f7bd23f54b85cca104713
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5589) arrow-ipc-fuzzing-test crash 2354085db0125113f04f7bd23f54b85cca104713

2019-06-13 Thread Marco Neumann (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Neumann updated ARROW-5589:
-
Summary: arrow-ipc-fuzzing-test crash 
2354085db0125113f04f7bd23f54b85cca104713  (was: ipc-fuzzing-test crash 
2354085db0125113f04f7bd23f54b85cca104713)

> arrow-ipc-fuzzing-test crash 2354085db0125113f04f7bd23f54b85cca104713
> -
>
> Key: ARROW-5589
> URL: https://issues.apache.org/jira/browse/ARROW-5589
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>Priority: Major
>  Labels: fuzzer
> Attachments: crash-2354085db0125113f04f7bd23f54b85cca104713
>
>
> {{ipc-fuzzing-test}} found the attached attached crash. Reproduce with
> {code:bash}
> ipc-fuzzing-test crash-2354085db0125113f04f7bd23f54b85cca104713
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5589) ipc-fuzzing-test crash 2354085db0125113f04f7bd23f54b85cca104713

2019-06-13 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-5589:


 Summary: ipc-fuzzing-test crash 
2354085db0125113f04f7bd23f54b85cca104713
 Key: ARROW-5589
 URL: https://issues.apache.org/jira/browse/ARROW-5589
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.13.0
Reporter: Marco Neumann
Assignee: Marco Neumann
 Attachments: crash-2354085db0125113f04f7bd23f54b85cca104713

{{ipc-fuzzing-test}} found the attached attached crash. Reproduce with

{code:bash}
ipc-fuzzing-test crash-2354085db0125113f04f7bd23f54b85cca104713
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5525) Enable continuous fuzzing

2019-06-07 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-5525:


 Summary: Enable continuous fuzzing
 Key: ARROW-5525
 URL: https://issues.apache.org/jira/browse/ARROW-5525
 Project: Apache Arrow
  Issue Type: Test
  Components: C++
Reporter: Marco Neumann


Since fuzzing kinda only works if done as a continuous background job, we 
should find a way of doing so. This likely requires another service than 
Travis. Basic requirements are:
 * master builds should be submitted for fuzzing
 * project members should be informed about new crashes (ideally not via public 
issue due to potential security impact)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2256) [C++] Fuzzer builds fail out of the box on Ubuntu 16.04 using LLVM apt repos

2019-06-07 Thread Marco Neumann (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858504#comment-16858504
 ] 

Marco Neumann commented on ARROW-2256:
--

I can confirm that and have a fix ready to commit.

> [C++] Fuzzer builds fail out of the box on Ubuntu 16.04 using LLVM apt repos
> 
>
> Key: ARROW-2256
> URL: https://issues.apache.org/jira/browse/ARROW-2256
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Marco Neumann
>Priority: Major
>
> I did a clean upgrade to 16.04 on one of my machine and ran into the problem 
> described here:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=866087
> I think this can be resolved temporarily by symlinking the static library, 
> but we should document the problem so other devs know what to do when it 
> happens



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2256) [C++] Fuzzer builds fail out of the box on Ubuntu 16.04 using LLVM apt repos

2019-06-07 Thread Marco Neumann (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Neumann reassigned ARROW-2256:


Assignee: Marco Neumann

> [C++] Fuzzer builds fail out of the box on Ubuntu 16.04 using LLVM apt repos
> 
>
> Key: ARROW-2256
> URL: https://issues.apache.org/jira/browse/ARROW-2256
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Marco Neumann
>Priority: Major
>
> I did a clean upgrade to 16.04 on one of my machine and ran into the problem 
> described here:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=866087
> I think this can be resolved temporarily by symlinking the static library, 
> but we should document the problem so other devs know what to do when it 
> happens



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values

2019-06-03 Thread Marco Neumann (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854341#comment-16854341
 ] 

Marco Neumann commented on ARROW-5028:
--

Sadly not, since the debugging is quite complicated and I feel like I'm blindly 
digging through the code base.

> [Python][C++] Arrow to Parquet conversion drops and corrupts values
> ---
>
> Key: ARROW-5028
> URL: https://issues.apache.org/jira/browse/ARROW-5028
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python 3.6
>Reporter: Marco Neumann
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
> Attachments: dct.pickle.gz
>
>
> I am sorry if this bugs feels rather long and the reproduction data is large, 
> but I was not able to reduce the data even further while still triggering the 
> problem. I was able to trigger this behavior on master and on {{0.11.1}}.
> {code:python}
> import io
> import os.path
> import pickle
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> def dct_to_table(index_dct):
> labeled_array = pa.array(np.array(list(index_dct.keys(
> partition_array = pa.array(np.array(list(index_dct.values(
> return pa.Table.from_arrays(
> [labeled_array, partition_array], names=['a', 'b']
> )
> def check_pq_nulls(data):
> fp = io.BytesIO(data)
> pfile = pq.ParquetFile(fp)
> assert pfile.num_row_groups == 1
> md = pfile.metadata.row_group(0)
> col = md.column(1)
> assert col.path_in_schema == 'b.list.item'
> assert col.statistics.null_count == 0  # fails
> def roundtrip(table):
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> data = buf.getvalue().to_pybytes()
> # this fails:
> #   check_pq_nulls(data)
> reader = pa.BufferReader(data)
> return pq.read_table(reader)
> with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp:
> dct = pickle.load(fp)
> # this does NOT help:
> #   pa.set_cpu_count(1)
> #   import gc; gc.disable()
> table = dct_to_table(dct)
> # this fixes the issue:
> #   table = pa.Table.from_pandas(table.to_pandas())
> table2 = roundtrip(table)
> assert table.column('b').null_count == 0
> assert table2.column('b').null_count == 0  # fails
> # if table2 is converted to pandas, you can also observe that some values at 
> the end of column b are `['']` which clearly is not present in the original 
> data
> {code}
> I would also be thankful for any pointers on where the bug comes from or on 
> who to reduce the test case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5166) [Python] Statistics for uint64 columns may overflow

2019-04-12 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-5166:


 Summary: [Python] Statistics for uint64 columns may overflow
 Key: ARROW-5166
 URL: https://issues.apache.org/jira/browse/ARROW-5166
 Project: Apache Arrow
  Issue Type: Bug
 Environment: python 3.6
pyarrow 0.13.0

Reporter: Marco Neumann
 Attachments: int64_statistics_overflow.parquet

See the attached parquet file, where the statistics max value is smaller than 
the min value.

You can roundtrip that file through pandas and store it back to provoke the 
same bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values

2019-04-05 Thread Marco Neumann (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810953#comment-16810953
 ] 

Marco Neumann commented on ARROW-5028:
--

So the original table seems to be broken because the mentioned offset array to 
be jumping backwards. The following python code can be used to test this:
{code}
def get_offset(chunk, pos):
b = chunk.buffers()[1]
x = 0
for i in range(4):
x = (x << 8) + b[pos * 4 + 3 - i]
return x


def check_table(table):
assert table.num_columns == 2
column = table.column(1)

assert column.data.num_chunks == 1
chunk = column.data.chunk(0)

assert get_offset(chunk, 734168) < get_offset(chunk, 734169)  # fails
{code}
[~wesmckinn] is it guaranteed that the offset should only go forwards? The 
current data looks like some kind of overflow to me, although it overflows at 
around 24 bits which is weird.

> [Python][C++] Arrow to Parquet conversion drops and corrupts values
> ---
>
> Key: ARROW-5028
> URL: https://issues.apache.org/jira/browse/ARROW-5028
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python 3.6
>Reporter: Marco Neumann
>Priority: Major
> Fix For: 0.14.0
>
> Attachments: dct.pickle.gz
>
>
> I am sorry if this bugs feels rather long and the reproduction data is large, 
> but I was not able to reduce the data even further while still triggering the 
> problem. I was able to trigger this behavior on master and on {{0.11.1}}.
> {code:python}
> import io
> import os.path
> import pickle
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> def dct_to_table(index_dct):
> labeled_array = pa.array(np.array(list(index_dct.keys(
> partition_array = pa.array(np.array(list(index_dct.values(
> return pa.Table.from_arrays(
> [labeled_array, partition_array], names=['a', 'b']
> )
> def check_pq_nulls(data):
> fp = io.BytesIO(data)
> pfile = pq.ParquetFile(fp)
> assert pfile.num_row_groups == 1
> md = pfile.metadata.row_group(0)
> col = md.column(1)
> assert col.path_in_schema == 'b.list.item'
> assert col.statistics.null_count == 0  # fails
> def roundtrip(table):
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> data = buf.getvalue().to_pybytes()
> # this fails:
> #   check_pq_nulls(data)
> reader = pa.BufferReader(data)
> return pq.read_table(reader)
> with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp:
> dct = pickle.load(fp)
> # this does NOT help:
> #   pa.set_cpu_count(1)
> #   import gc; gc.disable()
> table = dct_to_table(dct)
> # this fixes the issue:
> #   table = pa.Table.from_pandas(table.to_pandas())
> table2 = roundtrip(table)
> assert table.column('b').null_count == 0
> assert table2.column('b').null_count == 0  # fails
> # if table2 is converted to pandas, you can also observe that some values at 
> the end of column b are `['']` which clearly is not present in the original 
> data
> {code}
> I would also be thankful for any pointers on where the bug comes from or on 
> who to reduce the test case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values

2019-04-05 Thread Marco Neumann (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810752#comment-16810752
 ] 

Marco Neumann commented on ARROW-5028:
--

More debugging results:
 * {{def_levels}} and {{rep_levels}} have different length (first one is 1 
element too short) leading to an out-of-bounds / uninitialized read which 
explain the {{0}} seen in the last report
 * the place where a {{rep_levels}} entry is created but no data for 
{{def_levels}} is {{HandleNonNullList}} in {{writer.cc}}
 * the reason for that is that {{inner_length}} is negative. It seems to jump 
from a large number ({{16268812}}) to a small number ({{2}}) and then continues 
from there (6, 13, 17, ...)

> [Python][C++] Arrow to Parquet conversion drops and corrupts values
> ---
>
> Key: ARROW-5028
> URL: https://issues.apache.org/jira/browse/ARROW-5028
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python 3.6
>Reporter: Marco Neumann
>Priority: Major
> Fix For: 0.14.0
>
> Attachments: dct.pickle.gz
>
>
> I am sorry if this bugs feels rather long and the reproduction data is large, 
> but I was not able to reduce the data even further while still triggering the 
> problem. I was able to trigger this behavior on master and on {{0.11.1}}.
> {code:python}
> import io
> import os.path
> import pickle
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> def dct_to_table(index_dct):
> labeled_array = pa.array(np.array(list(index_dct.keys(
> partition_array = pa.array(np.array(list(index_dct.values(
> return pa.Table.from_arrays(
> [labeled_array, partition_array], names=['a', 'b']
> )
> def check_pq_nulls(data):
> fp = io.BytesIO(data)
> pfile = pq.ParquetFile(fp)
> assert pfile.num_row_groups == 1
> md = pfile.metadata.row_group(0)
> col = md.column(1)
> assert col.path_in_schema == 'b.list.item'
> assert col.statistics.null_count == 0  # fails
> def roundtrip(table):
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> data = buf.getvalue().to_pybytes()
> # this fails:
> #   check_pq_nulls(data)
> reader = pa.BufferReader(data)
> return pq.read_table(reader)
> with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp:
> dct = pickle.load(fp)
> # this does NOT help:
> #   pa.set_cpu_count(1)
> #   import gc; gc.disable()
> table = dct_to_table(dct)
> # this fixes the issue:
> #   table = pa.Table.from_pandas(table.to_pandas())
> table2 = roundtrip(table)
> assert table.column('b').null_count == 0
> assert table2.column('b').null_count == 0  # fails
> # if table2 is converted to pandas, you can also observe that some values at 
> the end of column b are `['']` which clearly is not present in the original 
> data
> {code}
> I would also be thankful for any pointers on where the bug comes from or on 
> who to reduce the test case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values

2019-03-29 Thread Marco Neumann (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804793#comment-16804793
 ] 

Marco Neumann commented on ARROW-5028:
--

Short update:

The error also occurs when:
 * Converting the arrow table to a batch, serializing the batch to bytes, 
deserializing it and converting it back to the table. This is in contrast to 
the Pandas roundtrip.
 * using parquet 2.0
 * disabling dictionary encoding
 * disabling compression (default is SNAPPY)

Digging deeper, I found out that the NULL value is created in 
{{column_writer.cc}} {{WriteMiniBatch}} to due the condition {{def_levels[i] == 
descr_->max_definition_level()}} . The max def level is 3 but for the last 
entry, the entry in {{def_levels}} is 0 which seems wrong. The origin of this 
datais {{GenerateLevels}} in {{writer.cc}} but I haven't figured out what is 
going on there.

> [Python][C++] Arrow to Parquet conversion drops and corrupts values
> ---
>
> Key: ARROW-5028
> URL: https://issues.apache.org/jira/browse/ARROW-5028
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python 3.6
>Reporter: Marco Neumann
>Priority: Major
> Fix For: 0.14.0
>
> Attachments: dct.pickle.gz
>
>
> I am sorry if this bugs feels rather long and the reproduction data is large, 
> but I was not able to reduce the data even further while still triggering the 
> problem. I was able to trigger this behavior on master and on {{0.11.1}}.
> {code:python}
> import io
> import os.path
> import pickle
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> def dct_to_table(index_dct):
> labeled_array = pa.array(np.array(list(index_dct.keys(
> partition_array = pa.array(np.array(list(index_dct.values(
> return pa.Table.from_arrays(
> [labeled_array, partition_array], names=['a', 'b']
> )
> def check_pq_nulls(data):
> fp = io.BytesIO(data)
> pfile = pq.ParquetFile(fp)
> assert pfile.num_row_groups == 1
> md = pfile.metadata.row_group(0)
> col = md.column(1)
> assert col.path_in_schema == 'b.list.item'
> assert col.statistics.null_count == 0  # fails
> def roundtrip(table):
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> data = buf.getvalue().to_pybytes()
> # this fails:
> #   check_pq_nulls(data)
> reader = pa.BufferReader(data)
> return pq.read_table(reader)
> with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp:
> dct = pickle.load(fp)
> # this does NOT help:
> #   pa.set_cpu_count(1)
> #   import gc; gc.disable()
> table = dct_to_table(dct)
> # this fixes the issue:
> #   table = pa.Table.from_pandas(table.to_pandas())
> table2 = roundtrip(table)
> assert table.column('b').null_count == 0
> assert table2.column('b').null_count == 0  # fails
> # if table2 is converted to pandas, you can also observe that some values at 
> the end of column b are `['']` which clearly is not present in the original 
> data
> {code}
> I would also be thankful for any pointers on where the bug comes from or on 
> who to reduce the test case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5028) Arrow->Parquet conversion drops and corrupts values

2019-03-27 Thread Marco Neumann (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Neumann updated ARROW-5028:
-
Environment: python 3.6

> Arrow->Parquet conversion drops and corrupts values
> ---
>
> Key: ARROW-5028
> URL: https://issues.apache.org/jira/browse/ARROW-5028
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python 3.6
>Reporter: Marco Neumann
>Priority: Major
> Attachments: dct.pickle.gz
>
>
> I am sorry if this bugs feels rather long and the reproduction data is large, 
> but I was not able to reduce the data even further while still triggering the 
> problem. I was able to trigger this behavior on master and on {{0.11.1}}.
> {code:python}
> import io
> import os.path
> import pickle
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> def dct_to_table(index_dct):
> labeled_array = pa.array(np.array(list(index_dct.keys(
> partition_array = pa.array(np.array(list(index_dct.values(
> return pa.Table.from_arrays(
> [labeled_array, partition_array], names=['a', 'b']
> )
> def check_pq_nulls(data):
> fp = io.BytesIO(data)
> pfile = pq.ParquetFile(fp)
> assert pfile.num_row_groups == 1
> md = pfile.metadata.row_group(0)
> col = md.column(1)
> assert col.path_in_schema == 'b.list.item'
> assert col.statistics.null_count == 0  # fails
> def roundtrip(table):
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> data = buf.getvalue().to_pybytes()
> # this fails:
> #   check_pq_nulls(data)
> reader = pa.BufferReader(data)
> return pq.read_table(reader)
> with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp:
> dct = pickle.load(fp)
> # this does NOT help:
> #   pa.set_cpu_count(1)
> #   import gc; gc.disable()
> table = dct_to_table(dct)
> # this fixes the issue:
> #   table = pa.Table.from_pandas(table.to_pandas())
> table2 = roundtrip(table)
> assert table.column('b').null_count == 0
> assert table2.column('b').null_count == 0  # fails
> # if table2 is converted to pandas, you can also observe that some values at 
> the end of column b are `['']` which clearly is not present in the original 
> data
> {code}
> I would also be thankful for any pointers on where the bug comes from or on 
> who to reduce the test case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5028) Arrow->Parquet conversion drops and corrupts values

2019-03-27 Thread Marco Neumann (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Neumann updated ARROW-5028:
-
Summary: Arrow->Parquet conversion drops and corrupts values  (was: 
Arrow->Parquet store drops and corrupts values)

> Arrow->Parquet conversion drops and corrupts values
> ---
>
> Key: ARROW-5028
> URL: https://issues.apache.org/jira/browse/ARROW-5028
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
>Reporter: Marco Neumann
>Priority: Major
> Attachments: dct.pickle.gz
>
>
> I am sorry if this bugs feels rather long and the reproduction data is large, 
> but I was not able to reduce the data even further while still triggering the 
> problem. I was able to trigger this behavior on master and on {{0.11.1}}.
> {code:python}
> import io
> import os.path
> import pickle
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> def dct_to_table(index_dct):
> labeled_array = pa.array(np.array(list(index_dct.keys(
> partition_array = pa.array(np.array(list(index_dct.values(
> return pa.Table.from_arrays(
> [labeled_array, partition_array], names=['a', 'b']
> )
> def check_pq_nulls(data):
> fp = io.BytesIO(data)
> pfile = pq.ParquetFile(fp)
> assert pfile.num_row_groups == 1
> md = pfile.metadata.row_group(0)
> col = md.column(1)
> assert col.path_in_schema == 'b.list.item'
> assert col.statistics.null_count == 0  # fails
> def roundtrip(table):
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> data = buf.getvalue().to_pybytes()
> # this fails:
> #   check_pq_nulls(data)
> reader = pa.BufferReader(data)
> return pq.read_table(reader)
> with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp:
> dct = pickle.load(fp)
> # this does NOT help:
> #   pa.set_cpu_count(1)
> #   import gc; gc.disable()
> table = dct_to_table(dct)
> # this fixes the issue:
> #   table = pa.Table.from_pandas(table.to_pandas())
> table2 = roundtrip(table)
> assert table.column('b').null_count == 0
> assert table2.column('b').null_count == 0  # fails
> # if table2 is converted to pandas, you can also observe that some values at 
> the end of column b are `['']` which clearly is not present in the original 
> data
> {code}
> I would also be thankful for any pointers on where the bug comes from or on 
> who to reduce the test case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5028) Arrow->Parquet store drops and corrupts values

2019-03-27 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-5028:


 Summary: Arrow->Parquet store drops and corrupts values
 Key: ARROW-5028
 URL: https://issues.apache.org/jira/browse/ARROW-5028
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.11.1, 0.13.0
Reporter: Marco Neumann
 Attachments: dct.pickle.gz

I am sorry if this bugs feels rather long and the reproduction data is large, 
but I was not able to reduce the data even further while still triggering the 
problem. I was able to trigger this behavior on master and on {{0.11.1}}.

{code:python}
import io
import os.path
import pickle

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq


def dct_to_table(index_dct):
labeled_array = pa.array(np.array(list(index_dct.keys(
partition_array = pa.array(np.array(list(index_dct.values(

return pa.Table.from_arrays(
[labeled_array, partition_array], names=['a', 'b']
)


def check_pq_nulls(data):
fp = io.BytesIO(data)
pfile = pq.ParquetFile(fp)
assert pfile.num_row_groups == 1
md = pfile.metadata.row_group(0)
col = md.column(1)
assert col.path_in_schema == 'b.list.item'
assert col.statistics.null_count == 0  # fails


def roundtrip(table):
buf = pa.BufferOutputStream()
pq.write_table(table, buf)

data = buf.getvalue().to_pybytes()

# this fails:
#   check_pq_nulls(data)

reader = pa.BufferReader(data)
return pq.read_table(reader)


with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp:
dct = pickle.load(fp)


# this does NOT help:
#   pa.set_cpu_count(1)
#   import gc; gc.disable()

table = dct_to_table(dct)

# this fixes the issue:
#   table = pa.Table.from_pandas(table.to_pandas())

table2 = roundtrip(table)

assert table.column('b').null_count == 0
assert table2.column('b').null_count == 0  # fails

# if table2 is converted to pandas, you can also observe that some values at 
the end of column b are `['']` which clearly is not present in the original data
{code}

I would also be thankful for any pointers on where the bug comes from or on who 
to reduce the test case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2963) [Python] Deadlock during fork-join and use_threads=True

2018-08-02 Thread Marco Neumann (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566987#comment-16566987
 ] 

Marco Neumann commented on ARROW-2963:
--

The problem is that using threads worked in {{0.9.0}}, because (I think) there 
was no pool involved.

> [Python] Deadlock during fork-join and use_threads=True
> ---
>
> Key: ARROW-2963
> URL: https://issues.apache.org/jira/browse/ARROW-2963
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
> Environment: pandas==0.23.3
> pyarrow==0.10.0rc0
>Reporter: Marco Neumann
>Assignee: Antoine Pitrou
>Priority: Major
>
> The following code passes:
> {noformat}
> import os
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': [1]})
> table = pa.Table.from_pandas(df)
> df = table.to_pandas(use_threads=False)
> pid = os.fork()
> if pid != 0:
> os.waitpid(pid, 0)
> {noformat}
> but the following code will never finish (the {{waitpid}} calls blocks 
> forever, seems that the child process is frozen):
> {noformat}
> import os
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': [1]})
> table = pa.Table.from_pandas(df)
> df = table.to_pandas(use_threads=True)
> pid = os.fork()
> if pid != 0:
> os.waitpid(pid, 0)
> {noformat}
> (the only difference is {{use_threads=True}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2963) [Python] Deadlock during fork-join and use_threads=True

2018-08-02 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-2963:


 Summary: [Python] Deadlock during fork-join and use_threads=True
 Key: ARROW-2963
 URL: https://issues.apache.org/jira/browse/ARROW-2963
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.10.0
 Environment: pandas==0.23.3
pyarrow==0.10.0rc0
Reporter: Marco Neumann


The following code passes:

{noformat}
import os
import pandas as pd
import pyarrow as pa


df = pd.DataFrame({'x': [1]})
table = pa.Table.from_pandas(df)
df = table.to_pandas(use_threads=False)

pid = os.fork()
if pid != 0:
os.waitpid(pid, 0)
{noformat}

but the following code will never finish (the {{waitpid}} calls blocks forever, 
seems that the child process is frozen):

{noformat}
import os
import pandas as pd
import pyarrow as pa


df = pd.DataFrame({'x': [1]})
table = pa.Table.from_pandas(df)
df = table.to_pandas(use_threads=True)

pid = os.fork()
if pid != 0:
os.waitpid(pid, 0)
{noformat}

(the only difference is {{use_threads=True}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2554) pa.array type inference bug when using NS-timestamp

2018-06-08 Thread Marco Neumann (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Neumann reassigned ARROW-2554:


Assignee: Marco Neumann

> pa.array type inference bug when using NS-timestamp
> ---
>
> Key: ARROW-2554
> URL: https://issues.apache.org/jira/browse/ARROW-2554
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>Priority: Major
> Fix For: 0.10.0
>
>
> The following fails:
> {noformat}
> import pandas as pd
> import pyarrow as pa
> pa.array([pd.Timestamp('now').to_datetime64()])
> {noformat}
> with {{ArrowNotImplementedError: Cannot convert NumPy datetime64 objects with 
> differing unit}}, but when you provide the correct type information directly, 
> it works:
> {noformat}
> import pandas as pd
> import pyarrow as pa
> pa.array([pd.Timestamp('now').to_datetime64()], type=pa.timestamp('ns'))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2554) pa.array type inference bug when using NS-timestamp

2018-05-08 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-2554:


 Summary: pa.array type inference bug when using NS-timestamp
 Key: ARROW-2554
 URL: https://issues.apache.org/jira/browse/ARROW-2554
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
Reporter: Marco Neumann


The following fails:
{noformat}
import pandas as pd
import pyarrow as pa

pa.array([pd.Timestamp('now').to_datetime64()])
{noformat}

with {{ArrowNotImplementedError: Cannot convert NumPy datetime64 objects with 
differing unit}}, but when you provide the correct type information directly, 
it works:

{noformat}
import pandas as pd
import pyarrow as pa

pa.array([pd.Timestamp('now').to_datetime64()], type=pa.timestamp('ns'))
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2513) [Python] DictionaryType should give access to index type and dictionary array

2018-04-26 Thread Marco Neumann (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Neumann reassigned ARROW-2513:


Assignee: Marco Neumann

> [Python] DictionaryType should give access to index type and dictionary array
> -
>
> Key: ARROW-2513
> URL: https://issues.apache.org/jira/browse/ARROW-2513
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>Priority: Minor
>
> Currently, only {{ordered}} is mapped from C Type to Python, but index type 
> and dictionary array are not accessible from Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2513) [Python] DictionaryType should give access to index type and dictionary array

2018-04-26 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-2513:


 Summary: [Python] DictionaryType should give access to index type 
and dictionary array
 Key: ARROW-2513
 URL: https://issues.apache.org/jira/browse/ARROW-2513
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.9.0
Reporter: Marco Neumann


Currently, only {{ordered}} is mapped from C Type to Python, but index type and 
dictionary array are not accessible from Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1589) [C++] Fuzzing for certain input formats

2018-01-29 Thread Marco Neumann (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343242#comment-16343242
 ] 

Marco Neumann commented on ARROW-1589:
--

So the "empty input" is one of them. The fuzzing process is still failing there 
when address sanitizer is enabled since the {{BufferReader}} produces a out of 
bounce read. So even though you're testing this case in PR1503, the current 
code on master results in undefined behavior.

> [C++] Fuzzing for certain input formats
> ---
>
> Key: ARROW-1589
> URL: https://issues.apache.org/jira/browse/ARROW-1589
> Project: Apache Arrow
>  Issue Type: Test
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>Priority: Major
>  Labels: pull-request-available
>
> The arrow lib should have fuzzing tests for certain input formats, e.g. for 
> reading record batches from streams. Ideally, malformed input must not crash 
> the system but must report a proper error. This could easily be implemented 
> e.g. w/ [libfuzzer|https://llvm.org/docs/LibFuzzer.html] in combination with 
> address sanitizer (that's already implemented by Arrow's build system).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1589) [C++] Fuzzing for certain input formats

2018-01-08 Thread Marco Neumann (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16315898#comment-16315898
 ] 

Marco Neumann commented on ARROW-1589:
--

I'll open a PR until end of January, sorry for the delay. The code is nearly 
ready but I've had some problems with the compilation workflow.

> [C++] Fuzzing for certain input formats
> ---
>
> Key: ARROW-1589
> URL: https://issues.apache.org/jira/browse/ARROW-1589
> Project: Apache Arrow
>  Issue Type: Test
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>
> The arrow lib should have fuzzing tests for certain input formats, e.g. for 
> reading record batches from streams. Ideally, malformed input must not crash 
> the system but must report a proper error. This could easily be implemented 
> e.g. w/ [libfuzzer|https://llvm.org/docs/LibFuzzer.html] in combination with 
> address sanitizer (that's already implemented by Arrow's build system).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1589) [C++] Fuzzing for certain input formats

2017-09-25 Thread Marco Neumann (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179136#comment-16179136
 ] 

Marco Neumann commented on ARROW-1589:
--

{quote}Please understand that this software we are discussing is primarily the 
work of a single volunteer developer (me)...{quote}

I am very thankful for your work. Arrow and parquet are absolutely amazing. I 
just want to help out. Integrating an automatic fuzzing solution is rather 
trivial (I actually have the corresponding PR nearly ready, just the usage docs 
are missing) and can prevent so many silly bugs (produces by smart people). I 
do NOT expect you to fix all bugs and problems found by the fuzzer, but it can 
help finding missing test coverage and could (on a long term) improve the 
stability of the library and the security aspect.

> [C++] Fuzzing for certain input formats
> ---
>
> Key: ARROW-1589
> URL: https://issues.apache.org/jira/browse/ARROW-1589
> Project: Apache Arrow
>  Issue Type: Test
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>
> The arrow lib should have fuzzing tests for certain input formats, e.g. for 
> reading record batches from streams. Ideally, malformed input must not crash 
> the system but must report a proper error. This could easily be implemented 
> e.g. w/ [libfuzzer|https://llvm.org/docs/LibFuzzer.html] in combination with 
> address sanitizer (that's already implemented by Arrow's build system).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1589) [C++] Fuzzing for certain input formats

2017-09-25 Thread Marco Neumann (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178675#comment-16178675
 ] 

Marco Neumann commented on ARROW-1589:
--

Currently it is not clearly stated that the message stream is trusted, 
therefore the opposite will be assumed by developers. Also, the naming you are 
proposing will very likely mislead people, since the current naming within the 
library does not contain any information about trust ("trusted" or "untrusted") 
so users minds will likely default to "trusted". So the current way method 
should rather be prefixed w/ "trusted"/"unsafe"/"fast".

A tiny example that already segfaults is the creation and read-out of an empty 
stream, which IMHO should not happen. The reason why unit testing is not 
sufficient is that the same kind of devs who are writing the code are also 
writing the unit tests and therefore won't be able to think outside their box. 
(that's not an offense, it's just human behavior and applies to all 
developers). 

> [C++] Fuzzing for certain input formats
> ---
>
> Key: ARROW-1589
> URL: https://issues.apache.org/jira/browse/ARROW-1589
> Project: Apache Arrow
>  Issue Type: Test
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>
> The arrow lib should have fuzzing tests for certain input formats, e.g. for 
> reading record batches from streams. Ideally, malformed input must not crash 
> the system but must report a proper error. This could easily be implemented 
> e.g. w/ [libfuzzer|https://llvm.org/docs/LibFuzzer.html] in combination with 
> address sanitizer (that's already implemented by Arrow's build system).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1589) Fuzzing for certain input formats

2017-09-21 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-1589:


 Summary: Fuzzing for certain input formats
 Key: ARROW-1589
 URL: https://issues.apache.org/jira/browse/ARROW-1589
 Project: Apache Arrow
  Issue Type: Test
Reporter: Marco Neumann
Assignee: Marco Neumann


The arrow lib should have fuzzing tests for certain input formats, e.g. for 
reading record batches from streams. Ideally, malformed input must not crash 
the system but must report a proper error. This could easily be implemented 
e.g. w/ [libfuzzer|https://llvm.org/docs/LibFuzzer.html] in combination with 
address sanitizer (that's already implemented by Arrow's build system).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1276) Cannot serializer empty DataFrame to parquet

2017-07-26 Thread Marco Neumann (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Neumann reassigned ARROW-1276:


Assignee: Marco Neumann

> Cannot serializer empty DataFrame to parquet
> 
>
> Key: ARROW-1276
> URL: https://issues.apache.org/jira/browse/ARROW-1276
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
>Reporter: Marco Neumann
>Assignee: Marco Neumann
>Priority: Minor
>
> The following code fails with {{pyarrow.lib.ArrowInvalid: Invalid: chunk size 
> per row_group must be greater than 0}} but should not:
> {noformat}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame({'x': pd.Series([], dtype=int)})
> table = pa.Table.from_pandas(df)
> buf = pa.InMemoryOutputStream()
> pq.write_table(table, buf)
> {noformat}
> I have a test and a fix prepared and will upstream both in the upcoming days.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1276) Cannot serializer empty DataFrame to parquet

2017-07-26 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-1276:


 Summary: Cannot serializer empty DataFrame to parquet
 Key: ARROW-1276
 URL: https://issues.apache.org/jira/browse/ARROW-1276
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.5.0
Reporter: Marco Neumann
Priority: Minor


The following code fails with {{pyarrow.lib.ArrowInvalid: Invalid: chunk size 
per row_group must be greater than 0}} but should not:

{noformat}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({'x': pd.Series([], dtype=int)})
table = pa.Table.from_pandas(df)
buf = pa.InMemoryOutputStream()
pq.write_table(table, buf)
{noformat}

I have a test and a fix prepared and will upstream both in the upcoming days.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1083) Object categoricals are not serialized when only None is present

2017-06-02 Thread Marco Neumann (JIRA)
Marco Neumann created ARROW-1083:


 Summary: Object categoricals are not serialized when only None is 
present
 Key: ARROW-1083
 URL: https://issues.apache.org/jira/browse/ARROW-1083
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.4.0
Reporter: Marco Neumann
Priority: Minor


The following code sample fails with {{pyarrow.lib.ArrowNotImplementedError: 
NotImplemented: unhandled type}} but should not:

{noformat}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({'x': [None]})
df['x'] = df['x'].astype('category')

table = pa.Table.from_pandas(df)
buf = pa.InMemoryOutputStream()

pq.write_table(table, buf)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)