[jira] [Created] (ARROW-17110) Move away from C++11
H. Vetinari created ARROW-17110: --- Summary: Move away from C++11 Key: ARROW-17110 URL: https://issues.apache.org/jira/browse/ARROW-17110 Project: Apache Arrow Issue Type: Task Reporter: H. Vetinari The upcoming abseil release has dropped support for C++11, so {_}eventually{_}, arrow will have to follow. More details [here|https://github.com/conda-forge/abseil-cpp-feedstock/issues/37]. Relatedly, when I [tried|https://github.com/conda-forge/abseil-cpp-feedstock/pull/25] to switch abseil to a newer C++ version on windows, things apparently broke in arrow CI. This is because the ABI of abseil is sensitive to the C++ standard that's used to compile, and google only supports a homogeneous version to compile all artefacts in a stack. This creates some friction with conda-forge (where the compilers are generally much newer than what arrow might be willing to impose). For now, things seems to have worked out with arrow [specifying|https://github.com/apache/arrow/blob/897a4c0ce73c3fe07872beee2c1d2128e44f6dd4/cpp/cmake_modules/SetupCxxFlags.cmake#L121-L124] C++11 while conda-forge moved to C++17 - at least on unix, but windows was not so lucky. Perhaps people would therefore also be interested in collaborating (or at least commenting on) this [issue|https://github.com/conda-forge/abseil-cpp-feedstock/issues/29], which should permit more flexibility by being able to opt into given standard versions also from conda-forge. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-7830) [C++] Parquet library version doesn't change with releases
[ https://issues.apache.org/jira/browse/ARROW-7830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055702#comment-17055702 ] H. Vetinari commented on ARROW-7830: OK, I didn't want to come off as demanding (anything, really), just trying to understand the situation. I had only ever seen PARQUET- associated with parquet-mr, I had never stumbled on the [parquet-cpp repo|https://github.com/apache/parquet-cpp]. Nevertheless, the head of that repo's README starts off as: > *Note: Development for Apache Parquet in C++ has moved* > > The Apache Arrow and Parquet have merged development process and build > systems in the Arrow repository. Please submit pull requests in > https://github.com/apache/arrow. So wouldn't it be a reasonable way of looking at things that the arrow-project can now set the corresponding version number (or absorb the project completely, for example)? > [C++] Parquet library version doesn't change with releases > -- > > Key: ARROW-7830 > URL: https://issues.apache.org/jira/browse/ARROW-7830 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Priority: Major > Labels: parquet > > [~jeroenooms] pointed this out to me. > {code} > $ pkg-config --modversion arrow > 0.16.0 > $ pkg-config --modversion arrow-dataset > 0.16.0 > $ pkg-config --modversion parquet > 1.5.1-SNAPSHOT > {code} > I get that parquet-cpp is technically not part of Apache Arrow, but if we're > releasing a libparquet with libarrow at our release time, wouldn't it make > sense to at least bump the parquet version at the same time, even if the > version numbers aren't the same? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-7830) [C++] Parquet library version doesn't change with releases
[ https://issues.apache.org/jira/browse/ARROW-7830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055049#comment-17055049 ] H. Vetinari edited comment on ARROW-7830 at 3/9/20, 3:08 PM: - What does the parquet-version used here (1.5.1) stand for, actually? Upstream parquet [never had|https://github.com/apache/parquet-mr/releases?after=parquet-1.6.0rc1] a 1.5.1 release, and if it had, it would be 5-6 years old. I mean, I get that it's set [here|https://github.com/apache/arrow/tree/master/cpp/src/parquet], but the relation to arrow and/or parquet versions (or what 1.0.0 meant for that number, for that matter) is not apparent to me. was (Author: h-vetinari): What does the parquet-version used here (1.5.1) stand for, actually? Upstream parquet [never had|https://github.com/apache/parquet-mr/releases?after=parquet-1.6.0rc1] a 1.5.1 release, and if it had, it would be 5-6 years old. > [C++] Parquet library version doesn't change with releases > -- > > Key: ARROW-7830 > URL: https://issues.apache.org/jira/browse/ARROW-7830 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Priority: Major > Labels: parquet > > [~jeroenooms] pointed this out to me. > {code} > $ pkg-config --modversion arrow > 0.16.0 > $ pkg-config --modversion arrow-dataset > 0.16.0 > $ pkg-config --modversion parquet > 1.5.1-SNAPSHOT > {code} > I get that parquet-cpp is technically not part of Apache Arrow, but if we're > releasing a libparquet with libarrow at our release time, wouldn't it make > sense to at least bump the parquet version at the same time, even if the > version numbers aren't the same? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7830) [C++] Parquet library version doesn't change with releases
[ https://issues.apache.org/jira/browse/ARROW-7830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055049#comment-17055049 ] H. Vetinari commented on ARROW-7830: What does the parquet-version used here (1.5.1) stand for, actually? Upstream parquet [never had|https://github.com/apache/parquet-mr/releases?after=parquet-1.6.0rc1] a 1.5.1 release, and if it had, it would be 5-6 years old. > [C++] Parquet library version doesn't change with releases > -- > > Key: ARROW-7830 > URL: https://issues.apache.org/jira/browse/ARROW-7830 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Priority: Major > Labels: parquet > > [~jeroenooms] pointed this out to me. > {code} > $ pkg-config --modversion arrow > 0.16.0 > $ pkg-config --modversion arrow-dataset > 0.16.0 > $ pkg-config --modversion parquet > 1.5.1-SNAPSHOT > {code} > I get that parquet-cpp is technically not part of Apache Arrow, but if we're > releasing a libparquet with libarrow at our release time, wouldn't it make > sense to at least bump the parquet version at the same time, even if the > version numbers aren't the same? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352 ] H. Vetinari edited comment on ARROW-5965 at 7/18/19 6:36 AM: - [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, {{# conda install pyarrow=0.14 gdb}} {{Collecting package metadata (current_repodata.json): done}} {{Solving environment: failed}} {{Collecting package metadata (repodata.json): done}} {{Solving environment: failed}} {{UnsatisfiableError: The following specifications were found to be incompatible with each other:}} {{- pip -> python[version='>=3.7,<3.8.0a0']}} which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... was (Author: h-vetinari): [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, {{# conda install pyarrow=0.14 gdb}} {{Collecting package metadata (current_repodata.json): done}} {{Solving environment: failed}} {{Collecting package metadata (repodata.json): done}} {{Solving environment: failed}} {{UnsatisfiableError: The following specifications were found to be incompatible with each other:}} {{ - pip -> python[version='>=3.7,<3.8.0a0']}} which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... EDIT: can't seem to format the code-block correctly, sorry. > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352 ] H. Vetinari edited comment on ARROW-5965 at 7/18/19 6:36 AM: - [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, {{# conda install pyarrow=0.14 gdb}} {{Collecting package metadata (current_repodata.json): done}} {{Solving environment: failed}} {{Collecting package metadata (repodata.json): done}} {{Solving environment: failed}} {{UnsatisfiableError: The following specifications were found to be incompatible with each other:}} {{ - pip -> python[version='>=3.7,<3.8.0a0']}} which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... EDIT: can't seem to format the code-block correctly, sorry. was (Author: h-vetinari): [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, {{ # conda install pyarrow=0.14 gdb Collecting package metadata (current_repodata.json): done Solving environment: failed Collecting package metadata (repodata.json): done Solving environment: failed UnsatisfiableError: The following specifications were found to be incompatible with each other: - pip -> python[version='>=3.7,<3.8.0a0'] }} which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... EDIT: can't seem to format the code-block correctly, sorry. > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887676#comment-16887676 ] H. Vetinari edited comment on ARROW-5965 at 7/18/19 6:35 AM: - [~wesmckinn] Thanks for the tips. Unfortunately, I can't follow that example because the code does not generate a core-dump but only prints "Killed". I found some ways to run it in gdb that *should* work (best as I can tell), like {{gdb -ex r --args python fail.py}} or interactively: {{gdb python}} {{(gdb) run fail.py}} but I always get: {{[...]}} {{warning: Could not trace the inferior process}} {{Error:}} {{warning: ptrace: Operation not permitted}} {{During startup program exited with code 127.}} Not sure if that's a mistake on my side or something in the setup/interplay of conda-gdb. was (Author: h-vetinari): [~wesmckinn] Thanks for the tips. Unfortunately, I can't follow that example because the code does not generate a core-dump but only prints "Killed". I found some ways to run it in gdb that *should* work (best as I can tell), like "gdb -ex r --args python fail.py" or interactively: " gdb python (gdb) run fail.py " but I always get: " [...] warning: Could not trace the inferior process Error: warning: ptrace: Operation not permitted During startup program exited with code 127. " Not sure if that's a mistake on my side or something in the setup/interplay of conda-gdb. > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887676#comment-16887676 ] H. Vetinari commented on ARROW-5965: [~wesmckinn] Thanks for the tips. Unfortunately, I can't follow that example because the code does not generate a core-dump but only prints "Killed". I found some ways to run it in gdb that *should* work (best as I can tell), like "gdb -ex r --args python fail.py" or interactively: " gdb python (gdb) run fail.py " but I always get: " [...] warning: Could not trace the inferior process Error: warning: ptrace: Operation not permitted During startup program exited with code 127. " Not sure if that's a mistake on my side or something in the setup/interplay of conda-gdb. > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352 ] H. Vetinari edited comment on ARROW-5965 at 7/17/19 7:11 PM: - [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, {{ # conda install pyarrow=0.14 gdb Collecting package metadata (current_repodata.json): done Solving environment: failed Collecting package metadata (repodata.json): done Solving environment: failed UnsatisfiableError: The following specifications were found to be incompatible with each other: - pip -> python[version='>=3.7,<3.8.0a0'] }} which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... EDIT: can't seem to format the code-block correctly, sorry. was (Author: h-vetinari): [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, {{ {quote}# conda install pyarrow=0.14 gdb Collecting package metadata (current_repodata.json): done Solving environment: failed Collecting package metadata (repodata.json): done Solving environment: failed UnsatisfiableError: The following specifications were found to be incompatible with each other: - pip -> python[version='>=3.7,<3.8.0a0']{quote} }} which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352 ] H. Vetinari edited comment on ARROW-5965 at 7/17/19 7:10 PM: - [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, {{ {quote}# conda install pyarrow=0.14 gdb Collecting package metadata (current_repodata.json): done Solving environment: failed Collecting package metadata (repodata.json): done Solving environment: failed UnsatisfiableError: The following specifications were found to be incompatible with each other: - pip -> python[version='>=3.7,<3.8.0a0']{quote} }} which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... was (Author: h-vetinari): [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, {{# conda install pyarrow=0.14 gdb Collecting package metadata (current_repodata.json): done Solving environment: failed Collecting package metadata (repodata.json): done Solving environment: failed UnsatisfiableError: The following specifications were found to be incompatible with each other: - pip -> python[version='>=3.7,<3.8.0a0']}} which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352 ] H. Vetinari edited comment on ARROW-5965 at 7/17/19 7:09 PM: - [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, {{# conda install pyarrow=0.14 gdb Collecting package metadata (current_repodata.json): done Solving environment: failed Collecting package metadata (repodata.json): done Solving environment: failed UnsatisfiableError: The following specifications were found to be incompatible with each other: - pip -> python[version='>=3.7,<3.8.0a0']}} which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... was (Author: h-vetinari): [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, ``` # conda install pyarrow=0.14 gdb Collecting package metadata (current_repodata.json): done Solving environment: failed Collecting package metadata (repodata.json): done Solving environment: failed UnsatisfiableError: The following specifications were found to be incompatible with each other: - pip -> python[version='>=3.7,<3.8.0a0'] ``` which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352 ] H. Vetinari commented on ARROW-5965: [~wesmckinn] Would like to provide it, but would only be able to install through conda (which has a hole in the firewall). Unfortunately, ``` # conda install pyarrow=0.14 gdb Collecting package metadata (current_repodata.json): done Solving environment: failed Collecting package metadata (repodata.json): done Solving environment: failed UnsatisfiableError: The following specifications were found to be incompatible with each other: - pip -> python[version='>=3.7,<3.8.0a0'] ``` which, I believe, is due to the fact that gdb has [not yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for python 3.7. (although, just as I was preparing this message, I triggered a rerender there and this has caused some further action and the first passing 3.7 build; not yet merged because 2.7 is failing). In the meantime I tried downgrading my whole environment to 3.6, where the program also crashes or hangs on v0.14. However, I haven't yet been able to get a gdb output. Might need some more reading of the GDB manual... > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14
[ https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887129#comment-16887129 ] H. Vetinari commented on ARROW-5965: Hey Neal, I tried a couple of times before filing the report, and all (~5) invocations on 0.14 crashed, and all invocations on 0.13 worked. The machine itself has lots of memory, so I don't think it's that. Not sure I'll be able to pare this down to a minimal reproducing parquet file. I'll try. > [Python] Regression: segfault when reading hive table with v0.14 > > > Key: ARROW-5965 > URL: https://issues.apache.org/jira/browse/ARROW-5965 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 >Reporter: H. Vetinari >Priority: Critical > Labels: parquet > > I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow > installed in a conda env. > The data I'm reading is a hive(-registered) table written as parquet, and > with v0.13, reading this table (that is partitioned) does not cause any > issues. > The code that worked before and now crashes with v0.14 is simply: > ``` > import pyarrow.parquet as pq > pq.ParquetDataset('hdfs:///data/raw/source/table').read() > ``` > Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I > cannot report much more, but this is a pretty severe usability restriction. > So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5965) Regression: segfault when reading hive table with v0.14
H. Vetinari created ARROW-5965: -- Summary: Regression: segfault when reading hive table with v0.14 Key: ARROW-5965 URL: https://issues.apache.org/jira/browse/ARROW-5965 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.14.0 Reporter: H. Vetinari I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow installed in a conda env. The data I'm reading is a hive(-registered) table written as parquet, and with v0.13, reading this table (that is partitioned) does not cause any issues. The code that worked before and now crashes with v0.14 is simply: ``` import pyarrow.parquet as pq pq.ParquetDataset('hdfs:///data/raw/source/table').read() ``` Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I cannot report much more, but this is a pretty severe usability restriction. So far the solution is to enforce `pyarrow<0.14` -- This message was sent by Atlassian JIRA (v7.6.14#76016)