[jira] [Created] (ARROW-17110) Move away from C++11

2022-07-18 Thread H. Vetinari (Jira)
H. Vetinari created ARROW-17110:
---

 Summary: Move away from C++11
 Key: ARROW-17110
 URL: https://issues.apache.org/jira/browse/ARROW-17110
 Project: Apache Arrow
  Issue Type: Task
Reporter: H. Vetinari


The upcoming abseil release has dropped support for C++11, so {_}eventually{_}, 
arrow will have to follow. More details 
[here|https://github.com/conda-forge/abseil-cpp-feedstock/issues/37].

Relatedly, when I 
[tried|https://github.com/conda-forge/abseil-cpp-feedstock/pull/25] to switch 
abseil to a newer C++ version on windows, things apparently broke in arrow CI. 
This is because the ABI of abseil is sensitive to the C++ standard that's used 
to compile, and google only supports a homogeneous version to compile all 
artefacts in a stack. This creates some friction with conda-forge (where the 
compilers are generally much newer than what arrow might be willing to impose). 
For now, things seems to have worked out with arrow 
[specifying|https://github.com/apache/arrow/blob/897a4c0ce73c3fe07872beee2c1d2128e44f6dd4/cpp/cmake_modules/SetupCxxFlags.cmake#L121-L124]
 C++11 while conda-forge moved to C++17 - at least on unix, but windows was not 
so lucky.

Perhaps people would therefore also be interested in collaborating (or at least 
commenting on) this 
[issue|https://github.com/conda-forge/abseil-cpp-feedstock/issues/29], which 
should permit more flexibility by being able to opt into given standard 
versions also from conda-forge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-7830) [C++] Parquet library version doesn't change with releases

2020-03-10 Thread H. Vetinari (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055702#comment-17055702
 ] 

H. Vetinari commented on ARROW-7830:


OK, I didn't want to come off as demanding (anything, really), just trying to 
understand the situation.

I had only ever seen PARQUET- associated with parquet-mr, I had never 
stumbled on the [parquet-cpp repo|https://github.com/apache/parquet-cpp].

Nevertheless, the head of that repo's README starts off as:
> *Note: Development for Apache Parquet in C++ has moved*
> 
> The Apache Arrow and Parquet have merged development process and build 
> systems in the Arrow repository. Please submit pull requests in 
> https://github.com/apache/arrow.

So wouldn't it be a reasonable way of looking at things that the arrow-project 
can now set the corresponding version number (or absorb the project completely, 
for example)?

> [C++] Parquet library version doesn't change with releases
> --
>
> Key: ARROW-7830
> URL: https://issues.apache.org/jira/browse/ARROW-7830
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>  Labels: parquet
>
> [~jeroenooms] pointed this out to me. 
> {code}
> $ pkg-config --modversion arrow
> 0.16.0
> $ pkg-config --modversion arrow-dataset
> 0.16.0
> $ pkg-config --modversion parquet
> 1.5.1-SNAPSHOT
> {code}
> I get that parquet-cpp is technically not part of Apache Arrow, but if we're 
> releasing a libparquet with libarrow at our release time, wouldn't it make 
> sense to at least bump the parquet version at the same time, even if the 
> version numbers aren't the same?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-7830) [C++] Parquet library version doesn't change with releases

2020-03-09 Thread H. Vetinari (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055049#comment-17055049
 ] 

H. Vetinari edited comment on ARROW-7830 at 3/9/20, 3:08 PM:
-

What does the parquet-version used here (1.5.1) stand for, actually? Upstream 
parquet [never 
had|https://github.com/apache/parquet-mr/releases?after=parquet-1.6.0rc1] a 
1.5.1 release, and if it had, it would be 5-6 years old.

I mean, I get that it's set 
[here|https://github.com/apache/arrow/tree/master/cpp/src/parquet], but the 
relation to arrow and/or parquet versions (or what 1.0.0 meant for that number, 
for that matter) is not apparent to me.


was (Author: h-vetinari):
What does the parquet-version used here (1.5.1) stand for, actually? Upstream 
parquet [never 
had|https://github.com/apache/parquet-mr/releases?after=parquet-1.6.0rc1] a 
1.5.1 release, and if it had, it would be 5-6 years old.

> [C++] Parquet library version doesn't change with releases
> --
>
> Key: ARROW-7830
> URL: https://issues.apache.org/jira/browse/ARROW-7830
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>  Labels: parquet
>
> [~jeroenooms] pointed this out to me. 
> {code}
> $ pkg-config --modversion arrow
> 0.16.0
> $ pkg-config --modversion arrow-dataset
> 0.16.0
> $ pkg-config --modversion parquet
> 1.5.1-SNAPSHOT
> {code}
> I get that parquet-cpp is technically not part of Apache Arrow, but if we're 
> releasing a libparquet with libarrow at our release time, wouldn't it make 
> sense to at least bump the parquet version at the same time, even if the 
> version numbers aren't the same?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7830) [C++] Parquet library version doesn't change with releases

2020-03-09 Thread H. Vetinari (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055049#comment-17055049
 ] 

H. Vetinari commented on ARROW-7830:


What does the parquet-version used here (1.5.1) stand for, actually? Upstream 
parquet [never 
had|https://github.com/apache/parquet-mr/releases?after=parquet-1.6.0rc1] a 
1.5.1 release, and if it had, it would be 5-6 years old.

> [C++] Parquet library version doesn't change with releases
> --
>
> Key: ARROW-7830
> URL: https://issues.apache.org/jira/browse/ARROW-7830
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>  Labels: parquet
>
> [~jeroenooms] pointed this out to me. 
> {code}
> $ pkg-config --modversion arrow
> 0.16.0
> $ pkg-config --modversion arrow-dataset
> 0.16.0
> $ pkg-config --modversion parquet
> 1.5.1-SNAPSHOT
> {code}
> I get that parquet-cpp is technically not part of Apache Arrow, but if we're 
> releasing a libparquet with libarrow at our release time, wouldn't it make 
> sense to at least bump the parquet version at the same time, even if the 
> version numbers aren't the same?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-18 Thread H. Vetinari (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352
 ] 

H. Vetinari edited comment on ARROW-5965 at 7/18/19 6:36 AM:
-

[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
{{# conda install pyarrow=0.14 gdb}}
{{Collecting package metadata (current_repodata.json): done}}
{{Solving environment: failed}}
{{Collecting package metadata (repodata.json): done}}
{{Solving environment: failed}}

{{UnsatisfiableError: The following specifications were found to be 
incompatible with each other:}}

  {{- pip -> python[version='>=3.7,<3.8.0a0']}}

which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...


was (Author: h-vetinari):
[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
{{# conda install pyarrow=0.14 gdb}}
{{Collecting package metadata (current_repodata.json): done}}
{{Solving environment: failed}}
{{Collecting package metadata (repodata.json): done}}
{{Solving environment: failed}}

{{UnsatisfiableError: The following specifications were found to be 
incompatible with each other:}}

{{  - pip -> python[version='>=3.7,<3.8.0a0']}}

which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...

EDIT: can't seem to format the code-block correctly, sorry.

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-18 Thread H. Vetinari (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352
 ] 

H. Vetinari edited comment on ARROW-5965 at 7/18/19 6:36 AM:
-

[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
{{# conda install pyarrow=0.14 gdb}}
{{Collecting package metadata (current_repodata.json): done}}
{{Solving environment: failed}}
{{Collecting package metadata (repodata.json): done}}
{{Solving environment: failed}}

{{UnsatisfiableError: The following specifications were found to be 
incompatible with each other:}}

{{  - pip -> python[version='>=3.7,<3.8.0a0']}}

which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...

EDIT: can't seem to format the code-block correctly, sorry.


was (Author: h-vetinari):
[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
{{
# conda install pyarrow=0.14 gdb
Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible 
with each other:

  - pip -> python[version='>=3.7,<3.8.0a0']
}}

which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...

EDIT: can't seem to format the code-block correctly, sorry.

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-18 Thread H. Vetinari (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887676#comment-16887676
 ] 

H. Vetinari edited comment on ARROW-5965 at 7/18/19 6:35 AM:
-

[~wesmckinn]
 Thanks for the tips. Unfortunately, I can't follow that example because the 
code does not generate a core-dump but only prints "Killed". I found some ways 
to run it in gdb that *should* work (best as I can tell), like {{gdb -ex r 
--args python fail.py}} or interactively:
{{gdb python}}
{{(gdb) run fail.py}}

but I always get:
{{[...]}}
{{warning: Could not trace the inferior process}}
{{Error:}}
{{warning: ptrace: Operation not permitted}}
{{During startup program exited with code 127.}}

Not sure if that's a mistake on my side or something in the setup/interplay of 
conda-gdb.


was (Author: h-vetinari):
[~wesmckinn]
Thanks for the tips. Unfortunately, I can't follow that example because the 
code does not generate a core-dump but only prints "Killed". I found some ways 
to run it in gdb that *should* work (best as I can tell), like "gdb -ex r 
--args python fail.py" or interactively:
"
gdb python
(gdb) run fail.py
"

but I always get:
"
[...]
warning: Could not trace the inferior process
Error:
warning: ptrace: Operation not permitted
During startup program exited with code 127.
"

Not sure if that's a mistake on my side or something in the setup/interplay of 
conda-gdb.

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-18 Thread H. Vetinari (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887676#comment-16887676
 ] 

H. Vetinari commented on ARROW-5965:


[~wesmckinn]
Thanks for the tips. Unfortunately, I can't follow that example because the 
code does not generate a core-dump but only prints "Killed". I found some ways 
to run it in gdb that *should* work (best as I can tell), like "gdb -ex r 
--args python fail.py" or interactively:
"
gdb python
(gdb) run fail.py
"

but I always get:
"
[...]
warning: Could not trace the inferior process
Error:
warning: ptrace: Operation not permitted
During startup program exited with code 127.
"

Not sure if that's a mistake on my side or something in the setup/interplay of 
conda-gdb.

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-17 Thread H. Vetinari (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352
 ] 

H. Vetinari edited comment on ARROW-5965 at 7/17/19 7:11 PM:
-

[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
{{
# conda install pyarrow=0.14 gdb
Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible 
with each other:

  - pip -> python[version='>=3.7,<3.8.0a0']
}}

which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...

EDIT: can't seem to format the code-block correctly, sorry.


was (Author: h-vetinari):
[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
{{
{quote}# conda install pyarrow=0.14 gdb
Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible 
with each other:

  - pip -> python[version='>=3.7,<3.8.0a0']{quote}
}}

which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-17 Thread H. Vetinari (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352
 ] 

H. Vetinari edited comment on ARROW-5965 at 7/17/19 7:10 PM:
-

[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
{{
{quote}# conda install pyarrow=0.14 gdb
Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible 
with each other:

  - pip -> python[version='>=3.7,<3.8.0a0']{quote}
}}

which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...


was (Author: h-vetinari):
[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
{{# conda install pyarrow=0.14 gdb
Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible 
with each other:

  - pip -> python[version='>=3.7,<3.8.0a0']}}

which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-17 Thread H. Vetinari (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352
 ] 

H. Vetinari edited comment on ARROW-5965 at 7/17/19 7:09 PM:
-

[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
{{# conda install pyarrow=0.14 gdb
Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible 
with each other:

  - pip -> python[version='>=3.7,<3.8.0a0']}}

which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...


was (Author: h-vetinari):
[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
```
# conda install pyarrow=0.14 gdb
Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible 
with each other:

  - pip -> python[version='>=3.7,<3.8.0a0']
```
which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-17 Thread H. Vetinari (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887352#comment-16887352
 ] 

H. Vetinari commented on ARROW-5965:


[~wesmckinn]
Would like to provide it, but would only be able to install through conda 
(which has a hole in the firewall).
Unfortunately,
```
# conda install pyarrow=0.14 gdb
Collecting package metadata (current_repodata.json): done
Solving environment: failed
Collecting package metadata (repodata.json): done
Solving environment: failed

UnsatisfiableError: The following specifications were found to be incompatible 
with each other:

  - pip -> python[version='>=3.7,<3.8.0a0']
```
which, I believe, is due to the fact that gdb has [not 
yet](https://github.com/conda-forge/gdb-feedstock/pull/12) been built for 
python 3.7. (although, just as I was preparing this message, I triggered a 
rerender there and this has caused some further action and the first passing 
3.7 build; not yet merged because 2.7 is failing).

In the meantime I tried downgrading my whole environment to 3.6, where the 
program also crashes or hangs on v0.14. However, I haven't yet been able to get 
a gdb output. Might need some more reading of the GDB manual...

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5965) [Python] Regression: segfault when reading hive table with v0.14

2019-07-17 Thread H. Vetinari (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887129#comment-16887129
 ] 

H. Vetinari commented on ARROW-5965:


Hey Neal,

I tried a couple of times before filing the report, and all (~5) invocations on 
0.14 crashed, and all invocations on 0.13 worked. The machine itself has lots 
of memory, so I don't think it's that. Not sure I'll be able to pare this down 
to a minimal reproducing parquet file. I'll try.

> [Python] Regression: segfault when reading hive table with v0.14
> 
>
> Key: ARROW-5965
> URL: https://issues.apache.org/jira/browse/ARROW-5965
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: H. Vetinari
>Priority: Critical
>  Labels: parquet
>
> I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
> installed in a conda env.
> The data I'm reading is a hive(-registered) table written as parquet, and 
> with v0.13, reading this table (that is partitioned) does not cause any 
> issues.
> The code that worked before and now crashes with v0.14 is simply:
> ```
> import pyarrow.parquet as pq
> pq.ParquetDataset('hdfs:///data/raw/source/table').read()
> ```
> Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
> cannot report much more, but this is a pretty severe usability restriction. 
> So far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5965) Regression: segfault when reading hive table with v0.14

2019-07-17 Thread H. Vetinari (JIRA)
H. Vetinari created ARROW-5965:
--

 Summary: Regression: segfault when reading hive table with v0.14
 Key: ARROW-5965
 URL: https://issues.apache.org/jira/browse/ARROW-5965
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.14.0
Reporter: H. Vetinari


I'm working with pyarrow on a cloudera cluster (CDH 6.1.1), with pyarrow 
installed in a conda env.

The data I'm reading is a hive(-registered) table written as parquet, and with 
v0.13, reading this table (that is partitioned) does not cause any issues.

The code that worked before and now crashes with v0.14 is simply:

```
import pyarrow.parquet as pq
pq.ParquetDataset('hdfs:///data/raw/source/table').read()

```

Since it completely crashes my notebook (resp. my REPL ends with "Killed"), I 
cannot report much more, but this is a pretty severe usability restriction. So 
far the solution is to enforce `pyarrow<0.14`



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)