[jira] [Commented] (ARROW-4917) [C++] orc_ep fails in cpp-alpine docker

2019-09-18 Thread Uwe L. Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932315#comment-16932315
 ] 

Uwe L. Korn commented on ARROW-4917:


[~mdeepak] [~owen.omalley] might care. I guess ORC is not testing on Alpine.

> [C++] orc_ep fails in cpp-alpine docker
> ---
>
> Key: ARROW-4917
> URL: https://issues.apache.org/jira/browse/ARROW-4917
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Uwe L. Korn
>Priority: Major
>
> Failure:
> {code:java}
> FAILED: c++/src/CMakeFiles/orc.dir/Timezone.cc.o
> /usr/bin/g++ -Ic++/include -I/build/cpp/orc_ep-prefix/src/orc_ep/c++/include 
> -I/build/cpp/orc_ep-prefix/src/orc_ep/c++/src -Ic++/src -isystem 
> /build/cpp/snappy_ep/src/snappy_ep-install/include -isystem 
> c++/libs/thirdparty/zlib_ep-install/include -isystem 
> c++/libs/thirdparty/lz4_ep-install/include -isystem 
> /arrow/cpp/thirdparty/protobuf_ep-install/include -fdiagnostics-color=always 
> -ggdb -O0 -g -fPIC -std=c++11 -Wall -Wno-unknown-pragmas -Wconversion -Werror 
> -std=c++11 -Wall -Wno-unknown-pragmas -Wconversion -Werror -O0 -g -MD -MT 
> c++/src/CMakeFiles/orc.dir/Timezone.cc.o -MF 
> c++/src/CMakeFiles/orc.dir/Timezone.cc.o.d -o 
> c++/src/CMakeFiles/orc.dir/Timezone.cc.o -c 
> /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Timezone.cc
> /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Timezone.cc: In member function 
> 'void orc::TimezoneImpl::parseTimeVariants(const unsigned char*, uint64_t, 
> uint64_t, uint64_t, uint64_t)':
> /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Timezone.cc:748:7: error: 'uint' 
> was not declared in this scope
> uint nameStart = ptr[variantOffset + 6 * variant + 5];
> ^~~~
> /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Timezone.cc:748:7: note: 
> suggested alternative: 'rint'
> uint nameStart = ptr[variantOffset + 6 * variant + 5];
> ^~~~
> rint
> /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Timezone.cc:749:11: error: 
> 'nameStart' was not declared in this scope
> if (nameStart >= nameCount) {
> ^
> /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Timezone.cc:749:11: note: 
> suggested alternative: 'nameCount'
> if (nameStart >= nameCount) {
> ^
> nameCount
> /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Timezone.cc:756:59: error: 
> 'nameStart' was not declared in this scope
> + nameOffset + nameStart);
> ^
> /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Timezone.cc:756:59: note: 
> suggested alternative: 'nameCount'
> + nameOffset + nameStart);
> ^
> nameCount{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6585) [C++] Create "ARROW_LIBRARIES" argument to pass list of desired components to build

2019-09-18 Thread Uwe L. Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932110#comment-16932110
 ] 

Uwe L. Korn commented on ARROW-6585:


FTR: there is a related ML discussion about this: "[DISCUSS] Changing C++ build 
system default options to produce more barebones builds"

> [C++] Create "ARROW_LIBRARIES"  argument to pass list of desired components 
> to build
> 
>
> Key: ARROW-6585
> URL: https://issues.apache.org/jira/browse/ARROW-6585
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> Our current {{-DARROW_*}} flag system strikes me as a little bit tedious. 
> When invoking Boost's build system, you can pass the argument 
> {{--with-libraries=filesystem,regex,system}} to indicate which components you 
> want to see built. 
> I think we should do a couple of things declare all component dependencies in 
> a central place. Presently we have many "if" statements toggling on 
> dependencies on an ad hoc basis. The code looks like this
> {code}
> if(ARROW_FLIGHT OR ARROW_PARQUET OR ARROW_BUILD_TESTS)
>   set(ARROW_IPC ON)
> endif()
> if(ARROW_IPC AND NOT ARROW_JSON)
>   message(FATAL_ERROR "JSON support is required for Arrow IPC")
> endif()
> {code}
> I don't think this is going to be scalable. 
> Secondly, I think we should make it easier to ask for a comprehensive build. 
> E.g. {{-DARROW_LIBRARIES=everything}} or similar



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6577) Dependency conflict in conda packages

2019-09-17 Thread Uwe L. Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-6577.

Resolution: Not A Bug

Closing as this is not an Arrow bug.

> Dependency conflict in conda packages
> -
>
> Key: ARROW-6577
> URL: https://issues.apache.org/jira/browse/ARROW-6577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.1
> Environment: kernel: 5.2.11-200.fc30.x86_64
> conda 4.6.13
> Python 3.7.3
>Reporter: Suvayu Ali
>Priority: Major
> Attachments: pa-conda.txt
>
>
> When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
> picked up. But installing certain packages downgrades pyarrow to 0.13.0 or 
> 0.12.1. I think a common dependency is causing the downgrade, my guess is 
> boost or protobuf. This is based on several instances of this issue I 
> encountered over the last few weeks. It took me a while to find a somewhat 
> reproducible recipe.
> {code:java}
> $ conda create -n test pyarrow pandas numpy
> ...
> Proceed ([y]/n)? y
> ...
> $ conda install -n test ipython
> ...
> Proceed ([y]/n)? n
> CondaSystemExit: Exiting.
> {code}
> I have attached a mildly edited (to remove progress bars, and control 
> characters) transcript of this session. Here {{ipython}} triggers the 
> problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other 
> common packages who also conflict in this way. Please let me know if I can 
> provide more info.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (ARROW-6577) Dependency conflict in conda packages

2019-09-17 Thread Uwe L. Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-6577:
--

Assignee: Uwe L. Korn

> Dependency conflict in conda packages
> -
>
> Key: ARROW-6577
> URL: https://issues.apache.org/jira/browse/ARROW-6577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.1
> Environment: kernel: 5.2.11-200.fc30.x86_64
> conda 4.6.13
> Python 3.7.3
>Reporter: Suvayu Ali
>Assignee: Uwe L. Korn
>Priority: Major
> Attachments: pa-conda.txt
>
>
> When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
> picked up. But installing certain packages downgrades pyarrow to 0.13.0 or 
> 0.12.1. I think a common dependency is causing the downgrade, my guess is 
> boost or protobuf. This is based on several instances of this issue I 
> encountered over the last few weeks. It took me a while to find a somewhat 
> reproducible recipe.
> {code:java}
> $ conda create -n test pyarrow pandas numpy
> ...
> Proceed ([y]/n)? y
> ...
> $ conda install -n test ipython
> ...
> Proceed ([y]/n)? n
> CondaSystemExit: Exiting.
> {code}
> I have attached a mildly edited (to remove progress bars, and control 
> characters) transcript of this session. Here {{ipython}} triggers the 
> problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other 
> common packages who also conflict in this way. Please let me know if I can 
> provide more info.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6577) Dependency conflict in conda packages

2019-09-17 Thread Uwe L. Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931360#comment-16931360
 ] 

Uwe L. Korn commented on ARROW-6577:


[~suvayu] Otherwise this should be solved by using {{conda install ipython 
pyarrow>=0.14}}. As this is a conda issue, it's probably better to ask on the 
conda tracker but I guess that there you will get the same answer: update conda.

> Dependency conflict in conda packages
> -
>
> Key: ARROW-6577
> URL: https://issues.apache.org/jira/browse/ARROW-6577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.1
> Environment: kernel: 5.2.11-200.fc30.x86_64
> conda 4.6.13
> Python 3.7.3
>Reporter: Suvayu Ali
>Priority: Major
> Attachments: pa-conda.txt
>
>
> When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
> picked up. But installing certain packages downgrades pyarrow to 0.13.0 or 
> 0.12.1. I think a common dependency is causing the downgrade, my guess is 
> boost or protobuf. This is based on several instances of this issue I 
> encountered over the last few weeks. It took me a while to find a somewhat 
> reproducible recipe.
> {code:java}
> $ conda create -n test pyarrow pandas numpy
> ...
> Proceed ([y]/n)? y
> ...
> $ conda install -n test ipython
> ...
> Proceed ([y]/n)? n
> CondaSystemExit: Exiting.
> {code}
> I have attached a mildly edited (to remove progress bars, and control 
> characters) transcript of this session. Here {{ipython}} triggers the 
> problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other 
> common packages who also conflict in this way. Please let me know if I can 
> provide more info.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6577) Dependency conflict in conda packages

2019-09-17 Thread Uwe L. Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931350#comment-16931350
 ] 

Uwe L. Korn commented on ARROW-6577:


The problem seem to be that the package resolution in conda 4.6 has some 
problems. The issue is fixed with conda 4.7, please upgrade. 

> Dependency conflict in conda packages
> -
>
> Key: ARROW-6577
> URL: https://issues.apache.org/jira/browse/ARROW-6577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.1
> Environment: kernel: 5.2.11-200.fc30.x86_64
> conda 4.6.13
> Python 3.7.3
>Reporter: Suvayu Ali
>Priority: Major
> Attachments: pa-conda.txt
>
>
> When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
> picked up. But installing certain packages downgrades pyarrow to 0.13.0 or 
> 0.12.1. I think a common dependency is causing the downgrade, my guess is 
> boost or protobuf. This is based on several instances of this issue I 
> encountered over the last few weeks. It took me a while to find a somewhat 
> reproducible recipe.
> {code:java}
> $ conda create -n test pyarrow pandas numpy
> ...
> Proceed ([y]/n)? y
> ...
> $ conda install -n test ipython
> ...
> Proceed ([y]/n)? n
> CondaSystemExit: Exiting.
> {code}
> I have attached a mildly edited (to remove progress bars, and control 
> characters) transcript of this session. Here {{ipython}} triggers the 
> problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other 
> common packages who also conflict in this way. Please let me know if I can 
> provide more info.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (ARROW-6339) [Python][C++] Rowgroup statistics for pd.NaT array ill defined

2019-09-17 Thread Uwe L. Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-6339:
--

Assignee: Uwe L. Korn  (was: Florian Jetter)

> [Python][C++] Rowgroup statistics for pd.NaT array ill defined
> --
>
> Key: ARROW-6339
> URL: https://issues.apache.org/jira/browse/ARROW-6339
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Florian Jetter
>Assignee: Uwe L. Korn
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When initialising an array with NaT only values the row group statistic is 
> corrupt returning either random values or raises integer out of bound 
> exceptions.
> {code:python}
> import io
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame({"t": pd.Series([pd.NaT], dtype="datetime64[ns]")})
> buf = pa.BufferOutputStream()
> pq.write_table(pa.Table.from_pandas(df), buf, version="2.0")
> buf = io.BytesIO(buf.getvalue().to_pybytes())
> parquet_file = pq.ParquetFile(buf)
> # Asserting behaviour is difficult since it is random and the state is ill 
> defined. 
> # After a few iterations an exception is raised.
> while True:
> parquet_file.metadata.row_group(0).column(0).statistics.max
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6339) [Python][C++] Rowgroup statistics for pd.NaT array ill defined

2019-09-17 Thread Uwe L. Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931175#comment-16931175
 ] 

Uwe L. Korn commented on ARROW-6339:


The problem here is that 
{{parquet_file.metadata.row_group(0).column(0).statistics.has_min_max}} is 
{{False}} and thus {{.max}} should never be accessed. Instead of returning 
undefined data, we should raise an exception.

> [Python][C++] Rowgroup statistics for pd.NaT array ill defined
> --
>
> Key: ARROW-6339
> URL: https://issues.apache.org/jira/browse/ARROW-6339
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Florian Jetter
>Assignee: Florian Jetter
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When initialising an array with NaT only values the row group statistic is 
> corrupt returning either random values or raises integer out of bound 
> exceptions.
> {code:python}
> import io
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame({"t": pd.Series([pd.NaT], dtype="datetime64[ns]")})
> buf = pa.BufferOutputStream()
> pq.write_table(pa.Table.from_pandas(df), buf, version="2.0")
> buf = io.BytesIO(buf.getvalue().to_pybytes())
> parquet_file = pq.ParquetFile(buf)
> # Asserting behaviour is difficult since it is random and the state is ill 
> defined. 
> # After a few iterations an exception is raised.
> while True:
> parquet_file.metadata.row_group(0).column(0).statistics.max
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6577) Dependency conflict in conda packages

2019-09-17 Thread Uwe L. Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931172#comment-16931172
 ] 

Uwe L. Korn commented on ARROW-6577:


I cannot replicate this locally. Can you share some details:

 
 * What is your conda version?
 * What is in your .condarc?

> Dependency conflict in conda packages
> -
>
> Key: ARROW-6577
> URL: https://issues.apache.org/jira/browse/ARROW-6577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.1
> Environment: kernel: 5.2.11-200.fc30.x86_64
> conda 4.6.13
> Python 3.7.3
>Reporter: Suvayu Ali
>Priority: Major
> Attachments: pa-conda.txt
>
>
> When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
> picked up. But installing certain packages downgrades pyarrow to 0.13.0 or 
> 0.12.1. I think a common dependency is causing the downgrade, my guess is 
> boost or protobuf. This is based on several instances of this issue I 
> encountered over the last few weeks. It took me a while to find a somewhat 
> reproducible recipe.
> {code:java}
> $ conda create -n test pyarrow pandas numpy
> ...
> Proceed ([y]/n)? y
> ...
> $ conda install -n test ipython
> ...
> Proceed ([y]/n)? n
> CondaSystemExit: Exiting.
> {code}
> I have attached a mildly edited (to remove progress bars, and control 
> characters) transcript of this session. Here {{ipython}} triggers the 
> problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other 
> common packages who also conflict in this way. Please let me know if I can 
> provide more info.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6509) [CI] Java test failures on Travis

2019-09-12 Thread Uwe L. Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928555#comment-16928555
 ] 

Uwe L. Korn commented on ARROW-6509:


Oh, I wasn't aware of that :( I only thought of pyarrow.jvm

> [CI] Java test failures on Travis
> -
>
> Key: ARROW-6509
> URL: https://issues.apache.org/jira/browse/ARROW-6509
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Java
>Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: 0.15.0
>
>
> This seems to happen more or less frequently on the Python - Java build (with 
> jpype enabled).
> See warnings and errors starting from 
> https://travis-ci.org/apache/arrow/jobs/583069089#L6662



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6509) [CI] Java test failures on Travis

2019-09-12 Thread Uwe L. Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928548#comment-16928548
 ] 

Uwe L. Korn commented on ARROW-6509:


We should simply skip tests in the Java build that is done in the Python job. 
They consume precious runtime for something that is tested in another job 
already.

> [CI] Java test failures on Travis
> -
>
> Key: ARROW-6509
> URL: https://issues.apache.org/jira/browse/ARROW-6509
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Java
>Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: 0.15.0
>
>
> This seems to happen more or less frequently on the Python - Java build (with 
> jpype enabled).
> See warnings and errors starting from 
> https://travis-ci.org/apache/arrow/jobs/583069089#L6662



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6228) [C++] Add context lines to Diff formatting

2019-09-12 Thread Uwe L. Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928547#comment-16928547
 ] 

Uwe L. Korn commented on ARROW-6228:


I would prefer the hunk headers as this gives at least some information about 
the position.

> [C++] Add context lines to Diff formatting
> --
>
> Key: ARROW-6228
> URL: https://issues.apache.org/jira/browse/ARROW-6228
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Trivial
>
> Diff currently renders only inserted or deleted elements, but context lines 
> can be helpful to viewers of the diff. Add an option for configurable context 
> line count to Diff and EqualOptions



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6504) [Python][Packaging] Add mimalloc to conda packages for better performance

2019-09-12 Thread Uwe L. Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928501#comment-16928501
 ] 

Uwe L. Korn commented on ARROW-6504:


In the case of jemalloc, one of the main reasons is that we need for older 
glibc versions the latest commit on the stable-4 branch. This was never 
released and as conda-forge is only building releases, we couldn't have a build 
for it there.

> [Python][Packaging] Add mimalloc to conda packages for better performance
> -
>
> Key: ARROW-6504
> URL: https://issues.apache.org/jira/browse/ARROW-6504
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6326) [C++] Nullable fields when converting std::tuple to Table

2019-09-08 Thread Uwe L. Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-6326.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5171
[https://github.com/apache/arrow/pull/5171]

> [C++] Nullable fields when converting std::tuple to Table
> -
>
> Key: ARROW-6326
> URL: https://issues.apache.org/jira/browse/ARROW-6326
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Omer Ozarslan
>Assignee: Omer Ozarslan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> {{std::optional}} isn't used for representing nullable fields in Arrow's 
> current STL conversion API since it requires C++17. Also there are other ways 
> to represent an optional field other than {{std::optional}} such as using 
> pointers or external implementations of optional ({{boost::optional}}, 
> {{type_safe::optional}} and alike). 
> Since it is hard to maintain so many different kinds of specializations, 
> introducing an {{Optional}} concept covering these classes could solve this 
> issue and allow implementing nullable fields consistently.
> So, the gist of proposed change will be something along the lines of:
> {code:cpp}
> template
> constexpr bool is_optional_like_v = ...;
> template
> struct CTypeTraits>> {
>//...
> }
> template
> struct ConversionTraits>> 
> : public CTypeTraits {
>//...
> }
> {code}
> For a type {{T}} to be considered as an {{Optional}}:
> 1) It should be convertible (implicitly or explicitly)  to {{bool}}, i.e. it 
> implements {{[explicit] operator bool()}},
> 2) It should be dereferencable, i.e. it implements {{operator*()}}.
> These two requirements provide a generalized way of templating nullable 
> fields based on pointers, {{std::optional}}, {{boost::optional}} etc. 
> However, it would be better (necessary?) if this implementation should act as 
> a default while not breaking existing specializations of users (e.g. an 
> existing  implementation in which {{std::optional}} is specialized by user).
> Is there any issues this approach may cause that I may have missed?
> I will open a draft PR for working on that meanwhile.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (ARROW-6326) [C++] Nullable fields when converting std::tuple to Table

2019-09-08 Thread Uwe L. Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-6326:
--

Assignee: Omer Ozarslan

> [C++] Nullable fields when converting std::tuple to Table
> -
>
> Key: ARROW-6326
> URL: https://issues.apache.org/jira/browse/ARROW-6326
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Omer Ozarslan
>Assignee: Omer Ozarslan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> {{std::optional}} isn't used for representing nullable fields in Arrow's 
> current STL conversion API since it requires C++17. Also there are other ways 
> to represent an optional field other than {{std::optional}} such as using 
> pointers or external implementations of optional ({{boost::optional}}, 
> {{type_safe::optional}} and alike). 
> Since it is hard to maintain so many different kinds of specializations, 
> introducing an {{Optional}} concept covering these classes could solve this 
> issue and allow implementing nullable fields consistently.
> So, the gist of proposed change will be something along the lines of:
> {code:cpp}
> template
> constexpr bool is_optional_like_v = ...;
> template
> struct CTypeTraits>> {
>//...
> }
> template
> struct ConversionTraits>> 
> : public CTypeTraits {
>//...
> }
> {code}
> For a type {{T}} to be considered as an {{Optional}}:
> 1) It should be convertible (implicitly or explicitly)  to {{bool}}, i.e. it 
> implements {{[explicit] operator bool()}},
> 2) It should be dereferencable, i.e. it implements {{operator*()}}.
> These two requirements provide a generalized way of templating nullable 
> fields based on pointers, {{std::optional}}, {{boost::optional}} etc. 
> However, it would be better (necessary?) if this implementation should act as 
> a default while not breaking existing specializations of users (e.g. an 
> existing  implementation in which {{std::optional}} is specialized by user).
> Is there any issues this approach may cause that I may have missed?
> I will open a draft PR for working on that meanwhile.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6456) [C++] Possible to reduce object code generated in compute/kernels/take.cc?

2019-09-05 Thread Uwe L. Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923407#comment-16923407
 ] 

Uwe L. Korn commented on ARROW-6456:


We should investigate building with link time optimization. While improving 
performance, another great benefit of it is that it reduces binary size, 
especially in such cases where there is a lot of similar code like in this case.

Drawback will be that link times will increase thus we should only add it to a 
single CI job but use it in release builds.

> [C++] Possible to reduce object code generated in compute/kernels/take.cc?
> --
>
> Key: ARROW-6456
> URL: https://issues.apache.org/jira/browse/ARROW-6456
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> According to 
> https://gist.github.com/wesm/90f73d050a81cbff6772aea2203cdf93
> take.cc is our largest piece of object code in the codebase. This is a pretty 
> important function but I wonder if it's possible to make the implementation 
> "leaner" than it is currently to reduce generated code, without sacrificing 
> performance. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6277) [C++][Parquet] Support reading/writing other Parquet primitive types to DictionaryArray

2019-09-05 Thread Uwe L. Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923401#comment-16923401
 ] 

Uwe L. Korn commented on ARROW-6277:


This could be interesting for date columns when working together with pandas. 
To correctly round-trip date columns in the cycle Parquet -> Arrow -> pandas -> 
Arrow -> Parquet you need to use object columns in pandas with datetime.date 
objects. These can be quite repetitive and thus using dictionary encoding helps 
a lot here. Otherwise I would see the same use case for float columns but that 
isn't something I haven't yet used, mostly due to pandas not really working 
well with float categories.

> [C++][Parquet] Support reading/writing other Parquet primitive types to 
> DictionaryArray
> ---
>
> Key: ARROW-6277
> URL: https://issues.apache.org/jira/browse/ARROW-6277
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> As follow up to ARROW-3246, we should support direct read/write of the other 
> Parquet primitive types. Currently only BYTE_ARRAY is implemented as it 
> provides the most performance benefit.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6403) [Python] Expose FileReader::ReadRowGroups() to Python

2019-09-01 Thread Uwe L. Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-6403.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5241
[https://github.com/apache/arrow/pull/5241]

> [Python] Expose FileReader::ReadRowGroups() to Python
> -
>
> Key: ARROW-6403
> URL: https://issues.apache.org/jira/browse/ARROW-6403
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Arik Funke
>Assignee: Arik Funke
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Expose ReadRowGroups to Python to allow efficient filtered reading 
> implementations as suggested @xhochy in 
> https://github.com/apache/arrow/issues/2491#issuecomment-416958663_
> Without this PR users would have to re-implement threaded reads in python.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (ARROW-6403) [Python] Expose FileReader::ReadRowGroups() to Python

2019-09-01 Thread Uwe L. Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-6403:
--

Assignee: Arik Funke

> [Python] Expose FileReader::ReadRowGroups() to Python
> -
>
> Key: ARROW-6403
> URL: https://issues.apache.org/jira/browse/ARROW-6403
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Arik Funke
>Assignee: Arik Funke
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Expose ReadRowGroups to Python to allow efficient filtered reading 
> implementations as suggested @xhochy in 
> https://github.com/apache/arrow/issues/2491#issuecomment-416958663_
> Without this PR users would have to re-implement threaded reads in python.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-05 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900260#comment-16900260
 ] 

Uwe L. Korn commented on ARROW-6132:


+1, not getting segfaults or delayed errors on Python APIs is essential.

> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3054) [Packaging] Tooling to enable nightly conda packages to be updated to some anaconda.org channel

2019-08-05 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900208#comment-16900208
 ] 

Uwe L. Korn commented on ARROW-3054:


We should be able to build nightly packages by taking the recipes that are 
checked in into the crossbow tasks and do a {{conda smithy rerender}} before 
running the build. {{conda-smithy}} should also be able to upload to a 
different channel nowadays through a simple config option.

> [Packaging] Tooling to enable nightly conda packages to be updated to some 
> anaconda.org channel
> ---
>
> Key: ARROW-3054
> URL: https://issues.apache.org/jira/browse/ARROW-3054
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging
>Affects Versions: 0.10.0
>Reporter: Phillip Cloud
>Assignee: Krisztian Szucs
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6119) [Python] PyArrow import fails on Windows Python 3.7

2019-08-02 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899091#comment-16899091
 ] 

Uwe L. Korn commented on ARROW-6119:


How did you install this? Did you use conda (preferred) or pip or did you 
compile it yourself?

> [Python] PyArrow import fails on Windows Python 3.7
> ---
>
> Key: ARROW-6119
> URL: https://issues.apache.org/jira/browse/ARROW-6119
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: Windows, Python 3.7
>Reporter: Paul Suganthan
>Priority: Major
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in 
> 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified procedure could not be found.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6096) [C++] Remove dependency on boost regex library

2019-08-01 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898110#comment-16898110
 ] 

Uwe L. Korn commented on ARROW-6096:


[~hatem] This was in the past the C++ regex library but we had some issues. 
[~wesmckinn] [~mdeepak] Do you remember the problems with that?

> [C++] Remove dependency on boost regex library
> --
>
> Key: ARROW-6096
> URL: https://issues.apache.org/jira/browse/ARROW-6096
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Hatem Helal
>Assignee: Hatem Helal
>Priority: Minor
>
> There appears to be only one place where the boost regex library is used:
> [cpp/src/parquet/metadata.cc|https://github.com/apache/arrow/blob/eb73b962e42b5ae6983bf026ebf825f1f707e245/cpp/src/parquet/metadata.cc#L32]
> I think this can be replaced by the C++11 regex library.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5757) [Python] Stop supporting Python 2.7

2019-07-31 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897332#comment-16897332
 ] 

Uwe L. Korn commented on ARROW-5757:


Release 1.0 with Python 2 support and then drop immediately?

> [Python] Stop supporting Python 2.7
> ---
>
> Key: ARROW-5757
> URL: https://issues.apache.org/jira/browse/ARROW-5757
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> By the end of 2019 many scientific Python projects will stop supporting 
> Python 2 altogether:
> https://python3statement.org/
> We'll certainly support Python 2 in Arrow 1.0 but we could perhaps drop 
> support in 1.1.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5994) [CI] [Rust] Create nightly releases of the Rust implementation

2019-07-23 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890743#comment-16890743
 ] 

Uwe L. Korn commented on ARROW-5994:


{quote} having published nightly releases
{quote}
 

No, we cannot have them. All releases need to voted on, so there won't be an 
apache-arrow-nightly on crates.io. You can however have a CI job that uploads 
andys-private-arrow-nightlies, just make sure that it is not in any way 
official. Also be careful to depend in released artifacts on this private fork, 
this may lead to very complicated situations when you want to integrate with 
other libraries that also use Arrow; rather invest some effort in making Arrow 
releases in general more frequent.

> [CI] [Rust] Create nightly releases of the Rust implementation
> --
>
> Key: ARROW-5994
> URL: https://issues.apache.org/jira/browse/ARROW-5994
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> I would like to work on this but I'm not currently sure where to start. I 
> will follow up on the mailing list.
> I am interested in this so I can use Arrow in my new PoC and I know of 
> another project that is now using Arrow and will likely benefit from nightly 
> releases.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5994) [CI] [Rust] Create nightly releases of the Rust implementation

2019-07-22 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890017#comment-16890017
 ] 

Uwe L. Korn commented on ARROW-5994:


[~andygrove] Is there a place where you could upload these nightlies where it 
is clearly visible that they are not meant for public consumption?

> [CI] [Rust] Create nightly releases of the Rust implementation
> --
>
> Key: ARROW-5994
> URL: https://issues.apache.org/jira/browse/ARROW-5994
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> I would like to work on this but I'm not currently sure where to start. I 
> will follow up on the mailing list.
> I am interested in this so I can use Arrow in my new PoC and I know of 
> another project that is now using Arrow and will likely benefit from nightly 
> releases.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5956) [R] Ability for R to link to C++ libraries from pyarrow Wheel

2019-07-16 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886285#comment-16886285
 ] 

Uwe L. Korn commented on ARROW-5956:


{quote}[https://twitter.com/xhochy/status/114029791272079] describes a 
similar wish. 
{quote}
 

In that case I actually had R use the same lib as pyarrow. This is working fine 
in a conda-provided environment. I amended the title that this here is about 
using the libraries from the {{pyarrow}} *wheel*.

> [R] Ability for R to link to C++ libraries from pyarrow Wheel
> -
>
> Key: ARROW-5956
> URL: https://issues.apache.org/jira/browse/ARROW-5956
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
> Environment: Ubuntu 16.04, R 3.4.4, python 3.6.5
>Reporter: Jeffrey Wong
>Priority: Major
>
> I have installed pyarrow 0.14.0 and want to be able to also use R arrow. In 
> my work I use rpy2 a lot to exchange python data structures with R data 
> structures, so would like R arrow to link against the exact same .so files 
> found in pyarrow
>  
>  
> When I pass in include_dir and lib_dir to R's configure, pointing to 
> pyarrow's include and pyarrow's root directories, I am able to compile R's 
> arrow.so file. However, I am unable to load it in an R session, getting the 
> error:
>  
> {code:java}
> > dyn.load('arrow.so')
> Error in dyn.load("arrow.so") :
>  unable to load shared object '/tmp/arrow2/r/src/arrow.so':
>  /tmp/arrow2/r/src/arrow.so: undefined symbol: 
> _ZNK5arrow11StructArray14GetFieldByNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE{code}
>  
>  
> Steps to reproduce:
>  
> Install pyarrow, which also ships libarrow.so and libparquet.so
>  
> {code:java}
> pip3 install pyarrow --upgrade --user
> PY_ARROW_PATH=$(python3 -c "import pyarrow, os; 
> print(os.path.dirname(pyarrow.__file__))")
> PY_ARROW_VERSION=$(python3 -c "import pyarrow; print(pyarrow.__version__)")
> ln -s $PY_ARROW_PATH/libarrow.so.14 $PY_ARROW_PATH/libarrow.so
> ln -s $PY_ARROW_PATH/libparquet.so.14 $PY_ARROW_PATH/libparquet.so
> {code}
>  
>  
> Add to LD_LIBRARY_PATH
>  
> {code:java}
> sudo tee -a /usr/lib/R/etc/ldpaths < LD_LIBRARY_PATH="\${LD_LIBRARY_PATH}:$PY_ARROW_PATH"
> export LD_LIBRARY_PATH
> LINES
> sudo tee -a /usr/lib/rstudio-server/bin/r-ldpath < LD_LIBRARY_PATH="\${LD_LIBRARY_PATH}:$PY_ARROW_PATH"
> export LD_LIBRARY_PATH
> LINES
> export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:$PY_ARROW_PATH"
> {code}
>  
>  
> Install r arrow from source
> {code:java}
> git clone https://github.com/apache/arrow.git /tmp/arrow2
> cd /tmp/arrow2/r
> git checkout tags/apache-arrow-0.14.0
> R CMD INSTALL ./ --configure-vars="INCLUDE_DIR=$PY_ARROW_PATH/include 
> LIB_DIR=$PY_ARROW_PATH"{code}
>  
> I have noticed that the R package for arrow no longer has an RcppExports, but 
> instead an arrowExports. Could it be that the lack of RcppExports has made it 
> difficult to find GetFieldByName?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5956) [R] Ability for R to link to C++ libraries from pyarrow Wheel

2019-07-16 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-5956:
---
Summary: [R] Ability for R to link to C++ libraries from pyarrow Wheel  
(was: [R] Ability for R to link to C++ libraries from pyarrow)

> [R] Ability for R to link to C++ libraries from pyarrow Wheel
> -
>
> Key: ARROW-5956
> URL: https://issues.apache.org/jira/browse/ARROW-5956
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
> Environment: Ubuntu 16.04, R 3.4.4, python 3.6.5
>Reporter: Jeffrey Wong
>Priority: Major
>
> I have installed pyarrow 0.14.0 and want to be able to also use R arrow. In 
> my work I use rpy2 a lot to exchange python data structures with R data 
> structures, so would like R arrow to link against the exact same .so files 
> found in pyarrow
>  
>  
> When I pass in include_dir and lib_dir to R's configure, pointing to 
> pyarrow's include and pyarrow's root directories, I am able to compile R's 
> arrow.so file. However, I am unable to load it in an R session, getting the 
> error:
>  
> {code:java}
> > dyn.load('arrow.so')
> Error in dyn.load("arrow.so") :
>  unable to load shared object '/tmp/arrow2/r/src/arrow.so':
>  /tmp/arrow2/r/src/arrow.so: undefined symbol: 
> _ZNK5arrow11StructArray14GetFieldByNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE{code}
>  
>  
> Steps to reproduce:
>  
> Install pyarrow, which also ships libarrow.so and libparquet.so
>  
> {code:java}
> pip3 install pyarrow --upgrade --user
> PY_ARROW_PATH=$(python3 -c "import pyarrow, os; 
> print(os.path.dirname(pyarrow.__file__))")
> PY_ARROW_VERSION=$(python3 -c "import pyarrow; print(pyarrow.__version__)")
> ln -s $PY_ARROW_PATH/libarrow.so.14 $PY_ARROW_PATH/libarrow.so
> ln -s $PY_ARROW_PATH/libparquet.so.14 $PY_ARROW_PATH/libparquet.so
> {code}
>  
>  
> Add to LD_LIBRARY_PATH
>  
> {code:java}
> sudo tee -a /usr/lib/R/etc/ldpaths < LD_LIBRARY_PATH="\${LD_LIBRARY_PATH}:$PY_ARROW_PATH"
> export LD_LIBRARY_PATH
> LINES
> sudo tee -a /usr/lib/rstudio-server/bin/r-ldpath < LD_LIBRARY_PATH="\${LD_LIBRARY_PATH}:$PY_ARROW_PATH"
> export LD_LIBRARY_PATH
> LINES
> export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:$PY_ARROW_PATH"
> {code}
>  
>  
> Install r arrow from source
> {code:java}
> git clone https://github.com/apache/arrow.git /tmp/arrow2
> cd /tmp/arrow2/r
> git checkout tags/apache-arrow-0.14.0
> R CMD INSTALL ./ --configure-vars="INCLUDE_DIR=$PY_ARROW_PATH/include 
> LIB_DIR=$PY_ARROW_PATH"{code}
>  
> I have noticed that the R package for arrow no longer has an RcppExports, but 
> instead an arrowExports. Could it be that the lack of RcppExports has made it 
> difficult to find GetFieldByName?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5919) [R] Add nightly tests for building r-arrow with dependencies from conda-forge

2019-07-15 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-5919.

Resolution: Fixed

Issue resolved by pull request 4855
[https://github.com/apache/arrow/pull/4855]

> [R] Add nightly tests for building r-arrow with dependencies from conda-forge
> -
>
> Key: ARROW-5919
> URL: https://issues.apache.org/jira/browse/ARROW-5919
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5914) [CI] Build bundled dependencies in docker build step

2019-07-12 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883823#comment-16883823
 ] 

Uwe L. Korn commented on ARROW-5914:


{quote}Yeah, so an alternative is that we use conda only for the dependencies 
that don't work from the system package manager. I guess that's about as good 
as building the dependency in the image
{quote}
No, it is either all-from-conda or none, mixing is not working due to the 
different toolchains.
  

> [CI] Build bundled dependencies in docker build step
> 
>
> Key: ARROW-5914
> URL: https://issues.apache.org/jira/browse/ARROW-5914
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Francois Saint-Jacques
>Priority: Minor
> Fix For: 1.0.0
>
>
> In the recently introduced ARROW-5803, some heavy dependencies (thrift, 
> protobuf, flatbufers, grpc) are build at each invocation of docker-compose 
> build (thus each travis test).
> We should aim to build the third party dependencies in docker build phase 
> instead, to exploit caching and docker-compose pull so that the CI step 
> doesn't need to build said dependencies each time.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5919) [R] Add nightly tests for building r-arrow with dependencies from conda-forge

2019-07-12 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-5919:
--

 Summary: [R] Add nightly tests for building r-arrow with 
dependencies from conda-forge
 Key: ARROW-5919
 URL: https://issues.apache.org/jira/browse/ARROW-5919
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5914) [CI] Build bundled dependencies in docker build step

2019-07-12 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883681#comment-16883681
 ] 

Uwe L. Korn commented on ARROW-5914:


[~fsaintjacques][~kszucs][~wesmckinn] This is why we have used conda in these 
builds. I have the great fear that we rely more and on more on manual building 
of thirdparty dependencies in our build scripts which just adds more 
maintenance overhead. I was so frustrated with the manual scripts in the 
manylinux1 one case that I was considering in making a manylinux1 conda channel 
to build the dependencies. This would have greatly reduced my pain in the 
maintenance of the manylinux1 container.

We need to test against system dependencies but then we should do this as we 
were doing this previously in the nightlies.

> [CI] Build bundled dependencies in docker build step
> 
>
> Key: ARROW-5914
> URL: https://issues.apache.org/jira/browse/ARROW-5914
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Francois Saint-Jacques
>Priority: Minor
> Fix For: 1.0.0
>
>
> In the recently introduced ARROW-5803, some heavy dependencies (thrift, 
> protobuf, flatbufers, grpc) are build at each invocation of docker-compose 
> build (thus each travis test).
> We should aim to build the third party dependencies in docker build phase 
> instead, to exploit caching and docker-compose pull so that the CI step 
> doesn't need to build said dependencies each time.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5885) [Python] Support optional arrow components via extras_require

2019-07-09 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881284#comment-16881284
 ] 

Uwe L. Korn commented on ARROW-5885:


This will only work when you split the {{pyarrow}} package into multiple 
packages. The extras are not on a single wheel basis. At the current stage this 
requires a big refactoring of the Python package.

> [Python] Support optional arrow components via extras_require
> -
>
> Key: ARROW-5885
> URL: https://issues.apache.org/jira/browse/ARROW-5885
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: George Sakkis
>Priority: Minor
>
> Since Arrow (and pyarrow) have several independent optional component, 
> instead of installing all of them it would be convenient if these could be 
> opt-in from pip like 
> {{pip install pyarrow[gandiva,flight,plasma]}}
> or opt-out like
> {{pip install pyarrow[no-gandiva,no-flight,no-plasma]}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5885) [Python] Support optional arrow components via extras_require

2019-07-09 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-5885:
---
Summary: [Python] Support optional arrow components via extras_require  
(was: Support optional arrow components via extras_require)

> [Python] Support optional arrow components via extras_require
> -
>
> Key: ARROW-5885
> URL: https://issues.apache.org/jira/browse/ARROW-5885
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: George Sakkis
>Priority: Minor
>
> Since Arrow (and pyarrow) have several independent optional component, 
> instead of installing all of them it would be convenient if these could be 
> opt-in from pip like 
> {{pip install pyarrow[gandiva,flight,plasma]}}
> or opt-out like
> {{pip install pyarrow[no-gandiva,no-flight,no-plasma]}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5886) [Python][Packaging] Manylinux1/2010 compliance issue with libz

2019-07-09 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881282#comment-16881282
 ] 

Uwe L. Korn commented on ARROW-5886:


Actually the work of {{auditwheel repair}} should be to rename these libs and 
use {{patchelf}} on all binaries so that they linked to the renamed ones. If 
there is a binary that still links to the old name, then there is a bug in 
{{auditwheel repair}}.

> [Python][Packaging] Manylinux1/2010 compliance issue with libz
> --
>
> Key: ARROW-5886
> URL: https://issues.apache.org/jira/browse/ARROW-5886
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Affects Versions: 0.14.0
>Reporter: Krisztian Szucs
>Priority: Major
>
> So we statically link liblz4 in the manylinux1 wheels
> {code}
> # ldd pyarrow-manylinux1/libarrow.so.14 | grep z
> libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7fc28cef4000)
> {code}
> but dynamically in the manylinux2010 wheels
> {code}
> # ldd pyarrow-manylinux2010/libarrow.so.14 | grep z
> liblz4.so.1 => not found  (already deleted to reproduce the issue)
> libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7f56f744)
> {code}
> this what this PR resolves.
> What I'm finding strange, that auditwheel seems to bundle libz for manylinux1:
> {code}
> # ls -lah pyarrow-manylinux1/*z*so.*
> -rwxr-xr-x 1 root root 115K Jun 29 00:14 
> pyarrow-manylinux1/libz-7f57503f.so.1.2.11
> {code}
> while ldd still uses the system libz:
> {code}
> # ldd pyarrow-manylinux1/libarrow.so.14 | grep z
> libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7f91fcf3f000)
> {code}
> For manylinux2010 we also have liblz4:
> {code}
> #  ls -lah pyarrow-manylinux2010/*z*so.*
> -rwxr-xr-x 1 root root 191K Jun 28 23:38 
> pyarrow-manylinux2010/liblz4-8cb8bdde.so.1.8.3
> -rwxr-xr-x 1 root root 115K Jun 28 23:38 
> pyarrow-manylinux2010/libz-c69b9943.so.1.2.11
> {code}
> and ldd similarly tries to load the system libs:
> {code}
> # ldd pyarrow-manylinux2010/libarrow.so.14 | grep z
> liblz4.so.1 => not found
> libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7fd72764e000)
> {code}
> Inspecting manylinux1 with `LD_DEBUG=files,libs ldd libarrow.so.14` it seems 
> like to search the right path, but cannot find the hashed version of libz 
> `libz-7f57503f.so.1.2.11`
> {code}
>463: file=libz.so.1 [0];  needed by ./libarrow.so.14 [0]
>463: find library=libz.so.1 [0]; searching
>463:  search path=/tmp/pyarrow-manylinux1/.  (RPATH from 
> file ./libarrow.so.14)
>463:   trying file=/tmp/pyarrow-manylinux1/./libz.so.1
>463:  search cache=/etc/ld.so.cache
>463:   trying file=/lib/x86_64-linux-gnu/libz.so.1
> {code}
> There is no `libz.so.1` just `libz-7f57503f.so.1.2.11`.
> Similarly for manylinux2010 and libz:
> {code}
>470: file=libz.so.1 [0];  needed by ./libarrow.so.14 [0]
>470: find library=libz.so.1 [0]; searching
>470:  search path=/tmp/pyarrow-manylinux2010/.   
> (RPATH from file ./libarrow.so.14)
>470:   trying file=/tmp/pyarrow-manylinux2010/./libz.so.1
>470:  search cache=/etc/ld.so.cache
>470:   trying file=/lib/x86_64-linux-gnu/libz.so.1
> {code}
> for liblz4 (again, I've deleted the system one):
> {code}
>470: file=liblz4.so.1 [0];  needed by ./libarrow.so.14 [0]
>470: find library=liblz4.so.1 [0]; searching
>470:  search path=/tmp/pyarrow-manylinux2010/.   
> (RPATH from file ./libarrow.so.14)
>470:   trying file=/tmp/pyarrow-manylinux2010/./liblz4.so.1
>470:  search cache=/etc/ld.so.cache
>470:  search 
> path=/lib/x86_64-linux-gnu/tls/x86_64:/lib/x86_64-linux-gnu/tls:/lib/x86_64-linux-gnu/x86_64:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/tls/x86_64:/usr/lib/x86_64-linux-gnu/tls:/usr/lib/x86_64-linux-gnu/x86_6$
> :/usr/lib/x86_64-linux-gnu:/lib/tls/x86_64:/lib/tls:/lib/x86_64:/lib:/usr/lib/tls/x86_64:/usr/lib/tls:/usr/lib/x86_64:/usr/lib
>   (system search path)
> {code}
> There are no `libz.so.1` nor `liblz4.so.1`, just `libz-c69b9943.so.1.2.11` 
> and `liblz4-8cb8bdde.so.1.8.3`
> According to https://www.python.org/dev/peps/pep-0571/ `liblz4` nor `libz` 
> are part of the whitelist, and while these are bundled with the wheel, 
> seemingly cannot be found - perhaps because of the hash in the library name?
> I've tried to inspect the wheels with `auditwheel show` with version `2` and 
> `1.10`, both says the following:
> {code}
> # auditwheel show pyarrow-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl
> pyarrow-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl is consistent with
> the following platform tag: 

[jira] [Assigned] (ARROW-5874) [Python] pyarrow 0.14.0 macOS wheels depend on shared libs under /usr/local/opt

2019-07-08 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-5874:
--

Assignee: Krisztian Szucs

> [Python] pyarrow 0.14.0 macOS wheels depend on shared libs under 
> /usr/local/opt
> ---
>
> Key: ARROW-5874
> URL: https://issues.apache.org/jira/browse/ARROW-5874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: macOS 10.14.5
> Anaconda Python 3.7.3
>Reporter: Michael Anselmi
>Assignee: Krisztian Szucs
>Priority: Critical
>  Labels: pull-request-available, pyarrow, wheel
> Fix For: 0.14.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Hello, and congrats on the recent release of Apache Arrow 0.14.0!
> This morning I installed pyarrow 0.14.0 on my macOS 10.14.5 system like so:
> {code:java}
> python3.7 -m venv ~/virtualenv/pyarrow-0.14.0
> source ~/virtualenv/pyarrow-0.14.0/bin/activate
> pip install --upgrade pip setuptools
> pip install pyarrow  # installs 
> pyarrow-0.14.0-cp37-cp37m-macosx_10_6_intel.whl
> pip freeze --all
> # numpy==1.16.4
> # pip==19.1.1
> # pyarrow==0.14.0
> # setuptools==41.0.1
> # six==1.12.0
> {code}
> However I am unable to import pyarrow:
> {code:java}
> python -c 'import pyarrow'
> # Traceback (most recent call last):
> #   File "", line 1, in 
> #   File 
> "/Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/__init__.py",
>  line 49, in 
> # from pyarrow.lib import cpu_count, set_cpu_count
> # ImportError: 
> dlopen(/Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-darwin.so,
>  2): Library not loaded: /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib
> #   Referenced from: 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib
> #   Reason: image not found
> {code}
> pyarrow is trying to load a shared library (OpenSSL in this case) from a path 
> under {{/usr/local/opt}} that doesn't exist; perhaps that OpenSSL had been 
> provided by Homebrew as part of your build process?  Unfortunately this makes 
> the pyarrow 0.14.0 wheel completely unusable on my system or any system that 
> doesn't have OpenSSL installed in that location.  This is a regression from 
> pyarrow 0.13.0 as those wheels "just worked".
> Additional diagnostic output below.  I ran {{otool -L}} on each {{.dylib}} 
> and {{.so}} file in 
> {{/Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow}}
>  and included the output for those with dependencies under {{/usr/local/opt}}:
> {code:java}
> otool -L 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib
> # 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib:
> # @rpath/libarrow.14.dylib (compatibility version 14.0.0, current 
> version 14.0.0)
> # /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/local/opt/openssl/lib/libssl.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 
> 1.2.8)
> # @rpath/libarrow_boost_system.dylib (compatibility version 0.0.0, 
> current version 0.0.0)
> # @rpath/libarrow_boost_filesystem.dylib (compatibility version 
> 0.0.0, current version 0.0.0)
> # @rpath/libarrow_boost_regex.dylib (compatibility version 0.0.0, 
> current version 0.0.0)
> # /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current 
> version 307.5.0)
> # /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current 
> version 1238.50.2)
> otool -L 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow_flight.14.dylib
> # 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow_flight.14.dylib:
> # @rpath/libarrow_flight.14.dylib (compatibility version 14.0.0, 
> current version 14.0.0)
> # @rpath/libarrow.14.dylib (compatibility version 14.0.0, current 
> version 14.0.0)
> # /usr/local/opt/openssl/lib/libssl.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current 
> version 307.5.0)
> # /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current 
> version 1238.50.2)
> otool -L 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow_python.14.dylib
> # 
> 

[jira] [Resolved] (ARROW-5874) [Python] pyarrow 0.14.0 macOS wheels depend on shared libs under /usr/local/opt

2019-07-08 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-5874.

Resolution: Fixed

Issue resolved by pull request 4823
[https://github.com/apache/arrow/pull/4823]

> [Python] pyarrow 0.14.0 macOS wheels depend on shared libs under 
> /usr/local/opt
> ---
>
> Key: ARROW-5874
> URL: https://issues.apache.org/jira/browse/ARROW-5874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: macOS 10.14.5
> Anaconda Python 3.7.3
>Reporter: Michael Anselmi
>Priority: Critical
>  Labels: pull-request-available, pyarrow, wheel
> Fix For: 0.14.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hello, and congrats on the recent release of Apache Arrow 0.14.0!
> This morning I installed pyarrow 0.14.0 on my macOS 10.14.5 system like so:
> {code:java}
> python3.7 -m venv ~/virtualenv/pyarrow-0.14.0
> source ~/virtualenv/pyarrow-0.14.0/bin/activate
> pip install --upgrade pip setuptools
> pip install pyarrow  # installs 
> pyarrow-0.14.0-cp37-cp37m-macosx_10_6_intel.whl
> pip freeze --all
> # numpy==1.16.4
> # pip==19.1.1
> # pyarrow==0.14.0
> # setuptools==41.0.1
> # six==1.12.0
> {code}
> However I am unable to import pyarrow:
> {code:java}
> python -c 'import pyarrow'
> # Traceback (most recent call last):
> #   File "", line 1, in 
> #   File 
> "/Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/__init__.py",
>  line 49, in 
> # from pyarrow.lib import cpu_count, set_cpu_count
> # ImportError: 
> dlopen(/Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-darwin.so,
>  2): Library not loaded: /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib
> #   Referenced from: 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib
> #   Reason: image not found
> {code}
> pyarrow is trying to load a shared library (OpenSSL in this case) from a path 
> under {{/usr/local/opt}} that doesn't exist; perhaps that OpenSSL had been 
> provided by Homebrew as part of your build process?  Unfortunately this makes 
> the pyarrow 0.14.0 wheel completely unusable on my system or any system that 
> doesn't have OpenSSL installed in that location.  This is a regression from 
> pyarrow 0.13.0 as those wheels "just worked".
> Additional diagnostic output below.  I ran {{otool -L}} on each {{.dylib}} 
> and {{.so}} file in 
> {{/Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow}}
>  and included the output for those with dependencies under {{/usr/local/opt}}:
> {code:java}
> otool -L 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib
> # 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib:
> # @rpath/libarrow.14.dylib (compatibility version 14.0.0, current 
> version 14.0.0)
> # /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/local/opt/openssl/lib/libssl.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 
> 1.2.8)
> # @rpath/libarrow_boost_system.dylib (compatibility version 0.0.0, 
> current version 0.0.0)
> # @rpath/libarrow_boost_filesystem.dylib (compatibility version 
> 0.0.0, current version 0.0.0)
> # @rpath/libarrow_boost_regex.dylib (compatibility version 0.0.0, 
> current version 0.0.0)
> # /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current 
> version 307.5.0)
> # /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current 
> version 1238.50.2)
> otool -L 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow_flight.14.dylib
> # 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow_flight.14.dylib:
> # @rpath/libarrow_flight.14.dylib (compatibility version 14.0.0, 
> current version 14.0.0)
> # @rpath/libarrow.14.dylib (compatibility version 14.0.0, current 
> version 14.0.0)
> # /usr/local/opt/openssl/lib/libssl.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current 
> version 307.5.0)
> # /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current 
> version 1238.50.2)
> otool -L 
> 

[jira] [Comment Edited] (ARROW-5874) [Python] pyarrow 0.14.0 macOS wheels depend on shared libs under /usr/local/opt

2019-07-08 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880457#comment-16880457
 ] 

Uwe L. Korn edited comment on ARROW-5874 at 7/8/19 3:17 PM:


We should bundle OpenSSL with the wheel and declare it as unsafe to use in 
production. Either compile from source when using {{pip}} in your production 
environment or use {{conda}}. This is roughly the same way {{psycopg2}} went. 
You cannot manage this type of binary dependencies with {{pip}}, this is why 
{{conda}} was created.

I'm aware of {{delocate}} but we are explcitly not using it as we rely on CMake 
and {{setup.py}} to bundle all required libraries. In our case it might be 
better to statically link OpenSSL to not pollute the global namespace with our 
shipped version of OpenSSL.


was (Author: xhochy):
We should bundle OpenSSL with the wheel and declare it as unsafe to use in 
production. Either compile from source when using {{pip}} in your production 
environment or use {{conda}}. This is roughly the same way {{psycopg2}} went. 
You cannot 

I'm aware of {{delocate}} but we are explcitly not using it as we rely on CMake 
and {{setup.py}} to bundle all required libraries. In our case it might be 
better to statically link OpenSSL to not pollute the global namespace with our 
shipped version of OpenSSL.

> [Python] pyarrow 0.14.0 macOS wheels depend on shared libs under 
> /usr/local/opt
> ---
>
> Key: ARROW-5874
> URL: https://issues.apache.org/jira/browse/ARROW-5874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: macOS 10.14.5
> Anaconda Python 3.7.3
>Reporter: Michael Anselmi
>Priority: Critical
>  Labels: pyarrow, wheel
>
> Hello, and congrats on the recent release of Apache Arrow 0.14.0!
> This morning I installed pyarrow 0.14.0 on my macOS 10.14.5 system like so:
> {code:java}
> python3.7 -m venv ~/virtualenv/pyarrow-0.14.0
> source ~/virtualenv/pyarrow-0.14.0/bin/activate
> pip install --upgrade pip setuptools
> pip install pyarrow  # installs 
> pyarrow-0.14.0-cp37-cp37m-macosx_10_6_intel.whl
> pip freeze --all
> # numpy==1.16.4
> # pip==19.1.1
> # pyarrow==0.14.0
> # setuptools==41.0.1
> # six==1.12.0
> {code}
> However I am unable to import pyarrow:
> {code:java}
> python -c 'import pyarrow'
> # Traceback (most recent call last):
> #   File "", line 1, in 
> #   File 
> "/Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/__init__.py",
>  line 49, in 
> # from pyarrow.lib import cpu_count, set_cpu_count
> # ImportError: 
> dlopen(/Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-darwin.so,
>  2): Library not loaded: /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib
> #   Referenced from: 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib
> #   Reason: image not found
> {code}
> pyarrow is trying to load a shared library (OpenSSL in this case) from a path 
> under {{/usr/local/opt}} that doesn't exist; perhaps that OpenSSL had been 
> provided by Homebrew as part of your build process?  Unfortunately this makes 
> the pyarrow 0.14.0 wheel completely unusable on my system or any system that 
> doesn't have OpenSSL installed in that location.  This is a regression from 
> pyarrow 0.13.0 as those wheels "just worked".
> Additional diagnostic output below.  I ran {{otool -L}} on each {{.dylib}} 
> and {{.so}} file in 
> {{/Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow}}
>  and included the output for those with dependencies under {{/usr/local/opt}}:
> {code:java}
> otool -L 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib
> # 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib:
> # @rpath/libarrow.14.dylib (compatibility version 14.0.0, current 
> version 14.0.0)
> # /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/local/opt/openssl/lib/libssl.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 
> 1.2.8)
> # @rpath/libarrow_boost_system.dylib (compatibility version 0.0.0, 
> current version 0.0.0)
> # @rpath/libarrow_boost_filesystem.dylib (compatibility version 
> 0.0.0, current version 0.0.0)
> # @rpath/libarrow_boost_regex.dylib (compatibility version 0.0.0, 
> current version 0.0.0)
> # /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current 
> version 307.5.0)
> # /usr/lib/libSystem.B.dylib 

[jira] [Commented] (ARROW-5874) [Python] pyarrow 0.14.0 macOS wheels depend on shared libs under /usr/local/opt

2019-07-08 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880457#comment-16880457
 ] 

Uwe L. Korn commented on ARROW-5874:


We should bundle OpenSSL with the wheel and declare it as unsafe to use in 
production. Either compile from source when using {{pip}} in your production 
environment or use {{conda}}. This is roughly the same way {{psycopg2}} went. 
You cannot 

I'm aware of {{delocate}} but we are explcitly not using it as we rely on CMake 
and {{setup.py}} to bundle all required libraries. In our case it might be 
better to statically link OpenSSL to not pollute the global namespace with our 
shipped version of OpenSSL.

> [Python] pyarrow 0.14.0 macOS wheels depend on shared libs under 
> /usr/local/opt
> ---
>
> Key: ARROW-5874
> URL: https://issues.apache.org/jira/browse/ARROW-5874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: macOS 10.14.5
> Anaconda Python 3.7.3
>Reporter: Michael Anselmi
>Priority: Critical
>  Labels: pyarrow, wheel
>
> Hello, and congrats on the recent release of Apache Arrow 0.14.0!
> This morning I installed pyarrow 0.14.0 on my macOS 10.14.5 system like so:
> {code:java}
> python3.7 -m venv ~/virtualenv/pyarrow-0.14.0
> source ~/virtualenv/pyarrow-0.14.0/bin/activate
> pip install --upgrade pip setuptools
> pip install pyarrow  # installs 
> pyarrow-0.14.0-cp37-cp37m-macosx_10_6_intel.whl
> pip freeze --all
> # numpy==1.16.4
> # pip==19.1.1
> # pyarrow==0.14.0
> # setuptools==41.0.1
> # six==1.12.0
> {code}
> However I am unable to import pyarrow:
> {code:java}
> python -c 'import pyarrow'
> # Traceback (most recent call last):
> #   File "", line 1, in 
> #   File 
> "/Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/__init__.py",
>  line 49, in 
> # from pyarrow.lib import cpu_count, set_cpu_count
> # ImportError: 
> dlopen(/Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-darwin.so,
>  2): Library not loaded: /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib
> #   Referenced from: 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib
> #   Reason: image not found
> {code}
> pyarrow is trying to load a shared library (OpenSSL in this case) from a path 
> under {{/usr/local/opt}} that doesn't exist; perhaps that OpenSSL had been 
> provided by Homebrew as part of your build process?  Unfortunately this makes 
> the pyarrow 0.14.0 wheel completely unusable on my system or any system that 
> doesn't have OpenSSL installed in that location.  This is a regression from 
> pyarrow 0.13.0 as those wheels "just worked".
> Additional diagnostic output below.  I ran {{otool -L}} on each {{.dylib}} 
> and {{.so}} file in 
> {{/Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow}}
>  and included the output for those with dependencies under {{/usr/local/opt}}:
> {code:java}
> otool -L 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib
> # 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow.14.dylib:
> # @rpath/libarrow.14.dylib (compatibility version 14.0.0, current 
> version 14.0.0)
> # /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/local/opt/openssl/lib/libssl.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 
> 1.2.8)
> # @rpath/libarrow_boost_system.dylib (compatibility version 0.0.0, 
> current version 0.0.0)
> # @rpath/libarrow_boost_filesystem.dylib (compatibility version 
> 0.0.0, current version 0.0.0)
> # @rpath/libarrow_boost_regex.dylib (compatibility version 0.0.0, 
> current version 0.0.0)
> # /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current 
> version 307.5.0)
> # /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current 
> version 1238.50.2)
> otool -L 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow_flight.14.dylib
> # 
> /Users/manselmi/virtualenv/pyarrow-0.14.0/lib/python3.7/site-packages/pyarrow/libarrow_flight.14.dylib:
> # @rpath/libarrow_flight.14.dylib (compatibility version 14.0.0, 
> current version 14.0.0)
> # @rpath/libarrow.14.dylib (compatibility version 14.0.0, current 
> version 14.0.0)
> # /usr/local/opt/openssl/lib/libssl.1.0.0.dylib (compatibility 
> version 1.0.0, current version 1.0.0)
> # /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib (compatibility 
> 

[jira] [Commented] (ARROW-5133) [Integration] Update turbodbc integration test to install a pinned version in the Dockerfile

2019-06-29 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875551#comment-16875551
 ] 

Uwe L. Korn commented on ARROW-5133:


[~kszucs] I don't think that this would work well given how often we break 
things in Arrow. I would rather keep building against master.

> [Integration] Update turbodbc integration test to install a pinned version in 
> the Dockerfile
> 
>
> Key: ARROW-5133
> URL: https://issues.apache.org/jira/browse/ARROW-5133
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: turbodbc
>
> integration/turbodbc/runtest.sh currently installs and tests the integration 
> with
> a fork's branch.
> We should test against the official turbodbc release once Uwe's PR gets 
> merged.
> The turbodbc install step should be run during the docker image build (
> in the Dockerfile) instead of the runtest.sh script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5609) [C++] Set CMP0068 CMake policy to avoid macOS warnings

2019-06-29 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-5609:
--

Assignee: Uwe L. Korn

> [C++] Set CMP0068 CMake policy to avoid macOS warnings
> --
>
> Key: ARROW-5609
> URL: https://issues.apache.org/jira/browse/ARROW-5609
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>
> These warnings are appearing in the build on macOS
> {code}
> CMake Warning (dev):
>   Policy CMP0068 is not set: RPATH settings on macOS do not affect
>   install_name.  Run "cmake --help-policy CMP0068" for policy details.  Use
>   the cmake_policy command to set the policy and suppress this warning.
>   For compatibility with older versions of CMake, the install_name fields for
>   the following targets are still affected by RPATH settings:
>arrow_dataset_shared
>arrow_python_shared
>arrow_shared
>arrow_testing_shared
>parquet_shared
>plasma_shared
> This warning is for project developers.  Use -Wno-dev to suppress it.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5731) [CI] Turbodbc integration tests are failing

2019-06-29 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-5731:
--

Assignee: Uwe L. Korn

> [CI] Turbodbc integration tests are failing 
> 
>
> Key: ARROW-5731
> URL: https://issues.apache.org/jira/browse/ARROW-5731
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 1.0.0
>
>
> Have not investigated yet, build: 
> https://circleci.com/gh/ursa-labs/crossbow/383
> cc [~xhochy]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5735) [C++] Appveyor builds failing persistently in thrift_ep build

2019-06-26 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873092#comment-16873092
 ] 

Uwe L. Korn commented on ARROW-5735:


The problem here is that this is picking up Boost's new CMake config which we 
cannot ingest. We need to disable this using {{-DBoost_NO_BOOST_CMAKE=ON}}

> [C++] Appveyor builds failing persistently in thrift_ep build
> -
>
> Key: ARROW-5735
> URL: https://issues.apache.org/jira/browse/ARROW-5735
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> See
> {code}
> 72/541] Performing configure step for 'thrift_ep'
> FAILED: thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-configure 
> cmd.exe /C "cd /D 
> C:\projects\arrow\cpp\build\thrift_ep-prefix\src\thrift_ep-build && 
> "C:\Program Files (x86)\CMake\bin\cmake.exe" 
> -DFLEX_EXECUTABLE=C:/projects/arrow/cpp/build/winflexbison_ep/src/winflexbison_ep-install/win_flex.exe
>  
> -DBISON_EXECUTABLE=C:/projects/arrow/cpp/build/winflexbison_ep/src/winflexbison_ep-install/win_bison.exe
>  -DZLIB_INCLUDE_DIR= -DWITH_SHARED_LIB=OFF -DWITH_PLUGIN=OFF -DZLIB_LIBRARY= 
> "-DCMAKE_C_COMPILER=C:/Program Files (x86)/Microsoft Visual 
> Studio/2017/Community/VC/Tools/MSVC/14.16.27023/bin/Hostx64/x64/cl.exe" 
> -DCMAKE_CXX_COMPILER=C:/Miniconda36-x64/Scripts/clcache.exe 
> -DCMAKE_BUILD_TYPE=RELEASE "-DCMAKE_C_FLAGS=/DWIN32 /D_WINDOWS /W3  /MD /O2 
> /Ob2 /DNDEBUG" "-DCMAKE_C_FLAGS_RELEASE=/DWIN32 /D_WINDOWS /W3  /MD /O2 /Ob2 
> /DNDEBUG" "-DCMAKE_CXX_FLAGS=/DWIN32 /D_WINDOWS  /GR /EHsc 
> /D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING  /MD /Od /UNDEBUG" 
> "-DCMAKE_CXX_FLAGS_RELEASE=/DWIN32 /D_WINDOWS  /GR /EHsc 
> /D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING  /MD /Od /UNDEBUG" 
> -DCMAKE_INSTALL_PREFIX=C:/projects/arrow/cpp/build/thrift_ep/src/thrift_ep-install
>  
> -DCMAKE_INSTALL_RPATH=C:/projects/arrow/cpp/build/thrift_ep/src/thrift_ep-install/lib
>  -DBUILD_SHARED_LIBS=OFF -DBUILD_TESTING=OFF -DBUILD_EXAMPLES=OFF 
> -DBUILD_TUTORIALS=OFF -DWITH_QT4=OFF -DWITH_C_GLIB=OFF -DWITH_JAVA=OFF 
> -DWITH_PYTHON=OFF -DWITH_HASKELL=OFF -DWITH_CPP=ON -DWITH_STATIC_LIB=ON 
> -DWITH_LIBEVENT=OFF -DWITH_MT=OFF -GNinja 
> C:/projects/arrow/cpp/build/thrift_ep-prefix/src/thrift_ep && "C:\Program 
> Files (x86)\CMake\bin\cmake.exe" -E touch 
> C:/projects/arrow/cpp/build/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-configure"
> -- The C compiler identification is MSVC 19.16.27030.1
> -- The CXX compiler identification is MSVC 19.16.27030.1
> -- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual 
> Studio/2017/Community/VC/Tools/MSVC/14.16.27023/bin/Hostx64/x64/cl.exe
> -- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual 
> Studio/2017/Community/VC/Tools/MSVC/14.16.27023/bin/Hostx64/x64/cl.exe -- 
> works
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Check for working CXX compiler: C:/Miniconda36-x64/Scripts/clcache.exe
> -- Check for working CXX compiler: C:/Miniconda36-x64/Scripts/clcache.exe -- 
> works
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Parsed Thrift package version: 0.12.0
> -- Parsed Thrift version: 0.12.0 (0.2.0)
> -- Setting C++11 as the default language level.
> -- To specify a different C++ language level, set CMAKE_CXX_STANDARD
> CMake Warning (dev) at build/cmake/DefineOptions.cmake:63 (find_package):
>   Policy CMP0074 is not set: find_package uses _ROOT variables.
>   Run "cmake --help-policy CMP0074" for policy details.  Use the cmake_policy
>   command to set the policy and suppress this warning.
>   Environment variable Boost_ROOT is set to:
> C:\Miniconda36-x64\envs\arrow\Library
>   For compatibility, CMake is ignoring the variable.
> Call Stack (most recent call first):
>   CMakeLists.txt:52 (include)
> This warning is for project developers.  Use -Wno-dev to suppress it.
> -- Found Boost 1.70.0 at 
> C:/Miniconda36-x64/envs/arrow/Library/lib/cmake/Boost-1.70.0
> --   Requested configuration: QUIET
> -- Found boost_headers 1.70.0 at 
> C:/Miniconda36-x64/envs/arrow/Library/lib/cmake/boost_headers-1.70.0
> -- Boost 1.53 found.
> -- libevent NOT found.
> -- Could NOT find RUN_HASKELL (missing: RUN_HASKELL) 
> -- Could NOT find CABAL (missing: CABAL) 
> -- Looking for arpa/inet.h
> -- Looking for arpa/inet.h - not found
> -- Looking for fcntl.h
> -- Looking for fcntl.h - found
> -- Looking for getopt.h
> -- Looking for getopt.h - not 

[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-06-23 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870498#comment-16870498
 ] 

Uwe L. Korn commented on ARROW-5691:


I would be 100% fine with moving it into {{src/arrow/parquet}} but I question a 
bit of making the Parquet adaptor a full subset of the dataset project. For me 
these are two different entities, an adaptor providing access to the Parquet 
file format, either standalone low-level access or high-level reads into Arrow. 
Whereas, the dataset project builds on top of various adaptors but is not 
required for simple interactions with the file formats it supports.

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5614) [R] Error: 'install_arrow' is not an exported object from 'namespace:arrow'

2019-06-14 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16864463#comment-16864463
 ] 

Uwe L. Korn commented on ARROW-5614:


Building the R package using conda-forge based packages is quite 
straightforward. All packages that are also on CRAN are on conda-forge with an 
{{r-}} prefix. Installing them, you can then use the same {{R}} commands as you 
would use with all other tools.

> [R] Error: 'install_arrow' is not an exported object from 'namespace:arrow'
> ---
>
> Key: ARROW-5614
> URL: https://issues.apache.org/jira/browse/ARROW-5614
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Thomas Buhrmann
>Assignee: Neal Richardson
>Priority: Major
>
> I'm trying to get the R package installed in a Debian docker image that 
> already contains R and RStudio (via rocker/rstudio from dockerhub), as well 
> as arrow-cpp, parquet-cpp and pyarrow installed via conda. I.e. I should have 
> all required arrow dependencies in my conda environment's /lib and /include 
> folders.
> I then tried to install the R package in two ways (as stated in the README, 
> having devtools, and after managing to get git2r installed)
> 1/ via remotes
> {code:java}
> remotes::install_github("apache/arrow/r", 
> ref="76e1bc5dfb9d08e31eddd5cbcc0b1bab934da2c7"){code}
> 2/ from source
> {code:java}
> git clone https://github.com/apache/arrow.git
> cd arrow/r
> R -e 'remotes::install_deps()'
> R CMD INSTALL 
> --configure-vars='INCLUDE_DIR=/root/miniconda/envs/my_env/include
> LIB_DIR=/root/miniconda/envs/my_env/lib' .{code}
> In both cases the install seems to work fine:
> {code:java}
> ** building package indices
> ** testing if installed package can be loaded from temporary location
> ** checking absolute paths in shared objects and dynamic libraries
> ** testing if installed package can be loaded from final location
> ** testing if installed package keeps a record of temporary installation path
> * DONE (arrow)
> {code}
>  But when I then do the following as prompted:
> {code:java}
> library(arrow)
> arrow::install_arrow()
> {code}
> The result is
> {code:java}
> Error: 'install_arrow' is not an exported object from 'namespace:arrow'
> {code}
> And running the example without calling that non-existing function I get the 
> error
> {code:java}
> Error in Table__from_dots(dots, schema) : 
>   Cannot call Table__from_dots(). Please use arrow::install_arrow() to 
> install required runtime libraries. 
> {code}
> So I don't know if I'm doing something wrong or if the documentation isn't up 
> to date? Specifically, what is the arrow::install_arrow() function supposed 
> to install, given that I already have the arrow and parquet libs and headers 
> installed, and supposedly they've been used (linked to) when I installed the 
> R package?
> In general, is there any way to get this package installed in the above 
> context (arrow-cpp etc. installed via conda)?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5565) [Python] Document how to use gdb when working on pyarrow

2019-06-14 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-5565.

Resolution: Fixed

Issue resolved by pull request 4560
[https://github.com/apache/arrow/pull/4560]

> [Python] Document how to use gdb when working on pyarrow
> 
>
> Key: ARROW-5565
> URL: https://issues.apache.org/jira/browse/ARROW-5565
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> It may not be obvious to new developers how to set breakpoints in the C++ 
> libraries when driven from Python. The incantation is slightly abstruse, for 
> example
> {code}
> $ gdb --args env py.test pyarrow/tests/test_array.py -k scalars_mixed_type
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5509) [R] write_parquet()

2019-06-11 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860771#comment-16860771
 ] 

Uwe L. Korn commented on ARROW-5509:


[~romainfrancois] see my PR, I'm already working on this and will continue 
today.

> [R] write_parquet()
> ---
>
> Key: ARROW-5509
> URL: https://issues.apache.org/jira/browse/ARROW-5509
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain François
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We can read but not yet write. The C++ library supports this and pyarrow does 
> it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5509) [R] write_parquet()

2019-06-06 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-5509:
--

Assignee: Uwe L. Korn

> [R] write_parquet()
> ---
>
> Key: ARROW-5509
> URL: https://issues.apache.org/jira/browse/ARROW-5509
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> We can read but not yet write. The C++ library supports this and pyarrow does 
> it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5436) [Python] expose filters argument in parquet.read_table

2019-06-06 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-5436.

Resolution: Fixed

Issue resolved by pull request 4409
[https://github.com/apache/arrow/pull/4409]

> [Python] expose filters argument in parquet.read_table
> --
>
> Key: ARROW-5436
> URL: https://issues.apache.org/jira/browse/ARROW-5436
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Currently, the {{parquet.read_table}} function can be used both for reading a 
> single file (interface to ParquetFile) as a directory (interface to 
> ParquetDataset). 
> ParquetDataset has some extra keywords such as {{filters}} that would be nice 
> to expose through {{read_table}} as well.
> Of course one can always use {{ParquetDataset}} if you need its power, but 
> for pandas wrapping pyarrow it is easier to be able to pass through keywords 
> just to {{parquet.read_table}} instead of calling either {{read_table}} or 
> {{ParquetDataset}}. Context: https://github.com/pandas-dev/pandas/issues/26551



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5436) [Python] expose filters argument in parquet.read_table

2019-06-06 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-5436:
--

Assignee: Joris Van den Bossche

> [Python] expose filters argument in parquet.read_table
> --
>
> Key: ARROW-5436
> URL: https://issues.apache.org/jira/browse/ARROW-5436
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Currently, the {{parquet.read_table}} function can be used both for reading a 
> single file (interface to ParquetFile) as a directory (interface to 
> ParquetDataset). 
> ParquetDataset has some extra keywords such as {{filters}} that would be nice 
> to expose through {{read_table}} as well.
> Of course one can always use {{ParquetDataset}} if you need its power, but 
> for pandas wrapping pyarrow it is easier to be able to pass through keywords 
> just to {{parquet.read_table}} instead of calling either {{read_table}} or 
> {{ParquetDataset}}. Context: https://github.com/pandas-dev/pandas/issues/26551



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5521) [Packaging] License check fails with Apache RAT 0.13

2019-06-06 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-5521:
--

Assignee: Antoine Pitrou

> [Packaging] License check fails with Apache RAT 0.13
> 
>
> Key: ARROW-5521
> URL: https://issues.apache.org/jira/browse/ARROW-5521
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We currently use version 0.12. With 0.13 I get:
> {code:java}
> NOT APPROVED: js/src/fb/File.ts (xx/js/src/fb/File.ts): false
> NOT APPROVED: js/src/fb/Message.ts (xx/js/src/fb/Message.ts): false
> NOT APPROVED: js/src/fb/Schema.ts (xx/js/src/fb/Schema.ts): false
> NOT APPROVED: js/test/inference/column.ts (xx/js/test/inference/column.ts): 
> false
> NOT APPROVED: js/test/inference/nested.ts (xx/js/test/inference/nested.ts): 
> false
> NOT APPROVED: js/test/inference/visitor/get.ts 
> (xx/js/test/inference/visitor/get.ts): false
> 6 unapproved licences. Check rat report: rat.txt
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5521) [Packaging] License check fails with Apache RAT 0.13

2019-06-06 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-5521.

   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4486
[https://github.com/apache/arrow/pull/4486]

> [Packaging] License check fails with Apache RAT 0.13
> 
>
> Key: ARROW-5521
> URL: https://issues.apache.org/jira/browse/ARROW-5521
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We currently use version 0.12. With 0.13 I get:
> {code:java}
> NOT APPROVED: js/src/fb/File.ts (xx/js/src/fb/File.ts): false
> NOT APPROVED: js/src/fb/Message.ts (xx/js/src/fb/Message.ts): false
> NOT APPROVED: js/src/fb/Schema.ts (xx/js/src/fb/Schema.ts): false
> NOT APPROVED: js/test/inference/column.ts (xx/js/test/inference/column.ts): 
> false
> NOT APPROVED: js/test/inference/nested.ts (xx/js/test/inference/nested.ts): 
> false
> NOT APPROVED: js/test/inference/visitor/get.ts 
> (xx/js/test/inference/visitor/get.ts): false
> 6 unapproved licences. Check rat report: rat.txt
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5449) [C++] Local filesystem implementation: investigate Windows UNC paths

2019-06-06 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-5449.

   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4487
[https://github.com/apache/arrow/pull/4487]

> [C++] Local filesystem implementation: investigate Windows UNC paths
> 
>
> Key: ARROW-5449
> URL: https://issues.apache.org/jira/browse/ARROW-5449
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Followup to ARROW-5378: Windows paths to networked files (e.g. 
> "\\server\share\path\file.txt") and extended-length paths (e.g. 
> "\\?\c:\some\absolute\path.txt") should be checked for compatibility with the 
> LocalFileSystem implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5497) [R][Release] Build and publish R package docs

2019-06-05 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857095#comment-16857095
 ] 

Uwe L. Korn commented on ARROW-5497:


I'm not sure whether JS and Java docs currently get build at all. The 
{{gen_apidocs}} broke at some time and the solution was to migrate everything 
to the main {{docker-compose.yml}} but just hasn't happened yet.

> [R][Release] Build and publish R package docs
> -
>
> Key: ARROW-5497
> URL: https://issues.apache.org/jira/browse/ARROW-5497
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools, Documentation, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 0.14.0
>
>
> https://issues.apache.org/jira/browse/ARROW-5452 added the R pkgdown site 
> config. Adding the wiring into the apidocs build scripts was deferred because 
> there was some discussion about which workflow was supported and which was 
> deprecated.  
> Uwe says: "Have a look at 
> [https://github.com/apache/arrow/blob/master/docs/Dockerfile] and 
> [https://github.com/apache/arrow/blob/master/ci/docker_build_sphinx.sh] Add 
> that and a docs-r entry in the main {{docker-compose.yml}} should be 
> sufficient to get it running in the docker setup. But actually I would rather 
> like to see that we also add the R build to the above linked files."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5474) [C++] What version of Boost do we require now?

2019-06-03 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854687#comment-16854687
 ] 

Uwe L. Korn commented on ARROW-5474:


For adoption reasons, it would be nice to use Ubuntu 16.04 as a baseline. This 
has Boost 1.58.

> [C++] What version of Boost do we require now?
> --
>
> Key: ARROW-5474
> URL: https://issues.apache.org/jira/browse/ARROW-5474
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 0.14.0
>
>
> See debugging on https://issues.apache.org/jira/browse/ARROW-5470. One 
> possible cause for that error is that the local filesystem patch increased 
> the version of boost that we actually require. The boost version (1.54 vs 
> 1.58) was one difference between failure and success. 
> Another point of confusion was that CMake reported two different versions of 
> boost at different times. 
> If we require a minimum version of boost, can we document that better, check 
> for it more accurately in the build scripts, and fail with a useful message 
> if that minimum isn't met? Or something else helpful.
> If the actual cause of the failure was something else (e.g. compiler 
> version), we should figure that out too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5488) [R] Workaround when C++ lib not available

2019-06-03 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854674#comment-16854674
 ] 

Uwe L. Korn commented on ARROW-5488:


Would this involve compiling the C++ lib from source in that case?

> [R] Workaround when C++ lib not available
> -
>
> Key: ARROW-5488
> URL: https://issues.apache.org/jira/browse/ARROW-5488
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Romain François
>Priority: Major
>
> As a way to get to CRAN, we need some way for the package still compile and 
> install and test (although do nothing useful) even when the c++ lib is not 
> available. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5452) [R] Add documentation website (pkgdown)

2019-05-30 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852150#comment-16852150
 ] 

Uwe L. Korn commented on ARROW-5452:


Not sure this is the best way to go. We merged some months ago the C++ and 
Python documentation as there was a quite an overlap between them and it is 
forseeable that there will also be a lot of overlap between R, C++ and Python. 
While using the typical setup for each language may make it easier for 
contributors to contribute intially to the documentation, it will be harder for 
Arrow users to navigate across all documentations for specific content. 

 

> [R] Add documentation website (pkgdown)
> ---
>
> Key: ARROW-5452
> URL: https://issues.apache.org/jira/browse/ARROW-5452
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 0.14.0
>
>
> pkgdown ([https://pkgdown.r-lib.org/]) is the standard for R package 
> documentation websites. Build this for arrow and deploy it at 
> https://arrow.apache.org/docs/r.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3729) [C++] Support for writing TIMESTAMP_NANOS Parquet metadata

2019-05-28 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849415#comment-16849415
 ] 

Uwe L. Korn commented on ARROW-3729:


[~tpboudreau] assigned to you and also gave you the permissions so that you can 
do it by yourself.

> [C++] Support for writing TIMESTAMP_NANOS Parquet metadata
> --
>
> Key: ARROW-3729
> URL: https://issues.apache.org/jira/browse/ARROW-3729
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: TP Boudreau
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> This was brought up on the mailing list.
> We also will need to do corresponding work in the parquet-cpp library to opt 
> in to writing nanosecond timestamps instead of casting to micro- or 
> millisecond.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3729) [C++] Support for writing TIMESTAMP_NANOS Parquet metadata

2019-05-28 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-3729:
--

Assignee: TP Boudreau

> [C++] Support for writing TIMESTAMP_NANOS Parquet metadata
> --
>
> Key: ARROW-3729
> URL: https://issues.apache.org/jira/browse/ARROW-3729
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: TP Boudreau
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> This was brought up on the mailing list.
> We also will need to do corresponding work in the parquet-cpp library to opt 
> in to writing nanosecond timestamps instead of casting to micro- or 
> millisecond.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData

2019-05-26 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-5349.

Resolution: Fixed

Issue resolved by pull request 4386
[https://github.com/apache/arrow/pull/4386]

> [Python/C++] Provide a way to specify the file path in parquet 
> ColumnChunkMetaData
> --
>
> Key: ARROW-5349
> URL: https://issues.apache.org/jira/browse/ARROW-5349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now 
> possible to collect the file metadata while writing different files (then how 
> to write those metadata was not yet addressed -> original issue ARROW-1983).
> However, currently, the {{file_path}} information in the ColumnChunkMetaData 
> object is not set. This is, I think, expected / correct for the metadata as 
> included within the single file; but for using the metadata in the combined 
> dataset `_metadata`, it needs a file path set.
> So if you want to use this metadata for a partitioned dataset, there needs to 
> be a way to specify this file path. 
> Ideas I am thinking of currently: either, we could specify a file path to be 
> used when writing, or expose the `set_file_path` method on the Python side so 
> you can create an updated version of the metadata after collecting it.
> cc [~pearu] [~mdurant]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData

2019-05-26 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-5349:
--

Assignee: Wes McKinney

> [Python/C++] Provide a way to specify the file path in parquet 
> ColumnChunkMetaData
> --
>
> Key: ARROW-5349
> URL: https://issues.apache.org/jira/browse/ARROW-5349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now 
> possible to collect the file metadata while writing different files (then how 
> to write those metadata was not yet addressed -> original issue ARROW-1983).
> However, currently, the {{file_path}} information in the ColumnChunkMetaData 
> object is not set. This is, I think, expected / correct for the metadata as 
> included within the single file; but for using the metadata in the combined 
> dataset `_metadata`, it needs a file path set.
> So if you want to use this metadata for a partitioned dataset, there needs to 
> be a way to specify this file path. 
> Ideas I am thinking of currently: either, we could specify a file path to be 
> used when writing, or expose the `set_file_path` method on the Python side so 
> you can create an updated version of the metadata after collecting it.
> cc [~pearu] [~mdurant]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5245) [C++][CI] Unpin cmake_format

2019-05-26 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-5245:
--

Assignee: Micah Kornfield

> [C++][CI] Unpin cmake_format
> 
>
> Key: ARROW-5245
> URL: https://issues.apache.org/jira/browse/ARROW-5245
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Once we either fix the cmake files or a newer version of cmake_format (> 
> 0.5.0) that continues to work with existing files we should unpin the version 
> from 0.4.5



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5245) [C++][CI] Unpin cmake_format

2019-05-26 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-5245.

   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4388
[https://github.com/apache/arrow/pull/4388]

> [C++][CI] Unpin cmake_format
> 
>
> Key: ARROW-5245
> URL: https://issues.apache.org/jira/browse/ARROW-5245
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Once we either fix the cmake files or a newer version of cmake_format (> 
> 0.5.0) that continues to work with existing files we should unpin the version 
> from 0.4.5



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5410) Crash at arrow::internal::FileWrite

2019-05-24 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847375#comment-16847375
 ] 

Uwe L. Korn commented on ARROW-5410:


Can you share code to reproduce this crash?

> Crash at arrow::internal::FileWrite
> ---
>
> Key: ARROW-5410
> URL: https://issues.apache.org/jira/browse/ARROW-5410
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: Windows version 10.0.14393.0 (rs1_release.160715-1616)
>Reporter: Tham
>Priority: Major
>
> My application is writing a bunch of parquet files and it often crashes. Most 
> of the time it crashes when writing the first file, sometimes it can write 
> the first file and crashing at the 2nd file. The file can always be opened. 
> It only crashes at writeTable.
> As I tested, my application crashes when build with release mode, but don't 
> crash with debug mode. It crashed only on one Windows machine, not others.
> Here is stack trace from dump file:
> {code:java}
> STACK_TEXT:  
> 001e`10efd840 7ffc`0333d53f : ` 001e`10efe230 
> `0033 7ffc`032dbe21 : 
> CortexSync!google_breakpad::ExceptionHandler::HandleInvalidParameter+0x1a0
> 001e`10efe170 7ffc`0333d559 : `ff02 7ffc`032da63d 
> `0033 `0033 : ucrtbase!invalid_parameter+0x13f
> 001e`10efe1b0 7ffc`03318664 : 7ff7`7f7c8489 `ff02 
> 001e`10efe230 `0033 : ucrtbase!invalid_parameter_noinfo+0x9
> 001e`10efe1f0 7ffc`032d926d : ` `0140 
> `0005 0122`bbe61e30 : 
> ucrtbase!_acrt_uninitialize_command_line+0x6fd4
> 001e`10efe250 7ff7`7f66585e : 0010`0005 ` 
> 001e`10efe560 0122`b2337b88 : ucrtbase!write+0x8d
> 001e`10efe2a0 7ff7`7f632785 : 7ff7` 7ff7`7f7bb153 
> 0122`bbe890e0 001e`10efe634 : 
> CortexSync!arrow::internal::FileWrite+0x5e
> 001e`10efe360 7ff7`7f632442 : `348a `0004 
> 733f`5e86f38c 0122`bbe14c40 : 
> CortexSync!arrow::io::OSFile::Write+0x1d5
> 001e`10efe510 7ff7`7f71c1b9 : 001e`10efe738 7ff7`7f665522 
> 0122`bbffe6e0 ` : 
> CortexSync!arrow::io::FileOutputStream::Write+0x12
> 001e`10efe540 7ff7`7f79cb2f : 0122`bbe61e30 0122`bbffe6e0 
> `0013 001e`10efe730 : 
> CortexSync!parquet::ArrowOutputStream::Write+0x39
> 001e`10efe6e0 7ff7`7f7abbaf : 7ff7`7fd75b78 7ff7`7fd75b78 
> 001e`10efe9c0 ` : 
> CortexSync!parquet::ThriftSerializer::Serialize+0x11f
> 001e`10efe8c0 7ff7`7f7aaf93 : ` 0122`bbe3f450 
> `0002 0122`bc0218d0 : 
> CortexSync!parquet::SerializedPageWriter::WriteDictionaryPage+0x44f
> 001e`10efee20 7ff7`7f7a3707 : 0122`bbe3f450 001e`10eff250 
> ` 0122`b168 : 
> CortexSync!parquet::TypedColumnWriterImpl 
> >::WriteDictionaryPage+0x143
> 001e`10efeed0 7ff7`7f710480 : 001e`10eff1c0 ` 
> 0122`bbe3f540 0122`b2439998 : 
> CortexSync!parquet::ColumnWriterImpl::Close+0x47
> 001e`10efef60 7ff7`7f7154da : 0122`bbec3cd0 001e`10eff1c0 
> 0122`bbec4bb0 0122`b2439998 : 
> CortexSync!parquet::arrow::FileWriter::Impl::`vector deleting 
> destructor'+0x100
> 001e`10efefa0 7ff7`7f71619c : ` 001e`10eff1c0 
> 0122`bbe89390 ` : 
> CortexSync!parquet::arrow::FileWriter::Impl::WriteColumnChunk+0x6fa
> 001e`10eff150 7ff7`7f202de9 : `0001 001e`10eff430 
> `000f ` : 
> CortexSync!parquet::arrow::FileWriter::WriteTable+0x6cc
> 001e`10eff410 7ff7`7f18baf3 : 0122`bbec39b0 0122`b24c53f8 
> `3f80 ` : 
> CortexSync!Cortex::Storage::ParquetStreamWriter::writeRowGroup+0x49{code}
> I tried a lot of ways to find out the root cause, but failed. Can anyone here 
> give me some information/advice please, so that I can investigate more? 
> Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5410) Crash at arrow::internal::FileWrite

2019-05-24 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-5410:
---
Labels: parquet  (was: )

> Crash at arrow::internal::FileWrite
> ---
>
> Key: ARROW-5410
> URL: https://issues.apache.org/jira/browse/ARROW-5410
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: Windows version 10.0.14393.0 (rs1_release.160715-1616)
>Reporter: Tham
>Priority: Major
>  Labels: parquet
>
> My application is writing a bunch of parquet files and it often crashes. Most 
> of the time it crashes when writing the first file, sometimes it can write 
> the first file and crashing at the 2nd file. The file can always be opened. 
> It only crashes at writeTable.
> As I tested, my application crashes when build with release mode, but don't 
> crash with debug mode. It crashed only on one Windows machine, not others.
> Here is stack trace from dump file:
> {code:java}
> STACK_TEXT:  
> 001e`10efd840 7ffc`0333d53f : ` 001e`10efe230 
> `0033 7ffc`032dbe21 : 
> CortexSync!google_breakpad::ExceptionHandler::HandleInvalidParameter+0x1a0
> 001e`10efe170 7ffc`0333d559 : `ff02 7ffc`032da63d 
> `0033 `0033 : ucrtbase!invalid_parameter+0x13f
> 001e`10efe1b0 7ffc`03318664 : 7ff7`7f7c8489 `ff02 
> 001e`10efe230 `0033 : ucrtbase!invalid_parameter_noinfo+0x9
> 001e`10efe1f0 7ffc`032d926d : ` `0140 
> `0005 0122`bbe61e30 : 
> ucrtbase!_acrt_uninitialize_command_line+0x6fd4
> 001e`10efe250 7ff7`7f66585e : 0010`0005 ` 
> 001e`10efe560 0122`b2337b88 : ucrtbase!write+0x8d
> 001e`10efe2a0 7ff7`7f632785 : 7ff7` 7ff7`7f7bb153 
> 0122`bbe890e0 001e`10efe634 : 
> CortexSync!arrow::internal::FileWrite+0x5e
> 001e`10efe360 7ff7`7f632442 : `348a `0004 
> 733f`5e86f38c 0122`bbe14c40 : 
> CortexSync!arrow::io::OSFile::Write+0x1d5
> 001e`10efe510 7ff7`7f71c1b9 : 001e`10efe738 7ff7`7f665522 
> 0122`bbffe6e0 ` : 
> CortexSync!arrow::io::FileOutputStream::Write+0x12
> 001e`10efe540 7ff7`7f79cb2f : 0122`bbe61e30 0122`bbffe6e0 
> `0013 001e`10efe730 : 
> CortexSync!parquet::ArrowOutputStream::Write+0x39
> 001e`10efe6e0 7ff7`7f7abbaf : 7ff7`7fd75b78 7ff7`7fd75b78 
> 001e`10efe9c0 ` : 
> CortexSync!parquet::ThriftSerializer::Serialize+0x11f
> 001e`10efe8c0 7ff7`7f7aaf93 : ` 0122`bbe3f450 
> `0002 0122`bc0218d0 : 
> CortexSync!parquet::SerializedPageWriter::WriteDictionaryPage+0x44f
> 001e`10efee20 7ff7`7f7a3707 : 0122`bbe3f450 001e`10eff250 
> ` 0122`b168 : 
> CortexSync!parquet::TypedColumnWriterImpl 
> >::WriteDictionaryPage+0x143
> 001e`10efeed0 7ff7`7f710480 : 001e`10eff1c0 ` 
> 0122`bbe3f540 0122`b2439998 : 
> CortexSync!parquet::ColumnWriterImpl::Close+0x47
> 001e`10efef60 7ff7`7f7154da : 0122`bbec3cd0 001e`10eff1c0 
> 0122`bbec4bb0 0122`b2439998 : 
> CortexSync!parquet::arrow::FileWriter::Impl::`vector deleting 
> destructor'+0x100
> 001e`10efefa0 7ff7`7f71619c : ` 001e`10eff1c0 
> 0122`bbe89390 ` : 
> CortexSync!parquet::arrow::FileWriter::Impl::WriteColumnChunk+0x6fa
> 001e`10eff150 7ff7`7f202de9 : `0001 001e`10eff430 
> `000f ` : 
> CortexSync!parquet::arrow::FileWriter::WriteTable+0x6cc
> 001e`10eff410 7ff7`7f18baf3 : 0122`bbec39b0 0122`b24c53f8 
> `3f80 ` : 
> CortexSync!Cortex::Storage::ParquetStreamWriter::writeRowGroup+0x49{code}
> I tried a lot of ways to find out the root cause, but failed. Can anyone here 
> give me some information/advice please, so that I can investigate more? 
> Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5411) [C++][Python] Build error building on Mac OS Mojave

2019-05-24 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-5411.

Resolution: Fixed
  Assignee: Uwe L. Korn

> [C++][Python] Build error building on Mac OS Mojave
> ---
>
> Key: ARROW-5411
> URL: https://issues.apache.org/jira/browse/ARROW-5411
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Mac OSX Mojave 10.14.5
> Anaconda 4.6.14
> XCode 10.2.1
> CLANGXX=/Users/mcabrera/anaconda3/envs/pyarrow-dev/bin/x86_64-apple-darwin13.4.0-clang++
> CLANG=/Users/mcabrera/anaconda3/envs/pyarrow-dev/bin/x86_64-apple-darwin13.4.0-clang
>Reporter: Miguel Cabrera
>Assignee: Uwe L. Korn
>Priority: Major
>
> After following the instruction on the Python development and Building 
> instruction for C++, I get a linking error:
>  
> {code:java}
> $ pwd
> /Users/mcabrera/dev/arrow/cpp/release
> $ cmake -DARROW_BUILD_TESTS=ON  ..
> ()
> ld: warning: ignoring file 
> /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd,
>  file was built for unsupported file format ( 0x2D 0x2D 0x2D 0x20 0x21 0x74 
> 0x61 0x70 0x69 0x2D 0x74 0x62 0x64 0x2D 0x76 0x33 ) which is not the 
> architecture being linked (x86_64): 
> /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd}}
>  ld: dynamic main executables must link with libSystem.dylib for architecture 
> x86_64
> clang-4.0: error: linker command failed with exit code 1 (use -v to see 
> invocation)
> make[1]: *** [cmTC_510d0] Error 1
> make: *** [cmTC_510d0/fast] Error 2
>  {code}
> Same issue if I follow the instructions on the Python Development 
> documentation
> {code:java}
>  mkdir arrow/cpp/build
>  pushd arrow/cpp/build
> cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>   -DCMAKE_INSTALL_LIBDIR=lib \
>   -DARROW_FLIGHT=ON \
>   -DARROW_GANDIVA=ON \
>   -DARROW_ORC=ON \
>   -DARROW_PARQUET=ON \
>   -DARROW_PYTHON=ON \
>   -DARROW_PLASMA=ON \
>   -DARROW_BUILD_TESTS=ON \
>   ..
> {code}
> The Python development documentation is not clear whether in order to build 
> the Python (and CPP library) brew depedencies are necessary (or just with the 
> Anaconda is enough) so I installed them nonetheless. However I get the same 
> issue
> h2. Enviornment
> {code:java}
> Mac OSX Mojave 10.14.5
> Anaconda 4.6.14
> XCode 10.2.1
> CLANGXX=/Users/mcabrera/anaconda3/envs/pyarrow-dev/bin/x86_64-apple-darwin13.4.0-clang++
> CLANG=/Users/mcabrera/anaconda3/envs/pyarrow-dev/bin/x86_64-apple-darwin13.4.0-clang{code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5411) [C++][Python] Build error building on Mac OS Mojave

2019-05-24 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847374#comment-16847374
 ] 

Uwe L. Korn commented on ARROW-5411:


This is missing the right MacOS SDK as outlined here in the "MacOS SDK" 
section: 
https://www.anaconda.com/utilizing-the-new-compilers-in-anaconda-distribution-5/

To use this SDK, you need to set CONDA_BUILD_SYSROOT as an environment variable 
after you have activated the conda environment.

> [C++][Python] Build error building on Mac OS Mojave
> ---
>
> Key: ARROW-5411
> URL: https://issues.apache.org/jira/browse/ARROW-5411
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Mac OSX Mojave 10.14.5
> Anaconda 4.6.14
> XCode 10.2.1
> CLANGXX=/Users/mcabrera/anaconda3/envs/pyarrow-dev/bin/x86_64-apple-darwin13.4.0-clang++
> CLANG=/Users/mcabrera/anaconda3/envs/pyarrow-dev/bin/x86_64-apple-darwin13.4.0-clang
>Reporter: Miguel Cabrera
>Priority: Major
>
> After following the instruction on the Python development and Building 
> instruction for C++, I get a linking error:
>  
> {code:java}
> $ pwd
> /Users/mcabrera/dev/arrow/cpp/release
> $ cmake -DARROW_BUILD_TESTS=ON  ..
> ()
> ld: warning: ignoring file 
> /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd,
>  file was built for unsupported file format ( 0x2D 0x2D 0x2D 0x20 0x21 0x74 
> 0x61 0x70 0x69 0x2D 0x74 0x62 0x64 0x2D 0x76 0x33 ) which is not the 
> architecture being linked (x86_64): 
> /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd}}
>  ld: dynamic main executables must link with libSystem.dylib for architecture 
> x86_64
> clang-4.0: error: linker command failed with exit code 1 (use -v to see 
> invocation)
> make[1]: *** [cmTC_510d0] Error 1
> make: *** [cmTC_510d0/fast] Error 2
>  {code}
> Same issue if I follow the instructions on the Python Development 
> documentation
> {code:java}
>  mkdir arrow/cpp/build
>  pushd arrow/cpp/build
> cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>   -DCMAKE_INSTALL_LIBDIR=lib \
>   -DARROW_FLIGHT=ON \
>   -DARROW_GANDIVA=ON \
>   -DARROW_ORC=ON \
>   -DARROW_PARQUET=ON \
>   -DARROW_PYTHON=ON \
>   -DARROW_PLASMA=ON \
>   -DARROW_BUILD_TESTS=ON \
>   ..
> {code}
> The Python development documentation is not clear whether in order to build 
> the Python (and CPP library) brew depedencies are necessary (or just with the 
> Anaconda is enough) so I installed them nonetheless. However I get the same 
> issue
> h2. Enviornment
> {code:java}
> Mac OSX Mojave 10.14.5
> Anaconda 4.6.14
> XCode 10.2.1
> CLANGXX=/Users/mcabrera/anaconda3/envs/pyarrow-dev/bin/x86_64-apple-darwin13.4.0-clang++
> CLANG=/Users/mcabrera/anaconda3/envs/pyarrow-dev/bin/x86_64-apple-darwin13.4.0-clang{code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5403) [C++] Test failures not propagated in Windows shared builds

2019-05-23 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846684#comment-16846684
 ] 

Uwe L. Korn commented on ARROW-5403:


Sorry, I meant {{ASSERT_NO_THROW}}

> [C++] Test failures not propagated in Windows shared builds
> ---
>
> Key: ARROW-5403
> URL: https://issues.apache.org/jira/browse/ARROW-5403
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Blocker
> Fix For: 0.14.0
>
>
> See https://github.com/google/googletest/issues/2261
> Try e.g. this change:
> {code}
> diff --git a/cpp/src/arrow/buffer-test.cc b/cpp/src/arrow/buffer-test.cc
> index 9b0530e5c..ce7628f55 100644
> --- a/cpp/src/arrow/buffer-test.cc
> +++ b/cpp/src/arrow/buffer-test.cc
> @@ -35,6 +35,10 @@
>  namespace arrow {
>  TEST(TestAllocate, Bitmap) {
> +  auto buf1 = Buffer::FromString("a");
> +  auto buf2 = Buffer::FromString("b");
> +  AssertBufferEqual(*buf1, *buf2);
> +
>std::shared_ptr new_buffer;
>ARROW_EXPECT_OK(AllocateBitmap(default_memory_pool(), 100, _buffer));
>EXPECT_GE(new_buffer->size(), 13);
> {code}
> On a Windows shared library build, it outputs this:
> {code}
> [==] Running 31 tests from 11 test cases.
> [--] Global test environment set-up.
> [--] 2 tests from TestAllocate
> [ RUN  ] TestAllocate.Bitmap
> ..\src\arrow\testing\gtest_util.cc(120): error: Value of: 
> buffer.Equals(expected
> )
>   Actual: false
> Expected: true
> [   OK ] TestAllocate.Bitmap (0 ms)
> [ RUN  ] TestAllocate.EmptyBitmap
> [   OK ] TestAllocate.EmptyBitmap (0 ms)
> [--] 2 tests from TestAllocate (0 ms total)
> {code}
>  and the entire test file is marked passed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5403) [C++] Test failures not propagated in Windows shared builds

2019-05-23 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846675#comment-16846675
 ] 

Uwe L. Korn commented on ARROW-5403:


Maybe we need to add {{ASSERT_NO_RAISES}} around {{AssertBufferEqual}} as we 
did in some other tests?

> [C++] Test failures not propagated in Windows shared builds
> ---
>
> Key: ARROW-5403
> URL: https://issues.apache.org/jira/browse/ARROW-5403
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Blocker
> Fix For: 0.14.0
>
>
> See https://github.com/google/googletest/issues/2261
> Try e.g. this change:
> {code}
> diff --git a/cpp/src/arrow/buffer-test.cc b/cpp/src/arrow/buffer-test.cc
> index 9b0530e5c..ce7628f55 100644
> --- a/cpp/src/arrow/buffer-test.cc
> +++ b/cpp/src/arrow/buffer-test.cc
> @@ -35,6 +35,10 @@
>  namespace arrow {
>  TEST(TestAllocate, Bitmap) {
> +  auto buf1 = Buffer::FromString("a");
> +  auto buf2 = Buffer::FromString("b");
> +  AssertBufferEqual(*buf1, *buf2);
> +
>std::shared_ptr new_buffer;
>ARROW_EXPECT_OK(AllocateBitmap(default_memory_pool(), 100, _buffer));
>EXPECT_GE(new_buffer->size(), 13);
> {code}
> On a Windows shared library build, it outputs this:
> {code}
> [==] Running 31 tests from 11 test cases.
> [--] Global test environment set-up.
> [--] 2 tests from TestAllocate
> [ RUN  ] TestAllocate.Bitmap
> ..\src\arrow\testing\gtest_util.cc(120): error: Value of: 
> buffer.Equals(expected
> )
>   Actual: false
> Expected: true
> [   OK ] TestAllocate.Bitmap (0 ms)
> [ RUN  ] TestAllocate.EmptyBitmap
> [   OK ] TestAllocate.EmptyBitmap (0 ms)
> [--] 2 tests from TestAllocate (0 ms total)
> {code}
>  and the entire test file is marked passed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2981) [C++] Support scripts / documentation for running clang-tidy on codebase

2019-05-13 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16838624#comment-16838624
 ] 

Uwe L. Korn commented on ARROW-2981:


[~bkietz] This is the indented behaviour. We also have a check-format command 
in CMake but not yet exposed via docker-compose.

> [C++] Support scripts / documentation for running clang-tidy on codebase
> 
>
> Key: ARROW-2981
> URL: https://issues.apache.org/jira/browse/ARROW-2981
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Related to ARROW-2952, ARROW-2980



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5265) [Python/CI] Add integration test with kartothek

2019-05-06 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-5265:
--

 Summary: [Python/CI] Add integration test with kartothek
 Key: ARROW-5265
 URL: https://issues.apache.org/jira/browse/ARROW-5265
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Python
Reporter: Uwe L. Korn
 Fix For: 0.15.0


https://github.com/JDASoftwareGroup/kartothek is a heavy user of Apache Arrow 
and thus a good indicator whether we have introduced some breakages in 
{{pyarrow}}. Thus we should run regular integration tests against it as we do 
with other libraries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5235) [C++] RAPIDJSON_INCLUDE_DIR not set (Windows + Anaconda)

2019-04-29 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-5235:
---
Component/s: Packaging

> [C++] RAPIDJSON_INCLUDE_DIR not set (Windows + Anaconda)
> 
>
> Key: ARROW-5235
> URL: https://issues.apache.org/jira/browse/ARROW-5235
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Packaging
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: Windows
>
> I'm trying to build Arrow in debug mode on Windows with some dependencies 
> installed via conda. Unfortunately I ultimately get the following error:
> {code}
> [...]
> -- RapidJSON found. Headers: C:/Miniconda3/envs/arrow/Library/include
> [...]
> -- Could NOT find Backtrace (missing: Backtrace_LIBRARY Backtrace_INCLUDE_DIR)
> CMake Error: The following variables are used in this project, but they are 
> set
> to NOTFOUND.
> Please set them or make sure they are set and tested correctly in the CMake 
> file
> s:
> RAPIDJSON_INCLUDE_DIR
>used as include directory in directory C:/t/arrow/cpp
> [ etc. ]
> {code}
> RapidJSON 1.1.0 is installed from Anaconda.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5235) [C++] RAPIDJSON_INCLUDE_DIR not set (Windows + Anaconda)

2019-04-29 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-5235:
---
Labels: Windows  (was: WIndows)

> [C++] RAPIDJSON_INCLUDE_DIR not set (Windows + Anaconda)
> 
>
> Key: ARROW-5235
> URL: https://issues.apache.org/jira/browse/ARROW-5235
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: Windows
>
> I'm trying to build Arrow in debug mode on Windows with some dependencies 
> installed via conda. Unfortunately I ultimately get the following error:
> {code}
> [...]
> -- RapidJSON found. Headers: C:/Miniconda3/envs/arrow/Library/include
> [...]
> -- Could NOT find Backtrace (missing: Backtrace_LIBRARY Backtrace_INCLUDE_DIR)
> CMake Error: The following variables are used in this project, but they are 
> set
> to NOTFOUND.
> Please set them or make sure they are set and tested correctly in the CMake 
> file
> s:
> RAPIDJSON_INCLUDE_DIR
>used as include directory in directory C:/t/arrow/cpp
> [ etc. ]
> {code}
> RapidJSON 1.1.0 is installed from Anaconda.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2079) [Python] Possibly use `_common_metadata` for schema if `_metadata` isn't available

2019-04-29 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829365#comment-16829365
 ] 

Uwe L. Korn commented on ARROW-2079:


It does make sense to write {{_common_metadata}} without {{_metadata}} as this 
already gives you a global schema for all Parquet files in the dataset. In the 
presence of {{_metadata}}, {{_common_metadata}} is actually redundant as the 
schema is already contained in {{_metadata}}.

> [Python] Possibly use `_common_metadata` for schema if `_metadata` isn't 
> available
> --
>
> Key: ARROW-2079
> URL: https://issues.apache.org/jira/browse/ARROW-2079
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Priority: Minor
>  Labels: parquet
>
> Currently pyarrow's parquet writer only writes `_common_metadata` and not 
> `_metadata`. From what I understand these are intended to contain the dataset 
> schema but not any row group information.
>  
> A few (possibly naive) questions:
>  
> 1. In the `__init__` for `ParquetDataset`, the following lines exist:
> {code:java}
> if self.metadata_path is not None:
> with self.fs.open(self.metadata_path) as f:
> self.common_metadata = ParquetFile(f).metadata
> else:
> self.common_metadata = None
> {code}
> I believe this should use `common_metadata_path` instead of `metadata_path`, 
> as the latter is never written by `pyarrow`, and is given by the `_metadata` 
> file instead of `_common_metadata` (as seemingly intended?).
>  
> 2. In `validate_schemas` I believe an option should exist for using the 
> schema from `_common_metadata` instead of `_metadata`, as pyarrow currently 
> only writes the former, and as far as I can tell `_common_metadata` does 
> include all the schema information needed.
>  
> Perhaps the logic in `validate_schemas` could be ported over to:
>  
> {code:java}
> if self.schema is not None:
> pass  # schema explicitly provided
> elif self.metadata is not None:
> self.schema = self.metadata.schema
> elif self.common_metadata is not None:
> self.schema = self.common_metadata.schema
> else:
> self.schema = self.pieces[0].get_metadata(open_file).schema{code}
> If these changes are valid, I'd be happy to submit a PR. It's not 100% clear 
> to me the difference between `_common_metadata` and `_metadata`, but I 
> believe the schema in both should be the same. Figured I'd open this for 
> discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4935) [C++] Errors from jemalloc when building pyarrow from source on OSX and Debian

2019-04-25 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-4935.

Resolution: Fixed

> [C++] Errors from jemalloc when building pyarrow from source on OSX and Debian
> --
>
> Key: ARROW-4935
> URL: https://issues.apache.org/jira/browse/ARROW-4935
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.12.1
> Environment: OSX, Debian, Python==3.6.7
>Reporter: Gregory Hayes
>Priority: Critical
>  Labels: build, newbie
>
> My attempts to build pyarrow from source are failing. I've set up the conda 
> environment using the instructions provided in the Develop instructions, and 
> have tried this on both Debian and OSX. When I run CMAKE in debug mode on 
> OSX, the output is:
> {code:java}
> -- Building using CMake version: 3.14.0
> -- Arrow version: 0.13.0 (full: '0.13.0-SNAPSHOT')
> -- clang-tidy not found
> -- clang-format not found
> -- infer found at /usr/local/bin/infer
> -- Using ccache: /usr/local/bin/ccache
> -- Found cpplint executable at 
> /Users/Greg/documents/repos/arrow/cpp/build-support/cpplint.py
> -- Compiler command: env LANG=C 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
>  -v
> -- Compiler version: Apple LLVM version 10.0.0 (clang-1000.11.45.5)
> Target: x86_64-apple-darwin18.2.0
> Thread model: posix
> InstalledDir: 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
> -- Compiler id: AppleClang
> Selected compiler clang 4.1.0svn
> -- Arrow build warning level: CHECKIN
> Configured for DEBUG build (set with cmake 
> -DCMAKE_BUILD_TYPE={release,debug,...})
> -- Build Type: DEBUG
> -- BOOST_VERSION: 1.67.0
> -- BROTLI_VERSION: v0.6.0
> -- CARES_VERSION: 1.15.0
> -- DOUBLE_CONVERSION_VERSION: v3.1.1
> -- FLATBUFFERS_VERSION: v1.10.0
> -- GBENCHMARK_VERSION: v1.4.1
> -- GFLAGS_VERSION: v2.2.0
> -- GLOG_VERSION: v0.3.5
> -- GRPC_VERSION: v1.18.0
> -- GTEST_VERSION: 1.8.1
> -- JEMALLOC_VERSION: 17c897976c60b0e6e4f4a365c751027244dada7a
> -- LZ4_VERSION: v1.8.3
> -- ORC_VERSION: 1.5.4
> -- PROTOBUF_VERSION: v3.6.1
> -- RAPIDJSON_VERSION: v1.1.0
> -- RE2_VERSION: 2018-10-01
> -- SNAPPY_VERSION: 1.1.3
> -- THRIFT_VERSION: 0.11.0
> -- ZLIB_VERSION: 1.2.8
> -- ZSTD_VERSION: v1.3.7
> -- Boost version: 1.68.0
> -- Found the following Boost libraries:
> --   regex
> --   system
> --   filesystem
> -- Boost include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- Boost libraries: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_regex.dylib/Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_system.dylib/Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_filesystem.dylib
> Added shared library dependency boost_system_shared: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_system.dylib
> Added shared library dependency boost_filesystem_shared: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_filesystem.dylib
> Added shared library dependency boost_regex_shared: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_regex.dylib
> Added static library dependency double-conversion_static: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libdouble-conversion.a
> -- double-conversion include dir: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- double-conversion static library: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libdouble-conversion.a
> -- GFLAGS_HOME: /Users/Greg/anaconda3/envs/pyarrow-dev
> -- GFlags include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- GFlags static library: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libgflags.a
> Added static library dependency gflags_static: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libgflags.a
> -- RapidJSON include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- Found the Flatbuffers library: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libflatbuffers.a
> -- Flatbuffers include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- Flatbuffers compiler: /Users/Greg/anaconda3/envs/pyarrow-dev/bin/flatc
> Added static library dependency jemalloc_static: 
> /Users/Greg/documents/repos/arrow/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a
> Added shared library dependency jemalloc_shared: 
> /Users/Greg/documents/repos/arrow/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc.dylib
> -- Found hdfs.h at: 
> /Users/Greg/documents/repos/arrow/cpp/thirdparty/hadoop/include/hdfs.h
> -- Found the ZLIB shared library: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libz.dylib
> Added shared library dependency zlib_shared: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libz.dylib
> -- SNAPPY_HOME: /Users/Greg/anaconda3/envs/pyarrow-dev
> -- Found 

[jira] [Resolved] (ARROW-5167) [C++] Upgrade string-view-light to latest

2019-04-22 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-5167.

   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4182
[https://github.com/apache/arrow/pull/4182]

> [C++] Upgrade string-view-light to latest
> -
>
> Key: ARROW-5167
> URL: https://issues.apache.org/jira/browse/ARROW-5167
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Lawrence Chan
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> string-view-lite had a typo in one of its macros (fixed in 
> https://github.com/martinmoene/string-view-lite/commit/2f2cce35293b0027056e5449b2c05b5f9c3e89ff).
>   We should vendor the latest version in the next Arrow release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4824) [Python] read_csv should accept io.StringIO objects

2019-04-22 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-4824.

Resolution: Fixed

Issue resolved by pull request 4183
[https://github.com/apache/arrow/pull/4183]

> [Python] read_csv should accept io.StringIO objects
> ---
>
> Key: ARROW-4824
> URL: https://issues.apache.org/jira/browse/ARROW-4824
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Dave Hirschfeld
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> It would be nice/convenient if `read_csv` also supported `io.StringIO` 
> objects:
>  
> {{In [59]: csv = 
> io.StringIO('''issue_date_utc,variable_name,station_name,station_id,value_date_utc,value}}
> {{    ...: 2019-02-26 22:00:00,TEMPERATURE,ARCHERFIELD,040211,2019-02-27 
> 03:00,29.1}}
> {{    ...: ''')}}
> {{In [60]: pd.read_csv(csv)}}
> {{Out[60]: }}
> {{    issue_date_utc variable_name  ...    value_date_utc  value}}
> {{0  2019-02-26 22:00:00   TEMPERATURE  ...  2019-02-27 03:00   29.1}}
> {{[1 rows x 6 columns]}}
> {{In [61]: pa.csv.read_csv(csv)}}
> {{Traceback (most recent call last):}}
> {{  File "", line 1, in }}
> {{    pa.csv.read_csv(csv)}}
> {{SystemError:  returned NULL without setting an 
> error}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4935) [C++] Errors from jemalloc when building pyarrow from source on OSX and Debian

2019-04-22 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823401#comment-16823401
 ] 

Uwe L. Korn commented on ARROW-4935:


The {{conda_build_config.yaml}} only works when you build with {{conda build}} 
but in this case, we aren't using it but calling cmake / the compiler directly, 
thus you need to {{export CONDA_BUILD_SYSROOT=}}.

> [C++] Errors from jemalloc when building pyarrow from source on OSX and Debian
> --
>
> Key: ARROW-4935
> URL: https://issues.apache.org/jira/browse/ARROW-4935
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.12.1
> Environment: OSX, Debian, Python==3.6.7
>Reporter: Gregory Hayes
>Priority: Critical
>  Labels: build, newbie
>
> My attempts to build pyarrow from source are failing. I've set up the conda 
> environment using the instructions provided in the Develop instructions, and 
> have tried this on both Debian and OSX. When I run CMAKE in debug mode on 
> OSX, the output is:
> {code:java}
> -- Building using CMake version: 3.14.0
> -- Arrow version: 0.13.0 (full: '0.13.0-SNAPSHOT')
> -- clang-tidy not found
> -- clang-format not found
> -- infer found at /usr/local/bin/infer
> -- Using ccache: /usr/local/bin/ccache
> -- Found cpplint executable at 
> /Users/Greg/documents/repos/arrow/cpp/build-support/cpplint.py
> -- Compiler command: env LANG=C 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
>  -v
> -- Compiler version: Apple LLVM version 10.0.0 (clang-1000.11.45.5)
> Target: x86_64-apple-darwin18.2.0
> Thread model: posix
> InstalledDir: 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
> -- Compiler id: AppleClang
> Selected compiler clang 4.1.0svn
> -- Arrow build warning level: CHECKIN
> Configured for DEBUG build (set with cmake 
> -DCMAKE_BUILD_TYPE={release,debug,...})
> -- Build Type: DEBUG
> -- BOOST_VERSION: 1.67.0
> -- BROTLI_VERSION: v0.6.0
> -- CARES_VERSION: 1.15.0
> -- DOUBLE_CONVERSION_VERSION: v3.1.1
> -- FLATBUFFERS_VERSION: v1.10.0
> -- GBENCHMARK_VERSION: v1.4.1
> -- GFLAGS_VERSION: v2.2.0
> -- GLOG_VERSION: v0.3.5
> -- GRPC_VERSION: v1.18.0
> -- GTEST_VERSION: 1.8.1
> -- JEMALLOC_VERSION: 17c897976c60b0e6e4f4a365c751027244dada7a
> -- LZ4_VERSION: v1.8.3
> -- ORC_VERSION: 1.5.4
> -- PROTOBUF_VERSION: v3.6.1
> -- RAPIDJSON_VERSION: v1.1.0
> -- RE2_VERSION: 2018-10-01
> -- SNAPPY_VERSION: 1.1.3
> -- THRIFT_VERSION: 0.11.0
> -- ZLIB_VERSION: 1.2.8
> -- ZSTD_VERSION: v1.3.7
> -- Boost version: 1.68.0
> -- Found the following Boost libraries:
> --   regex
> --   system
> --   filesystem
> -- Boost include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- Boost libraries: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_regex.dylib/Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_system.dylib/Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_filesystem.dylib
> Added shared library dependency boost_system_shared: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_system.dylib
> Added shared library dependency boost_filesystem_shared: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_filesystem.dylib
> Added shared library dependency boost_regex_shared: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_regex.dylib
> Added static library dependency double-conversion_static: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libdouble-conversion.a
> -- double-conversion include dir: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- double-conversion static library: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libdouble-conversion.a
> -- GFLAGS_HOME: /Users/Greg/anaconda3/envs/pyarrow-dev
> -- GFlags include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- GFlags static library: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libgflags.a
> Added static library dependency gflags_static: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libgflags.a
> -- RapidJSON include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- Found the Flatbuffers library: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libflatbuffers.a
> -- Flatbuffers include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- Flatbuffers compiler: /Users/Greg/anaconda3/envs/pyarrow-dev/bin/flatc
> Added static library dependency jemalloc_static: 
> /Users/Greg/documents/repos/arrow/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a
> Added shared library dependency jemalloc_shared: 
> /Users/Greg/documents/repos/arrow/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc.dylib
> -- Found hdfs.h at: 
> /Users/Greg/documents/repos/arrow/cpp/thirdparty/hadoop/include/hdfs.h
> -- Found the ZLIB shared library: 

[jira] [Commented] (ARROW-5176) [Python] Automate formatting of python files

2019-04-22 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823033#comment-16823033
 ] 

Uwe L. Korn commented on ARROW-5176:


I have used black in many other projects and am very happy with it. One of the 
major benefits is that it is not configurable and thus also saves us from long 
discussions about styling.

> [Python] Automate formatting of python files
> 
>
> Key: ARROW-5176
> URL: https://issues.apache.org/jira/browse/ARROW-5176
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Benjamin Kietzman
>Priority: Minor
>
> [Black](https://github.com/ambv/black) is a tool for automatically formatting 
> python code in ways which flake8 and our other linters approve of. Adding it 
> to the project will allow more reliably formatted python code and fill a 
> similar role to {{clang-format}} for c++ and {{cmake-format}} for cmake



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4139) [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set

2019-04-22 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823025#comment-16823025
 ] 

Uwe L. Korn commented on ARROW-4139:


Have a look at https://github.com/apache/arrow/pull/2623/files which implements 
predicate pushdown in Arrow on the Python side. There should also be some code 
in there that handles the physical to logical conversion.

> [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is 
> set
> ---
>
> Key: ARROW-4139
> URL: https://issues.apache.org/jira/browse/ARROW-4139
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Matthew Rocklin
>Priority: Minor
>  Labels: parquet, python
> Fix For: 0.14.0
>
>
> When writing Pandas data to Parquet format and reading it back again I find 
> that that statistics of text columns are stored as byte arrays rather than as 
> unicode text. 
> I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding 
> of how best to manage statistics.  (I'd be quite happy to learn that it was 
> the latter).
> Here is a minimal example
> {code:python}
> import pandas as pd
> df = pd.DataFrame({'x': ['a']})
> df.to_parquet('df.parquet')
> import pyarrow.parquet as pq
> pf = pq.ParquetDataset('df.parquet')
> piece = pf.pieces[0]
> rg = piece.row_group(0)
> md = piece.get_metadata(pq.ParquetFile)
> rg = md.row_group(0)
> c = rg.column(0)
> >>> c
> 
>   file_offset: 63
>   file_path: 
>   physical_type: BYTE_ARRAY
>   num_values: 1
>   path_in_schema: x
>   is_stats_set: True
>   statistics:
> 
>   has_min_max: True
>   min: b'a'
>   max: b'a'
>   null_count: 0
>   distinct_count: 0
>   num_values: 1
>   physical_type: BYTE_ARRAY
>   compression: SNAPPY
>   encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
>   has_dictionary_page: True
>   dictionary_page_offset: 4
>   data_page_offset: 25
>   total_compressed_size: 59
>   total_uncompressed_size: 55
> >>> type(c.statistics.min)
> bytes
> {code}
> My guess is that we would want to store a logical type in the statistics like 
> UNICODE, though I don't have enough experience with Parquet data types to 
> know if this is a good idea or possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5102) [C++] Reduce header dependencies

2019-04-07 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811814#comment-16811814
 ] 

Uwe L. Korn commented on ARROW-5102:


"avoid including" means in this case to foward declare and then include in the 
{{.cc}}?

> [C++] Reduce header dependencies
> 
>
> Key: ARROW-5102
> URL: https://issues.apache.org/jira/browse/ARROW-5102
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
>
> To tame C++ compile times, we should try to reduce the number of heavy 
> dependencies in our .h files.
> Two possible avenues come to mind:
> * avoid including `unordered_map` and friends
> * avoid including C++ stream libraries (such as `iostream`, `ios`, 
> `sstream`...)
> Unfortunately we're currently including `sstream` in `status.h` for some 
> template APIs. We may move those to a separate include file (e.g. 
> `status-builder.h`).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5101) [Packaging] Avoid bundling static libraries in Windows conda packages

2019-04-07 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811813#comment-16811813
 ] 

Uwe L. Korn commented on ARROW-5101:


I would actually vote for only distributing dynamic libs on all platforms in 
conda packages. This seems to be the default for mostly all conda packages and 
the preferred way of Linux distributions. We have mostly added static libraries 
in our dependencies in conda-forge in the beginning of Arrow development as 
Arrow wanted to have static libs. 

> [Packaging] Avoid bundling static libraries in Windows conda packages
> -
>
> Key: ARROW-5101
> URL: https://issues.apache.org/jira/browse/ARROW-5101
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Packaging
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
>
> We're currently bundling static libraries in Windows conda packages. 
> Unfortunately, it causes these to be quite large:
> {code:bash}
> $ ls -la ./Library/lib
> total 507808
> drwxrwxr-x 4 antoine antoine  4096 avril  3 10:28 .
> drwxrwxr-x 5 antoine antoine  4096 avril  3 10:28 ..
> -rw-rw-r-- 1 antoine antoine   1507048 avril  1 20:58 arrow.lib
> -rw-rw-r-- 1 antoine antoine 76184 avril  1 20:59 arrow_python.lib
> -rw-rw-r-- 1 antoine antoine  61323846 avril  1 21:00 arrow_python_static.lib
> -rw-rw-r-- 1 antoine antoine 32809 avril  1 21:02 arrow_static.lib
> drwxrwxr-x 3 antoine antoine  4096 avril  3 10:28 cmake
> -rw-rw-r-- 1 antoine antoine491292 avril  1 21:02 parquet.lib
> -rw-rw-r-- 1 antoine antoine 128473780 avril  1 21:03 parquet_static.lib
> drwxrwxr-x 2 antoine antoine  4096 avril  3 10:27 pkgconfig
> {code}
> (see files in https://anaconda.org/conda-forge/arrow-cpp/files )
> We should probably only ship dynamic libraries under Windows, as those are 
> reasonably small.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3543) [R] Time zone adjustment issue when reading Feather file written by Python

2019-04-07 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811810#comment-16811810
 ] 

Uwe L. Korn commented on ARROW-3543:


[~Olafsson] For now, I think your best approach is to have a look and try to 
fix it by yourself. Feel free to reach out on the mailing lists when you have 
questions on debugging the issue.

> [R] Time zone adjustment issue when reading Feather file written by Python
> --
>
> Key: ARROW-3543
> URL: https://issues.apache.org/jira/browse/ARROW-3543
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Olaf
>Priority: Critical
> Fix For: 0.14.0
>
>
> Hello the dream team,
> Pasting from [https://github.com/wesm/feather/issues/351]
> Thanks for this wonderful package. I was playing with feather and some 
> timestamps and I noticed some dangerous behavior. Maybe it is a bug.
> Consider this
>  
> {code:java}
> import pandas as pd
> import feather
> import numpy as np
> df = pd.DataFrame(
> {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), 
> pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 
> 14:01:02.200')]}
> )
> df['timestamp_est'] = 
> pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
> df
>  Out[17]: 
>  string_time_utc timestamp_est
>  0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
>  1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
> {code}
> Here I create the corresponding `EST` timestamp of my original timestamps (in 
> `UTC` time).
> Now saving the dataframe to `csv` or to `feather` will generate two 
> completely different results.
>  
> {code:java}
> df.to_csv('P://testing.csv')
> df.to_feather('P://testing.feather')
> {code}
> Switching to R.
> Using the good old `csv` gives me something a bit annoying, but expected. R 
> thinks my timezone is `UTC` by default, and wrongly attached this timezone to 
> `timestamp_est`. No big deal, I can always use `with_tz` or even better: 
> import as character and process as timestamp while in R.
>  
> {code:java}
> > dataframe <- read_csv('P://testing.csv')
>  Parsed with column specification:
>  cols(
>  X1 = col_integer(),
>  string_time_utc = col_datetime(format = ""),
>  timestamp_est = col_datetime(format = "")
>  )
>  Warning message:
>  Missing column names filled in: 'X1' [1] 
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 4
>  X1 string_time_utc timestamp_est 
> 
>  1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530
>  2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
>  mytimezone
>   
>  1 UTC 
>  2 UTC 
>  3 UTC  {code}
> {code:java}
> #Now look at what happens with feather:
>  
>  > dataframe <- read_feather('P://testing.feather')
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 3
>  string_time_utc timestamp_est mytimezone
> 
>  1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 
>  2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 
>  3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code}
> My timestamps have been converted!!! pure insanity. 
>  Am I missing something here?
> Thanks!!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5074) [C++/Python] When installing into a SYSTEM prefix, RPATHs are not correctly set

2019-03-31 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-5074:
--

 Summary: [C++/Python] When installing into a SYSTEM prefix, RPATHs 
are not correctly set
 Key: ARROW-5074
 URL: https://issues.apache.org/jira/browse/ARROW-5074
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Packaging, Python
Reporter: Uwe L. Korn


When installing the Arrow libraries into a system with a prefix (mostly a conda 
env), the RPATHs are not correctly set by CMake (there is no RPATH). Thus we 
need to use {{LD_LIBRARY_PATH}} in consumers. When packages are built using 
{{conda-build}}, this takes cares of that in its post-processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-465) [C++] Investigate usage of madvise

2019-03-31 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806204#comment-16806204
 ] 

Uwe L. Korn commented on ARROW-465:
---

The context of this ticket was that I was browsing the {{jemalloc}} source code 
and performance traces. We spent a lot time during the builders with repeately 
allocating new pages when writing to a newly allocated memory segment. With 
specifying {{MADV_WILLNEED}} we might reduce the time we wait for new pages. In 
the end, I want to the OS to allocate all pages I have requested with 
{{(je_)malloc}} immediately and not every page on first access (when a TLB miss 
occurs).

> [C++] Investigate usage of madvise 
> ---
>
> Key: ARROW-465
> URL: https://issues.apache.org/jira/browse/ARROW-465
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> In some usecases (e.g. Pandas->Arrow conversion) our main constraint is page 
> faulting not yet accessed pages. 
> With {{madvise}} we can indicate our planned actions to the OS and may 
> improve the performance a bit in these cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5017) [C++] [CI] Thrift not found on Azure Pipelines

2019-03-27 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802602#comment-16802602
 ] 

Uwe L. Korn commented on ARROW-5017:


This looks like the same problem that conda forge is expericing: When using 
{{Ninja}} as the generator, CMake errorenously detects gcc as the compiler, see 
[https://github.com/conda-forge/conda-forge.github.io/issues/714]

> [C++] [CI] Thrift not found on Azure Pipelines
> --
>
> Key: ARROW-5017
> URL: https://issues.apache.org/jira/browse/ARROW-5017
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 0.14.0
>
>
> I don't understand why this happens. The conda-forge package for 
> {{thrift-cpp}} is the same as installed on AppVeyor, yet on Azure Pipelines 
> the static library isn't found:
> https://dev.azure.com/pitrou/arrow/_build/results?buildId=70
> {code}
> -- Checking for module 'thrift'
> --   No package 'thrift' found
> CMake Error at 
> D:/a/1/conda-envs/arrow/Library/share/cmake-3.14/Modules/FindPackageHandleStandardArgs.cmake:137
>  (message):
>   Could NOT find Thrift (missing: THRIFT_STATIC_LIB)
> Call Stack (most recent call first):
>   
> D:/a/1/conda-envs/arrow/Library/share/cmake-3.14/Modules/FindPackageHandleStandardArgs.cmake:378
>  (_FPHSA_FAILURE_MESSAGE)
>   cmake_modules/FindThrift.cmake:94 (find_package_handle_standard_args)
>   cmake_modules/ThirdpartyToolchain.cmake:146 (find_package)
>   cmake_modules/ThirdpartyToolchain.cmake:1076 (resolve_dependency)
>   CMakeLists.txt:544 (include)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4608) [C++] cmake script assumes that double-conversion installs static libs

2019-03-25 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-4608.

Resolution: Fixed

Yes, this is fixed. We now use what is provided and preferred.

> [C++] cmake script assumes that double-conversion installs static libs
> --
>
> Key: ARROW-4608
> URL: https://issues.apache.org/jira/browse/ARROW-4608
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Packaging
>Reporter: Yuri
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.13.0
>
>
> This line: 
> [https://github.com/apache/arrow/blob/master/cpp/cmake_modules/ThirdpartyToolchain.cmake#L580]
> The {{double-conversion}} project can alternatively build shared libraries, 
> when {{BUILD_SHARED_LIBS=ON}} is used.
> You should only use libraries that {{double-conversion}} cmake script 
> provides, which is in the {{double-conversion_LIBRARIES}} variable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4987) [C++] Use orc conda-package on Linux and OSX

2019-03-21 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-4987:
---
Description: Instead of always building our vendored ORC from source, we 
should use the result of https://github.com/conda-forge/orc-feedstock. It would 
be even better when there were also windows builds for this but currently ORC 
doesn't build on Windows.

> [C++] Use orc conda-package on Linux and OSX
> 
>
> Key: ARROW-4987
> URL: https://issues.apache.org/jira/browse/ARROW-4987
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Packaging
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.15.0
>
>
> Instead of always building our vendored ORC from source, we should use the 
> result of https://github.com/conda-forge/orc-feedstock. It would be even 
> better when there were also windows builds for this but currently ORC doesn't 
> build on Windows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-4987) [C++] Use orc conda-package on Linux and OSX

2019-03-21 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-4987:
--

Assignee: (was: Uwe L. Korn)

> [C++] Use orc conda-package on Linux and OSX
> 
>
> Key: ARROW-4987
> URL: https://issues.apache.org/jira/browse/ARROW-4987
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Packaging
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4987) [C++] Use orc conda-package on Linux and OSX

2019-03-21 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-4987:
---
Fix Version/s: 0.15.0

> [C++] Use orc conda-package on Linux and OSX
> 
>
> Key: ARROW-4987
> URL: https://issues.apache.org/jira/browse/ARROW-4987
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Packaging
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4987) [C++] Use orc conda-package on Linux and OSX

2019-03-21 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4987:
--

 Summary: [C++] Use orc conda-package on Linux and OSX
 Key: ARROW-4987
 URL: https://issues.apache.org/jira/browse/ARROW-4987
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Packaging
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4985) [C++] arrow/testing headers are not installed

2019-03-21 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4985:
--

 Summary: [C++] arrow/testing headers are not installed
 Key: ARROW-4985
 URL: https://issues.apache.org/jira/browse/ARROW-4985
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 0.13.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4688) [C++][Parquet] 16MB limit on (nested) column chunk prevents tuning row_group_size

2019-03-20 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797199#comment-16797199
 ] 

Uwe L. Korn commented on ARROW-4688:


As an understanding on how this bug actually occured:
 * Why is there a limit of 16MB in the chunking, wouldn't 2G be better? (i.e. 
less chunks)
 * The chunks are currently split on the lowest nesting level. As far as I can 
see, a more upper level list could be spread between chunks. Is this correct?

> [C++][Parquet] 16MB limit on (nested) column chunk prevents tuning 
> row_group_size
> -
>
> Key: ARROW-4688
> URL: https://issues.apache.org/jira/browse/ARROW-4688
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Remek Zajac
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: parquet
> Fix For: 0.13.0
>
>
> We working on parquet files that involve nested lists. At most they are 
> multi-dimensional lists of simple types (never structs), but i understand, 
> for Parquet, they're still nested columns and involve repetition levels. 
> Some of these columns hold lists of rather large byte arrays (that dominate 
> the overall size of the row). When we bump the `row_group_size` to above 16MB 
> we see: 
>  
> {code:java}
> File "pyarrow/_parquet.pyx", line 700, in 
> pyarrow._parquet.ParquetReader.read_row_group
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented 
> for chunked array outputs{code}
>  
> I conclude it's 
> [this|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L848]
>  bit complaining:
>  
> {code:java}
> template 
>   Status PrimitiveImpl::WrapIntoListArray(Datum* inout_array) {
>   if (descr_->max_repetition_level() == 0) {
> // Flat, no action
> return Status::OK();
>   }
>   
>   std::shared_ptr flat_array;
>   
>   // ARROW-3762(wesm): If inout_array is a chunked array, we reject as 
> this is
>   // not yet implemented
>   if (inout_array->kind() == Datum::CHUNKED_ARRAY) {
> if (inout_array->chunked_array()->num_chunks() > 1) {
>   return Status::NotImplemented(
> "Nested data conversions not implemented for "
> "chunked array outputs");{code}
>  
> This appears to happen in the callstack of 
> ColumnReader::ColumnReaderImpl::NextBatch 
> and it appears to be provoked by 
> [this|https://github.com/apache/arrow/blob/de84293d9c93fe721cd127f1a27acc94fe290f3f/cpp/src/parquet/arrow/record_reader.cc#L604]
>  constant:
> {code:java}
> template <>     
> void TypedRecordReader::InitializeBuilder() {     
>   // Maximum of 16MB chunks     
>   constexpr int32_t kBinaryChunksize = 1 << 24;     
>   DCHECK_EQ(descr_->physical_type(), Type::BYTE_ARRAY);       
>   builder_.reset(
> new::arrow::internal::ChunkedBinaryBuilder(kBinaryChunksize, pool_));  }  
>  {code}
> Which appears to imply that the column chunk data, if larger than 
> kBinaryChunksize (hardcoded to 16MB), is returned as a Datum::CHUNKED_ARRAY 
> of more than one (16MB) chunks. Which ultimatelly leads to the 
> Status::NotImplemented error.
> We have no influence over what data we ingest, we have some influence in how 
> we flatten it and we need to tune our row_group_size to something sensibly 
> larger than 16MB. 
> We have see no obvious workaround for this and so we need to ask (1) if the 
> above diagnosis appears to correct (2) do people see any sensible workarounds 
> (3) is there an imminent intention to fix this in the Arrow community and if 
> not, how difficult would it be to fix this (in case we can afford helping)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4688) [C++][Parquet] 16MB limit on (nested) column chunk prevents tuning row_group_size

2019-03-20 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797103#comment-16797103
 ] 

Uwe L. Korn commented on ARROW-4688:


Code to reproduce:
{code:java}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

df = pd.DataFrame({
"a": np.arange(3)
})
df['b'] = df['a'].apply(lambda x: list(map(str, range(x
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(table, buf)
pq.read_table(buf.getvalue()){code}

> [C++][Parquet] 16MB limit on (nested) column chunk prevents tuning 
> row_group_size
> -
>
> Key: ARROW-4688
> URL: https://issues.apache.org/jira/browse/ARROW-4688
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Remek Zajac
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: parquet
> Fix For: 0.13.0
>
>
> We working on parquet files that involve nested lists. At most they are 
> multi-dimensional lists of simple types (never structs), but i understand, 
> for Parquet, they're still nested columns and involve repetition levels. 
> Some of these columns hold lists of rather large byte arrays (that dominate 
> the overall size of the row). When we bump the `row_group_size` to above 16MB 
> we see: 
>  
> {code:java}
> File "pyarrow/_parquet.pyx", line 700, in 
> pyarrow._parquet.ParquetReader.read_row_group
>  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented 
> for chunked array outputs{code}
>  
> I conclude it's 
> [this|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L848]
>  bit complaining:
>  
> {code:java}
> template 
>   Status PrimitiveImpl::WrapIntoListArray(Datum* inout_array) {
>   if (descr_->max_repetition_level() == 0) {
> // Flat, no action
> return Status::OK();
>   }
>   
>   std::shared_ptr flat_array;
>   
>   // ARROW-3762(wesm): If inout_array is a chunked array, we reject as 
> this is
>   // not yet implemented
>   if (inout_array->kind() == Datum::CHUNKED_ARRAY) {
> if (inout_array->chunked_array()->num_chunks() > 1) {
>   return Status::NotImplemented(
> "Nested data conversions not implemented for "
> "chunked array outputs");{code}
>  
> This appears to happen in the callstack of 
> ColumnReader::ColumnReaderImpl::NextBatch 
> and it appears to be provoked by 
> [this|https://github.com/apache/arrow/blob/de84293d9c93fe721cd127f1a27acc94fe290f3f/cpp/src/parquet/arrow/record_reader.cc#L604]
>  constant:
> {code:java}
> template <>     
> void TypedRecordReader::InitializeBuilder() {     
>   // Maximum of 16MB chunks     
>   constexpr int32_t kBinaryChunksize = 1 << 24;     
>   DCHECK_EQ(descr_->physical_type(), Type::BYTE_ARRAY);       
>   builder_.reset(
> new::arrow::internal::ChunkedBinaryBuilder(kBinaryChunksize, pool_));  }  
>  {code}
> Which appears to imply that the column chunk data, if larger than 
> kBinaryChunksize (hardcoded to 16MB), is returned as a Datum::CHUNKED_ARRAY 
> of more than one (16MB) chunks. Which ultimatelly leads to the 
> Status::NotImplemented error.
> We have no influence over what data we ingest, we have some influence in how 
> we flatten it and we need to tune our row_group_size to something sensibly 
> larger than 16MB. 
> We have see no obvious workaround for this and so we need to ask (1) if the 
> above diagnosis appears to correct (2) do people see any sensible workarounds 
> (3) is there an imminent intention to fix this in the Arrow community and if 
> not, how difficult would it be to fix this (in case we can afford helping)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4935) Errors from jemalloc when building pyarrow from source on OSX and Debian

2019-03-20 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796998#comment-16796998
 ] 

Uwe L. Korn commented on ARROW-4935:


[~hayesgb] The {{CMakeError.log}} is fine, these are simply error where we test 
your system for features that it doesn't have. No need to worry about this. 
{{Configuring incomplete, errors occurred!}} indicates that CMake had errors 
and should have displayed it somewhere in the command line output (CMake 
doesn't stop at errors).

> Errors from jemalloc when building pyarrow from source on OSX and Debian
> 
>
> Key: ARROW-4935
> URL: https://issues.apache.org/jira/browse/ARROW-4935
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.12.1
> Environment: OSX, Debian, Python==3.6.7
>Reporter: Gregory Hayes
>Priority: Critical
>  Labels: build, newbie
>
> My attempts to build pyarrow from source are failing. I've set up the conda 
> environment using the instructions provided in the Develop instructions, and 
> have tried this on both Debian and OSX. When I run CMAKE in debug mode on 
> OSX, the output is:
> {code:java}
> -- Building using CMake version: 3.14.0
> -- Arrow version: 0.13.0 (full: '0.13.0-SNAPSHOT')
> -- clang-tidy not found
> -- clang-format not found
> -- infer found at /usr/local/bin/infer
> -- Using ccache: /usr/local/bin/ccache
> -- Found cpplint executable at 
> /Users/Greg/documents/repos/arrow/cpp/build-support/cpplint.py
> -- Compiler command: env LANG=C 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
>  -v
> -- Compiler version: Apple LLVM version 10.0.0 (clang-1000.11.45.5)
> Target: x86_64-apple-darwin18.2.0
> Thread model: posix
> InstalledDir: 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
> -- Compiler id: AppleClang
> Selected compiler clang 4.1.0svn
> -- Arrow build warning level: CHECKIN
> Configured for DEBUG build (set with cmake 
> -DCMAKE_BUILD_TYPE={release,debug,...})
> -- Build Type: DEBUG
> -- BOOST_VERSION: 1.67.0
> -- BROTLI_VERSION: v0.6.0
> -- CARES_VERSION: 1.15.0
> -- DOUBLE_CONVERSION_VERSION: v3.1.1
> -- FLATBUFFERS_VERSION: v1.10.0
> -- GBENCHMARK_VERSION: v1.4.1
> -- GFLAGS_VERSION: v2.2.0
> -- GLOG_VERSION: v0.3.5
> -- GRPC_VERSION: v1.18.0
> -- GTEST_VERSION: 1.8.1
> -- JEMALLOC_VERSION: 17c897976c60b0e6e4f4a365c751027244dada7a
> -- LZ4_VERSION: v1.8.3
> -- ORC_VERSION: 1.5.4
> -- PROTOBUF_VERSION: v3.6.1
> -- RAPIDJSON_VERSION: v1.1.0
> -- RE2_VERSION: 2018-10-01
> -- SNAPPY_VERSION: 1.1.3
> -- THRIFT_VERSION: 0.11.0
> -- ZLIB_VERSION: 1.2.8
> -- ZSTD_VERSION: v1.3.7
> -- Boost version: 1.68.0
> -- Found the following Boost libraries:
> --   regex
> --   system
> --   filesystem
> -- Boost include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- Boost libraries: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_regex.dylib/Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_system.dylib/Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_filesystem.dylib
> Added shared library dependency boost_system_shared: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_system.dylib
> Added shared library dependency boost_filesystem_shared: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_filesystem.dylib
> Added shared library dependency boost_regex_shared: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libboost_regex.dylib
> Added static library dependency double-conversion_static: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libdouble-conversion.a
> -- double-conversion include dir: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- double-conversion static library: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libdouble-conversion.a
> -- GFLAGS_HOME: /Users/Greg/anaconda3/envs/pyarrow-dev
> -- GFlags include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- GFlags static library: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libgflags.a
> Added static library dependency gflags_static: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libgflags.a
> -- RapidJSON include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- Found the Flatbuffers library: 
> /Users/Greg/anaconda3/envs/pyarrow-dev/lib/libflatbuffers.a
> -- Flatbuffers include dir: /Users/Greg/anaconda3/envs/pyarrow-dev/include
> -- Flatbuffers compiler: /Users/Greg/anaconda3/envs/pyarrow-dev/bin/flatc
> Added static library dependency jemalloc_static: 
> /Users/Greg/documents/repos/arrow/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a
> Added shared library dependency jemalloc_shared: 
> /Users/Greg/documents/repos/arrow/cpp/build/jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc.dylib
> -- Found hdfs.h at: 

[jira] [Resolved] (ARROW-3208) [C++] Segmentation fault when casting dictionary to numeric with nullptr valid_bitmap

2019-03-20 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-3208.

Resolution: Fixed

Issue resolved by pull request 3978
[https://github.com/apache/arrow/pull/3978]

> [C++] Segmentation fault when casting dictionary to numeric with nullptr 
> valid_bitmap 
> --
>
> Key: ARROW-3208
> URL: https://issues.apache.org/jira/browse/ARROW-3208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.9.0
> Environment: Ubuntu 16.04 LTS; System76 Oryx Pro
>Reporter: Ying Wang
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Steps to reproduce:
>  # Create a partitioned dataset with the following code:
> ```python
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame({ 'one': [-1, 10, 2.5, 100, 1000, 1, 29.2], 'two': [-1, 10, 
> 2, 100, 1000, 1, 11], 'three': [0, 0, 0, 0, 0, 0, 0] })
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, root_path='/home/yingw787/misc/example_dataset', 
> partition_cols=['one', 'two'])
> ```
>  # Create a Parquet file from a PyArrow Table created from the partitioned 
> Parquet dataset:
> ```python
> import pyarrow.parquet as pq
> table = pq.ParquetDataset('/path/to/dataset').read()
> pq.write_table(table, '/path/to/example.parquet')
> ```
> EXPECTED:
>  * Successful write
> GOT:
>  * Segmentation fault
> Issue reference on GitHub mirror: https://github.com/apache/arrow/issues/2511



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4697) [C++] Add URI parsing facility

2019-03-20 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796925#comment-16796925
 ] 

Uwe L. Korn commented on ARROW-4697:


Only resolve issues, don't close them. We need this during our release process.

> [C++] Add URI parsing facility
> --
>
> Key: ARROW-4697
> URL: https://issues.apache.org/jira/browse/ARROW-4697
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> This is a prerequisite for ARROW-4651.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4697) [C++] Add URI parsing facility

2019-03-20 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-4697.

Resolution: Fixed

> [C++] Add URI parsing facility
> --
>
> Key: ARROW-4697
> URL: https://issues.apache.org/jira/browse/ARROW-4697
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> This is a prerequisite for ARROW-4651.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4356) [CI] Add integration (docker) test for turbodbc

2019-03-19 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796220#comment-16796220
 ] 

Uwe L. Korn commented on ARROW-4356:


I have an integration test locally running but it fails due to code changes. It 
requires [https://github.com/blue-yonder/turbodbc/pull/205], 
[https://github.com/apache/arrow/pull/3980] and the adoption of the includes in 
turbodbc for test-util.h. For the latter I would start working on it once the 
version macro PR is merged.

> [CI] Add integration (docker) test for turbodbc
> ---
>
> Key: ARROW-4356
> URL: https://issues.apache.org/jira/browse/ARROW-4356
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.13.0
>
>
> We regularly break our API so that {{turbodbc}} needs to make minor changes 
> to support the new Arrow version. We should setup a small integration test to 
> check before a release that {{turbodbc}} can easily upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4568) [C++] Add version macros to headers

2019-03-19 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-4568:
---
Fix Version/s: (was: 0.14.0)
   0.13.0

> [C++] Add version macros to headers
> ---
>
> Key: ARROW-4568
> URL: https://issues.apache.org/jira/browse/ARROW-4568
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Lawrence Chan
>Assignee: Uwe L. Korn
>Priority: Minor
> Fix For: 0.13.0
>
>
> It would be useful to have compile-time macros in the headers specifying the 
> major/minor/patch versions, so that users can more easily maintain code that 
> can be built with a range of arrow versions.
> Other nice-to-haves:
> - Maybe a "combiner" func that basically spits out the value as an easy to 
> compare integer e.g. 12000 for 0.12.0 or something.
> - Git hash



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   9   10   >