[jira] [Assigned] (ARROW-8509) GArrowRecordBatch <-> GArrowBuffer conversion functions

2020-04-27 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-8509:
---

Assignee: Kouhei Sutou  (was: Tanveer)

> GArrowRecordBatch <-> GArrowBuffer conversion functions
> ---
>
> Key: ARROW-8509
> URL: https://issues.apache.org/jira/browse/ARROW-8509
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Reporter: Tanveer
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hi All,
> I am working on integrating two programs, both of which are using Plasma API. 
> For this purpose, I need to convert RecordBatches to Buffer to transfer to 
> Plasma.
> I have created GArrowRecordBatch <-> GArrowBuffer conversion functions which 
> are working for me locally, but I am not sure if I have adopted the correct 
> way, I want it to be integrated into c_glib. Can you people please check 
> these functions and update/accept the pull request?
>  
> https://github.com/apache/arrow/pull/6963



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8610) [Rust] DivideByZero when running arrow crate when simd feature is disabled

2020-04-27 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8610:
-
Summary: [Rust] DivideByZero when running arrow crate when simd feature is 
disabled  (was: DivideByZero when running arrow crate when simd feature is 
disabled)

> [Rust] DivideByZero when running arrow crate when simd feature is disabled
> --
>
> Key: ARROW-8610
> URL: https://issues.apache.org/jira/browse/ARROW-8610
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: R. Tyler Croy
>Priority: Major
>
> This is reproducible when running without simd features, or when trying to 
> compile on an {{aarch64}} machine as well.
>  
> {{% cargo test --no-default-features}}
>  
> {code:java}
> failures: 
> compute::kernels::arithmetic::tests::test_primitive_array_divide_with_nulls 
> stdout 
> thread 
> 'compute::kernels::arithmetic::tests::test_primitive_array_divide_with_nulls' 
> panicked at 'called `Result::unwrap()` on an `Err` value: DivideByZero', 
> src/libcore/result.rs:1187:5
> failures:
> 
> compute::kernels::arithmetic::tests::test_primitive_array_divide_with_nullstest
>  result: FAILED. 312 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out 
> {code}
>  
> I tried to address the issue this myself, and it looks like the {{divide}} 
> function with the {{simd}} feature doesn't work properly, something is up 
> with {{math_op}} but I don't understand this well enough.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7076) `pip install pyarrow` with python 3.8 fail with message : Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly

2020-04-27 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094192#comment-17094192
 ] 

Joris Van den Bossche commented on ARROW-7076:
--

[~ManthanAdmane] you will need to give more details (the exact commands you 
ran, the full output, which version you are installing, which platform, ...)

> `pip install pyarrow` with python 3.8 fail with message : Could not build 
> wheels for pyarrow which use PEP 517 and cannot be installed directly
> ---
>
> Key: ARROW-7076
> URL: https://issues.apache.org/jira/browse/ARROW-7076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Ubuntu 19.10 / Python 3.8.0
>Reporter: Fabien
>Priority: Minor
>
> When I install pyarrow in python 3.7.5 with `pip install pyarrow` it works.
> However with python 3.8.0 it fails with the following error :
> {noformat}
> 14:06 $ pip install pyarrow
> Collecting pyarrow
>  Using cached 
> https://files.pythonhosted.org/packages/e0/e6/d14b4a2b54ef065b1a2c576537abe805c1af0c94caef70d365e2d78fc528/pyarrow-0.15.1.tar.gz
>  Installing build dependencies ... done
>  Getting requirements to build wheel ... done
>  Preparing wheel metadata ... done
> Collecting numpy>=1.14
>  Using cached 
> https://files.pythonhosted.org/packages/3a/8f/f9ee25c0ae608f86180c26a1e35fe7ea9d71b473ea7f54db20759ba2745e/numpy-1.17.3-cp38-cp38-manylinux1_x86_64.whl
> Collecting six>=1.0.0
>  Using cached 
> https://files.pythonhosted.org/packages/65/26/32b8464df2a97e6dd1b656ed26b2c194606c16fe163c695a992b36c11cdf/six-1.13.0-py2.py3-none-any.whl
> Building wheels for collected packages: pyarrow
>  Building wheel for pyarrow (PEP 517) ... error
>  ERROR: Command errored out with exit status 1:
>  command: /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/bin/python3.8 
> /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py
>  build_wheel /tmp/tmp4gpyu82j
>  cwd: /tmp/pip-install-cj5ucedq/pyarrow
>  Complete output (490 lines):
>  running bdist_wheel
>  running build
>  running build_py
>  creating build
>  creating build/lib.linux-x86_64-3.8
>  creating build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/flight.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/orc.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/jvm.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/util.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/pandas_compat.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/cuda.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/filesystem.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/json.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/feather.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/serialization.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/ipc.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/parquet.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/_generated_version.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/benchmark.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/types.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/hdfs.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/fs.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/plasma.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/csv.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/compat.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/__init__.py -> build/lib.linux-x86_64-3.8/pyarrow
>  creating build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_strategies.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_array.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_tensor.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_json.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_cython.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_deprecations.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/conftest.py -> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_memory.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_io.py -> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/pandas_examples.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_compute.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/util

[jira] [Created] (ARROW-8610) DivideByZero when running arrow crate when simd feature is disabled

2020-04-27 Thread R. Tyler Croy (Jira)
R. Tyler Croy created ARROW-8610:


 Summary: DivideByZero when running arrow crate when simd feature 
is disabled
 Key: ARROW-8610
 URL: https://issues.apache.org/jira/browse/ARROW-8610
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: R. Tyler Croy


This is reproducible when running without simd features, or when trying to 
compile on an {{aarch64}} machine as well.

 

{{% cargo test --no-default-features}}

 
{code:java}
failures: 
compute::kernels::arithmetic::tests::test_primitive_array_divide_with_nulls 
stdout 
thread 
'compute::kernels::arithmetic::tests::test_primitive_array_divide_with_nulls' 
panicked at 'called `Result::unwrap()` on an `Err` value: DivideByZero', 
src/libcore/result.rs:1187:5
failures:

compute::kernels::arithmetic::tests::test_primitive_array_divide_with_nullstest 
result: FAILED. 312 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out 
{code}
 

I tried to address the issue this myself, and it looks like the {{divide}} 
function with the {{simd}} feature doesn't work properly, something is up with 
{{math_op}} but I don't understand this well enough.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8609) [C++]orc JNI bridge crashed on null arrow buffer

2020-04-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8609:
--
Labels: pull-request-available  (was: )

> [C++]orc JNI bridge crashed on null arrow buffer
> 
>
> Key: ARROW-8609
> URL: https://issues.apache.org/jira/browse/ARROW-8609
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yuan Zhou
>Assignee: Yuan Zhou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/apache/arrow/blob/master/cpp/src/jni/orc/jni_wrapper.cpp#L278-L281
> We should do a check on arrow buffer if it's null, and passing right value to 
> the constructor. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8609) [C++]orc JNI bridge crashed on null arrow buffer

2020-04-27 Thread Yuan Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuan Zhou updated ARROW-8609:
-
Summary: [C++]orc JNI bridge crashed on null arrow buffer  (was: orc JNI 
bridge crashed on null arrow buffer)

> [C++]orc JNI bridge crashed on null arrow buffer
> 
>
> Key: ARROW-8609
> URL: https://issues.apache.org/jira/browse/ARROW-8609
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yuan Zhou
>Assignee: Yuan Zhou
>Priority: Major
>
> https://github.com/apache/arrow/blob/master/cpp/src/jni/orc/jni_wrapper.cpp#L278-L281
> We should do a check on arrow buffer if it's null, and passing right value to 
> the constructor. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8609) orc JNI bridge crashed on null arrow buffer

2020-04-27 Thread Yuan Zhou (Jira)
Yuan Zhou created ARROW-8609:


 Summary: orc JNI bridge crashed on null arrow buffer
 Key: ARROW-8609
 URL: https://issues.apache.org/jira/browse/ARROW-8609
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Yuan Zhou
Assignee: Yuan Zhou


https://github.com/apache/arrow/blob/master/cpp/src/jni/orc/jni_wrapper.cpp#L278-L281
We should do a check on arrow buffer if it's null, and passing right value to 
the constructor. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8608) Update vendored mpark/variant.h to latest to fix NVCC compilation issues

2020-04-27 Thread Mark Harris (Jira)
Mark Harris created ARROW-8608:
--

 Summary: Update vendored mpark/variant.h to  latest to fix NVCC 
compilation issues
 Key: ARROW-8608
 URL: https://issues.apache.org/jira/browse/ARROW-8608
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Mark Harris


Arrow vendors [https://github.com/mpark/variant]. The vendored version is  from 
2019, which has issues compiling with NVCC (CUDA compiler). Projects like cuDF 
that depend on Arrow are stuck on a version of Arrow from before this 
dependency was added because they can't compile. 

mpark/variant's two most recent PRs are fixes for NVCC compilation. 

We would like to move cuDF forward to Arrow 0.16 or 0.17 soon, so it would be 
great to update the version mpark/variant in Arrow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8556) [R] zstd symbol not found if there are multiple installations of zstd

2020-04-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8556:
---
Summary: [R] zstd symbol not found if there are multiple installations of 
zstd  (was: [R] zstd symbol not found on Ubuntu 19.10)

> [R] zstd symbol not found if there are multiple installations of zstd
> -
>
> Key: ARROW-8556
> URL: https://issues.apache.org/jira/browse/ARROW-8556
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: Ubuntu 19.10
> R 3.6.1
>Reporter: Karl Dunkle Werner
>Priority: Major
>
> I would like to install the `arrow` R package on my Ubuntu 19.10 system. 
> Prebuilt binaries are unavailable, and I want to enable compression, so I set 
> the {{LIBARROW_MINIMAL=false}} environment variable. When I do so, it looks 
> like the package is able to compile, but can't be loaded. I'm able to install 
> correctly if I don't set the {{LIBARROW_MINIMAL}} variable.
> Here's the error I get:
> {code:java}
> ** testing if installed package can be loaded from temporary location
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so':
>   ~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: 
> ZSTD_initCStream
> Error: loading failed
> Execution halted
> ERROR: loading failed
> * removing ‘~/.R/3.6/arrow’
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8556) [R] zstd symbol not found on Ubuntu 19.10

2020-04-27 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094017#comment-17094017
 ] 

Neal Richardson commented on ARROW-8556:


Thanks, that makes some sense. Googling the original undefined symbol error 
message, all I found were issues caused by having multiple versions of zstd 
installed (e.g. https://github.com/facebook/wangle/issues/73), but since you 
said you didn't have it installed before, I didn't think it was relevant.

I wish there were a good way to make it not fail in that case, to make sure 
that if you build from source in the R build, that that version gets picked up. 
Maybe someone else will have an idea on how to achieve that.

> [R] zstd symbol not found on Ubuntu 19.10
> -
>
> Key: ARROW-8556
> URL: https://issues.apache.org/jira/browse/ARROW-8556
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: Ubuntu 19.10
> R 3.6.1
>Reporter: Karl Dunkle Werner
>Priority: Major
>
> I would like to install the `arrow` R package on my Ubuntu 19.10 system. 
> Prebuilt binaries are unavailable, and I want to enable compression, so I set 
> the {{LIBARROW_MINIMAL=false}} environment variable. When I do so, it looks 
> like the package is able to compile, but can't be loaded. I'm able to install 
> correctly if I don't set the {{LIBARROW_MINIMAL}} variable.
> Here's the error I get:
> {code:java}
> ** testing if installed package can be loaded from temporary location
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so':
>   ~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: 
> ZSTD_initCStream
> Error: loading failed
> Execution halted
> ERROR: loading failed
> * removing ‘~/.R/3.6/arrow’
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8586) [R] installation failure on CentOS 7

2020-04-27 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094013#comment-17094013
 ] 

Neal Richardson commented on ARROW-8586:


Thanks. A few thoughts. Apologies if this is confusing; we're going deep in 
some different directions:

* {{ARROW_R_DEV=true}} is for installation verbosity only, not for crash 
reporting, and from the install logs you shared, I can see that apparently 
thrift failed to build/install. I haven't seen it fail in that specific way 
before, I don't think. If you want to go deeper into the Matrix with me, try 
reinstalling with {{ARROW_R_DEV=true}} and 
{{EXTRA_CMAKE_ARGS="-DARROW_VERBOSE_THIRDPARTY_BUILD=ON"}} (but unset 
{{LIBARROW_BINARY}} so that we build from source) and maybe we'll see what's 
going on there.
* Alternatively, you could try installing {{thrift}} from {{yum}}, though I'm 
not sure that they have a new enough version (0.11 is the minimum).
* Odd that you got a segfault when reading a parquet file. Is there anything 
special about how your system is configured (compilers, toolchains, etc.) 
beyond a vanilla CentOS 7 environment? The centos-7 binary is built on a base 
centos image with this Dockerfile: 
https://github.com/ursa-labs/arrow-r-nightly/blob/master/linux/yum.Dockerfile 
So maybe see if setting {{CC=/usr/bin/gcc CXX=/usr/bin/g++}} before installing 
the R package (with {{LIBARROW_BINARY=centos-7}}).
* If that makes a difference, I wonder if 
https://github.com/ursa-labs/arrow-r-nightly/blob/master/linux/yum.Dockerfile#L18-L20
 is what is needed to get the thrift compilation when building everything from 
source to work.
* Thanks for the {{lsb_release}} output. That confirms my suspicion about why 
it did not try to download the centos-7 binary to begin with (though obviously 
that's not desirable unless we get it not to segfault for you).

> [R] installation failure on CentOS 7
> 
>
> Key: ARROW-8586
> URL: https://issues.apache.org/jira/browse/ARROW-8586
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: CentOS 7
>Reporter: Hei
>Priority: Major
>
> Hi,
> I am trying to install arrow via RStudio, but it seems like it is not working 
> that after I installed the package, it kept asking me to run 
> arrow::install_arrow() even after I did:
> {code}
> > install.packages("arrow")
> Installing package into ‘/home/hc/R/x86_64-redhat-linux-gnu-library/3.6’
> (as ‘lib’ is unspecified)
> trying URL 'https://cran.rstudio.com/src/contrib/arrow_0.17.0.tar.gz'
> Content type 'application/x-gzip' length 242534 bytes (236 KB)
> ==
> downloaded 236 KB
> * installing *source* package ‘arrow’ ...
> ** package ‘arrow’ successfully unpacked and MD5 sums checked
> ** using staged installation
> *** Successfully retrieved C++ source
> *** Building C++ libraries
>  cmake
>  arrow  
> ./configure: line 132: cd: libarrow/arrow-0.17.0/lib: Not a directory
> - NOTE ---
> After installation, please run arrow::install_arrow()
> for help installing required runtime libraries
> -
> ** libs
> g++ -m64 -std=gnu++11 -I"/usr/include/R" -DNDEBUG  
> -I"/home/hc/R/x86_64-redhat-linux-gnu-library/3.6/Rcpp/include" 
> -I/usr/local/include  -fpic  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 
> -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 
> -grecord-gcc-switches   -m64 -mtune=generic  -c array.cpp -o array.o
> g++ -m64 -std=gnu++11 -I"/usr/include/R" -DNDEBUG  
> -I"/home/hc/R/x86_64-redhat-linux-gnu-library/3.6/Rcpp/include" 
> -I/usr/local/include  -fpic  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 
> -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 
> -grecord-gcc-switches   -m64 -mtune=generic  -c array_from_vector.cpp -o 
> array_from_vector.o
> g++ -m64 -std=gnu++11 -I"/usr/include/R" -DNDEBUG  
> -I"/home/hc/R/x86_64-redhat-linux-gnu-library/3.6/Rcpp/include" 
> -I/usr/local/include  -fpic  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 
> -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 
> -grecord-gcc-switches   -m64 -mtune=generic  -c array_to_vector.cpp -o 
> array_to_vector.o
> g++ -m64 -std=gnu++11 -I"/usr/include/R" -DNDEBUG  
> -I"/home/hc/R/x86_64-redhat-linux-gnu-library/3.6/Rcpp/include" 
> -I/usr/local/include  -fpic  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 
> -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 
> -grecord-gcc-switches   -m64 -mtune=generic  -c arraydata.cpp -o arraydata.o
> g++ -m64 -std=gnu++11 -I"/usr/include/R" -DNDEBUG  
> -I"/home/hc/R/x86_64-redhat-linux-gnu-library/3.6/Rcpp/include" 
> -I/usr/local/include  -fpic  -O2 -g -pipe -Wal

[jira] [Commented] (ARROW-8556) [R] zstd symbol not found on Ubuntu 19.10

2020-04-27 Thread Karl Dunkle Werner (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094007#comment-17094007
 ] 

Karl Dunkle Werner commented on ARROW-8556:
---

Update: I remembered dev packages.

I had libzstd-dev 1.4.3 installed as a dependency of libgdal-dev. After 
uninstalling it, I was able to install arrow. Logs are below.

 

 
{noformat}
* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** Generating code with data-raw/codegen.R
Fatal error: cannot open file 'data-raw/codegen.R': No such file or directory
trying URL 
'https://dl.bintray.com/ursalabs/arrow-r/libarrow/src/arrow-0.17.0.zip'
Error in download.file(from_url, to_file, quiet = quietly) : 
  cannot open URL 
'https://dl.bintray.com/ursalabs/arrow-r/libarrow/src/arrow-0.17.0.zip'
trying URL 
'https://www.apache.org/dyn/closer.lua?action=download&filename=arrow/arrow-0.17.0/apache-arrow-0.17.0.tar.gz'
Content type 'application/x-gzip' length 6460548 bytes (6.2 MB)
==
downloaded 6.2 MB*** Successfully retrieved C++ source
*** Building C++ libraries
rm: cannot remove 'src/*.o': No such file or directory
*** Building with MAKEFLAGS=  -j4 
 arrow with 
SOURCE_DIR=/tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp 
BUILD_DIR=/tmp/Rtmp9loTsA/file46055b57ae53 DEST_DIR=libarrow/arrow-0.17.0 
CMAKE=/usr/bin/cmake 
++ pwd
+ : /tmp/Rtmppd6Y9y/R.INSTALL45dd4a4e6ea2/arrow
+ : /tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp
+ : /tmp/Rtmp9loTsA/file46055b57ae53
+ : libarrow/arrow-0.17.0
+ : /usr/bin/cmake
++ cd /tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp
++ pwd
+ SOURCE_DIR=/tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp
++ mkdir -p libarrow/arrow-0.17.0
++ cd libarrow/arrow-0.17.0
++ pwd
+ DEST_DIR=/tmp/Rtmppd6Y9y/R.INSTALL45dd4a4e6ea2/arrow/libarrow/arrow-0.17.0
+ '[' '' = '' ']'
+ which ninja
+ CMAKE_GENERATOR=Ninja
+ '[' false = false ']'
+ ARROW_JEMALLOC=ON
+ ARROW_WITH_BROTLI=ON
+ ARROW_WITH_BZ2=ON
+ ARROW_WITH_LZ4=ON
+ ARROW_WITH_SNAPPY=ON
+ ARROW_WITH_ZLIB=ON
+ ARROW_WITH_ZSTD=ON
+ mkdir -p /tmp/Rtmp9loTsA/file46055b57ae53
+ pushd /tmp/Rtmp9loTsA/file46055b57ae53
/tmp/Rtmp9loTsA/file46055b57ae53 /tmp/Rtmppd6Y9y/R.INSTALL45dd4a4e6ea2/arrow
+ /usr/bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF 
-DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON 
-DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=AUTO 
-DARROW_FILESYSTEM=ON -DARROW_JEMALLOC=ON -DARROW_JSON=ON -DARROW_PARQUET=ON 
-DARROW_WITH_BROTLI=ON -DARROW_WITH_BZ2=ON -DARROW_WITH_LZ4=ON 
-DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON 
-DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_LIBDIR=lib 
-DCMAKE_INSTALL_PREFIX=/tmp/Rtmppd6Y9y/R.INSTALL45dd4a4e6ea2/arrow/libarrow/arrow-0.17.0
 -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON 
-DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=ON 
-DOPENSSL_USE_STATIC_LIBS=ON -G Ninja 
/tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp
-- Building using CMake version: 3.13.4
-- The C compiler identification is GNU 9.2.1
-- The CXX compiler identification is GNU 9.2.1
-- Check for working C compiler: /usr/lib/ccache/cc
-- Check for working C compiler: /usr/lib/ccache/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/lib/ccache/c++
-- Check for working CXX compiler: /usr/lib/ccache/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Arrow version: 0.17.0 (full: '0.17.0')
-- Arrow SO version: 17 (full: 17.0.0)
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.29.1") 
-- clang-tidy not found
-- clang-format not found
-- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) 
-- infer not found
-- Found Python3: /usr/bin/python3.7 (found version "3.7.5") found components:  
Interpreter 
-- Using ccache: /usr/bin/ccache
-- Found cpplint executable at 
/tmp/Rtmp9loTsA/file46054fc6ee7f/apache-arrow-0.17.0/cpp/build-support/cpplint.py
-- System processor: x86_64
-- Performing Test CXX_SUPPORTS_SSE4_2
-- Performing Test CXX_SUPPORTS_SSE4_2 - Success
-- Performing Test CXX_SUPPORTS_AVX2
-- Performing Test CXX_SUPPORTS_AVX2 - Success
-- Performing Test CXX_SUPPORTS_AVX512
-- Performing Test CXX_SUPPORTS_AVX512 - Success
-- Arrow build warning level: PRODUCTION
Using ld linker
Configured for RELEASE build (set with cmake 
-DCMAKE_BUILD_TYPE={release,debug,...})
-- Build Type: RELEASE
-- Using AUTO approach to find dependencies
-- ARROW_AWSSDK_BUILD_VERSION: 1.7.160
-- ARROW_BOOST_BUILD_VERSION: 1.71.0
-- ARROW_BROTLI_BUILD_VERSION: v1.0.7
-- ARROW_BZIP2_BUILD_

[jira] [Commented] (ARROW-8556) [R] zstd symbol not found on Ubuntu 19.10

2020-04-27 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094002#comment-17094002
 ] 

Neal Richardson commented on ARROW-8556:


Any ideas [~fsaintjacques] [~bkietz]?

> [R] zstd symbol not found on Ubuntu 19.10
> -
>
> Key: ARROW-8556
> URL: https://issues.apache.org/jira/browse/ARROW-8556
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: Ubuntu 19.10
> R 3.6.1
>Reporter: Karl Dunkle Werner
>Priority: Major
>
> I would like to install the `arrow` R package on my Ubuntu 19.10 system. 
> Prebuilt binaries are unavailable, and I want to enable compression, so I set 
> the {{LIBARROW_MINIMAL=false}} environment variable. When I do so, it looks 
> like the package is able to compile, but can't be loaded. I'm able to install 
> correctly if I don't set the {{LIBARROW_MINIMAL}} variable.
> Here's the error I get:
> {code:java}
> ** testing if installed package can be loaded from temporary location
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so':
>   ~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: 
> ZSTD_initCStream
> Error: loading failed
> Execution halted
> ERROR: loading failed
> * removing ‘~/.R/3.6/arrow’
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7076) `pip install pyarrow` with python 3.8 fail with message : Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly

2020-04-27 Thread Manthan Admane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093996#comment-17093996
 ] 

Manthan Admane commented on ARROW-7076:
---

I don't know how and why-
it gives me a "ERROR: Could not build wheels for pyarrow which use PEP 517 and 
cannot be installed directly"

When I am using a virtual env on anaconda BUT succeeds when I am using the 
(base) default environment. 

> `pip install pyarrow` with python 3.8 fail with message : Could not build 
> wheels for pyarrow which use PEP 517 and cannot be installed directly
> ---
>
> Key: ARROW-7076
> URL: https://issues.apache.org/jira/browse/ARROW-7076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Ubuntu 19.10 / Python 3.8.0
>Reporter: Fabien
>Priority: Minor
>
> When I install pyarrow in python 3.7.5 with `pip install pyarrow` it works.
> However with python 3.8.0 it fails with the following error :
> {noformat}
> 14:06 $ pip install pyarrow
> Collecting pyarrow
>  Using cached 
> https://files.pythonhosted.org/packages/e0/e6/d14b4a2b54ef065b1a2c576537abe805c1af0c94caef70d365e2d78fc528/pyarrow-0.15.1.tar.gz
>  Installing build dependencies ... done
>  Getting requirements to build wheel ... done
>  Preparing wheel metadata ... done
> Collecting numpy>=1.14
>  Using cached 
> https://files.pythonhosted.org/packages/3a/8f/f9ee25c0ae608f86180c26a1e35fe7ea9d71b473ea7f54db20759ba2745e/numpy-1.17.3-cp38-cp38-manylinux1_x86_64.whl
> Collecting six>=1.0.0
>  Using cached 
> https://files.pythonhosted.org/packages/65/26/32b8464df2a97e6dd1b656ed26b2c194606c16fe163c695a992b36c11cdf/six-1.13.0-py2.py3-none-any.whl
> Building wheels for collected packages: pyarrow
>  Building wheel for pyarrow (PEP 517) ... error
>  ERROR: Command errored out with exit status 1:
>  command: /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/bin/python3.8 
> /home/fabien/.local/share/virtualenvs/pipenv-_eZlsrLD/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py
>  build_wheel /tmp/tmp4gpyu82j
>  cwd: /tmp/pip-install-cj5ucedq/pyarrow
>  Complete output (490 lines):
>  running bdist_wheel
>  running build
>  running build_py
>  creating build
>  creating build/lib.linux-x86_64-3.8
>  creating build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/flight.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/orc.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/jvm.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/util.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/pandas_compat.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/cuda.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/filesystem.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/json.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/feather.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/serialization.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/ipc.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/parquet.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/_generated_version.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/benchmark.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/types.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/hdfs.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/fs.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/plasma.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/csv.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/compat.py -> build/lib.linux-x86_64-3.8/pyarrow
>  copying pyarrow/__init__.py -> build/lib.linux-x86_64-3.8/pyarrow
>  creating build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_strategies.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_array.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_tensor.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_json.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_cython.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_deprecations.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/conftest.py -> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_memory.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_io.py -> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/pandas_examples.py -> 
> build/lib.linux-x86_64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_compute.py ->

[jira] [Resolved] (ARROW-8607) [R][CI] Unbreak builds following R 4.0 release

2020-04-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-8607.

Resolution: Fixed

Issue resolved by pull request 7047
[https://github.com/apache/arrow/pull/7047]

> [R][CI] Unbreak builds following R 4.0 release
> --
>
> Key: ARROW-8607
> URL: https://issues.apache.org/jira/browse/ARROW-8607
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Just a tourniquet to get master passing again while I work on ARROW-8604.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (ARROW-5634) [C#] ArrayData.NullCount should be a property

2020-04-27 Thread Zachary Gramana (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zachary Gramana updated ARROW-5634:
---
Comment: was deleted

(was: [GitHub Pull Request #7032|https://github.com/apache/arrow/pull/7032] now 
properly computes the `NullCount` value and passes it to the `ArrayData` ctor 
in the `Slice` method.

`NullCount` should remain a readonly field, however, in order to preserve 
immutability.)

> [C#] ArrayData.NullCount should be a property 
> --
>
> Key: ARROW-5634
> URL: https://issues.apache.org/jira/browse/ARROW-5634
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C#
>Reporter: Prashanth Govindarajan
>Priority: Major
>
> ArrayData.NullCount should be a property so that it can be computed when 
> necessary: for ex: after Slice(), NullCount is -1 and needs to be computed 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8606) [CI] Don't trigger all builds on a change to any file in ci/

2020-04-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8606.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7046
[https://github.com/apache/arrow/pull/7046]

> [CI] Don't trigger all builds on a change to any file in ci/
> 
>
> Key: ARROW-8606
> URL: https://issues.apache.org/jira/browse/ARROW-8606
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8607) [R][CI] Unbreak builds following R 4.0 release

2020-04-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8607:
--
Labels: pull-request-available  (was: )

> [R][CI] Unbreak builds following R 4.0 release
> --
>
> Key: ARROW-8607
> URL: https://issues.apache.org/jira/browse/ARROW-8607
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Just a tourniquet to get master passing again while I work on ARROW-8604.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8603) [Documentation] Fix Sphinx doxygen comment

2020-04-27 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-8603.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7045
[https://github.com/apache/arrow/pull/7045]

> [Documentation] Fix Sphinx doxygen comment
> --
>
> Key: ARROW-8603
> URL: https://issues.apache.org/jira/browse/ARROW-8603
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See [https://github.com/apache/arrow/runs/622393532]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8603) [Documentation] Fix Sphinx doxygen comment

2020-04-27 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-8603:
---

Assignee: Francois Saint-Jacques

> [Documentation] Fix Sphinx doxygen comment
> --
>
> Key: ARROW-8603
> URL: https://issues.apache.org/jira/browse/ARROW-8603
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See [https://github.com/apache/arrow/runs/622393532]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8607) [R][CI] Unbreak builds following R 4.0 release

2020-04-27 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8607:
--

 Summary: [R][CI] Unbreak builds following R 4.0 release
 Key: ARROW-8607
 URL: https://issues.apache.org/jira/browse/ARROW-8607
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0


Just a tourniquet to get master passing again while I work on ARROW-8604.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7610) [Java] Finish support for 64 bit int allocations

2020-04-27 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-7610.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 6323
[https://github.com/apache/arrow/pull/6323]

> [Java] Finish support for 64 bit int allocations 
> -
>
> Key: ARROW-7610
> URL: https://issues.apache.org/jira/browse/ARROW-7610
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 8.5h
>  Remaining Estimate: 0h
>
> 1.  Add an allocator capable of allocating larger then 2GB of data.
> 2.  Do end-to-end round trip trip on a larger vector/record batch size.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7873) [Python] Segfault in pandas version 1.0.1, read_parquet after creating a clickhouse odbc connection

2020-04-27 Thread Matt Calder (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093843#comment-17093843
 ] 

Matt Calder commented on ARROW-7873:


No, we have so far kept pandas at version 0.25.3. We're transitioning away from 
the odbc driver and to our own in-house version so the issue may be moot for us.

 

Matt

> [Python] Segfault in pandas version 1.0.1, read_parquet after creating a 
> clickhouse odbc connection
> ---
>
> Key: ARROW-7873
> URL: https://issues.apache.org/jira/browse/ARROW-7873
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: Ubuntu 18.04
>Reporter: Matt Calder
>Priority: Minor
> Attachments: foo.pkl, foo.pq
>
>
> [I posted this issue to the pandas 
> github|[https://github.com/pandas-dev/pandas/issues/31981]].
> We get a segfault when making a call to pd.read_parquet after having made a 
> connection to clickhouse via odbc. Like so,
> {code:python}
> import pyodbc
> import pandas as pd
> con_str = 
> f"Driver=libclickhouseodbc.so;url=http://clickhouse/query;timeout=600";
> with pyodbc.connect(con_str, autocommit=True) as con:
> pass
> df = pd.DataFrame({'A': [1,1,1], 'B': ['a', 'b', 'c']})
> df.to_parquet('/tmp/foo.pq')
> # This line core dumps:
> pd.read_parquet('/tmp/foo.pq')
> {code}
> This happens with pandas version 1.0.1 but not with pandas 0.25.3. Here's a 
> stacktrace:
> {code:java}
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1  0x77a24801 in __GI_abort () at abort.c:79
> #2  0x763c1957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #3  0x763c7ab6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #4  0x763c7af1 in std::terminate() () from 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #5  0x763c7d24 in __cxa_throw () from 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #6  0x763c6a52 in __cxa_bad_cast () from 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #7  0x764131ec in std::__cxx11::collate const& 
> std::use_facet >(std::locale const&) () from 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #8  0x7fffbe4b8279 in std::__cxx11::basic_string std::char_traits, std::allocator > 
> std::__cxx11::regex_traits::transform_primary(char const*, 
> char const*) const () from /usr/local/lib/libparquet.so.100
> #9  0x7fffbe4bd71c in 
> std::__detail::_BracketMatcher, false, 
> false>::_M_ready() () from /usr/local/lib/libparquet.so.100
> #10 0x7fffbe4bda9e in void 
> std::__detail::_Compiler 
> >::_M_insert_character_class_matcher() () from 
> /usr/local/lib/libparquet.so.100
> #11 0x7fffbe4c0569 in 
> std::__detail::_Compiler >::_M_atom() () 
> from /usr/local/lib/libparquet.so.100
> #12 0x7fffbe4c0ad8 in 
> std::__detail::_Compiler >::_M_alternative() 
> () from /usr/local/lib/libparquet.so.100
> #13 0x7fffbe4c0a43 in 
> std::__detail::_Compiler >::_M_alternative() 
> () from /usr/local/lib/libparquet.so.100
> #14 0x7fffbe4c0d1c in 
> std::__detail::_Compiler >::_M_disjunction() 
> () from /usr/local/lib/libparquet.so.100
> #15 0x7fffbe4c1469 in 
> std::__detail::_Compiler >::_Compiler(char 
> const*, char const*, std::locale const&, 
> std::regex_constants::syntax_option_type) () from 
> /usr/local/lib/libparquet.so.100
> #16 0x7fffbe4a93d1 in 
> parquet::ApplicationVersion::ApplicationVersion(std::__cxx11::basic_string  std::char_traits, std::allocator > const&) () from 
> /usr/local/lib/libparquet.so.100
> #17 0x7fffbe4c1c03 in 
> parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl(void const*, 
> unsigned int*, std::shared_ptr const&) () from 
> /usr/local/lib/libparquet.so.100
> #18 0x7fffbe4a9e62 in parquet::FileMetaData::FileMetaData(void const*, 
> unsigned int*, std::shared_ptr const&) () from 
> /usr/local/lib/libparquet.so.100
> #19 0x7fffbe4a9ec2 in parquet::FileMetaData::Make(void const*, unsigned 
> int*, std::shared_ptr const&) () from 
> /usr/local/lib/libparquet.so.100
> #20 0x7fffbe48acaf in 
> parquet::SerializedFile::ParseUnencryptedFileMetadata(std::shared_ptr
>  const&, long, long, std::shared_ptr*, unsigned int*, unsigned 
> int*) () from /usr/local/lib/libparquet.so.100
> #21 0x7fffbe492d75 in parquet::SerializedFile::ParseMetaData() () from 
> /usr/local/lib/libparquet.so.100
> #22 0x7fffbe48d8f8 in 
> parquet::ParquetFileReader::Contents::Open(std::shared_ptr,
>  parquet::ReaderProperties const&, std::shared_ptr) () 
> from /usr/local/lib/libparquet.so.100
> #23 0x7fffbe48e598 in 
> parquet::ParquetFileReader::Open(std::shared_ptr,
>  parquet::ReaderProperties const&, std::shared_ptr) () 
> from /usr/local/lib/libparquet.so.100
> #24 0x7fffbe3a89bd in 
>

[jira] [Updated] (ARROW-8606) [CI] Don't trigger all builds on a change to any file in ci/

2020-04-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8606:
--
Labels: pull-request-available  (was: )

> [CI] Don't trigger all builds on a change to any file in ci/
> 
>
> Key: ARROW-8606
> URL: https://issues.apache.org/jira/browse/ARROW-8606
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8606) [CI] Don't trigger all builds on a change to any file in ci/

2020-04-27 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8606:
--

 Summary: [CI] Don't trigger all builds on a change to any file in 
ci/
 Key: ARROW-8606
 URL: https://issues.apache.org/jira/browse/ARROW-8606
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Neal Richardson
Assignee: Neal Richardson






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8605) [R] Add support for brotli to Windows build

2020-04-27 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093766#comment-17093766
 ] 

Neal Richardson commented on ARROW-8605:


You are correct. We do not build the windows package with brotli. Here is what 
we do build with: 
https://github.com/apache/arrow/blob/master/ci/scripts/PKGBUILD#L28-L31

If you were interested in adding it, ARROW-6960 is the right model to follow.

> [R] Add support for brotli to Windows build
> ---
>
> Key: ARROW-8605
> URL: https://issues.apache.org/jira/browse/ARROW-8605
> Project: Apache Arrow
>  Issue Type: New Feature
>Affects Versions: 0.17.0
>Reporter: Hei
>Priority: Major
>
> Hi,
> My friend installed arrow and tried to open a parquet file with brotli codec. 
>  But then, he got an error when calling read_parquet("my.parquet") on Windows:
> {code}
> Error in parquet__arrow__FileReader__ReadTable(self) :
>IOError: NotImplemented: Brotli codec support not built
> {code}
> It sounds similar to ARROW-6960.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8605) [R] Add support for brotli to Windows build

2020-04-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8605:
---
Summary: [R] Add support for brotli to Windows build  (was: Missing brotli 
Support in R Package?)

> [R] Add support for brotli to Windows build
> ---
>
> Key: ARROW-8605
> URL: https://issues.apache.org/jira/browse/ARROW-8605
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: Hei
>Priority: Major
>
> Hi,
> My friend installed arrow and tried to open a parquet file with brotli codec. 
>  But then, he got an error when calling read_parquet("my.parquet") on Windows:
> {code}
> Error in parquet__arrow__FileReader__ReadTable(self) :
>IOError: NotImplemented: Brotli codec support not built
> {code}
> It sounds similar to ARROW-6960.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8605) [R] Add support for brotli to Windows build

2020-04-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8605:
---
Issue Type: New Feature  (was: Bug)

> [R] Add support for brotli to Windows build
> ---
>
> Key: ARROW-8605
> URL: https://issues.apache.org/jira/browse/ARROW-8605
> Project: Apache Arrow
>  Issue Type: New Feature
>Affects Versions: 0.17.0
>Reporter: Hei
>Priority: Major
>
> Hi,
> My friend installed arrow and tried to open a parquet file with brotli codec. 
>  But then, he got an error when calling read_parquet("my.parquet") on Windows:
> {code}
> Error in parquet__arrow__FileReader__ReadTable(self) :
>IOError: NotImplemented: Brotli codec support not built
> {code}
> It sounds similar to ARROW-6960.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7681) [Rust] Explicitly seeking a BufReader will discard the internal buffer

2020-04-27 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved ARROW-7681.
-
Resolution: Fixed

Issue resolved by pull request 6949
[https://github.com/apache/arrow/pull/6949]

> [Rust] Explicitly seeking a BufReader will discard the internal buffer
> --
>
> Key: ARROW-7681
> URL: https://issues.apache.org/jira/browse/ARROW-7681
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Max Burke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> This behavior was observed in the Parquet Rust file reader 
> (parquet/src/util/io.rs).
>  
> Pull request: [https://github.com/apache/arrow/pull/6280]
>  
> From the Rust documentation for BufReader:
>  
> "Seeking always discards the internal buffer, even if the seek position would 
> otherwise fall within it. This guarantees that calling {{.into_inner()}} 
> immediately after a seek yields the underlying reader at the same position."
>  
> [https://doc.rust-lang.org/std/io/struct.BufReader.html#impl-Seek]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8074) [C++][Dataset] Support for file-like objects (buffers) in FileSystemDataset?

2020-04-27 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8074:
-
Fix Version/s: 1.0.0

> [C++][Dataset] Support for file-like objects (buffers) in FileSystemDataset?
> 
>
> Key: ARROW-8074
> URL: https://issues.apache.org/jira/browse/ARROW-8074
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The current {{pyarrow.parquet.read_table}}/{{ParquetFile}} can work with 
> buffer (reader) objects (file-like objects, pyarrow.Buffer, 
> pyarrow.BufferReader) as input when dealing with single files. This 
> functionality is for example being used by pandas and kartothek (in addition 
> to being extensively used in our own tests as well).
> While we could keep the old implementation to handle single files (which is 
> different from the ParquetDataset logic), there are also some advantages of 
> being able to handle this in the Datasets API.  
> For example, this would enable to filtering functionality of the datasets 
> API, also for this single-file buffers use case, which would be a nice 
> enhancement (currently, {{read_table}} does not support {{filters}} in case 
> of single files, which is eg why kartothek implements this themselves).
> Would this be possible to support?
> The {{arrow::dataset::FileSource}} already has PATH and BUFFER enum types 
> (https://github.com/apache/arrow/blob/08f8bff05af37921ff1e5a2b630ce1e7ec1c0ede/cpp/src/arrow/dataset/file_base.h#L46-L49),
>  so it seems in principle possible to create a FileSource (for a 
> FileSystemDataset / FileFragment) from a buffer instead of from a path?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8596) [C++][Dataset] Add test case to check if all essential properties are reserved once ScannerBuilder::Project is called

2020-04-27 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8596:
--
Labels: dataset  (was: )

> [C++][Dataset] Add test case to check if all essential properties are 
> reserved once ScannerBuilder::Project is called
> -
>
> Key: ARROW-8596
> URL: https://issues.apache.org/jira/browse/ARROW-8596
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Hongze Zhang
>Assignee: Hongze Zhang
>Priority: Major
>  Labels: dataset
>
> This is a follow-up of ARROW-8499. It's better to provide a test around 
> ScanOptions::ReplaceSchema to check if all properties other than projector 
> are copied when the function is called. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8394) Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-04-27 Thread Phil Price (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093616#comment-17093616
 ] 

Phil Price edited comment on ARROW-8394 at 4/27/20, 3:15 PM:
-

This also happens with `0.17.0` from what I can tell there are two things which 
I would consider adoption blockers:
 # apache-arrow compiles with typescript@3.5, but typescript@3.6+ has stricter 
type-checks, and cannot consume the output. A couple of cases I've found that 
are invalid: 
 ## Extension of static `new()` methods with different typed parameter order 
(column derived from chunked but changes the signature of `new` to takes 
`string | Field` as the first param over `Data`
 ## Not passing template types to fields (e.g. `foo: Schema` to `foo: 
Schema`) 
 # Attempting to upgrade `apache-arrow` to typescript@3.8 (or 3.9-beta) falls 
afoul of a typescript compiler bug 
([https://github.com/microsoft/TypeScript/issues/35186]); from my 
understanding, the compiler is on an error path anyway but bails when trying to 
print error detail. This makes the upgrade difficult as it's a case of 
whack-a-mole. 


was (Author: pprice):
This also happens with `0.17.0` from what I can tell there are two things which 
I would consider adoption blockers:
 # `apache-arrow` compiles with typescript@3.5, but typescript@3.6+ has 
stricter type-checks, and cannot consume the output. A couple of cases I've 
found that are invalid: 
 ## Extension of static `new()` methods with different typed parameter order 
(column derived from chunked but changes the signature of `new` to takes 
`string | Field` as the first param over `Data`
 ## Not passing template types to fields (e.g. `foo: Schema` to `foo: 
Schema`) 
 # Attempting to upgrade `apache-arrow` to typescript@3.8 (or 3.9-beta) falls 
afoul of a typescript compiler bug 
([https://github.com/microsoft/TypeScript/issues/35186]); from my 
understanding, the compiler is on an error path anyway but bails when trying to 
print error detail. This makes the upgrade difficult as it's a case of 
whack-a-mole. 

> Typescript compiler errors for arrow d.ts files, when using es2015-esm package
> --
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Priority: Blocker
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8394) Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-04-27 Thread Phil Price (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093616#comment-17093616
 ] 

Phil Price commented on ARROW-8394:
---

This also happens with `0.17.0` from what I can tell there are two things which 
I would consider adoption blockers:
 # `apache-arrow` compiles with typescript@3.5, but typescript@3.6+ has 
stricter type-checks, and cannot consume the output. A couple of cases I've 
found that are invalid: 
 ## Extension of static `new()` methods with different typed parameter order 
(column derived from chunked but changes the signature of `new` to takes 
`string | Field` as the first param over `Data`
 ## Not passing template types to fields (e.g. `foo: Schema` to `foo: 
Schema`) 
 # Attempting to upgrade `apache-arrow` to typescript@3.8 (or 3.9-beta) falls 
afoul of a typescript compiler bug 
([https://github.com/microsoft/TypeScript/issues/35186]); from my 
understanding, the compiler is on an error path anyway but bails when trying to 
print error detail. This makes the upgrade difficult as it's a case of 
whack-a-mole. 

> Typescript compiler errors for arrow d.ts files, when using es2015-esm package
> --
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Priority: Blocker
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-7706) [Python] saving a dataframe to the same partitioned location silently doubles the data

2020-04-27 Thread Gregory Hayes (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093604#comment-17093604
 ] 

Gregory Hayes edited comment on ARROW-7706 at 4/27/20, 3:09 PM:


One additional thought -- Spark 2.3 also implements the ability to dynamically 
overwrite only partitions that have changed, as described 
[here|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-dynamic-partition-inserts.html
 ], by using:  
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic"), allowing 
for incrementally updating data in a pipeline.

It would be awesome to see this in the roadmap eventually.  


was (Author: hayesgb):
One additional thought -- Spark 2.3 also implements the ability to dynamically 
overwrite only partitions that have changed, as described 
[here|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-dynamic-partition-inserts.html
 ]. It would be awesome to see this in the roadmap eventually.  

> [Python] saving a dataframe to the same partitioned location silently doubles 
> the data
> --
>
> Key: ARROW-7706
> URL: https://issues.apache.org/jira/browse/ARROW-7706
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
>Reporter: Tsvika Shapira
>Priority: Major
>  Labels: dataset, parquet
>
> When a user saves a dataframe:
> {code:python}
> df1.to_parquet('/tmp/table', partition_cols=['col_a'], engine='pyarrow')
> {code}
> it will create sub-directories named "{{a=val1}}", "{{a=val2}}" in 
> {{/tmp/table}}. Each of them will contain one (or more?) parquet files with 
> random filenames.
> If a user runs the same command again, the code will use the existing 
> sub-directories, but with different (random) filenames. As a result, any data 
> loaded from this folder will be wrong - each row will be present twice.
> For example, when using
> {code:python}
> df1.to_parquet('/tmp/table', partition_cols=['col_a'], engine='pyarrow')  # 
> second time
> df2 = pd.read_parquet('/tmp/table', engine='pyarrow')
> assert len(df1) == len(df2)  # raise an error{code}
> This is a subtle change in the data that can pass unnoticed.
>  
> I would expect that the code will prevent the user from using an non-empty 
> destination as partitioned target. an overwrite flag can also be useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7706) [Python] saving a dataframe to the same partitioned location silently doubles the data

2020-04-27 Thread Gregory Hayes (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093604#comment-17093604
 ] 

Gregory Hayes commented on ARROW-7706:
--

One additional thought -- Spark 2.3 also implements the ability to dynamically 
overwrite only partitions that have changed, as described 
[here|https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-dynamic-partition-inserts.html
 ]. It would be awesome to see this in the roadmap eventually.  

> [Python] saving a dataframe to the same partitioned location silently doubles 
> the data
> --
>
> Key: ARROW-7706
> URL: https://issues.apache.org/jira/browse/ARROW-7706
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
>Reporter: Tsvika Shapira
>Priority: Major
>  Labels: dataset, parquet
>
> When a user saves a dataframe:
> {code:python}
> df1.to_parquet('/tmp/table', partition_cols=['col_a'], engine='pyarrow')
> {code}
> it will create sub-directories named "{{a=val1}}", "{{a=val2}}" in 
> {{/tmp/table}}. Each of them will contain one (or more?) parquet files with 
> random filenames.
> If a user runs the same command again, the code will use the existing 
> sub-directories, but with different (random) filenames. As a result, any data 
> loaded from this folder will be wrong - each row will be present twice.
> For example, when using
> {code:python}
> df1.to_parquet('/tmp/table', partition_cols=['col_a'], engine='pyarrow')  # 
> second time
> df2 = pd.read_parquet('/tmp/table', engine='pyarrow')
> assert len(df1) == len(df2)  # raise an error{code}
> This is a subtle change in the data that can pass unnoticed.
>  
> I would expect that the code will prevent the user from using an non-empty 
> destination as partitioned target. an overwrite flag can also be useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8605) Missing brotli Support in R Package?

2020-04-27 Thread Hei (Jira)
Hei created ARROW-8605:
--

 Summary: Missing brotli Support in R Package?
 Key: ARROW-8605
 URL: https://issues.apache.org/jira/browse/ARROW-8605
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.17.0
Reporter: Hei


Hi,

My friend installed arrow and tried to open a parquet file with brotli codec.  
But then, he got an error when calling read_parquet("my.parquet") on Windows:
{code}
Error in parquet__arrow__FileReader__ReadTable(self) :
   IOError: NotImplemented: Brotli codec support not built
{code}

It sounds similar to ARROW-6960.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8604) [R] Windows compilation failure

2020-04-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-8604:
--

Assignee: Neal Richardson

> [R] Windows compilation failure
> ---
>
> Key: ARROW-8604
> URL: https://issues.apache.org/jira/browse/ARROW-8604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Francois Saint-Jacques
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> [Master|[https://github.com/apache/arrow/runs/622393526]] fails to compile. 
> The C++ cmake build is not using the same 
> [compiler|[https://github.com/apache/arrow/runs/622393526#step:8:807]] than 
> the R extension 
> [compiler|[https://github.com/apache/arrow/runs/622393526#step:11:141]].
> {code:java}
> // Files installed here
>   adding: arrow-0.17.0.9000/lib-4.9.3/i386/libarrow.a (deflated 85%)
>   adding: arrow-0.17.0.9000/lib-4.9.3/i386/libarrow_dataset.a (deflated 82%)
>   adding: arrow-0.17.0.9000/lib-4.9.3/i386/libparquet.a (deflated 84%)
>   adding: arrow-0.17.0.9000/lib-4.9.3/i386/libsnappy.a (deflated 61%)
>   adding: arrow-0.17.0.9000/lib-4.9.3/i386/libthrift.a (deflated 81%)
> // Linker is using `-L`
> C:/Rtools/mingw_32/bin/g++ -shared -s -static-libgcc -o arrow.dll tmp.def 
> array.o array_from_vector.o array_to_vector.o arraydata.o arrowExports.o 
> buffer.o chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o 
> expression.o feather.o field.o filesystem.o io.o json.o memorypool.o 
> message.o parquet.o py-to-r.o recordbatch.o recordbatchreader.o 
> recordbatchwriter.o schema.o symbols.o table.o threadpool.o 
> -L../windows/arrow-0.17.0.9000/lib-8.3.0/i386 
> -L../windows/arrow-0.17.0.9000/lib/i386 -lparquet -larrow_dataset -larrow 
> -lthrift -lsnappy -lz -lzstd -llz4 -lcrypto -lcrypt32 -lws2_32 
> -LC:/R/bin/i386 -lR
> C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
>  cannot find -lparquet
> C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
>  cannot find -larrow_dataset
> C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
>  cannot find -larrow
> C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
>  cannot find -lthrift
> C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
>  cannot find -lsnappy
> {code}
>  
> C++ developers, rejoice, this is almost the end of gcc-4.9.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8604) [R] Update CI to use R 4.0

2020-04-27 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8604:
---
Summary: [R] Update CI to use R 4.0  (was: [R] Windows compilation failure)

> [R] Update CI to use R 4.0
> --
>
> Key: ARROW-8604
> URL: https://issues.apache.org/jira/browse/ARROW-8604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Francois Saint-Jacques
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> [Master|[https://github.com/apache/arrow/runs/622393526]] fails to compile. 
> The C++ cmake build is not using the same 
> [compiler|[https://github.com/apache/arrow/runs/622393526#step:8:807]] than 
> the R extension 
> [compiler|[https://github.com/apache/arrow/runs/622393526#step:11:141]].
> {code:java}
> // Files installed here
>   adding: arrow-0.17.0.9000/lib-4.9.3/i386/libarrow.a (deflated 85%)
>   adding: arrow-0.17.0.9000/lib-4.9.3/i386/libarrow_dataset.a (deflated 82%)
>   adding: arrow-0.17.0.9000/lib-4.9.3/i386/libparquet.a (deflated 84%)
>   adding: arrow-0.17.0.9000/lib-4.9.3/i386/libsnappy.a (deflated 61%)
>   adding: arrow-0.17.0.9000/lib-4.9.3/i386/libthrift.a (deflated 81%)
> // Linker is using `-L`
> C:/Rtools/mingw_32/bin/g++ -shared -s -static-libgcc -o arrow.dll tmp.def 
> array.o array_from_vector.o array_to_vector.o arraydata.o arrowExports.o 
> buffer.o chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o 
> expression.o feather.o field.o filesystem.o io.o json.o memorypool.o 
> message.o parquet.o py-to-r.o recordbatch.o recordbatchreader.o 
> recordbatchwriter.o schema.o symbols.o table.o threadpool.o 
> -L../windows/arrow-0.17.0.9000/lib-8.3.0/i386 
> -L../windows/arrow-0.17.0.9000/lib/i386 -lparquet -larrow_dataset -larrow 
> -lthrift -lsnappy -lz -lzstd -llz4 -lcrypto -lcrypt32 -lws2_32 
> -LC:/R/bin/i386 -lR
> C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
>  cannot find -lparquet
> C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
>  cannot find -larrow_dataset
> C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
>  cannot find -larrow
> C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
>  cannot find -lthrift
> C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
>  cannot find -lsnappy
> {code}
>  
> C++ developers, rejoice, this is almost the end of gcc-4.9.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7251) [Python] Open CSVs with different encodings

2020-04-27 Thread Sascha Hofmann (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093550#comment-17093550
 ] 

Sascha Hofmann commented on ARROW-7251:
---

Our current setup allows users to upload CSV files which we are parsing to 
arrow. Right now, we are not doing any the preprocessing of the csv file so we 
can receive arbitrary weird files. I will propose your "recode on the fly" 
suggestion.

 

> [Python] Open CSVs with different encodings
> ---
>
> Key: ARROW-7251
> URL: https://issues.apache.org/jira/browse/ARROW-7251
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: Sascha Hofmann
>Priority: Major
>
> I would like to open an UTF-16 encoded CSVs (among others) without 
> preprocessing in let's say Pandas. Is there maybe a way to do this already ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8604) [R] Windows compilation failure

2020-04-27 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8604:
--
Description: 
[Master|[https://github.com/apache/arrow/runs/622393526]] fails to compile. The 
C++ cmake build is not using the same 
[compiler|[https://github.com/apache/arrow/runs/622393526#step:8:807]] than the 
R extension 
[compiler|[https://github.com/apache/arrow/runs/622393526#step:11:141]].


{code:java}
// Files installed here
  adding: arrow-0.17.0.9000/lib-4.9.3/i386/libarrow.a (deflated 85%)
  adding: arrow-0.17.0.9000/lib-4.9.3/i386/libarrow_dataset.a (deflated 82%)
  adding: arrow-0.17.0.9000/lib-4.9.3/i386/libparquet.a (deflated 84%)
  adding: arrow-0.17.0.9000/lib-4.9.3/i386/libsnappy.a (deflated 61%)
  adding: arrow-0.17.0.9000/lib-4.9.3/i386/libthrift.a (deflated 81%)

// Linker is using `-L`
C:/Rtools/mingw_32/bin/g++ -shared -s -static-libgcc -o arrow.dll tmp.def 
array.o array_from_vector.o array_to_vector.o arraydata.o arrowExports.o 
buffer.o chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o 
expression.o feather.o field.o filesystem.o io.o json.o memorypool.o message.o 
parquet.o py-to-r.o recordbatch.o recordbatchreader.o recordbatchwriter.o 
schema.o symbols.o table.o threadpool.o 
-L../windows/arrow-0.17.0.9000/lib-8.3.0/i386 
-L../windows/arrow-0.17.0.9000/lib/i386 -lparquet -larrow_dataset -larrow 
-lthrift -lsnappy -lz -lzstd -llz4 -lcrypto -lcrypt32 -lws2_32 -LC:/R/bin/i386 
-lR
C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
 cannot find -lparquet
C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
 cannot find -larrow_dataset
C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
 cannot find -larrow
C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
 cannot find -lthrift
C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
 cannot find -lsnappy
{code}
 
C++ developers, rejoice, this is almost the end of gcc-4.9.

 

  was:Master fails to compile.


> [R] Windows compilation failure
> ---
>
> Key: ARROW-8604
> URL: https://issues.apache.org/jira/browse/ARROW-8604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> [Master|[https://github.com/apache/arrow/runs/622393526]] fails to compile. 
> The C++ cmake build is not using the same 
> [compiler|[https://github.com/apache/arrow/runs/622393526#step:8:807]] than 
> the R extension 
> [compiler|[https://github.com/apache/arrow/runs/622393526#step:11:141]].
> {code:java}
> // Files installed here
>   adding: arrow-0.17.0.9000/lib-4.9.3/i386/libarrow.a (deflated 85%)
>   adding: arrow-0.17.0.9000/lib-4.9.3/i386/libarrow_dataset.a (deflated 82%)
>   adding: arrow-0.17.0.9000/lib-4.9.3/i386/libparquet.a (deflated 84%)
>   adding: arrow-0.17.0.9000/lib-4.9.3/i386/libsnappy.a (deflated 61%)
>   adding: arrow-0.17.0.9000/lib-4.9.3/i386/libthrift.a (deflated 81%)
> // Linker is using `-L`
> C:/Rtools/mingw_32/bin/g++ -shared -s -static-libgcc -o arrow.dll tmp.def 
> array.o array_from_vector.o array_to_vector.o arraydata.o arrowExports.o 
> buffer.o chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o 
> expression.o feather.o field.o filesystem.o io.o json.o memorypool.o 
> message.o parquet.o py-to-r.o recordbatch.o recordbatchreader.o 
> recordbatchwriter.o schema.o symbols.o table.o threadpool.o 
> -L../windows/arrow-0.17.0.9000/lib-8.3.0/i386 
> -L../windows/arrow-0.17.0.9000/lib/i386 -lparquet -larrow_dataset -larrow 
> -lthrift -lsnappy -lz -lzstd -llz4 -lcrypto -lcrypt32 -lws2_32 
> -LC:/R/bin/i386 -lR
> C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
>  cannot find -lparquet
> C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
>  cannot find -larrow_dataset
> C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
>  cannot find -larrow
> C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
>  cannot find -lthrift
> C:/Rtools/mingw_32/bin/../lib/gcc/i686-w64-mingw32/4.9.3/../../../../i686-w64-mingw32/bin/ld.exe:
>  cannot find -lsnappy
> {code}
>  
> C++ developers, rejoice, this is almost the end of gcc-4.9.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8604) [R] Windows compilation failure

2020-04-27 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8604:
--
Description: Master fails to compile.

> [R] Windows compilation failure
> ---
>
> Key: ARROW-8604
> URL: https://issues.apache.org/jira/browse/ARROW-8604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> Master fails to compile.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8604) [R] Windows compilation failure

2020-04-27 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8604:
-

 Summary: [R] Windows compilation failure
 Key: ARROW-8604
 URL: https://issues.apache.org/jira/browse/ARROW-8604
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Francois Saint-Jacques
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8603) [Documentation] Fix Sphinx doxygen comment

2020-04-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8603:
--
Labels: pull-request-available  (was: )

> [Documentation] Fix Sphinx doxygen comment
> --
>
> Key: ARROW-8603
> URL: https://issues.apache.org/jira/browse/ARROW-8603
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation
>Reporter: Francois Saint-Jacques
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See [https://github.com/apache/arrow/runs/622393532]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8603) [Documentation] Fix Sphinx doxygen comment

2020-04-27 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8603:
-

 Summary: [Documentation] Fix Sphinx doxygen comment
 Key: ARROW-8603
 URL: https://issues.apache.org/jira/browse/ARROW-8603
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Documentation
Reporter: Francois Saint-Jacques


See [https://github.com/apache/arrow/runs/622393532]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8602) [CMake] Fix ws2_32 link issue when cross-compiling on Linux

2020-04-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8602:
--
Labels: pull-request-available  (was: )

> [CMake] Fix ws2_32 link issue when cross-compiling on Linux
> ---
>
> Key: ARROW-8602
> URL: https://issues.apache.org/jira/browse/ARROW-8602
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8602) [CMake] Fix ws2_32 link issue when cross-compiling on Linux

2020-04-27 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8602.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7001
[https://github.com/apache/arrow/pull/7001]

> [CMake] Fix ws2_32 link issue when cross-compiling on Linux
> ---
>
> Key: ARROW-8602
> URL: https://issues.apache.org/jira/browse/ARROW-8602
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Trivial
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8602) [CMake] Fix ws2_32 link issue when cross-compiling on Linux

2020-04-27 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8602:
-

 Summary: [CMake] Fix ws2_32 link issue when cross-compiling on 
Linux
 Key: ARROW-8602
 URL: https://issues.apache.org/jira/browse/ARROW-8602
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-7251) [Python] Open CSVs with different encodings

2020-04-27 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093465#comment-17093465
 ] 

Antoine Pitrou edited comment on ARROW-7251 at 4/27/20, 12:45 PM:
--

cc [~saschahofmann] Is there anything that prevents you from recoding the CSV 
file before opening it with Arrow?
(what are your constraints? performance? file size?)

With some care, you could even implement a file-like object in Python that 
recodes data to UTF-8 on the fly. It should be accepted by {{csv.read_csv}}.


was (Author: pitrou):
cc [~saschahofmann] Is there anything that prevents you from recoding the CSV 
file before opening it with Arrow?
(what are you constraints? performance? file size?)

With some care, you could even implement a file-like object in Python that 
recodes data to UTF-8 on the fly. It should be accepted by {{csv.read_csv}}.

> [Python] Open CSVs with different encodings
> ---
>
> Key: ARROW-7251
> URL: https://issues.apache.org/jira/browse/ARROW-7251
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: Sascha Hofmann
>Priority: Major
>
> I would like to open an UTF-16 encoded CSVs (among others) without 
> preprocessing in let's say Pandas. Is there maybe a way to do this already ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7251) [Python] Open CSVs with different encodings

2020-04-27 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093465#comment-17093465
 ] 

Antoine Pitrou commented on ARROW-7251:
---

cc [~saschahofmann] Is there anything that prevents you from recoding the CSV 
file before opening it with Arrow?
(what are you constraints? performance? file size?)

With some care, you could even implement a file-like object in Python that 
recodes data to UTF-8 on the fly. It should be accepted by {{csv.read_csv}}.

> [Python] Open CSVs with different encodings
> ---
>
> Key: ARROW-7251
> URL: https://issues.apache.org/jira/browse/ARROW-7251
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: Sascha Hofmann
>Priority: Major
>
> I would like to open an UTF-16 encoded CSVs (among others) without 
> preprocessing in let's say Pandas. Is there maybe a way to do this already ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8601) [Go][Flight] Implement Flight Writer interface

2020-04-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8601:
--
Labels: pull-request-available  (was: )

> [Go][Flight] Implement Flight Writer interface
> --
>
> Key: ARROW-8601
> URL: https://issues.apache.org/jira/browse/ARROW-8601
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Go
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8601) [Go][Flight] Implement Flight Writer interface

2020-04-27 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8601:
-

 Summary: [Go][Flight] Implement Flight Writer interface
 Key: ARROW-8601
 URL: https://issues.apache.org/jira/browse/ARROW-8601
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC, Go
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7251) [Python] Open CSVs with different encodings

2020-04-27 Thread Sascha Hofmann (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093435#comment-17093435
 ] 

Sascha Hofmann commented on ARROW-7251:
---

For us having different string encoding support would be amazing. That being 
said, I admit other encodings are rare/dying out but we stumble upon them once 
in a while. From those, I don't know how many are using a BOM to identify their 
encoding. We haven't actually tried it but we might use pandas as mentioned 
above in cases where a file has a BOM different than the utf-8 (see comment 
above).  I am not sure how you did the csv reading in pandas but I assume it 
might not be worth going through it again. In the end, it might be best to 
force people using UTF-8. 

> [Python] Open CSVs with different encodings
> ---
>
> Key: ARROW-7251
> URL: https://issues.apache.org/jira/browse/ARROW-7251
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Reporter: Sascha Hofmann
>Priority: Major
>
> I would like to open an UTF-16 encoded CSVs (among others) without 
> preprocessing in let's say Pandas. Is there maybe a way to do this already ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8152) [C++] IO: split large coalesced reads into smaller ones

2020-04-27 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093429#comment-17093429
 ] 

David Li commented on ARROW-8152:
-

FWIW, I'm not sure this specific task is needed anymore - I originally didn't 
realize the Parquet reader issued individual reads for each column chunk. 
Splitting large reads hence isn't needed. It may help for people who have very 
large column chunks, but that can be pursued separately if it comes up.

> [C++] IO: split large coalesced reads into smaller ones
> ---
>
> Key: ARROW-8152
> URL: https://issues.apache.org/jira/browse/ARROW-8152
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Priority: Major
> Fix For: 1.0.0
>
>
> We have a facility to coalesce small reads, but remote filesystems may also 
> benefit from splitting large reads to take advantage of concurrency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8565) [C++] Static build with AWS SDK

2020-04-27 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093421#comment-17093421
 ] 

Francois Saint-Jacques commented on ARROW-8565:
---

I'm not sure if you're aware, but the AWS SDK supports selectively building 
components in the bundled library. The following should make a smaller build 
that supports S3.
{code:java}
cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DBUILD_ONLY="s3;core;config;transfer"
{code}

> [C++] Static build with AWS SDK
> ---
>
> Key: ARROW-8565
> URL: https://issues.apache.org/jira/browse/ARROW-8565
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Remi Dettai
>Priority: Major
>  Labels: aws-s3, build-problem
>
> I can't find my way around the build system when using the S3 client.
> It seems that only shared target is allowed when the S3 feature is ON. In the 
> thirdparty toolchain, when printing:
> ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong 
> libcrypto"??
> What is actually meant is that static build will not work, correct ? If it is 
> the case, should libarrow.a be generated at all when S3 feature is on ? 
> What can be done to fix this ? What does it mean that the SDK links to the 
> wrong libcrypto ? Is it fixable ? Or is their a way to have the static build 
> but maintain a dynamic link to a shared version of the SDK ?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8586) [R] installation failure on CentOS 7

2020-04-27 Thread Hei (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17092950#comment-17092950
 ] 

Hei edited comment on ARROW-8586 at 4/27/20, 11:10 AM:
---

Hi Neal,

I tried out your suggestion by setting LIBARROW_BINARY=centos-7 and then ran 
install.packages("arrow") to reinstall.

Then I tried:
{code}
> library(arrow)

Attaching package: ‘arrow’

The following object is masked from ‘package:utils’:

timestamp

> df <- read_parquet('/home/hc/my.10.level.20200331.2book.parquet')
{code}

And then RStudio's session crashed with a popup saying, "R Session Abort.  R 
encountered a fatal error.  The session was terminated".  There is no extra 
info in the console even I set ARROW_R_DEV=true.

Restarting the session doesn't help -- same error popup when loading the 
parquet file.

Restarting rstudio client doesn't help neither.

Here is my RStudio's version:
{code}
> RStudio.Version()
$citation

To cite RStudio in publications use:

  RStudio Team (2020). RStudio: Integrated Development for R. RStudio, Inc., 
Boston, MA URL http://www.rstudio.com/.

A BibTeX entry for LaTeX users is

  @Manual{,
title = {RStudio: Integrated Development Environment for R},
author = {{RStudio Team}},
organization = {RStudio, Inc.},
address = {Boston, MA},
year = {2020},
url = {http://www.rstudio.com/},
  }


$mode
[1] "desktop"

$version
[1] ‘1.2.5042’

$release_name
[1] "Double Marigold"
{code}

I tried to load the same parquet file with python 3.6 to construct pandas 
dataframe, it works fine.

Any idea?


was (Author: hei):
Hi Neal,

I tried out your suggestion by setting LIBARROW_BINARY=centos-7 and then ran 
install.packages("arrow") to reinstall.

Then I tried:
{code}
> library(arrow)

Attaching package: ‘arrow’

The following object is masked from ‘package:utils’:

timestamp

> df <- read_parquet('/home/hc/my.10.level.20200331.2book.parquet')
{code}

And then RStudio's session crashed with a popup saying, "R Session Abort.  R 
encountered a fatal error.  The session was terminated".

Restarting the session doesn't help -- same error popup when loading the 
parquet file.

Restarting rstudio client doesn't help neither.

Here is my RStudio's version:
{code}
> RStudio.Version()
$citation

To cite RStudio in publications use:

  RStudio Team (2020). RStudio: Integrated Development for R. RStudio, Inc., 
Boston, MA URL http://www.rstudio.com/.

A BibTeX entry for LaTeX users is

  @Manual{,
title = {RStudio: Integrated Development Environment for R},
author = {{RStudio Team}},
organization = {RStudio, Inc.},
address = {Boston, MA},
year = {2020},
url = {http://www.rstudio.com/},
  }


$mode
[1] "desktop"

$version
[1] ‘1.2.5042’

$release_name
[1] "Double Marigold"
{code}

I tried to load the same parquet file with python 3.6 to construct pandas 
dataframe, it works fine.

Any idea?

> [R] installation failure on CentOS 7
> 
>
> Key: ARROW-8586
> URL: https://issues.apache.org/jira/browse/ARROW-8586
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: CentOS 7
>Reporter: Hei
>Priority: Major
>
> Hi,
> I am trying to install arrow via RStudio, but it seems like it is not working 
> that after I installed the package, it kept asking me to run 
> arrow::install_arrow() even after I did:
> {code}
> > install.packages("arrow")
> Installing package into ‘/home/hc/R/x86_64-redhat-linux-gnu-library/3.6’
> (as ‘lib’ is unspecified)
> trying URL 'https://cran.rstudio.com/src/contrib/arrow_0.17.0.tar.gz'
> Content type 'application/x-gzip' length 242534 bytes (236 KB)
> ==
> downloaded 236 KB
> * installing *source* package ‘arrow’ ...
> ** package ‘arrow’ successfully unpacked and MD5 sums checked
> ** using staged installation
> *** Successfully retrieved C++ source
> *** Building C++ libraries
>  cmake
>  arrow  
> ./configure: line 132: cd: libarrow/arrow-0.17.0/lib: Not a directory
> - NOTE ---
> After installation, please run arrow::install_arrow()
> for help installing required runtime libraries
> -
> ** libs
> g++ -m64 -std=gnu++11 -I"/usr/include/R" -DNDEBUG  
> -I"/home/hc/R/x86_64-redhat-linux-gnu-library/3.6/Rcpp/include" 
> -I/usr/local/include  -fpic  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 
> -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 
> -grecord-gcc-switches   -m64 -mtune=generic  -c array.cpp -o array.o
> g++ -m64 -std=gnu++11 -I"/usr/include/R" -DNDEBUG  
> -I"/home/hc/R/x86_64-redhat-linux-gnu-library/3.6/Rcpp/include" 
> -I/usr/local/include  -fpic  -O2 -g -pipe -Wall -Wp,-D_FO

[jira] [Commented] (ARROW-8565) [C++] Static build with AWS SDK

2020-04-27 Thread Remi Dettai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093185#comment-17093185
 ] 

Remi Dettai commented on ARROW-8565:


I finally managed to make a static build but it requires changing the root 
CMakeLists to append AWSSDK_LINK_LIBRARIES to ARROW_STATIC_LINK_LIBS. I'm not 
sure this is safe to do because I really don't understand how the Arrow and AWS 
SDK dependencies are competing.

In the end the static build is barely more compact than the shared one, because 
the C++ AWS SDK build is a whole adventure. I'm continuing my investigation to 
see if I can come up with something nicer, but I really lack some cmake 
expertise :)

> [C++] Static build with AWS SDK
> ---
>
> Key: ARROW-8565
> URL: https://issues.apache.org/jira/browse/ARROW-8565
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Remi Dettai
>Priority: Major
>  Labels: aws-s3, build-problem
>
> I can't find my way around the build system when using the S3 client.
> It seems that only shared target is allowed when the S3 feature is ON. In the 
> thirdparty toolchain, when printing:
> ??FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong 
> libcrypto"??
> What is actually meant is that static build will not work, correct ? If it is 
> the case, should libarrow.a be generated at all when S3 feature is on ? 
> What can be done to fix this ? What does it mean that the SDK links to the 
> wrong libcrypto ? Is it fixable ? Or is their a way to have the static build 
> but maintain a dynamic link to a shared version of the SDK ?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)