date:20200529

[jira] [Resolved] (ARROW-8984) [R] Revise install guides now that Windows conda package exists

2020-05-29 Thread Uwe Korn (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn resolved ARROW-8984.
-
Resolution: Fixed

Issue resolved by pull request 7303
[https://github.com/apache/arrow/pull/7303]

> [R] Revise install guides now that Windows conda package exists
> ---
>
> Key: ARROW-8984
> URL: https://issues.apache.org/jira/browse/ARROW-8984
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8978) [C++][Compute] "Conditional jump or move depends on uninitialised value(s)" Valgrind warning

2020-05-29 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-8978.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7302
[https://github.com/apache/arrow/pull/7302]

> [C++][Compute] "Conditional jump or move depends on uninitialised value(s)" 
> Valgrind warning
> 
>
> Key: ARROW-8978
> URL: https://issues.apache.org/jira/browse/ARROW-8978
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> https://github.com/ursa-labs/crossbow/runs/715700830#step:6:4277
> {noformat}
> [ RUN  ] TestCallScalarFunction.PreallocationCases
> ==5357== Conditional jump or move depends on uninitialised value(s)
> ==5357==at 0x51D69A6: void arrow::internal::TransferBitmap true>(unsigned char const*, long, long, long, unsigned char*) 
> (bit_util.cc:176)
> ==5357==by 0x51CE866: arrow::internal::CopyBitmap(unsigned char const*, 
> long, long, unsigned char*, long, bool) (bit_util.cc:208)
> ==5357==by 0x52B6325: 
> arrow::compute::detail::NullPropagator::PropagateSingle() (exec.cc:295)
> ==5357==by 0x52B36D1: Execute (exec.cc:378)
> ==5357==by 0x52B36D1: 
> arrow::compute::detail::PropagateNulls(arrow::compute::KernelContext*, 
> arrow::compute::ExecBatch const&, arrow::ArrayData*) (exec.cc:412)
> ==5357==by 0x52BA7F3: ExecuteBatch (exec.cc:586)
> ==5357==by 0x52BA7F3: 
> arrow::compute::detail::ScalarExecutor::Execute(std::vector std::allocator > const&, arrow::compute::detail::ExecListener*) 
> (exec.cc:542)
> ==5357==by 0x52BC21F: 
> arrow::compute::Function::Execute(std::vector std::allocator > const&, arrow::compute::FunctionOptions 
> const*, arrow::compute::ExecContext*) const (function.cc:94)
> ==5357==by 0x52B141C: 
> arrow::compute::CallFunction(std::__cxx11::basic_string std::char_traits, std::allocator > const&, 
> std::vector > const&, 
> arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) 
> (exec.cc:937)
> ==5357==by 0x52B16F2: 
> arrow::compute::CallFunction(std::__cxx11::basic_string std::char_traits, std::allocator > const&, 
> std::vector > const&, 
> arrow::compute::ExecContext*) (exec.cc:942)
> ==5357==by 0x155515: 
> arrow::compute::detail::TestCallScalarFunction_PreallocationCases_Test::TestBody()::{lambda(std::__cxx11::basic_string  std::char_traits, std::allocator 
> >)#1}::operator()(std::__cxx11::basic_string, 
> std::allocator >) const (exec_test.cc:756)
> ==5357==by 0x156AF2: 
> arrow::compute::detail::TestCallScalarFunction_PreallocationCases_Test::TestBody()
>  (exec_test.cc:786)
> ==5357==by 0x5BE4862: void 
> testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) (in 
> /opt/conda/envs/arrow/lib/libgtest.so)
> ==5357==by 0x5BDEDE2: void 
> testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) (in 
> /opt/conda/envs/arrow/lib/libgtest.so)
> ==5357== 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8878) [R] try_download is confused when download.file.method isn't default

2020-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8878:
--
Labels: pull-request-available  (was: )

> [R] try_download is confused when download.file.method isn't default
> 
>
> Key: ARROW-8878
> URL: https://issues.apache.org/jira/browse/ARROW-8878
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: r
>Reporter: Olaf
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hello there and thanks again for this beautiful package!
> I am trying to install {{arrow}} on linux and I got a few problematic 
> warnings during the install. My computer is behind a firewall so not all the 
> connections coming from rstudio are allowed.
>  
> {code:java}
> > sessionInfo()
> R version 3.6.1 (2019-07-05)
> Platform: x86_64-ubuntu18-linux-gnu (64-bit)
> Running under: Ubuntu 18.04.4 LTS
> Matrix products: default
> BLAS/LAPACK: 
> /apps/intel/2019.1/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
> locale:
>  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
>  [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 
>  [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C 
> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] MKLthreads_0.1
> loaded via a namespace (and not attached):
> [1] compiler_3.6.1 tools_3.6.1
> {code}
>  
> after running {{install.packages("arrow")}} I get
>  
> {code:java}
>  
> installing *source* package ?arrow? ...
> ** package ?arrow? successfully unpacked and MD5 sums checked
> ** using staged installation
> *** Successfully retrieved C++ source
> *** Proceeding without C++ dependencies
> Warning message:
> In unzip(tf1, exdir = src_dir) : error 1 in extracting from zip file
> ./configure: line 132: cd: libarrow/arrow-0.17.1/lib: No such file or 
> directory
> - NOTE ---
> After installation, please run arrow::install_arrow()
> for help installing required runtime libraries
> -
> {code}
>  
>  
> However, the installation ends normally.
>  
> {code:java}
>  ** R
> ** inst
> ** byte-compile and prepare package for lazy loading
> ** help
> *** installing help indices
> ** building package indices
> ** installing vignettes
> ** testing if installed package can be loaded from temporary location
> ** checking absolute paths in shared objects and dynamic libraries
> ** testing if installed package can be loaded from final location
> ** testing if installed package keeps a record of temporary installation path
> * DONE (arrow)
> {code}
>  
> So I go ahead and try to run arrow::install_arrow() and get a similar warning.
>  
> {code:java}
> installing *source* package ?arrow? ...
> ** package ?arrow? successfully unpacked and MD5 sums checked
> ** using staged installation
> *** Successfully retrieved C++ binaries for ubuntu-18.04
> Warning messages:
> 1: In file(file, "rt") :
>  URL 
> 'https://raw.githubusercontent.com/ursa-labs/arrow-r-nightly/master/linux/distro-map.csv':
>  status was 'Couldn't connect to server'
> 2: In unzip(bin_file, exdir = dst_dir) :
>  error 1 in extracting from zip file
> ./configure: line 132: cd: libarrow/arrow-0.17.1/lib: No such file or 
> directory
> - NOTE ---
> After installation, please run arrow::install_arrow()
> for help installing required runtime libraries
> {code}
> And unfortunately I cannot read any parquet file.
> {noformat}
> Error in fetch(key) : lazy-load database 
> '/mydata/R/x86_64-ubuntu18-linux-gnu-library/3.6/arrow/help/arrow.rdb' is 
> corrupt{noformat}
>  
> Could you please tell me how to fix this? Can I just copy the zip from github 
> and do a manual install in Rstudio?
>  
> Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8878) [R] try_download is confused when download.file.method isn't default

2020-05-29 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8878:
---
Summary: [R] try_download is confused when download.file.method isn't 
default  (was: [R] how to install when behind a firewall?)

> [R] try_download is confused when download.file.method isn't default
> 
>
> Key: ARROW-8878
> URL: https://issues.apache.org/jira/browse/ARROW-8878
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: r
>Reporter: Olaf
>Priority: Major
>
> Hello there and thanks again for this beautiful package!
> I am trying to install {{arrow}} on linux and I got a few problematic 
> warnings during the install. My computer is behind a firewall so not all the 
> connections coming from rstudio are allowed.
>  
> {code:java}
> > sessionInfo()
> R version 3.6.1 (2019-07-05)
> Platform: x86_64-ubuntu18-linux-gnu (64-bit)
> Running under: Ubuntu 18.04.4 LTS
> Matrix products: default
> BLAS/LAPACK: 
> /apps/intel/2019.1/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
> locale:
>  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
>  [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 
>  [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C 
> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] MKLthreads_0.1
> loaded via a namespace (and not attached):
> [1] compiler_3.6.1 tools_3.6.1
> {code}
>  
> after running {{install.packages("arrow")}} I get
>  
> {code:java}
>  
> installing *source* package ?arrow? ...
> ** package ?arrow? successfully unpacked and MD5 sums checked
> ** using staged installation
> *** Successfully retrieved C++ source
> *** Proceeding without C++ dependencies
> Warning message:
> In unzip(tf1, exdir = src_dir) : error 1 in extracting from zip file
> ./configure: line 132: cd: libarrow/arrow-0.17.1/lib: No such file or 
> directory
> - NOTE ---
> After installation, please run arrow::install_arrow()
> for help installing required runtime libraries
> -
> {code}
>  
>  
> However, the installation ends normally.
>  
> {code:java}
>  ** R
> ** inst
> ** byte-compile and prepare package for lazy loading
> ** help
> *** installing help indices
> ** building package indices
> ** installing vignettes
> ** testing if installed package can be loaded from temporary location
> ** checking absolute paths in shared objects and dynamic libraries
> ** testing if installed package can be loaded from final location
> ** testing if installed package keeps a record of temporary installation path
> * DONE (arrow)
> {code}
>  
> So I go ahead and try to run arrow::install_arrow() and get a similar warning.
>  
> {code:java}
> installing *source* package ?arrow? ...
> ** package ?arrow? successfully unpacked and MD5 sums checked
> ** using staged installation
> *** Successfully retrieved C++ binaries for ubuntu-18.04
> Warning messages:
> 1: In file(file, "rt") :
>  URL 
> 'https://raw.githubusercontent.com/ursa-labs/arrow-r-nightly/master/linux/distro-map.csv':
>  status was 'Couldn't connect to server'
> 2: In unzip(bin_file, exdir = dst_dir) :
>  error 1 in extracting from zip file
> ./configure: line 132: cd: libarrow/arrow-0.17.1/lib: No such file or 
> directory
> - NOTE ---
> After installation, please run arrow::install_arrow()
> for help installing required runtime libraries
> {code}
> And unfortunately I cannot read any parquet file.
> {noformat}
> Error in fetch(key) : lazy-load database 
> '/mydata/R/x86_64-ubuntu18-linux-gnu-library/3.6/arrow/help/arrow.rdb' is 
> corrupt{noformat}
>  
> Could you please tell me how to fix this? Can I just copy the zip from github 
> and do a manual install in Rstudio?
>  
> Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8984) [R] Revise install guides now that Windows conda package exists

2020-05-29 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-8984:
--

 Summary: [R] Revise install guides now that Windows conda package 
exists
 Key: ARROW-8984
 URL: https://issues.apache.org/jira/browse/ARROW-8984
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8984) [R] Revise install guides now that Windows conda package exists

2020-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8984:
--
Labels: pull-request-available  (was: )

> [R] Revise install guides now that Windows conda package exists
> ---
>
> Key: ARROW-8984
> URL: https://issues.apache.org/jira/browse/ARROW-8984
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8982) [CI] Remove allow_failures for s390x in TravisCI

2020-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8982:
--
Labels: pull-request-available  (was: )

> [CI] Remove allow_failures for s390x in TravisCI
> 
>
> Key: ARROW-8982
> URL: https://issues.apache.org/jira/browse/ARROW-8982
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: CI
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now, all of existing tests except Parquet pass on s390x. It is good time to 
> remove {{allow_failures}} for s390x on TravisCI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8394) Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-05-29 Thread Paul Taylor (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119867#comment-17119867
 ] 

Paul Taylor commented on ARROW-8394:


Thanks [~pprice], I'll look into this. I had to do a bunch of weird things to 
trick the 3.5 compiler into propagating the types, so I'm hoping I can back 
some of those out to get it working in 3.9 and simplify the typedefs along the 
way.

> Typescript compiler errors for arrow d.ts files, when using es2015-esm package
> --
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Priority: Blocker
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8738) [Java] Investigate adding a getUnsafe method to vectors

2020-05-29 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119860#comment-17119860
 ] 

Micah Kornfield commented on ARROW-8738:


I'm really not sure of the expected benefit here given the existing flag 
approach for turning off checks.  I think we should wait until there is a clear 
use case

> [Java] Investigate adding a getUnsafe method to vectors
> ---
>
> Key: ARROW-8738
> URL: https://issues.apache.org/jira/browse/ARROW-8738
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Ryan Murray
>Assignee: Ji Liu
>Priority: Major
>
> As per: https://github.com/apache/arrow/pull/7095#issuecomment-625579459



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-8983) Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0

2020-05-29 Thread Valentyn Tymofieiev (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119813#comment-17119813
 ] 

Valentyn Tymofieiev edited comment on ARROW-8983 at 5/29/20, 5:59 PM:
--

Possibly this is related to numpy, a dependency of pyarrow, or there is a 
common rootcause.

On Py3:

python -m pip download --dest /tmp numpy==1.17.5  --no-binary :all:

finishes immediately, while:

python -m pip download --dest /tmp numpy==1.18.0  --no-binary :all:

takes a while to complete.


was (Author: tvalentyn):
Possibly this is related to numpy, a dependency of pyarrow, there is a common 
rootcause.

On Py3:

python -m pip download --dest /tmp numpy==1.17.5  --no-binary :all:

finishes immediately, while:

python -m pip download --dest /tmp numpy==1.18.0  --no-binary :all:

takes a while to complete.

> Downloading sources of pyarrow and its requirements from pypi takes several 
> minutes starting from 0.16.0
> 
>
> Key: ARROW-8983
> URL: https://issues.apache.org/jira/browse/ARROW-8983
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.16.0, 0.17.0, 0.17.1
>Reporter: Valentyn Tymofieiev
>Priority: Minor
>
> It appears that 
>   python -m pip download --dest /tmp pyarrow==0.17.1 --no-binary :all:
> takes several minutes to execute. 
> There seems to be an increase in runtime starting from 0.16.0: on Python 2 
>  python -m pip download --dest /tmp pyarrow==0.15.1 --no-binary :all:
> appears to be somewhat faster, but the same command is still slow on Py3.
> The command is stuck for a while with "Installing build dependencies ... ", 
> and increased CPU usage.
> The intent of this command is to download source tarball for a package and 
> its dependencies.
> Some investigation was started on the mailing list: 
> https://lists.apache.org/thread.html/r9baa48a9d1517834c285f0f238f29fcf54405cb7cf1e681314239d7f%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8983) Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0

2020-05-29 Thread Valentyn Tymofieiev (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentyn Tymofieiev updated ARROW-8983:
---
Description: 
It appears that 

  python -m pip download --dest /tmp pyarrow==0.17.1 --no-binary :all:

takes several minutes to execute. 

There seems to be an increase in runtime starting from 0.16.0: on Python 2 

 python -m pip download --dest /tmp pyarrow==0.15.1 --no-binary :all:

appears to be somewhat faster, but the same command is still slow on Py3.

The command is stuck for a while with "Installing build dependencies ... ", and 
increased CPU usage.

The intent of this command is to download source tarball for a package and its 
dependencies.

Some investigation was started on the mailing list: 
https://lists.apache.org/thread.html/r9baa48a9d1517834c285f0f238f29fcf54405cb7cf1e681314239d7f%40%3Cdev.arrow.apache.org%3E

  was:
It appears that 

  python -m pip download --dest /tmp pyarrow==0.17.1 --no-binary :all:

takes several minutes to execute. 

There seems to be an increase in runtime starting from 0.16.0: on Python 2 
 python -m pip download --dest /tmp pyarrow==0.15.1 --no-binary :all:
appears to be somewhat faster, but the same command is still slow on Py3.

The command is stuck for a while with "Installing build dependencies ... ", and 
increased CPU usage.

The intent of this command is to download source tarball for a package and its 
dependencies.

Some investigation was started on the mailing list: 
https://lists.apache.org/thread.html/r9baa48a9d1517834c285f0f238f29fcf54405cb7cf1e681314239d7f%40%3Cdev.arrow.apache.org%3E


> Downloading sources of pyarrow and its requirements from pypi takes several 
> minutes starting from 0.16.0
> 
>
> Key: ARROW-8983
> URL: https://issues.apache.org/jira/browse/ARROW-8983
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.16.0, 0.17.0, 0.17.1
>Reporter: Valentyn Tymofieiev
>Priority: Minor
>
> It appears that 
>   python -m pip download --dest /tmp pyarrow==0.17.1 --no-binary :all:
> takes several minutes to execute. 
> There seems to be an increase in runtime starting from 0.16.0: on Python 2 
>  python -m pip download --dest /tmp pyarrow==0.15.1 --no-binary :all:
> appears to be somewhat faster, but the same command is still slow on Py3.
> The command is stuck for a while with "Installing build dependencies ... ", 
> and increased CPU usage.
> The intent of this command is to download source tarball for a package and 
> its dependencies.
> Some investigation was started on the mailing list: 
> https://lists.apache.org/thread.html/r9baa48a9d1517834c285f0f238f29fcf54405cb7cf1e681314239d7f%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8983) Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0

2020-05-29 Thread Valentyn Tymofieiev (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119813#comment-17119813
 ] 

Valentyn Tymofieiev commented on ARROW-8983:


Possibly this is related to numpy, a dependency of pyarrow, there is a common 
rootcause.

On Py3:

python -m pip download --dest /tmp numpy==1.17.5  --no-binary :all:

finishes immediately, while:

python -m pip download --dest /tmp numpy==1.18.0  --no-binary :all:

takes a while to complete.

> Downloading sources of pyarrow and its requirements from pypi takes several 
> minutes starting from 0.16.0
> 
>
> Key: ARROW-8983
> URL: https://issues.apache.org/jira/browse/ARROW-8983
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.16.0, 0.17.0, 0.17.1
>Reporter: Valentyn Tymofieiev
>Priority: Minor
>
> It appears that 
>   python -m pip download --dest /tmp pyarrow==0.17.1 --no-binary :all:
> takes several minutes to execute. 
> There seems to be an increase in runtime starting from 0.16.0: on Python 2 
>  python -m pip download --dest /tmp pyarrow==0.15.1 --no-binary :all:
> appears to be somewhat faster, but the same command is still slow on Py3.
> The command is stuck for a while with "Installing build dependencies ... ", 
> and increased CPU usage.
> The intent of this command is to download source tarball for a package and 
> its dependencies.
> Some investigation was started on the mailing list: 
> https://lists.apache.org/thread.html/r9baa48a9d1517834c285f0f238f29fcf54405cb7cf1e681314239d7f%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8983) Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0

2020-05-29 Thread Valentyn Tymofieiev (Jira)

Valentyn Tymofieiev created ARROW-8983:
--

 Summary: Downloading sources of pyarrow and its requirements from 
pypi takes several minutes starting from 0.16.0
 Key: ARROW-8983
 URL: https://issues.apache.org/jira/browse/ARROW-8983
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.17.1, 0.17.0, 0.16.0
Reporter: Valentyn Tymofieiev


It appears that 

  python -m pip download --dest /tmp pyarrow==0.17.1 --no-binary :all:

takes several minutes to execute. 

There seems to be an increase in runtime starting from 0.16.0: on Python 2 
 python -m pip download --dest /tmp pyarrow==0.15.1 --no-binary :all:
appears to be somewhat faster, but the same command is still slow on Py3.

The command is stuck for a while with "Installing build dependencies ... ", and 
increased CPU usage.

The intent of this command is to download source tarball for a package and its 
dependencies.

Some investigation was started on the mailing list: 
https://lists.apache.org/thread.html/r9baa48a9d1517834c285f0f238f29fcf54405cb7cf1e681314239d7f%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8978) [C++][Compute] "Conditional jump or move depends on uninitialised value(s)" Valgrind warning

2020-05-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8978:
--
Labels: pull-request-available  (was: )

> [C++][Compute] "Conditional jump or move depends on uninitialised value(s)" 
> Valgrind warning
> 
>
> Key: ARROW-8978
> URL: https://issues.apache.org/jira/browse/ARROW-8978
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/ursa-labs/crossbow/runs/715700830#step:6:4277
> {noformat}
> [ RUN  ] TestCallScalarFunction.PreallocationCases
> ==5357== Conditional jump or move depends on uninitialised value(s)
> ==5357==at 0x51D69A6: void arrow::internal::TransferBitmap true>(unsigned char const*, long, long, long, unsigned char*) 
> (bit_util.cc:176)
> ==5357==by 0x51CE866: arrow::internal::CopyBitmap(unsigned char const*, 
> long, long, unsigned char*, long, bool) (bit_util.cc:208)
> ==5357==by 0x52B6325: 
> arrow::compute::detail::NullPropagator::PropagateSingle() (exec.cc:295)
> ==5357==by 0x52B36D1: Execute (exec.cc:378)
> ==5357==by 0x52B36D1: 
> arrow::compute::detail::PropagateNulls(arrow::compute::KernelContext*, 
> arrow::compute::ExecBatch const&, arrow::ArrayData*) (exec.cc:412)
> ==5357==by 0x52BA7F3: ExecuteBatch (exec.cc:586)
> ==5357==by 0x52BA7F3: 
> arrow::compute::detail::ScalarExecutor::Execute(std::vector std::allocator > const&, arrow::compute::detail::ExecListener*) 
> (exec.cc:542)
> ==5357==by 0x52BC21F: 
> arrow::compute::Function::Execute(std::vector std::allocator > const&, arrow::compute::FunctionOptions 
> const*, arrow::compute::ExecContext*) const (function.cc:94)
> ==5357==by 0x52B141C: 
> arrow::compute::CallFunction(std::__cxx11::basic_string std::char_traits, std::allocator > const&, 
> std::vector > const&, 
> arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) 
> (exec.cc:937)
> ==5357==by 0x52B16F2: 
> arrow::compute::CallFunction(std::__cxx11::basic_string std::char_traits, std::allocator > const&, 
> std::vector > const&, 
> arrow::compute::ExecContext*) (exec.cc:942)
> ==5357==by 0x155515: 
> arrow::compute::detail::TestCallScalarFunction_PreallocationCases_Test::TestBody()::{lambda(std::__cxx11::basic_string  std::char_traits, std::allocator 
> >)#1}::operator()(std::__cxx11::basic_string, 
> std::allocator >) const (exec_test.cc:756)
> ==5357==by 0x156AF2: 
> arrow::compute::detail::TestCallScalarFunction_PreallocationCases_Test::TestBody()
>  (exec_test.cc:786)
> ==5357==by 0x5BE4862: void 
> testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) (in 
> /opt/conda/envs/arrow/lib/libgtest.so)
> ==5357==by 0x5BDEDE2: void 
> testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) (in 
> /opt/conda/envs/arrow/lib/libgtest.so)
> ==5357== 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8981) [C++][Dataset] Add support for compressed FileSources

2020-05-29 Thread Kazuaki Ishizaki (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki reassigned ARROW-8981:
---

Assignee: Kazuaki Ishizaki

> [C++][Dataset] Add support for compressed FileSources
> -
>
> Key: ARROW-8981
> URL: https://issues.apache.org/jira/browse/ARROW-8981
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.1
>Reporter: Ben Kietzman
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: dataset
> Fix For: 1.0.0
>
>
> FileSource::compression_ is currently ignored. Ideally files/buffers which 
> are compressed could be decompressed on read. See ARROW-8942



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8981) [C++][Dataset] Add support for compressed FileSources

2020-05-29 Thread Kazuaki Ishizaki (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki reassigned ARROW-8981:
---

Assignee: (was: Kazuaki Ishizaki)

> [C++][Dataset] Add support for compressed FileSources
> -
>
> Key: ARROW-8981
> URL: https://issues.apache.org/jira/browse/ARROW-8981
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.1
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: dataset
> Fix For: 1.0.0
>
>
> FileSource::compression_ is currently ignored. Ideally files/buffers which 
> are compressed could be decompressed on read. See ARROW-8942



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8982) [CI] Remove allow_failures for s390x in TravisCI

2020-05-29 Thread Kazuaki Ishizaki (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki reassigned ARROW-8982:
---

Assignee: Kazuaki Ishizaki

> [CI] Remove allow_failures for s390x in TravisCI
> 
>
> Key: ARROW-8982
> URL: https://issues.apache.org/jira/browse/ARROW-8982
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: CI
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
>
> Now, all of existing tests except Parquet pass on s390x. It is good time to 
> remove {{allow_failures}} for s390x on TravisCI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8982) [CI] Remove allow_failures for s390x in TravisCI

2020-05-29 Thread Kazuaki Ishizaki (Jira)

Kazuaki Ishizaki created ARROW-8982:
---

 Summary: [CI] Remove allow_failures for s390x in TravisCI
 Key: ARROW-8982
 URL: https://issues.apache.org/jira/browse/ARROW-8982
 Project: Apache Arrow
  Issue Type: Bug
  Components: CI
Reporter: Kazuaki Ishizaki


Now, all of existing tests except Parquet pass on s390x. It is good time to 
remove {{allow_failures}} for s390x on TravisCI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8647) [C++][Dataset] Optionally encode partition field values as dictionary type

2020-05-29 Thread Ben Kietzman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-8647:
---

Assignee: Ben Kietzman

> [C++][Dataset] Optionally encode partition field values as dictionary type
> --
>
> Key: ARROW-8647
> URL: https://issues.apache.org/jira/browse/ARROW-8647
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, dataset-dask-integration
> Fix For: 1.0.0
>
>
> In the Python ParquetDataset implementation, the partition fields are 
> returned as dictionary type columns. 
> In the new Dataset API, we now use a plain type (integer or string when 
> inferred). But, you can already manually specify that the partition keys 
> should be dictionary type by specifying the partitioning schema (in 
> {{Partitioning}} passed to the dataset factory). 
> Since using dictionary type can be more efficient (since partition keys will 
> typically be repeated values in the resulting table), it might be good to 
> still have an option in the DatasetFactory to use dictionary types for the 
> partition fields.
> See also https://github.com/apache/arrow/pull/6303#discussion_r400622340



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8942) [R] support read gzip csv files

2020-05-29 Thread Ben Kietzman (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119684#comment-17119684
 ] 

Ben Kietzman commented on ARROW-8942:
-

added https://issues.apache.org/jira/browse/ARROW-8981 to track this feature 
for datasets.

> [R] support read gzip csv files
> ---
>
> Key: ARROW-8942
> URL: https://issues.apache.org/jira/browse/ARROW-8942
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Dyfan Jones
>Priority: Major
>
> Hi all,
> Apologises if this has already been covered by another ticket. Is it possible 
> for arrow to read in compress delimited files (for example gzip)?
> Currently I get an error when trying to read in a compressed delimited file:
>  
> {code:java}
> vroom::vroom_write(iris, "iris.csv.gz", delim = ",")
> arrow::read_csv_arrow("iris.csv.gz")
> # Error in csv__TableReader_Read(self) :
> # Invalid: CSV parse error: Expected 1 columns, got 4{code}
> however it can be read in by vroom and readr:
> {code:java}
> vroom::vroom("iris.csv.gz")
> readr::read_csv("iris.csv.gz")
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8981) [C++][Dataset] Add support for compressed FileSources

2020-05-29 Thread Ben Kietzman (Jira)

Ben Kietzman created ARROW-8981:
---

 Summary: [C++][Dataset] Add support for compressed FileSources
 Key: ARROW-8981
 URL: https://issues.apache.org/jira/browse/ARROW-8981
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.17.1
Reporter: Ben Kietzman
 Fix For: 1.0.0


FileSource::compression_ is currently ignored. Ideally files/buffers which are 
compressed could be decompressed on read. See ARROW-8942



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8942) [R] support read gzip csv files

2020-05-29 Thread Ben Kietzman (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119682#comment-17119682
 ] 

Ben Kietzman commented on ARROW-8942:
-

[~npr] it's a goal to support compressed files but we haven't implemented 
anything yet. Files are tagged with a compression type tag (which is currently 
ignored) so that when they are opened/read they can be lazily decompressed.

> [R] support read gzip csv files
> ---
>
> Key: ARROW-8942
> URL: https://issues.apache.org/jira/browse/ARROW-8942
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Dyfan Jones
>Priority: Major
>
> Hi all,
> Apologises if this has already been covered by another ticket. Is it possible 
> for arrow to read in compress delimited files (for example gzip)?
> Currently I get an error when trying to read in a compressed delimited file:
>  
> {code:java}
> vroom::vroom_write(iris, "iris.csv.gz", delim = ",")
> arrow::read_csv_arrow("iris.csv.gz")
> # Error in csv__TableReader_Read(self) :
> # Invalid: CSV parse error: Expected 1 columns, got 4{code}
> however it can be read in by vroom and readr:
> {code:java}
> vroom::vroom("iris.csv.gz")
> readr::read_csv("iris.csv.gz")
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8975) [FlightRPC][C++] Fix flaky MacOS tests

2020-05-29 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8975.
---
Resolution: Fixed

Issue resolved by pull request 7298
[https://github.com/apache/arrow/pull/7298]

> [FlightRPC][C++] Fix flaky MacOS tests
> --
>
> Key: ARROW-8975
> URL: https://issues.apache.org/jira/browse/ARROW-8975
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Affects Versions: 0.17.1
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The gRPC MacOS tests have been flaking again.
> Looking at [https://github.com/grpc/grpc/issues/20311] they may possibly have 
> been fixed except [https://github.com/grpc/grpc/issues/13856] reports they 
> haven't (in some configurations?) so I will try a few things in CI, or just 
> disable the tests on MacOS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6945) [Rust] Enable integration tests

2020-05-29 Thread Andy Grove (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119631#comment-17119631
 ] 

Andy Grove commented on ARROW-6945:
---

[~npr] 

Here is a log file showing output from a test run using the command 
"docker-compose run conda-integration".

 

[^rust-it.log.tgz]

> [Rust] Enable integration tests
> ---
>
> Key: ARROW-6945
> URL: https://issues.apache.org/jira/browse/ARROW-6945
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Integration, Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
> Attachments: rust-it.log.tgz
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Use docker-compose to generate test files using the Java implementation and 
> then have Rust tests read them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6945) [Rust] Enable integration tests

2020-05-29 Thread Andy Grove (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119632#comment-17119632
 ] 

Andy Grove commented on ARROW-6945:
---

cc [~vertexclique]

> [Rust] Enable integration tests
> ---
>
> Key: ARROW-6945
> URL: https://issues.apache.org/jira/browse/ARROW-6945
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Integration, Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
> Attachments: rust-it.log.tgz
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Use docker-compose to generate test files using the Java implementation and 
> then have Rust tests read them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6945) [Rust] Enable integration tests

2020-05-29 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-6945:
--
Attachment: rust-it.log.tgz

> [Rust] Enable integration tests
> ---
>
> Key: ARROW-6945
> URL: https://issues.apache.org/jira/browse/ARROW-6945
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Integration, Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
> Attachments: rust-it.log.tgz
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Use docker-compose to generate test files using the Java implementation and 
> then have Rust tests read them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8914) [C++][Gandiva] Decimal128 related test failed on big-endian platforms

2020-05-29 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8914.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

> [C++][Gandiva] Decimal128 related test failed on big-endian platforms
> -
>
> Key: ARROW-8914
> URL: https://issues.apache.org/jira/browse/ARROW-8914
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> These test failures in gandiva tests occur on big-endian platforms. An 
> example from https://travis-ci.org/github/apache/arrow/jobs/690006107#L2306
> {code}
> ...
> [==] 17 tests from 1 test case ran. (2334 ms total)
> [  PASSED  ] 7 tests.
> [  FAILED  ] 10 tests, listed below:
> [  FAILED  ] TestDecimal.TestSimple
> [  FAILED  ] TestDecimal.TestLiteral
> [  FAILED  ] TestDecimal.TestCompare
> [  FAILED  ] TestDecimal.TestRoundFunctions
> [  FAILED  ] TestDecimal.TestCastFunctions
> [  FAILED  ] TestDecimal.TestIsDistinct
> [  FAILED  ] TestDecimal.TestCastVarCharDecimal
> [  FAILED  ] TestDecimal.TestCastDecimalVarChar
> [  FAILED  ] TestDecimal.TestVarCharDecimalNestedCast
> [  FAILED  ] TestDecimal.TestCastDecimalOverflow
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8914) [C++][Gandiva] Decimal128 related test failed on big-endian platforms

2020-05-29 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8914:
--
Component/s: (was: C++ - Gandiva)
 C++

> [C++][Gandiva] Decimal128 related test failed on big-endian platforms
> -
>
> Key: ARROW-8914
> URL: https://issues.apache.org/jira/browse/ARROW-8914
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> These test failures in gandiva tests occur on big-endian platforms. An 
> example from https://travis-ci.org/github/apache/arrow/jobs/690006107#L2306
> {code}
> ...
> [==] 17 tests from 1 test case ran. (2334 ms total)
> [  PASSED  ] 7 tests.
> [  FAILED  ] 10 tests, listed below:
> [  FAILED  ] TestDecimal.TestSimple
> [  FAILED  ] TestDecimal.TestLiteral
> [  FAILED  ] TestDecimal.TestCompare
> [  FAILED  ] TestDecimal.TestRoundFunctions
> [  FAILED  ] TestDecimal.TestCastFunctions
> [  FAILED  ] TestDecimal.TestIsDistinct
> [  FAILED  ] TestDecimal.TestCastVarCharDecimal
> [  FAILED  ] TestDecimal.TestCastDecimalVarChar
> [  FAILED  ] TestDecimal.TestVarCharDecimalNestedCast
> [  FAILED  ] TestDecimal.TestCastDecimalOverflow
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8964) Pyarrow: improve reading of partitioned parquet datasets whose schema changed

2020-05-29 Thread Ira Saktor (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119598#comment-17119598
 ] 

Ira Saktor commented on ARROW-8964:
---

in an unrelated question, any chance you would also know how to write 
timestamps to parquet files so that i can then create impala table with columns 
as timestamp type from the given parquets? 

> Pyarrow: improve reading of partitioned parquet datasets whose schema changed
> -
>
> Key: ARROW-8964
> URL: https://issues.apache.org/jira/browse/ARROW-8964
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.1
> Environment: Ubuntu 18.04, latest miniconda with python 3.7, pyarrow 
> 0.17.1
>Reporter: Ira Saktor
>Priority: Major
>
> Hi there, i'm encountering the following issue when reading from HDFS:
>  
> *My situation:*
> I have a paritioned parquet dataset in HDFS, whose recent partitions contain 
> parquet files with more columns than the older ones. When i try to read data 
> using pyarrow.dataset.dataset and filter on recent data, i still get only the 
> columns that are also contained in the old parquet files. I'd like to somehow 
> merge the schema or use the schema from parquet files from which data ends up 
> being loaded.
> *when using:*
> `pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive', 
> filters = my_filter_expression).to_table().to_pandas()`
> Is there please a way to handle schema changes in a way, that the read data 
> would contain all columns?
> everything works fine when i copy the needed parquet files into a separate 
> folder, however it is very inconvenient way of working. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8976) [C++] compute::CallFunction can't Filter/Take with ChunkedArray

2020-05-29 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119554#comment-17119554
 ] 

Wes McKinney commented on ARROW-8976:
-

I plan to add Take and Filter "metafunctions" that deal with this and also 
Table/RecordBatch inputs

> [C++] compute::CallFunction can't Filter/Take with ChunkedArray
> ---
>
> Key: ARROW-8976
> URL: https://issues.apache.org/jira/browse/ARROW-8976
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Followup to ARROW-8938
> {{Invalid: Kernel does not support chunked array arguments}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8976) [C++] compute::CallFunction can't Filter/Take with ChunkedArray

2020-05-29 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8976:
---

Assignee: Wes McKinney

> [C++] compute::CallFunction can't Filter/Take with ChunkedArray
> ---
>
> Key: ARROW-8976
> URL: https://issues.apache.org/jira/browse/ARROW-8976
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Followup to ARROW-8938
> {{Invalid: Kernel does not support chunked array arguments}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8980) [Python] Metadata grows exponentially when using schema from disk

2020-05-29 Thread Kevin Glasson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Glasson updated ARROW-8980:
-
Description: 
When overwriting parquet files we first read the schema that is already on disk 
this is mainly to deal with some type harmonizing between pyarrow and pandas 
(that I wont go into).

Regardless here is a simple example (below) with no weirdness. If I continously 
re-write the same file by first fetching the schema from disk, creating a 
writer with that schema and then writing same dataframe the file size keeps 
growing even though the amount of rows has not changed.

Note: My solution was to remove `b'ARROW:schema'` data from the 
`schema.metadata.` this seems to stop the file size growing. So I wonder if the 
writer keeps appending to it or something? TBH I'm not entirely sure but I have 
a hunch that the ARROW:schema is just the metadata serialised or something.

I should also note that once the metadata gets to big this leads to a buffer 
overflow in another part of the code 'thrift' which was referenced here: 
https://issues.apache.org/jira/browse/PARQUET-1345
{code:java}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd
import pathlib
import sys
def main():
print(f"python: {sys.version}")
print(f"pa version: {pa.__version__}")
print(f"pd version: {pd.__version__}")fname = "test.pq"
path = pathlib.Path(fname)df = pd.DataFrame({"A": [0] * 10})
df.to_parquet(fname)print(f"Wrote test frame to {fname}")
print(f"Size of {fname}: {path.stat().st_size}")for _ in range(5):
file = pq.ParquetFile(fname)
tmp_df = file.read().to_pandas()
print(f"Number of rows on disk: {tmp_df.shape}")
print("Reading schema from disk")
schema = file.schema.to_arrow_schema()
print("Creating new writer")
writer = pq.ParquetWriter(fname, schema=schema)
print("Re-writing the dataframe")
writer.write_table(pa.Table.from_pandas(df))
writer.close()
print(f"Size of {fname}: {path.stat().st_size}")
if __name__ == "__main__":
main()
{code}
{code:java}
(sdm) ➜ ~ python growing_metadata.py
python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57)
[Clang 9.0.0 (tags/RELEASE_900/final)]
pa version: 0.16.0
pd version: 0.25.2
Wrote test frame to test.pq
Size of test.pq: 1643
Number of rows on disk: (10, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 3637
Number of rows on disk: (10, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 8327
Number of rows on disk: (10, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 19301
Number of rows on disk: (10, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 44944
Number of rows on disk: (10, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 104815{code}

  was:
When overwriting parquet files we first read the schema that is already on disk 
this is mainly to deal with some type harmonizing between pyarrow and pandas 
(that I wont go into).

Regardless here is a simple example (below) with no weirdness. If I continously 
re-write the same file by first fetching the schema from disk, creating a 
writer with that schema and then writing same dataframe the file size keeps 
growing even though the amount of rows has not changed.

Note: My solution was to remove `b'ARROW:schema'` data from the 
`schema.metadata.` this seems to stop the file size growing. So I wonder if the 
writer keeps appending to it or something? TBH I'm not entirely sure but I have 
a hunch that the ARROW:schema is just the metadata serialised or something.
{code:java}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd
import pathlib
import sys
def main():
print(f"python: {sys.version}")
print(f"pa version: {pa.__version__}")
print(f"pd version: {pd.__version__}")fname = "test.pq"
path = pathlib.Path(fname)df = pd.DataFrame({"A": [0] * 10})
df.to_parquet(fname)print(f"Wrote test frame to {fname}")
print(f"Size of {fname}: {path.stat().st_size}")for _ in range(5):
file = pq.ParquetFile(fname)
tmp_df = file.read().to_pandas()
print(f"Number of rows on disk: {tmp_df.shape}")
print("Reading schema from disk")
schema = file.schema.to_arrow_schema()
print("Creating new writer")
writer = pq.ParquetWriter(fname, schema=schema)
print("Re-writing the dataframe")
writer.write_table(pa.Table.from_pandas(df))
writer.close()
print(f"Size of {fname}: {path.stat().st_size}")
if __name__ == "__main__":
main()
{code}
{code:java}
(sdm) ➜ ~ python growing_metadata.py

[jira] [Created] (ARROW-8980) [Python] Metadata grows exponentially when using schema from disk

2020-05-29 Thread Kevin Glasson (Jira)

Kevin Glasson created ARROW-8980:


 Summary: [Python] Metadata grows exponentially when using schema 
from disk
 Key: ARROW-8980
 URL: https://issues.apache.org/jira/browse/ARROW-8980
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
 Environment: python: 3.7.3 | packaged by conda-forge | (default, Dec 6 
2019, 08:36:57)
[Clang 9.0.0 (tags/RELEASE_900/final)]
pa version: 0.16.0
pd version: 0.25.2
Reporter: Kevin Glasson
 Attachments: growing_metadata.py, test.pq

When overwriting parquet files we first read the schema that is already on disk 
this is mainly to deal with some type harmonizing between pyarrow and pandas 
(that I wont go into).

Regardless here is a simple example (below) with no weirdness. If I continously 
re-write the same file by first fetching the schema from disk, creating a 
writer with that schema and then writing same dataframe the file size keeps 
growing even though the amount of rows has not changed.

Note: My solution was to remove `b'ARROW:schema'` data from the 
`schema.metadata.` this seems to stop the file size growing. So I wonder if the 
writer keeps appending to it or something? TBH I'm not entirely sure but I have 
a hunch that the ARROW:schema is just the metadata serialised or something.
{code:java}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd
import pathlib
import sys
def main():
print(f"python: {sys.version}")
print(f"pa version: {pa.__version__}")
print(f"pd version: {pd.__version__}")fname = "test.pq"
path = pathlib.Path(fname)df = pd.DataFrame({"A": [0] * 10})
df.to_parquet(fname)print(f"Wrote test frame to {fname}")
print(f"Size of {fname}: {path.stat().st_size}")for _ in range(5):
file = pq.ParquetFile(fname)
tmp_df = file.read().to_pandas()
print(f"Number of rows on disk: {tmp_df.shape}")
print("Reading schema from disk")
schema = file.schema.to_arrow_schema()
print("Creating new writer")
writer = pq.ParquetWriter(fname, schema=schema)
print("Re-writing the dataframe")
writer.write_table(pa.Table.from_pandas(df))
writer.close()
print(f"Size of {fname}: {path.stat().st_size}")
if __name__ == "__main__":
main()
{code}
{code:java}
(sdm) ➜ ~ python growing_metadata.py
python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57)
[Clang 9.0.0 (tags/RELEASE_900/final)]
pa version: 0.16.0
pd version: 0.25.2
Wrote test frame to test.pq
Size of test.pq: 1643
Number of rows on disk: (10, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 3637
Number of rows on disk: (10, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 8327
Number of rows on disk: (10, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 19301
Number of rows on disk: (10, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 44944
Number of rows on disk: (10, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 104815{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-8964) Pyarrow: improve reading of partitioned parquet datasets whose schema changed

2020-05-29 Thread Ira Saktor (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119343#comment-17119343
 ] 

Ira Saktor edited comment on ARROW-8964 at 5/29/20, 7:20 AM:
-

awesome! thank you, that's what i was looking for. Looking forward to 
ARROW-8221 being resolved :). Should i close this task as it's pretty much a 
duplicate?


was (Author: 1ira):
awesome! thank you, that's what i was looking for. Looking forward to 
ARROW-8221 being resolved :)

> Pyarrow: improve reading of partitioned parquet datasets whose schema changed
> -
>
> Key: ARROW-8964
> URL: https://issues.apache.org/jira/browse/ARROW-8964
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.1
> Environment: Ubuntu 18.04, latest miniconda with python 3.7, pyarrow 
> 0.17.1
>Reporter: Ira Saktor
>Priority: Major
>
> Hi there, i'm encountering the following issue when reading from HDFS:
>  
> *My situation:*
> I have a paritioned parquet dataset in HDFS, whose recent partitions contain 
> parquet files with more columns than the older ones. When i try to read data 
> using pyarrow.dataset.dataset and filter on recent data, i still get only the 
> columns that are also contained in the old parquet files. I'd like to somehow 
> merge the schema or use the schema from parquet files from which data ends up 
> being loaded.
> *when using:*
> `pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive', 
> filters = my_filter_expression).to_table().to_pandas()`
> Is there please a way to handle schema changes in a way, that the read data 
> would contain all columns?
> everything works fine when i copy the needed parquet files into a separate 
> folder, however it is very inconvenient way of working. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8964) Pyarrow: improve reading of partitioned parquet datasets whose schema changed

2020-05-29 Thread Ira Saktor (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119343#comment-17119343
 ] 

Ira Saktor commented on ARROW-8964:
---

awesome! thank you, that's what i was looking for. Looking forward to 
ARROW-8221 being resolved :)

> Pyarrow: improve reading of partitioned parquet datasets whose schema changed
> -
>
> Key: ARROW-8964
> URL: https://issues.apache.org/jira/browse/ARROW-8964
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.1
> Environment: Ubuntu 18.04, latest miniconda with python 3.7, pyarrow 
> 0.17.1
>Reporter: Ira Saktor
>Priority: Major
>
> Hi there, i'm encountering the following issue when reading from HDFS:
>  
> *My situation:*
> I have a paritioned parquet dataset in HDFS, whose recent partitions contain 
> parquet files with more columns than the older ones. When i try to read data 
> using pyarrow.dataset.dataset and filter on recent data, i still get only the 
> columns that are also contained in the old parquet files. I'd like to somehow 
> merge the schema or use the schema from parquet files from which data ends up 
> being loaded.
> *when using:*
> `pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive', 
> filters = my_filter_expression).to_table().to_pandas()`
> Is there please a way to handle schema changes in a way, that the read data 
> would contain all columns?
> everything works fine when i copy the needed parquet files into a separate 
> folder, however it is very inconvenient way of working. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8984) [R] Revise install guides now that Windows conda package exists

[jira] [Resolved] (ARROW-8978) [C++][Compute] "Conditional jump or move depends on uninitialised value(s)" Valgrind warning

[jira] [Updated] (ARROW-8878) [R] try_download is confused when download.file.method isn't default

[jira] [Updated] (ARROW-8878) [R] try_download is confused when download.file.method isn't default

[jira] [Created] (ARROW-8984) [R] Revise install guides now that Windows conda package exists

[jira] [Updated] (ARROW-8984) [R] Revise install guides now that Windows conda package exists

[jira] [Updated] (ARROW-8982) [CI] Remove allow_failures for s390x in TravisCI

[jira] [Commented] (ARROW-8394) Typescript compiler errors for arrow d.ts files, when using es2015-esm package

[jira] [Commented] (ARROW-8738) [Java] Investigate adding a getUnsafe method to vectors

[jira] [Comment Edited] (ARROW-8983) Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0

[jira] [Updated] (ARROW-8983) Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0

[jira] [Commented] (ARROW-8983) Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0

[jira] [Created] (ARROW-8983) Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0

[jira] [Updated] (ARROW-8978) [C++][Compute] "Conditional jump or move depends on uninitialised value(s)" Valgrind warning

[jira] [Assigned] (ARROW-8981) [C++][Dataset] Add support for compressed FileSources

[jira] [Assigned] (ARROW-8981) [C++][Dataset] Add support for compressed FileSources

[jira] [Assigned] (ARROW-8982) [CI] Remove allow_failures for s390x in TravisCI

[jira] [Created] (ARROW-8982) [CI] Remove allow_failures for s390x in TravisCI

[jira] [Assigned] (ARROW-8647) [C++][Dataset] Optionally encode partition field values as dictionary type

[jira] [Commented] (ARROW-8942) [R] support read gzip csv files

[jira] [Created] (ARROW-8981) [C++][Dataset] Add support for compressed FileSources

[jira] [Commented] (ARROW-8942) [R] support read gzip csv files

[jira] [Resolved] (ARROW-8975) [FlightRPC][C++] Fix flaky MacOS tests

[jira] [Commented] (ARROW-6945) [Rust] Enable integration tests

[jira] [Commented] (ARROW-6945) [Rust] Enable integration tests

[jira] [Updated] (ARROW-6945) [Rust] Enable integration tests

[jira] [Resolved] (ARROW-8914) [C++][Gandiva] Decimal128 related test failed on big-endian platforms

[jira] [Updated] (ARROW-8914) [C++][Gandiva] Decimal128 related test failed on big-endian platforms

[jira] [Commented] (ARROW-8964) Pyarrow: improve reading of partitioned parquet datasets whose schema changed

[jira] [Commented] (ARROW-8976) [C++] compute::CallFunction can't Filter/Take with ChunkedArray

[jira] [Assigned] (ARROW-8976) [C++] compute::CallFunction can't Filter/Take with ChunkedArray

[jira] [Updated] (ARROW-8980) [Python] Metadata grows exponentially when using schema from disk

[jira] [Created] (ARROW-8980) [Python] Metadata grows exponentially when using schema from disk

[jira] [Comment Edited] (ARROW-8964) Pyarrow: improve reading of partitioned parquet datasets whose schema changed

[jira] [Commented] (ARROW-8964) Pyarrow: improve reading of partitioned parquet datasets whose schema changed

35 matches

Site Navigation

Mail list logo

Footer information