[jira] [Resolved] (ARROW-8984) [R] Revise install guides now that Windows conda package exists
[ https://issues.apache.org/jira/browse/ARROW-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn resolved ARROW-8984. - Resolution: Fixed Issue resolved by pull request 7303 [https://github.com/apache/arrow/pull/7303] > [R] Revise install guides now that Windows conda package exists > --- > > Key: ARROW-8984 > URL: https://issues.apache.org/jira/browse/ARROW-8984 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8978) [C++][Compute] "Conditional jump or move depends on uninitialised value(s)" Valgrind warning
[ https://issues.apache.org/jira/browse/ARROW-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-8978. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7302 [https://github.com/apache/arrow/pull/7302] > [C++][Compute] "Conditional jump or move depends on uninitialised value(s)" > Valgrind warning > > > Key: ARROW-8978 > URL: https://issues.apache.org/jira/browse/ARROW-8978 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Kouhei Sutou >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > https://github.com/ursa-labs/crossbow/runs/715700830#step:6:4277 > {noformat} > [ RUN ] TestCallScalarFunction.PreallocationCases > ==5357== Conditional jump or move depends on uninitialised value(s) > ==5357==at 0x51D69A6: void arrow::internal::TransferBitmap true>(unsigned char const*, long, long, long, unsigned char*) > (bit_util.cc:176) > ==5357==by 0x51CE866: arrow::internal::CopyBitmap(unsigned char const*, > long, long, unsigned char*, long, bool) (bit_util.cc:208) > ==5357==by 0x52B6325: > arrow::compute::detail::NullPropagator::PropagateSingle() (exec.cc:295) > ==5357==by 0x52B36D1: Execute (exec.cc:378) > ==5357==by 0x52B36D1: > arrow::compute::detail::PropagateNulls(arrow::compute::KernelContext*, > arrow::compute::ExecBatch const&, arrow::ArrayData*) (exec.cc:412) > ==5357==by 0x52BA7F3: ExecuteBatch (exec.cc:586) > ==5357==by 0x52BA7F3: > arrow::compute::detail::ScalarExecutor::Execute(std::vector std::allocator > const&, arrow::compute::detail::ExecListener*) > (exec.cc:542) > ==5357==by 0x52BC21F: > arrow::compute::Function::Execute(std::vector std::allocator > const&, arrow::compute::FunctionOptions > const*, arrow::compute::ExecContext*) const (function.cc:94) > ==5357==by 0x52B141C: > arrow::compute::CallFunction(std::__cxx11::basic_string std::char_traits, std::allocator > const&, > std::vector > const&, > arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) > (exec.cc:937) > ==5357==by 0x52B16F2: > arrow::compute::CallFunction(std::__cxx11::basic_string std::char_traits, std::allocator > const&, > std::vector > const&, > arrow::compute::ExecContext*) (exec.cc:942) > ==5357==by 0x155515: > arrow::compute::detail::TestCallScalarFunction_PreallocationCases_Test::TestBody()::{lambda(std::__cxx11::basic_string std::char_traits, std::allocator > >)#1}::operator()(std::__cxx11::basic_string, > std::allocator >) const (exec_test.cc:756) > ==5357==by 0x156AF2: > arrow::compute::detail::TestCallScalarFunction_PreallocationCases_Test::TestBody() > (exec_test.cc:786) > ==5357==by 0x5BE4862: void > testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) (in > /opt/conda/envs/arrow/lib/libgtest.so) > ==5357==by 0x5BDEDE2: void > testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) (in > /opt/conda/envs/arrow/lib/libgtest.so) > ==5357== > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8878) [R] try_download is confused when download.file.method isn't default
[ https://issues.apache.org/jira/browse/ARROW-8878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8878: -- Labels: pull-request-available (was: ) > [R] try_download is confused when download.file.method isn't default > > > Key: ARROW-8878 > URL: https://issues.apache.org/jira/browse/ARROW-8878 > Project: Apache Arrow > Issue Type: Bug > Components: R > Environment: r >Reporter: Olaf >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Hello there and thanks again for this beautiful package! > I am trying to install {{arrow}} on linux and I got a few problematic > warnings during the install. My computer is behind a firewall so not all the > connections coming from rstudio are allowed. > > {code:java} > > sessionInfo() > R version 3.6.1 (2019-07-05) > Platform: x86_64-ubuntu18-linux-gnu (64-bit) > Running under: Ubuntu 18.04.4 LTS > Matrix products: default > BLAS/LAPACK: > /apps/intel/2019.1/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 > [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] MKLthreads_0.1 > loaded via a namespace (and not attached): > [1] compiler_3.6.1 tools_3.6.1 > {code} > > after running {{install.packages("arrow")}} I get > > {code:java} > > installing *source* package ?arrow? ... > ** package ?arrow? successfully unpacked and MD5 sums checked > ** using staged installation > *** Successfully retrieved C++ source > *** Proceeding without C++ dependencies > Warning message: > In unzip(tf1, exdir = src_dir) : error 1 in extracting from zip file > ./configure: line 132: cd: libarrow/arrow-0.17.1/lib: No such file or > directory > - NOTE --- > After installation, please run arrow::install_arrow() > for help installing required runtime libraries > - > {code} > > > However, the installation ends normally. > > {code:java} > ** R > ** inst > ** byte-compile and prepare package for lazy loading > ** help > *** installing help indices > ** building package indices > ** installing vignettes > ** testing if installed package can be loaded from temporary location > ** checking absolute paths in shared objects and dynamic libraries > ** testing if installed package can be loaded from final location > ** testing if installed package keeps a record of temporary installation path > * DONE (arrow) > {code} > > So I go ahead and try to run arrow::install_arrow() and get a similar warning. > > {code:java} > installing *source* package ?arrow? ... > ** package ?arrow? successfully unpacked and MD5 sums checked > ** using staged installation > *** Successfully retrieved C++ binaries for ubuntu-18.04 > Warning messages: > 1: In file(file, "rt") : > URL > 'https://raw.githubusercontent.com/ursa-labs/arrow-r-nightly/master/linux/distro-map.csv': > status was 'Couldn't connect to server' > 2: In unzip(bin_file, exdir = dst_dir) : > error 1 in extracting from zip file > ./configure: line 132: cd: libarrow/arrow-0.17.1/lib: No such file or > directory > - NOTE --- > After installation, please run arrow::install_arrow() > for help installing required runtime libraries > {code} > And unfortunately I cannot read any parquet file. > {noformat} > Error in fetch(key) : lazy-load database > '/mydata/R/x86_64-ubuntu18-linux-gnu-library/3.6/arrow/help/arrow.rdb' is > corrupt{noformat} > > Could you please tell me how to fix this? Can I just copy the zip from github > and do a manual install in Rstudio? > > Thanks! > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8878) [R] try_download is confused when download.file.method isn't default
[ https://issues.apache.org/jira/browse/ARROW-8878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-8878: --- Summary: [R] try_download is confused when download.file.method isn't default (was: [R] how to install when behind a firewall?) > [R] try_download is confused when download.file.method isn't default > > > Key: ARROW-8878 > URL: https://issues.apache.org/jira/browse/ARROW-8878 > Project: Apache Arrow > Issue Type: Bug > Components: R > Environment: r >Reporter: Olaf >Priority: Major > > Hello there and thanks again for this beautiful package! > I am trying to install {{arrow}} on linux and I got a few problematic > warnings during the install. My computer is behind a firewall so not all the > connections coming from rstudio are allowed. > > {code:java} > > sessionInfo() > R version 3.6.1 (2019-07-05) > Platform: x86_64-ubuntu18-linux-gnu (64-bit) > Running under: Ubuntu 18.04.4 LTS > Matrix products: default > BLAS/LAPACK: > /apps/intel/2019.1/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 > [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] MKLthreads_0.1 > loaded via a namespace (and not attached): > [1] compiler_3.6.1 tools_3.6.1 > {code} > > after running {{install.packages("arrow")}} I get > > {code:java} > > installing *source* package ?arrow? ... > ** package ?arrow? successfully unpacked and MD5 sums checked > ** using staged installation > *** Successfully retrieved C++ source > *** Proceeding without C++ dependencies > Warning message: > In unzip(tf1, exdir = src_dir) : error 1 in extracting from zip file > ./configure: line 132: cd: libarrow/arrow-0.17.1/lib: No such file or > directory > - NOTE --- > After installation, please run arrow::install_arrow() > for help installing required runtime libraries > - > {code} > > > However, the installation ends normally. > > {code:java} > ** R > ** inst > ** byte-compile and prepare package for lazy loading > ** help > *** installing help indices > ** building package indices > ** installing vignettes > ** testing if installed package can be loaded from temporary location > ** checking absolute paths in shared objects and dynamic libraries > ** testing if installed package can be loaded from final location > ** testing if installed package keeps a record of temporary installation path > * DONE (arrow) > {code} > > So I go ahead and try to run arrow::install_arrow() and get a similar warning. > > {code:java} > installing *source* package ?arrow? ... > ** package ?arrow? successfully unpacked and MD5 sums checked > ** using staged installation > *** Successfully retrieved C++ binaries for ubuntu-18.04 > Warning messages: > 1: In file(file, "rt") : > URL > 'https://raw.githubusercontent.com/ursa-labs/arrow-r-nightly/master/linux/distro-map.csv': > status was 'Couldn't connect to server' > 2: In unzip(bin_file, exdir = dst_dir) : > error 1 in extracting from zip file > ./configure: line 132: cd: libarrow/arrow-0.17.1/lib: No such file or > directory > - NOTE --- > After installation, please run arrow::install_arrow() > for help installing required runtime libraries > {code} > And unfortunately I cannot read any parquet file. > {noformat} > Error in fetch(key) : lazy-load database > '/mydata/R/x86_64-ubuntu18-linux-gnu-library/3.6/arrow/help/arrow.rdb' is > corrupt{noformat} > > Could you please tell me how to fix this? Can I just copy the zip from github > and do a manual install in Rstudio? > > Thanks! > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8984) [R] Revise install guides now that Windows conda package exists
Neal Richardson created ARROW-8984: -- Summary: [R] Revise install guides now that Windows conda package exists Key: ARROW-8984 URL: https://issues.apache.org/jira/browse/ARROW-8984 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8984) [R] Revise install guides now that Windows conda package exists
[ https://issues.apache.org/jira/browse/ARROW-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8984: -- Labels: pull-request-available (was: ) > [R] Revise install guides now that Windows conda package exists > --- > > Key: ARROW-8984 > URL: https://issues.apache.org/jira/browse/ARROW-8984 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8982) [CI] Remove allow_failures for s390x in TravisCI
[ https://issues.apache.org/jira/browse/ARROW-8982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8982: -- Labels: pull-request-available (was: ) > [CI] Remove allow_failures for s390x in TravisCI > > > Key: ARROW-8982 > URL: https://issues.apache.org/jira/browse/ARROW-8982 > Project: Apache Arrow > Issue Type: Bug > Components: CI >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Now, all of existing tests except Parquet pass on s390x. It is good time to > remove {{allow_failures}} for s390x on TravisCI. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8394) Typescript compiler errors for arrow d.ts files, when using es2015-esm package
[ https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119867#comment-17119867 ] Paul Taylor commented on ARROW-8394: Thanks [~pprice], I'll look into this. I had to do a bunch of weird things to trick the 3.5 compiler into propagating the types, so I'm hoping I can back some of those out to get it working in 3.9 and simplify the typedefs along the way. > Typescript compiler errors for arrow d.ts files, when using es2015-esm package > -- > > Key: ARROW-8394 > URL: https://issues.apache.org/jira/browse/ARROW-8394 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: 0.16.0 >Reporter: Shyamal Shukla >Priority: Blocker > > Attempting to use apache-arrow within a web application, but typescript > compiler throws the following errors in some of arrow's .d.ts files > import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow"; > export class SomeClass { > . > . > constructor() { > const t = Table.from(''); > } > *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: > Class static side 'typeof Column' incorrectly extends base class static side > 'typeof Chunked'. Types of property 'new' are incompatible. > *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: > Subsequent property declarations must have the same type. Property 'schema' > must be of type 'Schema', but here has type 'Schema'. > 238 schema: Schema; > *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error > TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. > The types of 'slice(...).clone' are incompatible between these types. > the tsconfig.json file looks like > { > "compilerOptions": { > "target":"ES6", > "outDir": "dist", > "baseUrl": "src/" > }, > "exclude": ["dist"], > "include": ["src/*.ts"] > } -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8738) [Java] Investigate adding a getUnsafe method to vectors
[ https://issues.apache.org/jira/browse/ARROW-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119860#comment-17119860 ] Micah Kornfield commented on ARROW-8738: I'm really not sure of the expected benefit here given the existing flag approach for turning off checks. I think we should wait until there is a clear use case > [Java] Investigate adding a getUnsafe method to vectors > --- > > Key: ARROW-8738 > URL: https://issues.apache.org/jira/browse/ARROW-8738 > Project: Apache Arrow > Issue Type: Task > Components: Java >Reporter: Ryan Murray >Assignee: Ji Liu >Priority: Major > > As per: https://github.com/apache/arrow/pull/7095#issuecomment-625579459 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-8983) Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0
[ https://issues.apache.org/jira/browse/ARROW-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119813#comment-17119813 ] Valentyn Tymofieiev edited comment on ARROW-8983 at 5/29/20, 5:59 PM: -- Possibly this is related to numpy, a dependency of pyarrow, or there is a common rootcause. On Py3: python -m pip download --dest /tmp numpy==1.17.5 --no-binary :all: finishes immediately, while: python -m pip download --dest /tmp numpy==1.18.0 --no-binary :all: takes a while to complete. was (Author: tvalentyn): Possibly this is related to numpy, a dependency of pyarrow, there is a common rootcause. On Py3: python -m pip download --dest /tmp numpy==1.17.5 --no-binary :all: finishes immediately, while: python -m pip download --dest /tmp numpy==1.18.0 --no-binary :all: takes a while to complete. > Downloading sources of pyarrow and its requirements from pypi takes several > minutes starting from 0.16.0 > > > Key: ARROW-8983 > URL: https://issues.apache.org/jira/browse/ARROW-8983 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.16.0, 0.17.0, 0.17.1 >Reporter: Valentyn Tymofieiev >Priority: Minor > > It appears that > python -m pip download --dest /tmp pyarrow==0.17.1 --no-binary :all: > takes several minutes to execute. > There seems to be an increase in runtime starting from 0.16.0: on Python 2 > python -m pip download --dest /tmp pyarrow==0.15.1 --no-binary :all: > appears to be somewhat faster, but the same command is still slow on Py3. > The command is stuck for a while with "Installing build dependencies ... ", > and increased CPU usage. > The intent of this command is to download source tarball for a package and > its dependencies. > Some investigation was started on the mailing list: > https://lists.apache.org/thread.html/r9baa48a9d1517834c285f0f238f29fcf54405cb7cf1e681314239d7f%40%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8983) Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0
[ https://issues.apache.org/jira/browse/ARROW-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Valentyn Tymofieiev updated ARROW-8983: --- Description: It appears that python -m pip download --dest /tmp pyarrow==0.17.1 --no-binary :all: takes several minutes to execute. There seems to be an increase in runtime starting from 0.16.0: on Python 2 python -m pip download --dest /tmp pyarrow==0.15.1 --no-binary :all: appears to be somewhat faster, but the same command is still slow on Py3. The command is stuck for a while with "Installing build dependencies ... ", and increased CPU usage. The intent of this command is to download source tarball for a package and its dependencies. Some investigation was started on the mailing list: https://lists.apache.org/thread.html/r9baa48a9d1517834c285f0f238f29fcf54405cb7cf1e681314239d7f%40%3Cdev.arrow.apache.org%3E was: It appears that python -m pip download --dest /tmp pyarrow==0.17.1 --no-binary :all: takes several minutes to execute. There seems to be an increase in runtime starting from 0.16.0: on Python 2 python -m pip download --dest /tmp pyarrow==0.15.1 --no-binary :all: appears to be somewhat faster, but the same command is still slow on Py3. The command is stuck for a while with "Installing build dependencies ... ", and increased CPU usage. The intent of this command is to download source tarball for a package and its dependencies. Some investigation was started on the mailing list: https://lists.apache.org/thread.html/r9baa48a9d1517834c285f0f238f29fcf54405cb7cf1e681314239d7f%40%3Cdev.arrow.apache.org%3E > Downloading sources of pyarrow and its requirements from pypi takes several > minutes starting from 0.16.0 > > > Key: ARROW-8983 > URL: https://issues.apache.org/jira/browse/ARROW-8983 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.16.0, 0.17.0, 0.17.1 >Reporter: Valentyn Tymofieiev >Priority: Minor > > It appears that > python -m pip download --dest /tmp pyarrow==0.17.1 --no-binary :all: > takes several minutes to execute. > There seems to be an increase in runtime starting from 0.16.0: on Python 2 > python -m pip download --dest /tmp pyarrow==0.15.1 --no-binary :all: > appears to be somewhat faster, but the same command is still slow on Py3. > The command is stuck for a while with "Installing build dependencies ... ", > and increased CPU usage. > The intent of this command is to download source tarball for a package and > its dependencies. > Some investigation was started on the mailing list: > https://lists.apache.org/thread.html/r9baa48a9d1517834c285f0f238f29fcf54405cb7cf1e681314239d7f%40%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8983) Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0
[ https://issues.apache.org/jira/browse/ARROW-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119813#comment-17119813 ] Valentyn Tymofieiev commented on ARROW-8983: Possibly this is related to numpy, a dependency of pyarrow, there is a common rootcause. On Py3: python -m pip download --dest /tmp numpy==1.17.5 --no-binary :all: finishes immediately, while: python -m pip download --dest /tmp numpy==1.18.0 --no-binary :all: takes a while to complete. > Downloading sources of pyarrow and its requirements from pypi takes several > minutes starting from 0.16.0 > > > Key: ARROW-8983 > URL: https://issues.apache.org/jira/browse/ARROW-8983 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.16.0, 0.17.0, 0.17.1 >Reporter: Valentyn Tymofieiev >Priority: Minor > > It appears that > python -m pip download --dest /tmp pyarrow==0.17.1 --no-binary :all: > takes several minutes to execute. > There seems to be an increase in runtime starting from 0.16.0: on Python 2 > python -m pip download --dest /tmp pyarrow==0.15.1 --no-binary :all: > appears to be somewhat faster, but the same command is still slow on Py3. > The command is stuck for a while with "Installing build dependencies ... ", > and increased CPU usage. > The intent of this command is to download source tarball for a package and > its dependencies. > Some investigation was started on the mailing list: > https://lists.apache.org/thread.html/r9baa48a9d1517834c285f0f238f29fcf54405cb7cf1e681314239d7f%40%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8983) Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0
Valentyn Tymofieiev created ARROW-8983: -- Summary: Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0 Key: ARROW-8983 URL: https://issues.apache.org/jira/browse/ARROW-8983 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.17.1, 0.17.0, 0.16.0 Reporter: Valentyn Tymofieiev It appears that python -m pip download --dest /tmp pyarrow==0.17.1 --no-binary :all: takes several minutes to execute. There seems to be an increase in runtime starting from 0.16.0: on Python 2 python -m pip download --dest /tmp pyarrow==0.15.1 --no-binary :all: appears to be somewhat faster, but the same command is still slow on Py3. The command is stuck for a while with "Installing build dependencies ... ", and increased CPU usage. The intent of this command is to download source tarball for a package and its dependencies. Some investigation was started on the mailing list: https://lists.apache.org/thread.html/r9baa48a9d1517834c285f0f238f29fcf54405cb7cf1e681314239d7f%40%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8978) [C++][Compute] "Conditional jump or move depends on uninitialised value(s)" Valgrind warning
[ https://issues.apache.org/jira/browse/ARROW-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8978: -- Labels: pull-request-available (was: ) > [C++][Compute] "Conditional jump or move depends on uninitialised value(s)" > Valgrind warning > > > Key: ARROW-8978 > URL: https://issues.apache.org/jira/browse/ARROW-8978 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Kouhei Sutou >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > https://github.com/ursa-labs/crossbow/runs/715700830#step:6:4277 > {noformat} > [ RUN ] TestCallScalarFunction.PreallocationCases > ==5357== Conditional jump or move depends on uninitialised value(s) > ==5357==at 0x51D69A6: void arrow::internal::TransferBitmap true>(unsigned char const*, long, long, long, unsigned char*) > (bit_util.cc:176) > ==5357==by 0x51CE866: arrow::internal::CopyBitmap(unsigned char const*, > long, long, unsigned char*, long, bool) (bit_util.cc:208) > ==5357==by 0x52B6325: > arrow::compute::detail::NullPropagator::PropagateSingle() (exec.cc:295) > ==5357==by 0x52B36D1: Execute (exec.cc:378) > ==5357==by 0x52B36D1: > arrow::compute::detail::PropagateNulls(arrow::compute::KernelContext*, > arrow::compute::ExecBatch const&, arrow::ArrayData*) (exec.cc:412) > ==5357==by 0x52BA7F3: ExecuteBatch (exec.cc:586) > ==5357==by 0x52BA7F3: > arrow::compute::detail::ScalarExecutor::Execute(std::vector std::allocator > const&, arrow::compute::detail::ExecListener*) > (exec.cc:542) > ==5357==by 0x52BC21F: > arrow::compute::Function::Execute(std::vector std::allocator > const&, arrow::compute::FunctionOptions > const*, arrow::compute::ExecContext*) const (function.cc:94) > ==5357==by 0x52B141C: > arrow::compute::CallFunction(std::__cxx11::basic_string std::char_traits, std::allocator > const&, > std::vector > const&, > arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) > (exec.cc:937) > ==5357==by 0x52B16F2: > arrow::compute::CallFunction(std::__cxx11::basic_string std::char_traits, std::allocator > const&, > std::vector > const&, > arrow::compute::ExecContext*) (exec.cc:942) > ==5357==by 0x155515: > arrow::compute::detail::TestCallScalarFunction_PreallocationCases_Test::TestBody()::{lambda(std::__cxx11::basic_string std::char_traits, std::allocator > >)#1}::operator()(std::__cxx11::basic_string, > std::allocator >) const (exec_test.cc:756) > ==5357==by 0x156AF2: > arrow::compute::detail::TestCallScalarFunction_PreallocationCases_Test::TestBody() > (exec_test.cc:786) > ==5357==by 0x5BE4862: void > testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) (in > /opt/conda/envs/arrow/lib/libgtest.so) > ==5357==by 0x5BDEDE2: void > testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) (in > /opt/conda/envs/arrow/lib/libgtest.so) > ==5357== > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8981) [C++][Dataset] Add support for compressed FileSources
[ https://issues.apache.org/jira/browse/ARROW-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki reassigned ARROW-8981: --- Assignee: Kazuaki Ishizaki > [C++][Dataset] Add support for compressed FileSources > - > > Key: ARROW-8981 > URL: https://issues.apache.org/jira/browse/ARROW-8981 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.17.1 >Reporter: Ben Kietzman >Assignee: Kazuaki Ishizaki >Priority: Major > Labels: dataset > Fix For: 1.0.0 > > > FileSource::compression_ is currently ignored. Ideally files/buffers which > are compressed could be decompressed on read. See ARROW-8942 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8981) [C++][Dataset] Add support for compressed FileSources
[ https://issues.apache.org/jira/browse/ARROW-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki reassigned ARROW-8981: --- Assignee: (was: Kazuaki Ishizaki) > [C++][Dataset] Add support for compressed FileSources > - > > Key: ARROW-8981 > URL: https://issues.apache.org/jira/browse/ARROW-8981 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.17.1 >Reporter: Ben Kietzman >Priority: Major > Labels: dataset > Fix For: 1.0.0 > > > FileSource::compression_ is currently ignored. Ideally files/buffers which > are compressed could be decompressed on read. See ARROW-8942 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8982) [CI] Remove allow_failures for s390x in TravisCI
[ https://issues.apache.org/jira/browse/ARROW-8982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki reassigned ARROW-8982: --- Assignee: Kazuaki Ishizaki > [CI] Remove allow_failures for s390x in TravisCI > > > Key: ARROW-8982 > URL: https://issues.apache.org/jira/browse/ARROW-8982 > Project: Apache Arrow > Issue Type: Bug > Components: CI >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Major > > Now, all of existing tests except Parquet pass on s390x. It is good time to > remove {{allow_failures}} for s390x on TravisCI. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8982) [CI] Remove allow_failures for s390x in TravisCI
Kazuaki Ishizaki created ARROW-8982: --- Summary: [CI] Remove allow_failures for s390x in TravisCI Key: ARROW-8982 URL: https://issues.apache.org/jira/browse/ARROW-8982 Project: Apache Arrow Issue Type: Bug Components: CI Reporter: Kazuaki Ishizaki Now, all of existing tests except Parquet pass on s390x. It is good time to remove {{allow_failures}} for s390x on TravisCI. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8647) [C++][Dataset] Optionally encode partition field values as dictionary type
[ https://issues.apache.org/jira/browse/ARROW-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman reassigned ARROW-8647: --- Assignee: Ben Kietzman > [C++][Dataset] Optionally encode partition field values as dictionary type > -- > > Key: ARROW-8647 > URL: https://issues.apache.org/jira/browse/ARROW-8647 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, dataset-dask-integration > Fix For: 1.0.0 > > > In the Python ParquetDataset implementation, the partition fields are > returned as dictionary type columns. > In the new Dataset API, we now use a plain type (integer or string when > inferred). But, you can already manually specify that the partition keys > should be dictionary type by specifying the partitioning schema (in > {{Partitioning}} passed to the dataset factory). > Since using dictionary type can be more efficient (since partition keys will > typically be repeated values in the resulting table), it might be good to > still have an option in the DatasetFactory to use dictionary types for the > partition fields. > See also https://github.com/apache/arrow/pull/6303#discussion_r400622340 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8942) [R] support read gzip csv files
[ https://issues.apache.org/jira/browse/ARROW-8942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119684#comment-17119684 ] Ben Kietzman commented on ARROW-8942: - added https://issues.apache.org/jira/browse/ARROW-8981 to track this feature for datasets. > [R] support read gzip csv files > --- > > Key: ARROW-8942 > URL: https://issues.apache.org/jira/browse/ARROW-8942 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Dyfan Jones >Priority: Major > > Hi all, > Apologises if this has already been covered by another ticket. Is it possible > for arrow to read in compress delimited files (for example gzip)? > Currently I get an error when trying to read in a compressed delimited file: > > {code:java} > vroom::vroom_write(iris, "iris.csv.gz", delim = ",") > arrow::read_csv_arrow("iris.csv.gz") > # Error in csv__TableReader_Read(self) : > # Invalid: CSV parse error: Expected 1 columns, got 4{code} > however it can be read in by vroom and readr: > {code:java} > vroom::vroom("iris.csv.gz") > readr::read_csv("iris.csv.gz") > {code} > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8981) [C++][Dataset] Add support for compressed FileSources
Ben Kietzman created ARROW-8981: --- Summary: [C++][Dataset] Add support for compressed FileSources Key: ARROW-8981 URL: https://issues.apache.org/jira/browse/ARROW-8981 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.17.1 Reporter: Ben Kietzman Fix For: 1.0.0 FileSource::compression_ is currently ignored. Ideally files/buffers which are compressed could be decompressed on read. See ARROW-8942 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8942) [R] support read gzip csv files
[ https://issues.apache.org/jira/browse/ARROW-8942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119682#comment-17119682 ] Ben Kietzman commented on ARROW-8942: - [~npr] it's a goal to support compressed files but we haven't implemented anything yet. Files are tagged with a compression type tag (which is currently ignored) so that when they are opened/read they can be lazily decompressed. > [R] support read gzip csv files > --- > > Key: ARROW-8942 > URL: https://issues.apache.org/jira/browse/ARROW-8942 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Dyfan Jones >Priority: Major > > Hi all, > Apologises if this has already been covered by another ticket. Is it possible > for arrow to read in compress delimited files (for example gzip)? > Currently I get an error when trying to read in a compressed delimited file: > > {code:java} > vroom::vroom_write(iris, "iris.csv.gz", delim = ",") > arrow::read_csv_arrow("iris.csv.gz") > # Error in csv__TableReader_Read(self) : > # Invalid: CSV parse error: Expected 1 columns, got 4{code} > however it can be read in by vroom and readr: > {code:java} > vroom::vroom("iris.csv.gz") > readr::read_csv("iris.csv.gz") > {code} > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8975) [FlightRPC][C++] Fix flaky MacOS tests
[ https://issues.apache.org/jira/browse/ARROW-8975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques resolved ARROW-8975. --- Resolution: Fixed Issue resolved by pull request 7298 [https://github.com/apache/arrow/pull/7298] > [FlightRPC][C++] Fix flaky MacOS tests > -- > > Key: ARROW-8975 > URL: https://issues.apache.org/jira/browse/ARROW-8975 > Project: Apache Arrow > Issue Type: Bug > Components: C++, FlightRPC >Affects Versions: 0.17.1 >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > The gRPC MacOS tests have been flaking again. > Looking at [https://github.com/grpc/grpc/issues/20311] they may possibly have > been fixed except [https://github.com/grpc/grpc/issues/13856] reports they > haven't (in some configurations?) so I will try a few things in CI, or just > disable the tests on MacOS. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6945) [Rust] Enable integration tests
[ https://issues.apache.org/jira/browse/ARROW-6945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119631#comment-17119631 ] Andy Grove commented on ARROW-6945: --- [~npr] Here is a log file showing output from a test run using the command "docker-compose run conda-integration". [^rust-it.log.tgz] > [Rust] Enable integration tests > --- > > Key: ARROW-6945 > URL: https://issues.apache.org/jira/browse/ARROW-6945 > Project: Apache Arrow > Issue Type: Sub-task > Components: Integration, Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Attachments: rust-it.log.tgz > > Time Spent: 3.5h > Remaining Estimate: 0h > > Use docker-compose to generate test files using the Java implementation and > then have Rust tests read them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6945) [Rust] Enable integration tests
[ https://issues.apache.org/jira/browse/ARROW-6945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119632#comment-17119632 ] Andy Grove commented on ARROW-6945: --- cc [~vertexclique] > [Rust] Enable integration tests > --- > > Key: ARROW-6945 > URL: https://issues.apache.org/jira/browse/ARROW-6945 > Project: Apache Arrow > Issue Type: Sub-task > Components: Integration, Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Attachments: rust-it.log.tgz > > Time Spent: 3.5h > Remaining Estimate: 0h > > Use docker-compose to generate test files using the Java implementation and > then have Rust tests read them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6945) [Rust] Enable integration tests
[ https://issues.apache.org/jira/browse/ARROW-6945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-6945: -- Attachment: rust-it.log.tgz > [Rust] Enable integration tests > --- > > Key: ARROW-6945 > URL: https://issues.apache.org/jira/browse/ARROW-6945 > Project: Apache Arrow > Issue Type: Sub-task > Components: Integration, Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Attachments: rust-it.log.tgz > > Time Spent: 3.5h > Remaining Estimate: 0h > > Use docker-compose to generate test files using the Java implementation and > then have Rust tests read them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8914) [C++][Gandiva] Decimal128 related test failed on big-endian platforms
[ https://issues.apache.org/jira/browse/ARROW-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques resolved ARROW-8914. --- Fix Version/s: 1.0.0 Resolution: Fixed > [C++][Gandiva] Decimal128 related test failed on big-endian platforms > - > > Key: ARROW-8914 > URL: https://issues.apache.org/jira/browse/ARROW-8914 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > These test failures in gandiva tests occur on big-endian platforms. An > example from https://travis-ci.org/github/apache/arrow/jobs/690006107#L2306 > {code} > ... > [==] 17 tests from 1 test case ran. (2334 ms total) > [ PASSED ] 7 tests. > [ FAILED ] 10 tests, listed below: > [ FAILED ] TestDecimal.TestSimple > [ FAILED ] TestDecimal.TestLiteral > [ FAILED ] TestDecimal.TestCompare > [ FAILED ] TestDecimal.TestRoundFunctions > [ FAILED ] TestDecimal.TestCastFunctions > [ FAILED ] TestDecimal.TestIsDistinct > [ FAILED ] TestDecimal.TestCastVarCharDecimal > [ FAILED ] TestDecimal.TestCastDecimalVarChar > [ FAILED ] TestDecimal.TestVarCharDecimalNestedCast > [ FAILED ] TestDecimal.TestCastDecimalOverflow > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8914) [C++][Gandiva] Decimal128 related test failed on big-endian platforms
[ https://issues.apache.org/jira/browse/ARROW-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-8914: -- Component/s: (was: C++ - Gandiva) C++ > [C++][Gandiva] Decimal128 related test failed on big-endian platforms > - > > Key: ARROW-8914 > URL: https://issues.apache.org/jira/browse/ARROW-8914 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > These test failures in gandiva tests occur on big-endian platforms. An > example from https://travis-ci.org/github/apache/arrow/jobs/690006107#L2306 > {code} > ... > [==] 17 tests from 1 test case ran. (2334 ms total) > [ PASSED ] 7 tests. > [ FAILED ] 10 tests, listed below: > [ FAILED ] TestDecimal.TestSimple > [ FAILED ] TestDecimal.TestLiteral > [ FAILED ] TestDecimal.TestCompare > [ FAILED ] TestDecimal.TestRoundFunctions > [ FAILED ] TestDecimal.TestCastFunctions > [ FAILED ] TestDecimal.TestIsDistinct > [ FAILED ] TestDecimal.TestCastVarCharDecimal > [ FAILED ] TestDecimal.TestCastDecimalVarChar > [ FAILED ] TestDecimal.TestVarCharDecimalNestedCast > [ FAILED ] TestDecimal.TestCastDecimalOverflow > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8964) Pyarrow: improve reading of partitioned parquet datasets whose schema changed
[ https://issues.apache.org/jira/browse/ARROW-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119598#comment-17119598 ] Ira Saktor commented on ARROW-8964: --- in an unrelated question, any chance you would also know how to write timestamps to parquet files so that i can then create impala table with columns as timestamp type from the given parquets? > Pyarrow: improve reading of partitioned parquet datasets whose schema changed > - > > Key: ARROW-8964 > URL: https://issues.apache.org/jira/browse/ARROW-8964 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.17.1 > Environment: Ubuntu 18.04, latest miniconda with python 3.7, pyarrow > 0.17.1 >Reporter: Ira Saktor >Priority: Major > > Hi there, i'm encountering the following issue when reading from HDFS: > > *My situation:* > I have a paritioned parquet dataset in HDFS, whose recent partitions contain > parquet files with more columns than the older ones. When i try to read data > using pyarrow.dataset.dataset and filter on recent data, i still get only the > columns that are also contained in the old parquet files. I'd like to somehow > merge the schema or use the schema from parquet files from which data ends up > being loaded. > *when using:* > `pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive', > filters = my_filter_expression).to_table().to_pandas()` > Is there please a way to handle schema changes in a way, that the read data > would contain all columns? > everything works fine when i copy the needed parquet files into a separate > folder, however it is very inconvenient way of working. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8976) [C++] compute::CallFunction can't Filter/Take with ChunkedArray
[ https://issues.apache.org/jira/browse/ARROW-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119554#comment-17119554 ] Wes McKinney commented on ARROW-8976: - I plan to add Take and Filter "metafunctions" that deal with this and also Table/RecordBatch inputs > [C++] compute::CallFunction can't Filter/Take with ChunkedArray > --- > > Key: ARROW-8976 > URL: https://issues.apache.org/jira/browse/ARROW-8976 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Neal Richardson >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Followup to ARROW-8938 > {{Invalid: Kernel does not support chunked array arguments}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8976) [C++] compute::CallFunction can't Filter/Take with ChunkedArray
[ https://issues.apache.org/jira/browse/ARROW-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-8976: --- Assignee: Wes McKinney > [C++] compute::CallFunction can't Filter/Take with ChunkedArray > --- > > Key: ARROW-8976 > URL: https://issues.apache.org/jira/browse/ARROW-8976 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Neal Richardson >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Followup to ARROW-8938 > {{Invalid: Kernel does not support chunked array arguments}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8980) [Python] Metadata grows exponentially when using schema from disk
[ https://issues.apache.org/jira/browse/ARROW-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Glasson updated ARROW-8980: - Description: When overwriting parquet files we first read the schema that is already on disk this is mainly to deal with some type harmonizing between pyarrow and pandas (that I wont go into). Regardless here is a simple example (below) with no weirdness. If I continously re-write the same file by first fetching the schema from disk, creating a writer with that schema and then writing same dataframe the file size keeps growing even though the amount of rows has not changed. Note: My solution was to remove `b'ARROW:schema'` data from the `schema.metadata.` this seems to stop the file size growing. So I wonder if the writer keeps appending to it or something? TBH I'm not entirely sure but I have a hunch that the ARROW:schema is just the metadata serialised or something. I should also note that once the metadata gets to big this leads to a buffer overflow in another part of the code 'thrift' which was referenced here: https://issues.apache.org/jira/browse/PARQUET-1345 {code:java} import pyarrow as pa import pyarrow.parquet as pq import pyarrow as pa import pandas as pd import pathlib import sys def main(): print(f"python: {sys.version}") print(f"pa version: {pa.__version__}") print(f"pd version: {pd.__version__}")fname = "test.pq" path = pathlib.Path(fname)df = pd.DataFrame({"A": [0] * 10}) df.to_parquet(fname)print(f"Wrote test frame to {fname}") print(f"Size of {fname}: {path.stat().st_size}")for _ in range(5): file = pq.ParquetFile(fname) tmp_df = file.read().to_pandas() print(f"Number of rows on disk: {tmp_df.shape}") print("Reading schema from disk") schema = file.schema.to_arrow_schema() print("Creating new writer") writer = pq.ParquetWriter(fname, schema=schema) print("Re-writing the dataframe") writer.write_table(pa.Table.from_pandas(df)) writer.close() print(f"Size of {fname}: {path.stat().st_size}") if __name__ == "__main__": main() {code} {code:java} (sdm) ➜ ~ python growing_metadata.py python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57) [Clang 9.0.0 (tags/RELEASE_900/final)] pa version: 0.16.0 pd version: 0.25.2 Wrote test frame to test.pq Size of test.pq: 1643 Number of rows on disk: (10, 1) Reading schema from disk Creating new writer Re-writing the dataframe Size of test.pq: 3637 Number of rows on disk: (10, 1) Reading schema from disk Creating new writer Re-writing the dataframe Size of test.pq: 8327 Number of rows on disk: (10, 1) Reading schema from disk Creating new writer Re-writing the dataframe Size of test.pq: 19301 Number of rows on disk: (10, 1) Reading schema from disk Creating new writer Re-writing the dataframe Size of test.pq: 44944 Number of rows on disk: (10, 1) Reading schema from disk Creating new writer Re-writing the dataframe Size of test.pq: 104815{code} was: When overwriting parquet files we first read the schema that is already on disk this is mainly to deal with some type harmonizing between pyarrow and pandas (that I wont go into). Regardless here is a simple example (below) with no weirdness. If I continously re-write the same file by first fetching the schema from disk, creating a writer with that schema and then writing same dataframe the file size keeps growing even though the amount of rows has not changed. Note: My solution was to remove `b'ARROW:schema'` data from the `schema.metadata.` this seems to stop the file size growing. So I wonder if the writer keeps appending to it or something? TBH I'm not entirely sure but I have a hunch that the ARROW:schema is just the metadata serialised or something. {code:java} import pyarrow as pa import pyarrow.parquet as pq import pyarrow as pa import pandas as pd import pathlib import sys def main(): print(f"python: {sys.version}") print(f"pa version: {pa.__version__}") print(f"pd version: {pd.__version__}")fname = "test.pq" path = pathlib.Path(fname)df = pd.DataFrame({"A": [0] * 10}) df.to_parquet(fname)print(f"Wrote test frame to {fname}") print(f"Size of {fname}: {path.stat().st_size}")for _ in range(5): file = pq.ParquetFile(fname) tmp_df = file.read().to_pandas() print(f"Number of rows on disk: {tmp_df.shape}") print("Reading schema from disk") schema = file.schema.to_arrow_schema() print("Creating new writer") writer = pq.ParquetWriter(fname, schema=schema) print("Re-writing the dataframe") writer.write_table(pa.Table.from_pandas(df)) writer.close() print(f"Size of {fname}: {path.stat().st_size}") if __name__ == "__main__": main() {code} {code:java} (sdm) ➜ ~ python growing_metadata.py
[jira] [Created] (ARROW-8980) [Python] Metadata grows exponentially when using schema from disk
Kevin Glasson created ARROW-8980: Summary: [Python] Metadata grows exponentially when using schema from disk Key: ARROW-8980 URL: https://issues.apache.org/jira/browse/ARROW-8980 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Environment: python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57) [Clang 9.0.0 (tags/RELEASE_900/final)] pa version: 0.16.0 pd version: 0.25.2 Reporter: Kevin Glasson Attachments: growing_metadata.py, test.pq When overwriting parquet files we first read the schema that is already on disk this is mainly to deal with some type harmonizing between pyarrow and pandas (that I wont go into). Regardless here is a simple example (below) with no weirdness. If I continously re-write the same file by first fetching the schema from disk, creating a writer with that schema and then writing same dataframe the file size keeps growing even though the amount of rows has not changed. Note: My solution was to remove `b'ARROW:schema'` data from the `schema.metadata.` this seems to stop the file size growing. So I wonder if the writer keeps appending to it or something? TBH I'm not entirely sure but I have a hunch that the ARROW:schema is just the metadata serialised or something. {code:java} import pyarrow as pa import pyarrow.parquet as pq import pyarrow as pa import pandas as pd import pathlib import sys def main(): print(f"python: {sys.version}") print(f"pa version: {pa.__version__}") print(f"pd version: {pd.__version__}")fname = "test.pq" path = pathlib.Path(fname)df = pd.DataFrame({"A": [0] * 10}) df.to_parquet(fname)print(f"Wrote test frame to {fname}") print(f"Size of {fname}: {path.stat().st_size}")for _ in range(5): file = pq.ParquetFile(fname) tmp_df = file.read().to_pandas() print(f"Number of rows on disk: {tmp_df.shape}") print("Reading schema from disk") schema = file.schema.to_arrow_schema() print("Creating new writer") writer = pq.ParquetWriter(fname, schema=schema) print("Re-writing the dataframe") writer.write_table(pa.Table.from_pandas(df)) writer.close() print(f"Size of {fname}: {path.stat().st_size}") if __name__ == "__main__": main() {code} {code:java} (sdm) ➜ ~ python growing_metadata.py python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57) [Clang 9.0.0 (tags/RELEASE_900/final)] pa version: 0.16.0 pd version: 0.25.2 Wrote test frame to test.pq Size of test.pq: 1643 Number of rows on disk: (10, 1) Reading schema from disk Creating new writer Re-writing the dataframe Size of test.pq: 3637 Number of rows on disk: (10, 1) Reading schema from disk Creating new writer Re-writing the dataframe Size of test.pq: 8327 Number of rows on disk: (10, 1) Reading schema from disk Creating new writer Re-writing the dataframe Size of test.pq: 19301 Number of rows on disk: (10, 1) Reading schema from disk Creating new writer Re-writing the dataframe Size of test.pq: 44944 Number of rows on disk: (10, 1) Reading schema from disk Creating new writer Re-writing the dataframe Size of test.pq: 104815{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-8964) Pyarrow: improve reading of partitioned parquet datasets whose schema changed
[ https://issues.apache.org/jira/browse/ARROW-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119343#comment-17119343 ] Ira Saktor edited comment on ARROW-8964 at 5/29/20, 7:20 AM: - awesome! thank you, that's what i was looking for. Looking forward to ARROW-8221 being resolved :). Should i close this task as it's pretty much a duplicate? was (Author: 1ira): awesome! thank you, that's what i was looking for. Looking forward to ARROW-8221 being resolved :) > Pyarrow: improve reading of partitioned parquet datasets whose schema changed > - > > Key: ARROW-8964 > URL: https://issues.apache.org/jira/browse/ARROW-8964 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.17.1 > Environment: Ubuntu 18.04, latest miniconda with python 3.7, pyarrow > 0.17.1 >Reporter: Ira Saktor >Priority: Major > > Hi there, i'm encountering the following issue when reading from HDFS: > > *My situation:* > I have a paritioned parquet dataset in HDFS, whose recent partitions contain > parquet files with more columns than the older ones. When i try to read data > using pyarrow.dataset.dataset and filter on recent data, i still get only the > columns that are also contained in the old parquet files. I'd like to somehow > merge the schema or use the schema from parquet files from which data ends up > being loaded. > *when using:* > `pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive', > filters = my_filter_expression).to_table().to_pandas()` > Is there please a way to handle schema changes in a way, that the read data > would contain all columns? > everything works fine when i copy the needed parquet files into a separate > folder, however it is very inconvenient way of working. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8964) Pyarrow: improve reading of partitioned parquet datasets whose schema changed
[ https://issues.apache.org/jira/browse/ARROW-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119343#comment-17119343 ] Ira Saktor commented on ARROW-8964: --- awesome! thank you, that's what i was looking for. Looking forward to ARROW-8221 being resolved :) > Pyarrow: improve reading of partitioned parquet datasets whose schema changed > - > > Key: ARROW-8964 > URL: https://issues.apache.org/jira/browse/ARROW-8964 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.17.1 > Environment: Ubuntu 18.04, latest miniconda with python 3.7, pyarrow > 0.17.1 >Reporter: Ira Saktor >Priority: Major > > Hi there, i'm encountering the following issue when reading from HDFS: > > *My situation:* > I have a paritioned parquet dataset in HDFS, whose recent partitions contain > parquet files with more columns than the older ones. When i try to read data > using pyarrow.dataset.dataset and filter on recent data, i still get only the > columns that are also contained in the old parquet files. I'd like to somehow > merge the schema or use the schema from parquet files from which data ends up > being loaded. > *when using:* > `pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive', > filters = my_filter_expression).to_table().to_pandas()` > Is there please a way to handle schema changes in a way, that the read data > would contain all columns? > everything works fine when i copy the needed parquet files into a separate > folder, however it is very inconvenient way of working. > -- This message was sent by Atlassian Jira (v8.3.4#803005)