[jira] [Updated] (ARROW-6155) [Java] Extract a super interface for vectors whose elements reside in continuous memory segments
[ https://issues.apache.org/jira/browse/ARROW-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6155: -- Labels: pull-request-available (was: ) > [Java] Extract a super interface for vectors whose elements reside in > continuous memory segments > > > Key: ARROW-6155 > URL: https://issues.apache.org/jira/browse/ARROW-6155 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > > For vectors whose data elements reside in continuous memory segments, they > should implement a common super interface. This will avoid unnecessary code > branches. > For now, such vectors include fixed-width vectors and variable-width vectors. > In the future, there can be more vectors included. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6155) [Java] Extract a super interface for vectors whose elements reside in continuous memory segments
Liya Fan created ARROW-6155: --- Summary: [Java] Extract a super interface for vectors whose elements reside in continuous memory segments Key: ARROW-6155 URL: https://issues.apache.org/jira/browse/ARROW-6155 Project: Apache Arrow Issue Type: New Feature Reporter: Liya Fan Assignee: Liya Fan For vectors whose data elements reside in continuous memory segments, they should implement a common super interface. This will avoid unnecessary code branches. For now, such vectors include fixed-width vectors and variable-width vectors. In the future, there can be more vectors included. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6154) [Rust] Too many open files (os error 24)
[ https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield updated ARROW-6154: --- Summary: [Rust] Too many open files (os error 24) (was: Too many open files (os error 24)) > [Rust] Too many open files (os error 24) > > > Key: ARROW-6154 > URL: https://issues.apache.org/jira/browse/ARROW-6154 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Yesh >Priority: Major > > Used [rust]*parquet-read binary to read a deeply nested parquet file and see > the below stack trace. Unfortunately won't be able to upload file.* > {code:java} > stack backtrace: > 0: std::panicking::default_hook::{{closure}} > 1: std::panicking::default_hook > 2: std::panicking::rust_panic_with_hook > 3: std::panicking::continue_panic_fmt > 4: rust_begin_unwind > 5: core::panicking::panic_fmt > 6: core::result::unwrap_failed > 7: parquet::util::io::FileSource::new > 8: as > parquet::file::reader::RowGroupReader>::get_column_page_reader > 9: as > parquet::file::reader::RowGroupReader>::get_column_reader > 10: parquet::record::reader::TreeBuilder::reader_tree > 11: parquet::record::reader::TreeBuilder::reader_tree > 12: parquet::record::reader::TreeBuilder::reader_tree > 13: parquet::record::reader::TreeBuilder::reader_tree > 14: parquet::record::reader::TreeBuilder::reader_tree > 15: parquet::record::reader::TreeBuilder::build > 16: core::iter::traits::iterator::Iterator>::next > 17: parquet_read::main > 18: std::rt::lang_start::{{closure}} > 19: std::panicking::try::do_call > 20: __rust_maybe_catch_panic > 21: std::rt::lang_start_internal > 22: main{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-5772) [GLib][Plasma][CUDA] Plasma::Client#refer_object test is failed
[ https://issues.apache.org/jira/browse/ARROW-5772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yosuke Shiro resolved ARROW-5772. - Resolution: Fixed Fix Version/s: (was: 1.0.0) 0.15.0 Issue resolved by pull request 5004 [https://github.com/apache/arrow/pull/5004] > [GLib][Plasma][CUDA] Plasma::Client#refer_object test is failed > --- > > Key: ARROW-5772 > URL: https://issues.apache.org/jira/browse/ARROW-5772 > Project: Apache Arrow > Issue Type: Bug > Components: GLib >Affects Versions: 0.14.0 >Reporter: Sutou Kouhei >Assignee: Sutou Kouhei >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h > Remaining Estimate: 0h > > {noformat} > /home/kou/work/cpp/arrow.kou/c_glib/test/plasma/test-plasma-client.rb:75:in > `block (2 levels) in ' > /tmp/local/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.3.6/lib/gobject-introspection/loader.rb:533:in > `block in define_method' > /tmp/local/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.3.6/lib/gobject-introspection/loader.rb:616:in > `invoke' > /tmp/local/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.3.6/lib/gobject-introspection/loader.rb:616:in > `invoke' > Error: test: options: GPU device(TestPlasmaClient::#create): > Arrow::Error::Io: [plasma][client][refer-object]: IOError: Cuda Driver API > call in ../src/arrow/gpu/cuda_context.cc at line 156 failed with code 208: > cuIpcOpenMemHandle(, *handle, CU_IPC_MEM_LAZY_ENABLE_PEER_ACCESS) > In ../src/arrow/gpu/cuda_context.cc, line 341, code: > impl_->OpenIpcBuffer(ipc_handle, ) > In ../src/plasma/client.cc, line 586, code: > context->OpenIpcBuffer(*object->ipc_handle, _handle->ptr) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6154) Too many open files (os error 24)
Yesh created ARROW-6154: --- Summary: Too many open files (os error 24) Key: ARROW-6154 URL: https://issues.apache.org/jira/browse/ARROW-6154 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Yesh Used [rust]*parquet-read binary to read a deeply nested parquet file and see the below stack trace. Unfortunately won't be able to upload file.* {code:java} stack backtrace: 0: std::panicking::default_hook::{{closure}} 1: std::panicking::default_hook 2: std::panicking::rust_panic_with_hook 3: std::panicking::continue_panic_fmt 4: rust_begin_unwind 5: core::panicking::panic_fmt 6: core::result::unwrap_failed 7: parquet::util::io::FileSource::new 8: as parquet::file::reader::RowGroupReader>::get_column_page_reader 9: as parquet::file::reader::RowGroupReader>::get_column_reader 10: parquet::record::reader::TreeBuilder::reader_tree 11: parquet::record::reader::TreeBuilder::reader_tree 12: parquet::record::reader::TreeBuilder::reader_tree 13: parquet::record::reader::TreeBuilder::reader_tree 14: parquet::record::reader::TreeBuilder::reader_tree 15: parquet::record::reader::TreeBuilder::build 16: ::next 17: parquet_read::main 18: std::rt::lang_start::{{closure}} 19: std::panicking::try::do_call 20: __rust_maybe_catch_panic 21: std::rt::lang_start_internal 22: main{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6142) [R] Install instructions on linux could be clearer
[ https://issues.apache.org/jira/browse/ARROW-6142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6142: -- Labels: documentation pull-request-available (was: documentation) > [R] Install instructions on linux could be clearer > -- > > Key: ARROW-6142 > URL: https://issues.apache.org/jira/browse/ARROW-6142 > Project: Apache Arrow > Issue Type: Wish > Components: R >Affects Versions: 0.14.1 > Environment: Ubuntu 19.04 >Reporter: Karl Dunkle Werner >Assignee: Neal Richardson >Priority: Minor > Labels: documentation, pull-request-available > Fix For: 0.15.0 > > > Installing R packages on Linux is almost always from source, which means > Arrow needs some system dependencies. The existing help message (from > arrow::install_arrow()) is very helpful in pointing that out, but it's still > a heavy lift for users who install R packages from source but don't plan to > develop Arrow itself. > Here are a couple of things that could make things slightly smoother: > # I would be very grateful if the install_arrow() message or installation > page told me which libraries were essential to make the R package work. > # install_arrow() refers to a PPA. Previously I've only seen PPAs hosted on > launchpad.net, so the bintray URL threw me. Changing it to "bintray.com PPA" > instead of just "PPA" would have caused me less confusion. (Others may differ) > # A snap package would be easier than installing a new apt address, but I > understand that building for snap would be more packaging work and only > benefits Ubuntu users. > > Thanks for making R bindings, and congratulations on the CRAN release! -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6153) [R] Address parquet deprecation warning
Neal Richardson created ARROW-6153: -- Summary: [R] Address parquet deprecation warning Key: ARROW-6153 URL: https://issues.apache.org/jira/browse/ARROW-6153 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Romain François [~wesmckinn] has been refactoring the Parquet C++ library and there's now this deprecation warning appearing when I build the R package locally: {code:java} clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -DNDEBUG -I/usr/local/include -DARROW_R_WITH_ARROW -I"/Users/enpiar/R/Rcpp/include" -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -fPIC -Wall -g -O2 -c parquet.cpp -o parquet.o parquet.cpp:66:23: warning: 'OpenFile' is deprecated: Deprecated since 0.15.0. Use FileReaderBuilder [-Wdeprecated-declarations] parquet::arrow::OpenFile(file, arrow::default_memory_pool(), *props, )); ^ {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet
[ https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901508#comment-16901508 ] Wes McKinney commented on ARROW-3246: - I created ARROW-6152 to cover the initial feature-preserving refactoring. I estimate about a day of effort for that, will report in once I make a little progress > [Python][Parquet] direct reading/writing of pandas categoricals in parquet > -- > > Key: ARROW-3246 > URL: https://issues.apache.org/jira/browse/ARROW-3246 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Martin Durant >Assignee: Wes McKinney >Priority: Minor > Labels: parquet > Fix For: 1.0.0 > > > Parquet supports "dictionary encoding" of column data in a manner very > similar to the concept of Categoricals in pandas. It is natural to use this > encoding for a column which originated as a categorical. Conversely, when > loading, if the file metadata says that a given column came from a pandas (or > arrow) categorical, then we can trust that the whole of the column is > dictionary-encoded and load the data directly into a categorical column, > rather than expanding the labels upon load and recategorising later. > If the data does not have the pandas metadata, then the guarantee cannot > hold, and we cannot assume either that the whole column is dictionary encoded > or that the labels are the same throughout. In this case, the current > behaviour is fine. > > (please forgive that some of this has already been mentioned elsewhere; this > is one of the entries in the list at > [https://github.com/dask/fastparquet/issues/374] as a feature that is useful > in fastparquet) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6152) [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter
Wes McKinney created ARROW-6152: --- Summary: [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter Key: ARROW-6152 URL: https://issues.apache.org/jira/browse/ARROW-6152 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.15.0 This is an initial refactoring task to enable the Arrow write layer to access some of the internal implementation details of {{parquet::TypedColumnWriter}}. See discussion in ARROW-3246 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet
[ https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901506#comment-16901506 ] Wes McKinney commented on ARROW-3246: - I've been looking at what's required to write {{arrow::DictionaryArray}} directly into the appropriate lower-level ColumnWriter class. The trouble with the way the software is layered right now is that there is a "Chinese wall" between {{TypedColumnWriter}} and the Arrow write layer. We can only communicate with this class using the Parquet C types such as {{ByteArray}} and {{FixedLenByteArray}}. This is also a performance issue since we cannot write directly into the writer from {{arrow::BinaryArray}} or similar cases where it might make sense. I think the only way to fix the current situation is to add a {{TypedColumnWriter::WriteArrow(const ::arrow::Array&)}} method and "push down" a lot of the logic that's currently in parquet/arrow/writer.cc into the {{TypedColumnWriter}} implementation. This will enable us to do various write performance optimizations and also address the direct dictionary write issue. This is not a small project, but I would say that it's overdue and will put us on a better footing going forward cc [~xhochy] [~hatem] for any thoughts > [Python][Parquet] direct reading/writing of pandas categoricals in parquet > -- > > Key: ARROW-3246 > URL: https://issues.apache.org/jira/browse/ARROW-3246 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Martin Durant >Assignee: Wes McKinney >Priority: Minor > Labels: parquet > Fix For: 1.0.0 > > > Parquet supports "dictionary encoding" of column data in a manner very > similar to the concept of Categoricals in pandas. It is natural to use this > encoding for a column which originated as a categorical. Conversely, when > loading, if the file metadata says that a given column came from a pandas (or > arrow) categorical, then we can trust that the whole of the column is > dictionary-encoded and load the data directly into a categorical column, > rather than expanding the labels upon load and recategorising later. > If the data does not have the pandas metadata, then the guarantee cannot > hold, and we cannot assume either that the whole column is dictionary encoded > or that the labels are the same throughout. In this case, the current > behaviour is fine. > > (please forgive that some of this has already been mentioned elsewhere; this > is one of the entries in the list at > [https://github.com/dask/fastparquet/issues/374] as a feature that is useful > in fastparquet) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6131) [C++] Optimize the Arrow UTF-8-string-validation
[ https://issues.apache.org/jira/browse/ARROW-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901471#comment-16901471 ] Wes McKinney commented on ARROW-6131: - In principle this seems OK to me. We can discuss further in a PR > [C++] Optimize the Arrow UTF-8-string-validation > - > > Key: ARROW-6131 > URL: https://issues.apache.org/jira/browse/ARROW-6131 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Yuqi Gu >Assignee: Yuqi Gu >Priority: Major > > The new Algorithm comes from: https://github.com/cyb70289/utf8 (MIT LICENSE) > Range base algorithm: > 1. Map each byte of input-string to Range table. > 2. Leverage the Neon 'tbl' instruction to lookup table. > 3. Find the pattern and set correct table index for each input byte > 4. Validate input string. > The Algorithm would improve utf8-validation ~1.6x Speedup for LargeNonAscii > and SmallNonAscii. But the algorithm would deteriorate the All-Ascii cases > (The input data is all ascii string). > The benchmark API is > {code:java} > ValidateUTF8 > {code} > As far as I know, the data that is all-ascii is unusual on the internet. > Could you guys please tell me what's the use case scenario for Apache Arrow? > Is the Arrow's data that need to be validated all-ascii string? > If not, I'd like to submit the patch to accelerate the NonAscii validation. > As for All-Ascii validation, I would like to propose another optimization > solution with SIMD in another jira. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6151) [R] See if possible to generate r/inst/NOTICE.txt rather than duplicate information
[ https://issues.apache.org/jira/browse/ARROW-6151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901434#comment-16901434 ] Neal Richardson commented on ARROW-6151: Me too. This was discussed here: [https://github.com/apache/arrow/pull/4908#discussion_r306586276] I first proposed a shim NOTICE file in the R package that contained a link to the official NOTICE. That would keep a single source of truth for the NOTICE and still satisfy the CRAN request. When you requested the full NOTICE be included, I copied it there, and then added to the {{r/Makefile}} [https://github.com/apache/arrow/blob/master/r/Makefile#L33], which copies the NOTICE file every time {{make build}} is run, which happens when checking locally. Given the constraints, this seemed like the best option that would ensure that the full NOTICE file is included in the R package without requiring on human discipline to manually copy it in every time, while also providing a mechanism by which it gets synced with the official version. > [R] See if possible to generate r/inst/NOTICE.txt rather than duplicate > information > --- > > Key: ARROW-6151 > URL: https://issues.apache.org/jira/browse/ARROW-6151 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Wes McKinney >Priority: Major > > I noticed this file -- I am concerned about its maintainability. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6151) [R] See if possible to generate r/inst/NOTICE.txt rather than duplicate information
Wes McKinney created ARROW-6151: --- Summary: [R] See if possible to generate r/inst/NOTICE.txt rather than duplicate information Key: ARROW-6151 URL: https://issues.apache.org/jira/browse/ARROW-6151 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Wes McKinney I noticed this file -- I am concerned about its maintainability. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6150) [Python] Intermittent HDFS error
[ https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901399#comment-16901399 ] Wes McKinney commented on ARROW-6150: - The RPC port depends on your Hadoop configuration. You can generally find it in the HDFS web ui. {{pyarrow.hdfs.connect}} will also try to use the defaults in {{core-site.xml}} (I think) to connect > [Python] Intermittent HDFS error > > > Key: ARROW-6150 > URL: https://issues.apache.org/jira/browse/ARROW-6150 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: Saurabh Bajaj >Priority: Minor > > I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code > shown in traceback below) using PyArrow's HDFS IO library. However, the job > intermittently runs into the error shown below, not every run, only > sometimes. I'm unable to determine the root cause of this issue. > > {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, > fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in > pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in > pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, > errno: 255 (Unknown error 255) Please check that you are connecting to the > correct HDFS RPC port}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-6150) [Python] Intermittent HDFS error
[ https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901397#comment-16901397 ] Saurabh Bajaj edited comment on ARROW-6150 at 8/6/19 7:12 PM: -- I tried setting port=8020 in pa.hdfs.connect(), but same intermittent errors. was (Author: sbajaj): I tried setting `port=8020` in `pa.hdfs.connect()`, but same intermittent errors. > [Python] Intermittent HDFS error > > > Key: ARROW-6150 > URL: https://issues.apache.org/jira/browse/ARROW-6150 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: Saurabh Bajaj >Priority: Minor > > I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code > shown in traceback below) using PyArrow's HDFS IO library. However, the job > intermittently runs into the error shown below, not every run, only > sometimes. I'm unable to determine the root cause of this issue. > > {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, > fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in > pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in > pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, > errno: 255 (Unknown error 255) Please check that you are connecting to the > correct HDFS RPC port}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6150) [Python] Intermittent HDFS error
[ https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901397#comment-16901397 ] Saurabh Bajaj commented on ARROW-6150: -- I tried setting `port=8020` in `pa.hdfs.connect()`, but same intermittent errors. > [Python] Intermittent HDFS error > > > Key: ARROW-6150 > URL: https://issues.apache.org/jira/browse/ARROW-6150 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: Saurabh Bajaj >Priority: Minor > > I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code > shown in traceback below) using PyArrow's HDFS IO library. However, the job > intermittently runs into the error shown below, not every run, only > sometimes. I'm unable to determine the root cause of this issue. > > {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, > fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in > pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in > pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, > errno: 255 (Unknown error 255) Please check that you are connecting to the > correct HDFS RPC port}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6150) [Python] Intermittent HDFS error
[ https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6150: Summary: [Python] Intermittent HDFS error (was: Intermittent Pyarrow HDFS IO error) > [Python] Intermittent HDFS error > > > Key: ARROW-6150 > URL: https://issues.apache.org/jira/browse/ARROW-6150 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: Saurabh Bajaj >Priority: Minor > > I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code > shown in traceback below) using PyArrow's HDFS IO library. However, the job > intermittently runs into the error shown below, not every run, only > sometimes. I'm unable to determine the root cause of this issue. > > {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, > fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in > pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in > pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, > errno: 255 (Unknown error 255) Please check that you are connecting to the > correct HDFS RPC port}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6150) Intermittent Pyarrow HDFS IO error
[ https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901394#comment-16901394 ] Saurabh Bajaj commented on ARROW-6150: -- [~wesmckinn] Thanks for your response! I found https://issues.apache.org/jira/browse/ARROW-3957 and the PR that address it: [https://github.com/apache/arrow/commit/758bd557584107cb336cbc3422744dacd93978af]. Seems like the cause of the issue is an incorrect port? The default to {{pa.hdfs.connect()}} is {{port=0}}. What would be the correct port to use? > Intermittent Pyarrow HDFS IO error > -- > > Key: ARROW-6150 > URL: https://issues.apache.org/jira/browse/ARROW-6150 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: Saurabh Bajaj >Priority: Minor > > I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code > shown in traceback below) using PyArrow's HDFS IO library. However, the job > intermittently runs into the error shown below, not every run, only > sometimes. I'm unable to determine the root cause of this issue. > > {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, > fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in > pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in > pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, > errno: 255 (Unknown error 255) Please check that you are connecting to the > correct HDFS RPC port}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6150) Intermittent Pyarrow HDFS IO error
[ https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901386#comment-16901386 ] Wes McKinney commented on ARROW-6150: - Our use of libhdfs is pretty straightforward, so the issue seems unlikely to be caused by a bug in the Arrow implementation. I've seen other reports of the errno 255, they might give some clue about what could be wrong with the job. If you find anything out (or a way to reliably reproduce) let us know > Intermittent Pyarrow HDFS IO error > -- > > Key: ARROW-6150 > URL: https://issues.apache.org/jira/browse/ARROW-6150 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: Saurabh Bajaj >Priority: Minor > > I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code > shown in traceback below) using PyArrow's HDFS IO library. However, the job > intermittently runs into the error shown below, not every run, only > sometimes. I'm unable to determine the root cause of this issue. > > {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, > fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in > pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in > pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, > errno: 255 (Unknown error 255) Please check that you are connecting to the > correct HDFS RPC port}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6084) [Python] Support LargeList
[ https://issues.apache.org/jira/browse/ARROW-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-6084. - Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 4979 [https://github.com/apache/arrow/pull/4979] > [Python] Support LargeList > -- > > Key: ARROW-6084 > URL: https://issues.apache.org/jira/browse/ARROW-6084 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6150) Intermittent Pyarrow HDFS IO error
Saurabh Bajaj created ARROW-6150: Summary: Intermittent Pyarrow HDFS IO error Key: ARROW-6150 URL: https://issues.apache.org/jira/browse/ARROW-6150 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.1 Reporter: Saurabh Bajaj I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code shown in traceback below) using PyArrow's HDFS IO library. However, the job intermittently runs into the error shown below, not every run, only sometimes. I'm unable to determine the root cause of this issue. {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Closed] (ARROW-5922) [Python] Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API
[ https://issues.apache.org/jira/browse/ARROW-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saurabh Bajaj closed ARROW-5922. Resolution: Works for Me > [Python] Unable to connect to HDFS from a worker/data node on a Kerberized > cluster using pyarrow' hdfs API > -- > > Key: ARROW-5922 > URL: https://issues.apache.org/jira/browse/ARROW-5922 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 > Environment: Unix >Reporter: Saurabh Bajaj >Priority: Major > Fix For: 0.14.0 > > > Here's what I'm trying: > {{```}} > {{import pyarrow as pa }} > {{conf = \{"hadoop.security.authentication": "kerberos"} }} > {{fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf)}} > {{```}} > However, when I submit this job to the cluster using {{Dask-YARN}}, I get the > following error: > ``` > {{File "test/run.py", line 3 fs = > pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf) File > "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", > line 211, in connect File > "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", > line 38, in __init__ File "pyarrow/io-hdfs.pxi", line 105, in > pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 83, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed}} > {{```}} > I also tried setting {{host (to a name node)}} and {{port (=8020)}}, however > I run into the same error. Since the error is not descriptive, I'm not sure > which setting needs to be altered. Any clues anyone? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6088) [Rust] [DataFusion] Implement parallel execution for projection
[ https://issues.apache.org/jira/browse/ARROW-6088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-6088. --- Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 4988 [https://github.com/apache/arrow/pull/4988] > [Rust] [DataFusion] Implement parallel execution for projection > --- > > Key: ARROW-6088 > URL: https://issues.apache.org/jira/browse/ARROW-6088 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6149) [Parquet] Decimal comparisons used for min/max statistics are not correct
[ https://issues.apache.org/jira/browse/ARROW-6149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Felton updated ARROW-6149: - Affects Version/s: 0.14.1 > [Parquet] Decimal comparisons used for min/max statistics are not correct > - > > Key: ARROW-6149 > URL: https://issues.apache.org/jira/browse/ARROW-6149 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.1 >Reporter: Philip Felton >Priority: Major > > The [Parquet Format > specifications|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] > says > bq. If the column uses int32 or int64 physical types, then signed comparison > of the integer values produces the correct ordering. If the physical type is > fixed, then the correct ordering can be produced by flipping the > most-significant bit in the first byte and then using unsigned byte-wise > comparison. > However this isn't followed in the C++ Parquet code. 16-byte decimal > comparison is implemented using a lexicographical comparison of signed chars. > This appears to be because the function > [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183] > just goes off the sort_order (signed) and physical_type > (FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?
[ https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5977: -- Labels: csv pull-request-available (was: csv) > [C++] [Python] Method for read_csv to limit which columns are read? > --- > > Key: ARROW-5977 > URL: https://issues.apache.org/jira/browse/ARROW-5977 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.14.0 >Reporter: Jordan Samuels >Assignee: Antoine Pitrou >Priority: Major > Labels: csv, pull-request-available > > In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this > in pyarrow. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6055) [C++] Refactor arrow/io/hdfs.h to use common FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901210#comment-16901210 ] Benjamin Kietzman commented on ARROW-6055: -- In addition to removing io::FileSystem and io::FileStatistics, should HdfsPathInfo be replaced with fs::FileStats? It carries more information than fs::FileStats: last access time in addition to last modified time (though in seconds rather than ns since the epoch), block size, replication, and permissions > [C++] Refactor arrow/io/hdfs.h to use common FileSystem API > --- > > Key: ARROW-6055 > URL: https://issues.apache.org/jira/browse/ARROW-6055 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Benjamin Kietzman >Priority: Major > Fix For: 1.0.0 > > > As part of this refactor, the FileSystem-related classes in > https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/interfaces.h#L51 > should be removed. The files should probably be moved also to arrow/filesystem -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?
[ https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-5977: - Assignee: Antoine Pitrou > [C++] [Python] Method for read_csv to limit which columns are read? > --- > > Key: ARROW-5977 > URL: https://issues.apache.org/jira/browse/ARROW-5977 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.14.0 >Reporter: Jordan Samuels >Assignee: Antoine Pitrou >Priority: Major > Labels: csv > > In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this > in pyarrow. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6149) [Parquet] Decimal comparisons used for min/max statistics are not correct
[ https://issues.apache.org/jira/browse/ARROW-6149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Felton updated ARROW-6149: - Description: The [Parquet Format specifications|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] says bq. If the column uses int32 or int64 physical types, then signed comparison of the integer values produces the correct ordering. If the physical type is fixed, then the correct ordering can be produced by flipping the most-significant bit in the first byte and then using unsigned byte-wise comparison. However this isn't followed in the C++ Parquet code. 16-byte decimal comparison is implemented using a lexicographical comparison of signed chars. This appears to be because the function [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183] just goes off the sort_order (signed) and physical_type (FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal. was: The [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md|Parquet Format specifications] says bq. If the column uses int32 or int64 physical types, then signed comparison of the integer values produces the correct ordering. If the physical type is fixed, then the correct ordering can be produced by flipping the most-significant bit in the first byte and then using unsigned byte-wise comparison. However this isn't followed in the C++ Parquet code. 16-byte decimal comparison is implemented using a lexicographical comparison of signed chars. This appears to be because the function [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183] just goes off the sort_order (signed) and physical_type (FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal. > [Parquet] Decimal comparisons used for min/max statistics are not correct > - > > Key: ARROW-6149 > URL: https://issues.apache.org/jira/browse/ARROW-6149 > Project: Apache Arrow > Issue Type: Bug >Reporter: Philip Felton >Priority: Major > > The [Parquet Format > specifications|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] > says > bq. If the column uses int32 or int64 physical types, then signed comparison > of the integer values produces the correct ordering. If the physical type is > fixed, then the correct ordering can be produced by flipping the > most-significant bit in the first byte and then using unsigned byte-wise > comparison. > However this isn't followed in the C++ Parquet code. 16-byte decimal > comparison is implemented using a lexicographical comparison of signed chars. > This appears to be because the function > [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183] > just goes off the sort_order (signed) and physical_type > (FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6149) [Parquet] Decimal comparisons used for min/max statistics are not correct
Philip Felton created ARROW-6149: Summary: [Parquet] Decimal comparisons used for min/max statistics are not correct Key: ARROW-6149 URL: https://issues.apache.org/jira/browse/ARROW-6149 Project: Apache Arrow Issue Type: Bug Reporter: Philip Felton The [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md|Parquet Format specifications] says bq. If the column uses int32 or int64 physical types, then signed comparison of the integer values produces the correct ordering. If the physical type is fixed, then the correct ordering can be produced by flipping the most-significant bit in the first byte and then using unsigned byte-wise comparison. However this isn't followed in the C++ Parquet code. 16-byte decimal comparison is implemented using a lexicographical comparison of signed chars. This appears to be because the function [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183] just goes off the sort_order (signed) and physical_type (FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-6055) [C++] Refactor arrow/io/hdfs.h to use common FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Kietzman reassigned ARROW-6055: Assignee: Benjamin Kietzman > [C++] Refactor arrow/io/hdfs.h to use common FileSystem API > --- > > Key: ARROW-6055 > URL: https://issues.apache.org/jira/browse/ARROW-6055 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Benjamin Kietzman >Priority: Major > Fix For: 1.0.0 > > > As part of this refactor, the FileSystem-related classes in > https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/interfaces.h#L51 > should be removed. The files should probably be moved also to arrow/filesystem -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6055) [C++] Refactor arrow/io/hdfs.h to use common FileSystem API
[ https://issues.apache.org/jira/browse/ARROW-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901168#comment-16901168 ] Benjamin Kietzman commented on ARROW-6055: -- [~wesmckinn] should io::FileSystem be deprecated or just deleted? > [C++] Refactor arrow/io/hdfs.h to use common FileSystem API > --- > > Key: ARROW-6055 > URL: https://issues.apache.org/jira/browse/ARROW-6055 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > As part of this refactor, the FileSystem-related classes in > https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/interfaces.h#L51 > should be removed. The files should probably be moved also to arrow/filesystem -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?
[ https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901145#comment-16901145 ] Neal Richardson commented on ARROW-5977: Right, that was the other option that I said that all of the R readers support. We could support both ways, but the benefits of a separate vector of column names argument would be (1) you don't have to specify types for all other columns; (2) you wouldn't have to know the names of the other columns; and (3) you can't specify desired column order in the column_types because it is a map. > [C++] [Python] Method for read_csv to limit which columns are read? > --- > > Key: ARROW-5977 > URL: https://issues.apache.org/jira/browse/ARROW-5977 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.14.0 >Reporter: Jordan Samuels >Priority: Major > Labels: csv > > In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this > in pyarrow. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?
[ https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901142#comment-16901142 ] Francois Saint-Jacques commented on ARROW-5977: --- Can we re-use the `ConverterOptions.column_types` for the same purpose? The field would: # Select columns # Possibly give a type hint. > [C++] [Python] Method for read_csv to limit which columns are read? > --- > > Key: ARROW-5977 > URL: https://issues.apache.org/jira/browse/ARROW-5977 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.14.0 >Reporter: Jordan Samuels >Priority: Major > Labels: csv > > In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this > in pyarrow. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?
[ https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901141#comment-16901141 ] Neal Richardson commented on ARROW-5977: Yeah I agree that just an {{include_columns}} argument is fine. Any other sugar we want to have for selecting or deselecting columns can be handled at the Dataset layer (and thus be available regardless of the underlying storage type). Null column probably makes sense if the column does not exist in the CSV. > [C++] [Python] Method for read_csv to limit which columns are read? > --- > > Key: ARROW-5977 > URL: https://issues.apache.org/jira/browse/ARROW-5977 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.14.0 >Reporter: Jordan Samuels >Priority: Major > Labels: csv > > In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this > in pyarrow. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5766) [Python] Unpin jpype1 version
[ https://issues.apache.org/jira/browse/ARROW-5766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901139#comment-16901139 ] Antoine Pitrou commented on ARROW-5766: --- [~xhochy] > [Python] Unpin jpype1 version > - > > Key: ARROW-5766 > URL: https://issues.apache.org/jira/browse/ARROW-5766 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > According to the discussion in > https://github.com/conda-forge/jpype1-feedstock/issues/8 htere are some > changes that we must make to our code to stay on the released version of > jpype1 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?
[ https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901137#comment-16901137 ] Antoine Pitrou commented on ARROW-5977: --- So, to make things clear, a column in {{include_columns}} but not in the CSV file should produce a null column rather than emit an error? > [C++] [Python] Method for read_csv to limit which columns are read? > --- > > Key: ARROW-5977 > URL: https://issues.apache.org/jira/browse/ARROW-5977 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.14.0 >Reporter: Jordan Samuels >Priority: Major > Labels: csv > > In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this > in pyarrow. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6142) [R] Install instructions on linux could be clearer
[ https://issues.apache.org/jira/browse/ARROW-6142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901135#comment-16901135 ] Neal Richardson commented on ARROW-6142: I've revised the README along these lines in [https://github.com/apache/arrow/pull/4948] but didn't touch {{install_arrow}} so I'll take a look here. > [R] Install instructions on linux could be clearer > -- > > Key: ARROW-6142 > URL: https://issues.apache.org/jira/browse/ARROW-6142 > Project: Apache Arrow > Issue Type: Wish > Components: R >Affects Versions: 0.14.1 > Environment: Ubuntu 19.04 >Reporter: Karl Dunkle Werner >Assignee: Neal Richardson >Priority: Minor > Labels: documentation > Fix For: 0.15.0 > > > Installing R packages on Linux is almost always from source, which means > Arrow needs some system dependencies. The existing help message (from > arrow::install_arrow()) is very helpful in pointing that out, but it's still > a heavy lift for users who install R packages from source but don't plan to > develop Arrow itself. > Here are a couple of things that could make things slightly smoother: > # I would be very grateful if the install_arrow() message or installation > page told me which libraries were essential to make the R package work. > # install_arrow() refers to a PPA. Previously I've only seen PPAs hosted on > launchpad.net, so the bintray URL threw me. Changing it to "bintray.com PPA" > instead of just "PPA" would have caused me less confusion. (Others may differ) > # A snap package would be easier than installing a new apt address, but I > understand that building for snap would be more packaging work and only > benefits Ubuntu users. > > Thanks for making R bindings, and congratulations on the CRAN release! -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?
[ https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901133#comment-16901133 ] Wes McKinney commented on ARROW-5977: - I think just include is okay. It might make sense to co-develop this in conjunction with the Datasets interface to CSV files (since this needs to be able to select columns as well as insert missing fields -- which become all null -- this can happen as a post-scan operation though) > [C++] [Python] Method for read_csv to limit which columns are read? > --- > > Key: ARROW-5977 > URL: https://issues.apache.org/jira/browse/ARROW-5977 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.14.0 >Reporter: Jordan Samuels >Priority: Major > Labels: csv > > In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this > in pyarrow. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-6142) [R] Install instructions on linux could be clearer
[ https://issues.apache.org/jira/browse/ARROW-6142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-6142: -- Assignee: Neal Richardson > [R] Install instructions on linux could be clearer > -- > > Key: ARROW-6142 > URL: https://issues.apache.org/jira/browse/ARROW-6142 > Project: Apache Arrow > Issue Type: Wish > Components: R >Affects Versions: 0.14.1 > Environment: Ubuntu 19.04 >Reporter: Karl Dunkle Werner >Assignee: Neal Richardson >Priority: Minor > Labels: documentation > Fix For: 0.15.0 > > > Installing R packages on Linux is almost always from source, which means > Arrow needs some system dependencies. The existing help message (from > arrow::install_arrow()) is very helpful in pointing that out, but it's still > a heavy lift for users who install R packages from source but don't plan to > develop Arrow itself. > Here are a couple of things that could make things slightly smoother: > # I would be very grateful if the install_arrow() message or installation > page told me which libraries were essential to make the R package work. > # install_arrow() refers to a PPA. Previously I've only seen PPAs hosted on > launchpad.net, so the bintray URL threw me. Changing it to "bintray.com PPA" > instead of just "PPA" would have caused me less confusion. (Others may differ) > # A snap package would be easier than installing a new apt address, but I > understand that building for snap would be more packaging work and only > benefits Ubuntu users. > > Thanks for making R bindings, and congratulations on the CRAN release! -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6039) [GLib] Add garrow_array_filter()
[ https://issues.apache.org/jira/browse/ARROW-6039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6039: -- Labels: pull-request-available (was: ) > [GLib] Add garrow_array_filter() > > > Key: ARROW-6039 > URL: https://issues.apache.org/jira/browse/ARROW-6039 > Project: Apache Arrow > Issue Type: New Feature > Components: GLib >Reporter: Yosuke Shiro >Assignee: Yosuke Shiro >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > > Add bindings of a boolean selection filter. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?
[ https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901126#comment-16901126 ] Antoine Pitrou commented on ARROW-5977: --- Ok, so what kind of ergonomics would you favour? Simply a {{include_columns}} vector of strings? And/or a {{exclude_columns}} vector as well? > [C++] [Python] Method for read_csv to limit which columns are read? > --- > > Key: ARROW-5977 > URL: https://issues.apache.org/jira/browse/ARROW-5977 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.14.0 >Reporter: Jordan Samuels >Priority: Major > Labels: csv > > In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this > in pyarrow. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?
[ https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901124#comment-16901124 ] Neal Richardson commented on ARROW-5977: All of R's main CSV readers support this. One way they all expose this by allowing you to provide a null type for some columns when you specify their types explicitly. A couple of the readers allow you to specify columns by name or position to keep or drop. I think this is a good idea not just in the context of reading a CSV itself but also for the Datasets framework, where we are lazily reading chunks of data as needed and trying to be efficient with memory usage. > [C++] [Python] Method for read_csv to limit which columns are read? > --- > > Key: ARROW-5977 > URL: https://issues.apache.org/jira/browse/ARROW-5977 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.14.0 >Reporter: Jordan Samuels >Priority: Major > Labels: csv > > In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this > in pyarrow. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6039) [GLib] Add garrow_array_filter()
[ https://issues.apache.org/jira/browse/ARROW-6039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yosuke Shiro updated ARROW-6039: Fix Version/s: (was: 1.0.0) 0.15.0 > [GLib] Add garrow_array_filter() > > > Key: ARROW-6039 > URL: https://issues.apache.org/jira/browse/ARROW-6039 > Project: Apache Arrow > Issue Type: New Feature > Components: GLib >Reporter: Yosuke Shiro >Assignee: Yosuke Shiro >Priority: Major > Fix For: 0.15.0 > > > Add bindings of a boolean selection filter. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?
[ https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901074#comment-16901074 ] Antoine Pitrou commented on ARROW-5977: --- [~npr] Ping. > [C++] [Python] Method for read_csv to limit which columns are read? > --- > > Key: ARROW-5977 > URL: https://issues.apache.org/jira/browse/ARROW-5977 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.14.0 >Reporter: Jordan Samuels >Priority: Major > Labels: csv > > In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this > in pyarrow. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-6148) Missing debian build dependencies
[ https://issues.apache.org/jira/browse/ARROW-6148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Juszkiewicz reassigned ARROW-6148: - Assignee: Marcin Juszkiewicz > Missing debian build dependencies > - > > Key: ARROW-6148 > URL: https://issues.apache.org/jira/browse/ARROW-6148 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging >Reporter: Francois Saint-Jacques >Assignee: Marcin Juszkiewicz >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6148) Missing debian build dependencies
[ https://issues.apache.org/jira/browse/ARROW-6148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6148: -- Labels: pull-request-available (was: ) > Missing debian build dependencies > - > > Key: ARROW-6148 > URL: https://issues.apache.org/jira/browse/ARROW-6148 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging >Reporter: Francois Saint-Jacques >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6148) Missing debian build dependencies
Francois Saint-Jacques created ARROW-6148: - Summary: Missing debian build dependencies Key: ARROW-6148 URL: https://issues.apache.org/jira/browse/ARROW-6148 Project: Apache Arrow Issue Type: Bug Components: Packaging Reporter: Francois Saint-Jacques -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6147) [Go] implement a Flight client
Sebastien Binet created ARROW-6147: -- Summary: [Go] implement a Flight client Key: ARROW-6147 URL: https://issues.apache.org/jira/browse/ARROW-6147 Project: Apache Arrow Issue Type: New Feature Reporter: Sebastien Binet -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6107) [Go] ipc.Writer Option to skip appending data buffers
[ https://issues.apache.org/jira/browse/ARROW-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900964#comment-16900964 ] Sebastien Binet commented on ARROW-6107: ok. (just nit-picking but to really assess the CGo overhead, one should directly call C, not C++-via-python :P. that said, it's a nice PoC.) SGTM. > [Go] ipc.Writer Option to skip appending data buffers > - > > Key: ARROW-6107 > URL: https://issues.apache.org/jira/browse/ARROW-6107 > Project: Apache Arrow > Issue Type: Improvement > Components: Go >Reporter: Nick Poorman >Priority: Minor > > For cases where we have a known shared memory region, it would be great if > the ipc.Writer (and by extension ipc.Reader?) had the ability to write out > everything but the actual buffers holding the data. That way we can still > utilize the ipc mechanisms to communicate without having to serialize all the > underlying data across the wire. > > This seems like it should be possible since the `RecordBatch` flatbuffers > only contain the metadata and the underlying data buffers are appended later. > We just need to skip appending the underlying data buffers. > > [~sbinet] thoughts? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6146) [Go] implement a Plasma client
Sebastien Binet created ARROW-6146: -- Summary: [Go] implement a Plasma client Key: ARROW-6146 URL: https://issues.apache.org/jira/browse/ARROW-6146 Project: Apache Arrow Issue Type: New Feature Reporter: Sebastien Binet -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6145) [Java] UnionVector created by MinorType#getNewVector could not keep field type info properly
[ https://issues.apache.org/jira/browse/ARROW-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6145: -- Labels: pull-request-available (was: ) > [Java] UnionVector created by MinorType#getNewVector could not keep field > type info properly > > > Key: ARROW-6145 > URL: https://issues.apache.org/jira/browse/ARROW-6145 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > > When I worked for other items, I found {{UnionVector}} created by > {{VectorSchemaRoot#create(Schema schema, BufferAllocator allocator)}} could > not keep field type info properly. For example, if we set metadata in > {{Field}} in schema, we could not get it back by {{UnionVector#getField}}. > This is mainly because {{MinorType.Union.getNewVector}} did not pass > {{FieldType}} to vector and {{UnionVector#getField}} create a new {{Field}} > which cause inconsistent. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6145) [Java] UnionVector created by MinorType#getNewVector could not keep field type info properly
Ji Liu created ARROW-6145: - Summary: [Java] UnionVector created by MinorType#getNewVector could not keep field type info properly Key: ARROW-6145 URL: https://issues.apache.org/jira/browse/ARROW-6145 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu When I worked for other items, I found {{UnionVector}} created by {{VectorSchemaRoot#create(Schema schema, BufferAllocator allocator)}} could not keep field type info properly. For example, if we set metadata in {{Field}} in schema, we could not get it back by {{UnionVector#getField}}. This is mainly because {{MinorType.Union.getNewVector}} did not pass {{FieldType}} to vector and {{UnionVector#getField}} create a new {{Field}} which cause inconsistent. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6144) Implement random function in Gandiva
[ https://issues.apache.org/jira/browse/ARROW-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6144: -- Labels: pull-request-available (was: ) > Implement random function in Gandiva > > > Key: ARROW-6144 > URL: https://issues.apache.org/jira/browse/ARROW-6144 > Project: Apache Arrow > Issue Type: Task > Components: C++ - Gandiva >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Minor > Labels: pull-request-available > > Implement random(), random(int seed) functions -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty
[ https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6038: -- Labels: pull-request-available windows (was: windows) > [Python] pyarrow.Table.from_batches produces corrupted table if any of the > batches were empty > - > > Key: ARROW-6038 > URL: https://issues.apache.org/jira/browse/ARROW-6038 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.13.0, 0.14.0, 0.14.1 >Reporter: Piotr Bajger >Priority: Minor > Labels: pull-request-available, windows > Attachments: segfault_ex.py > > > When creating a Table from a list/iterator of batches which contains an > "empty" RecordBatch a Table is produced but attempts to run any pyarrow > built-in functions (such as unique()) occasionally result in a Segfault. > The MWE is attached: [^segfault_ex.py] > # The segfaults happen randomly, around 30% of the time. > # Commenting out line 10 in the MWE results in no segfaults. > # The segfault is triggered using the unique() function, but I doubt the > behaviour is specific to that function, from what I gather the problem lies > in Table creation. > I'm on Windows 10, using Python 3.6 and pyarrow 0.14.0 installed through pip > (problem also occurs with 0.13.0 from conda-forge). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6144) Implement random function in Gandiva
Prudhvi Porandla created ARROW-6144: --- Summary: Implement random function in Gandiva Key: ARROW-6144 URL: https://issues.apache.org/jira/browse/ARROW-6144 Project: Apache Arrow Issue Type: Task Components: C++ - Gandiva Reporter: Prudhvi Porandla Assignee: Prudhvi Porandla Implement random(), random(int seed) functions -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-1562) [C++] Numeric kernel implementations for add (+)
[ https://issues.apache.org/jira/browse/ARROW-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1562: -- Labels: Analytics pull-request-available (was: Analytics) > [C++] Numeric kernel implementations for add (+) > > > Key: ARROW-1562 > URL: https://issues.apache.org/jira/browse/ARROW-1562 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: Analytics, pull-request-available > > This function should respect consistent type promotions between types of > different sizes and signed and unsigned integers -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5953) Thrift download ERRORS with apache-arrow-0.14.0
[ https://issues.apache.org/jira/browse/ARROW-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900831#comment-16900831 ] Marcin Juszkiewicz commented on ARROW-5953: --- On Debian 'buster' I have similar failure with Arrow 0.14.1 when tried to rebuild package for arm64 architecture: {{– Checking for module 'thrift'}} {{-- No package 'thrift' found}} {{– Could NOT find Thrift (missing: THRIFT_STATIC_LIB THRIFT_INCLUDE_DIR THRIFT_COMPILER)}} {{Building Apache Thrift from source}} {{Downloading Apache Thrift from Traceback (most recent call last):}} {{ File "/usr/lib/python3.7/urllib/request.py", line 1317, in do_open}} {{ encode_chunked=req.has_header('Transfer-encoding'))}} {{ File "/usr/lib/python3.7/http/client.py", line 1229, in request}} {{ self._send_request(method, url, body, headers, encode_chunked)}} {{ File "/usr/lib/python3.7/http/client.py", line 1275, in _send_request}} {{ self.endheaders(body, encode_chunked=encode_chunked)}} {{ File "/usr/lib/python3.7/http/client.py", line 1224, in endheaders}} {{ self._send_output(message_body, encode_chunked=encode_chunked)}} {{ File "/usr/lib/python3.7/http/client.py", line 1016, in _send_output}} {{ self.send(msg)}} {{ File "/usr/lib/python3.7/http/client.py", line 956, in send}} {{ self.connect()}} {{ File "/usr/lib/python3.7/http/client.py", line 1392, in connect}} {{ server_hostname=server_hostname)}} {{ File "/usr/lib/python3.7/ssl.py", line 412, in wrap_socket}} {{ session=session}} {{ File "/usr/lib/python3.7/ssl.py", line 853, in _create}} {{ self.do_handshake()}} {{ File "/usr/lib/python3.7/ssl.py", line 1117, in do_handshake}} {{ self._sslobj.do_handshake()}} {{ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)}} > Thrift download ERRORS with apache-arrow-0.14.0 > --- > > Key: ARROW-5953 > URL: https://issues.apache.org/jira/browse/ARROW-5953 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.13.0, 0.14.0 > Environment: RHEL 6.7 >Reporter: Brian >Priority: Major > > {color:#33}cmake returns:{color} > requests.excetions.SSLError: hostname 'www.apache.org' doesn't match either > of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz > {color:#33}during check for thrift download location. {color} > {color:#33}This occurs with a freshly inflated arrow source release tree > where cmake is running for the first time. {color} > {color:#33}Reproducible with the release levels of apache-arrow-0.14.0 > and 0.13.0. I tried this 3-5x on 15Jul2019 and see it consistently each > time.{color} > {color:#33}Here's the full context from cmake output: {color} > {quote}-- Checking for module 'thrift' > -- No package 'thrift' found > -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB THRIFT_INCLUDE_DIR > THRIFT_COMPILER) > Building Apache Thrift from source > Downloading Apache Thrift from Traceback (most recent call last): > File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line > 38, in > suggested_mirror = get_url('[https://www.apache.org/dyn/]' > File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line > 27, in get_url > return requests.get(url).content > File "/usr/lib/python2.6/site-packages/requests/api.py", line 68, in get > return request('get', url, **kwargs) > File "/usr/lib/python2.6/site-packages/requests/api.py", line 50, in request > response = session.request(method=method, url=url, **kwargs) > File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 464, in > request > resp = self.send(prep, **send_kwargs) > File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 576, in > send > r = adapter.send(request, **kwargs) > File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 431, in > send > raise SSLError(e, request=request) > requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either > of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz > {quote} > {color:#FF} {color} > {color:#FF}{color:#33}Per Wes' suggestion I ran the following > directly:{color}{color} > {color:#FF}{color:#33}python cpp/build-support/get_apache_mirror.py > [https://www-eu.apache.org/dist/] [http://us.mirrors.quenda.co/apache/] > {color}{color} > {color:#FF}{color:#33}with this output:{color}{color} > [https://www-eu.apache.org/dist/] [http://us.mirrors.quenda.co/apache/] > > > *NOTE:* here are the cmake thrift log lines from a build of apache-arrow git > clone on 06Jul2019 where cmake/make were run fine.pwd > > {quote}-- Checking for module 'thrift' >
[jira] [Closed] (ARROW-6129) Row_groups duplicate Rows
[ https://issues.apache.org/jira/browse/ARROW-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] albertoramon closed ARROW-6129. --- Resolution: Not A Problem This is the expected behavior > Row_groups duplicate Rows > - > > Key: ARROW-6129 > URL: https://issues.apache.org/jira/browse/ARROW-6129 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.14.1 >Reporter: albertoramon >Priority: Major > Labels: parquetWriter > Attachments: tes_output.png, test01.py, top10.csv > > > Using Row_Groups to write Parquet, duplicate rows: > Input: CSV 10 Rows > Row_Groups=1 --> Output 10 Rows > Row_Groups=2 --> Output 20 Rows > !tes_output.png! > Is this the expected? > attached code snippet and CSV -- This message was sent by Atlassian JIRA (v7.6.14#76016)