[jira] [Updated] (ARROW-6155) [Java] Extract a super interface for vectors whose elements reside in continuous memory segments

2019-08-06 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6155:
--
Labels: pull-request-available  (was: )

> [Java] Extract a super interface for vectors whose elements reside in 
> continuous memory segments
> 
>
> Key: ARROW-6155
> URL: https://issues.apache.org/jira/browse/ARROW-6155
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>
> For vectors whose data elements reside in continuous memory segments, they 
> should implement a common super interface. This will avoid unnecessary code 
> branches.
> For now, such vectors include fixed-width vectors and variable-width vectors. 
> In the future, there can be more vectors included.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6155) [Java] Extract a super interface for vectors whose elements reside in continuous memory segments

2019-08-06 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6155:
---

 Summary: [Java] Extract a super interface for vectors whose 
elements reside in continuous memory segments
 Key: ARROW-6155
 URL: https://issues.apache.org/jira/browse/ARROW-6155
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Liya Fan
Assignee: Liya Fan


For vectors whose data elements reside in continuous memory segments, they 
should implement a common super interface. This will avoid unnecessary code 
branches.

For now, such vectors include fixed-width vectors and variable-width vectors. 
In the future, there can be more vectors included.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6154) [Rust] Too many open files (os error 24)

2019-08-06 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6154:
---
Summary: [Rust] Too many open files (os error 24)  (was: Too many open 
files (os error 24))

> [Rust] Too many open files (os error 24)
> 
>
> Key: ARROW-6154
> URL: https://issues.apache.org/jira/browse/ARROW-6154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Yesh
>Priority: Major
>
> Used [rust]*parquet-read binary to read a deeply nested parquet file and see 
> the below stack trace. Unfortunately won't be able to upload file.*
> {code:java}
> stack backtrace:
>    0: std::panicking::default_hook::{{closure}}
>    1: std::panicking::default_hook
>    2: std::panicking::rust_panic_with_hook
>    3: std::panicking::continue_panic_fmt
>    4: rust_begin_unwind
>    5: core::panicking::panic_fmt
>    6: core::result::unwrap_failed
>    7: parquet::util::io::FileSource::new
>    8:  as 
> parquet::file::reader::RowGroupReader>::get_column_page_reader
>    9:  as 
> parquet::file::reader::RowGroupReader>::get_column_reader
>   10: parquet::record::reader::TreeBuilder::reader_tree
>   11: parquet::record::reader::TreeBuilder::reader_tree
>   12: parquet::record::reader::TreeBuilder::reader_tree
>   13: parquet::record::reader::TreeBuilder::reader_tree
>   14: parquet::record::reader::TreeBuilder::reader_tree
>   15: parquet::record::reader::TreeBuilder::build
>   16:  core::iter::traits::iterator::Iterator>::next
>   17: parquet_read::main
>   18: std::rt::lang_start::{{closure}}
>   19: std::panicking::try::do_call
>   20: __rust_maybe_catch_panic
>   21: std::rt::lang_start_internal
>   22: main{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5772) [GLib][Plasma][CUDA] Plasma::Client#refer_object test is failed

2019-08-06 Thread Yosuke Shiro (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yosuke Shiro resolved ARROW-5772.
-
   Resolution: Fixed
Fix Version/s: (was: 1.0.0)
   0.15.0

Issue resolved by pull request 5004
[https://github.com/apache/arrow/pull/5004]

> [GLib][Plasma][CUDA] Plasma::Client#refer_object test is failed
> ---
>
> Key: ARROW-5772
> URL: https://issues.apache.org/jira/browse/ARROW-5772
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GLib
>Affects Versions: 0.14.0
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> {noformat}
> /home/kou/work/cpp/arrow.kou/c_glib/test/plasma/test-plasma-client.rb:75:in 
> `block (2 levels) in '
> /tmp/local/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.3.6/lib/gobject-introspection/loader.rb:533:in
>  `block in define_method'
> /tmp/local/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.3.6/lib/gobject-introspection/loader.rb:616:in
>  `invoke'
> /tmp/local/lib/ruby/gems/2.7.0/gems/gobject-introspection-3.3.6/lib/gobject-introspection/loader.rb:616:in
>  `invoke'
> Error: test: options: GPU device(TestPlasmaClient::#create):
>   Arrow::Error::Io: [plasma][client][refer-object]: IOError: Cuda Driver API 
> call in ../src/arrow/gpu/cuda_context.cc at line 156 failed with code 208: 
> cuIpcOpenMemHandle(, *handle, CU_IPC_MEM_LAZY_ENABLE_PEER_ACCESS)
>   In ../src/arrow/gpu/cuda_context.cc, line 341, code: 
> impl_->OpenIpcBuffer(ipc_handle, )
>   In ../src/plasma/client.cc, line 586, code: 
> context->OpenIpcBuffer(*object->ipc_handle, _handle->ptr)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6154) Too many open files (os error 24)

2019-08-06 Thread Yesh (JIRA)
Yesh created ARROW-6154:
---

 Summary: Too many open files (os error 24)
 Key: ARROW-6154
 URL: https://issues.apache.org/jira/browse/ARROW-6154
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Yesh


Used [rust]*parquet-read binary to read a deeply nested parquet file and see 
the below stack trace. Unfortunately won't be able to upload file.*
{code:java}
stack backtrace:

   0: std::panicking::default_hook::{{closure}}

   1: std::panicking::default_hook

   2: std::panicking::rust_panic_with_hook

   3: std::panicking::continue_panic_fmt

   4: rust_begin_unwind

   5: core::panicking::panic_fmt

   6: core::result::unwrap_failed

   7: parquet::util::io::FileSource::new

   8:  as 
parquet::file::reader::RowGroupReader>::get_column_page_reader

   9:  as 
parquet::file::reader::RowGroupReader>::get_column_reader

  10: parquet::record::reader::TreeBuilder::reader_tree

  11: parquet::record::reader::TreeBuilder::reader_tree

  12: parquet::record::reader::TreeBuilder::reader_tree

  13: parquet::record::reader::TreeBuilder::reader_tree

  14: parquet::record::reader::TreeBuilder::reader_tree

  15: parquet::record::reader::TreeBuilder::build

  16: ::next

  17: parquet_read::main

  18: std::rt::lang_start::{{closure}}

  19: std::panicking::try::do_call

  20: __rust_maybe_catch_panic

  21: std::rt::lang_start_internal

  22: main{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6142) [R] Install instructions on linux could be clearer

2019-08-06 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6142:
--
Labels: documentation pull-request-available  (was: documentation)

> [R] Install instructions on linux could be clearer
> --
>
> Key: ARROW-6142
> URL: https://issues.apache.org/jira/browse/ARROW-6142
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: R
>Affects Versions: 0.14.1
> Environment: Ubuntu 19.04
>Reporter: Karl Dunkle Werner
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: documentation, pull-request-available
> Fix For: 0.15.0
>
>
> Installing R packages on Linux is almost always from source, which means 
> Arrow needs some system dependencies. The existing help message (from 
> arrow::install_arrow()) is very helpful in pointing that out, but it's still 
> a heavy lift for users who install R packages from source but don't plan to 
> develop Arrow itself.
> Here are a couple of things that could make things slightly smoother:
>  # I would be very grateful if the install_arrow() message or installation 
> page told me which libraries were essential to make the R package work.
>  # install_arrow() refers to a PPA. Previously I've only seen PPAs hosted on 
> launchpad.net, so the bintray URL threw me. Changing it to "bintray.com PPA" 
> instead of just "PPA" would have caused me less confusion. (Others may differ)
>  # A snap package would be easier than installing a new apt address, but I 
> understand that building for snap would be more packaging work and only 
> benefits Ubuntu users.
>  
> Thanks for making R bindings, and congratulations on the CRAN release!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6153) [R] Address parquet deprecation warning

2019-08-06 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-6153:
--

 Summary: [R] Address parquet deprecation warning
 Key: ARROW-6153
 URL: https://issues.apache.org/jira/browse/ARROW-6153
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Romain François


[~wesmckinn] has been refactoring the Parquet C++ library and there's now this 
deprecation warning appearing when I build the R package locally: 
{code:java}
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" 
-DNDEBUG -DNDEBUG -I/usr/local/include -DARROW_R_WITH_ARROW 
-I"/Users/enpiar/R/Rcpp/include" -isysroot 
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include  -fPIC 
 -Wall -g -O2  -c parquet.cpp -o parquet.o parquet.cpp:66:23: warning: 
'OpenFile' is deprecated: Deprecated since 0.15.0. Use FileReaderBuilder       
[-Wdeprecated-declarations]       parquet::arrow::OpenFile(file, 
arrow::default_memory_pool(), *props, ));                       ^
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet

2019-08-06 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901508#comment-16901508
 ] 

Wes McKinney commented on ARROW-3246:
-

I created ARROW-6152 to cover the initial feature-preserving refactoring. I 
estimate about a day of effort for that, will report in once I make a little 
progress

> [Python][Parquet] direct reading/writing of pandas categoricals in parquet
> --
>
> Key: ARROW-3246
> URL: https://issues.apache.org/jira/browse/ARROW-3246
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Parquet supports "dictionary encoding" of column data in a manner very 
> similar to the concept of Categoricals in pandas. It is natural to use this 
> encoding for a column which originated as a categorical. Conversely, when 
> loading, if the file metadata says that a given column came from a pandas (or 
> arrow) categorical, then we can trust that the whole of the column is 
> dictionary-encoded and load the data directly into a categorical column, 
> rather than expanding the labels upon load and recategorising later.
> If the data does not have the pandas metadata, then the guarantee cannot 
> hold, and we cannot assume either that the whole column is dictionary encoded 
> or that the labels are the same throughout. In this case, the current 
> behaviour is fine.
>  
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6152) [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter

2019-08-06 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6152:
---

 Summary: [C++][Parquet] Write arrow::Array directly into 
parquet::TypedColumnWriter
 Key: ARROW-6152
 URL: https://issues.apache.org/jira/browse/ARROW-6152
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.15.0


This is an initial refactoring task to enable the Arrow write layer to access 
some of the internal implementation details of 
{{parquet::TypedColumnWriter}}. See discussion in ARROW-3246



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet

2019-08-06 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901506#comment-16901506
 ] 

Wes McKinney commented on ARROW-3246:
-

I've been looking at what's required to write {{arrow::DictionaryArray}} 
directly into the appropriate lower-level ColumnWriter class. The trouble with 
the way the software is layered right now is that there is a "Chinese wall" 
between {{TypedColumnWriter}} and the Arrow write layer. We can only 
communicate with this class using the Parquet C types such as {{ByteArray}} and 
{{FixedLenByteArray}}. This is also a performance issue since we cannot write 
directly into the writer from {{arrow::BinaryArray}} or similar cases where it 
might make sense. 

I think the only way to fix the current situation is to add a 
{{TypedColumnWriter::WriteArrow(const ::arrow::Array&)}} method and "push 
down" a lot of the logic that's currently in parquet/arrow/writer.cc into the 
{{TypedColumnWriter}} implementation. This will enable us to do various 
write performance optimizations and also address the direct dictionary write 
issue. This is not a small project, but I would say that it's overdue and will 
put us on a better footing going forward

cc [~xhochy] [~hatem] for any thoughts

> [Python][Parquet] direct reading/writing of pandas categoricals in parquet
> --
>
> Key: ARROW-3246
> URL: https://issues.apache.org/jira/browse/ARROW-3246
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Parquet supports "dictionary encoding" of column data in a manner very 
> similar to the concept of Categoricals in pandas. It is natural to use this 
> encoding for a column which originated as a categorical. Conversely, when 
> loading, if the file metadata says that a given column came from a pandas (or 
> arrow) categorical, then we can trust that the whole of the column is 
> dictionary-encoded and load the data directly into a categorical column, 
> rather than expanding the labels upon load and recategorising later.
> If the data does not have the pandas metadata, then the guarantee cannot 
> hold, and we cannot assume either that the whole column is dictionary encoded 
> or that the labels are the same throughout. In this case, the current 
> behaviour is fine.
>  
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6131) [C++] Optimize the Arrow UTF-8-string-validation

2019-08-06 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901471#comment-16901471
 ] 

Wes McKinney commented on ARROW-6131:
-

In principle this seems OK to me. We can discuss further in a PR

> [C++]  Optimize the Arrow UTF-8-string-validation
> -
>
> Key: ARROW-6131
> URL: https://issues.apache.org/jira/browse/ARROW-6131
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yuqi Gu
>Assignee: Yuqi Gu
>Priority: Major
>
> The new Algorithm comes from: https://github.com/cyb70289/utf8 (MIT LICENSE)
> Range base algorithm:
>   1. Map each byte of input-string to Range table.
>   2. Leverage the Neon 'tbl' instruction to lookup table.
>   3. Find the pattern and set correct table index for each input byte
>   4. Validate input string.
> The Algorithm would improve utf8-validation ~1.6x Speedup for LargeNonAscii 
> and SmallNonAscii. But the algorithm would deteriorate the All-Ascii cases 
> (The input data is all ascii string).
> The benchmark API is  
> {code:java}
> ValidateUTF8
> {code}
> As far as I know, the data that is all-ascii is unusual on the internet.
> Could you guys please tell me what's the use case scenario for Apache Arrow? 
> Is the Arrow's data that need to be validated  all-ascii string?
> If not, I'd like to submit the patch to accelerate the NonAscii validation.
> As for All-Ascii  validation,  I would like to propose another optimization 
> solution with SIMD in another jira.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6151) [R] See if possible to generate r/inst/NOTICE.txt rather than duplicate information

2019-08-06 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901434#comment-16901434
 ] 

Neal Richardson commented on ARROW-6151:


Me too. This was discussed here: 
[https://github.com/apache/arrow/pull/4908#discussion_r306586276]

I first proposed a shim NOTICE file in the R package that contained a link to 
the official NOTICE. That would keep a single source of truth for the NOTICE 
and still satisfy the CRAN request. When you requested the full NOTICE be 
included, I copied it there, and then added to the {{r/Makefile}} 
[https://github.com/apache/arrow/blob/master/r/Makefile#L33], which copies the 
NOTICE file every time {{make build}} is run, which happens when checking 
locally. Given the constraints, this seemed like the best option that would 
ensure that the full NOTICE file is included in the R package without requiring 
on human discipline to manually copy it in every time, while also providing a 
mechanism by which it gets synced with the official version. 

> [R] See if possible to generate r/inst/NOTICE.txt rather than duplicate 
> information
> ---
>
> Key: ARROW-6151
> URL: https://issues.apache.org/jira/browse/ARROW-6151
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Wes McKinney
>Priority: Major
>
> I noticed this file -- I am concerned about its maintainability. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6151) [R] See if possible to generate r/inst/NOTICE.txt rather than duplicate information

2019-08-06 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6151:
---

 Summary: [R] See if possible to generate r/inst/NOTICE.txt rather 
than duplicate information
 Key: ARROW-6151
 URL: https://issues.apache.org/jira/browse/ARROW-6151
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Wes McKinney


I noticed this file -- I am concerned about its maintainability. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6150) [Python] Intermittent HDFS error

2019-08-06 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901399#comment-16901399
 ] 

Wes McKinney commented on ARROW-6150:
-

The RPC port depends on your Hadoop configuration. You can generally find it in 
the HDFS web ui. {{pyarrow.hdfs.connect}} will also try to use the defaults in 
{{core-site.xml}} (I think) to connect

> [Python] Intermittent HDFS error
> 
>
> Key: ARROW-6150
> URL: https://issues.apache.org/jira/browse/ARROW-6150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Saurabh Bajaj
>Priority: Minor
>
> I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
> shown in traceback below) using PyArrow's HDFS IO library. However, the job 
> intermittently runs into the error shown below, not every run, only 
> sometimes. I'm unable to determine the root cause of this issue.
>  
> {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
> fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
> pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
> pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
> errno: 255 (Unknown error 255) Please check that you are connecting to the 
> correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-6150) [Python] Intermittent HDFS error

2019-08-06 Thread Saurabh Bajaj (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901397#comment-16901397
 ] 

Saurabh Bajaj edited comment on ARROW-6150 at 8/6/19 7:12 PM:
--

I tried setting port=8020 in pa.hdfs.connect(), but same intermittent errors. 


was (Author: sbajaj):
I tried setting `port=8020` in `pa.hdfs.connect()`, but same intermittent 
errors. 

> [Python] Intermittent HDFS error
> 
>
> Key: ARROW-6150
> URL: https://issues.apache.org/jira/browse/ARROW-6150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Saurabh Bajaj
>Priority: Minor
>
> I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
> shown in traceback below) using PyArrow's HDFS IO library. However, the job 
> intermittently runs into the error shown below, not every run, only 
> sometimes. I'm unable to determine the root cause of this issue.
>  
> {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
> fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
> pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
> pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
> errno: 255 (Unknown error 255) Please check that you are connecting to the 
> correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6150) [Python] Intermittent HDFS error

2019-08-06 Thread Saurabh Bajaj (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901397#comment-16901397
 ] 

Saurabh Bajaj commented on ARROW-6150:
--

I tried setting `port=8020` in `pa.hdfs.connect()`, but same intermittent 
errors. 

> [Python] Intermittent HDFS error
> 
>
> Key: ARROW-6150
> URL: https://issues.apache.org/jira/browse/ARROW-6150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Saurabh Bajaj
>Priority: Minor
>
> I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
> shown in traceback below) using PyArrow's HDFS IO library. However, the job 
> intermittently runs into the error shown below, not every run, only 
> sometimes. I'm unable to determine the root cause of this issue.
>  
> {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
> fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
> pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
> pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
> errno: 255 (Unknown error 255) Please check that you are connecting to the 
> correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6150) [Python] Intermittent HDFS error

2019-08-06 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6150:

Summary: [Python] Intermittent HDFS error  (was: Intermittent Pyarrow HDFS 
IO error)

> [Python] Intermittent HDFS error
> 
>
> Key: ARROW-6150
> URL: https://issues.apache.org/jira/browse/ARROW-6150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Saurabh Bajaj
>Priority: Minor
>
> I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
> shown in traceback below) using PyArrow's HDFS IO library. However, the job 
> intermittently runs into the error shown below, not every run, only 
> sometimes. I'm unable to determine the root cause of this issue.
>  
> {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
> fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
> pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
> pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
> errno: 255 (Unknown error 255) Please check that you are connecting to the 
> correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6150) Intermittent Pyarrow HDFS IO error

2019-08-06 Thread Saurabh Bajaj (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901394#comment-16901394
 ] 

Saurabh Bajaj commented on ARROW-6150:
--

[~wesmckinn] Thanks for your response! 

I found https://issues.apache.org/jira/browse/ARROW-3957 and the PR that 
address it: 
[https://github.com/apache/arrow/commit/758bd557584107cb336cbc3422744dacd93978af].
 

Seems like the cause of the issue is an incorrect port? The default to 
{{pa.hdfs.connect()}} is {{port=0}}. What would be the correct port to use?

> Intermittent Pyarrow HDFS IO error
> --
>
> Key: ARROW-6150
> URL: https://issues.apache.org/jira/browse/ARROW-6150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Saurabh Bajaj
>Priority: Minor
>
> I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
> shown in traceback below) using PyArrow's HDFS IO library. However, the job 
> intermittently runs into the error shown below, not every run, only 
> sometimes. I'm unable to determine the root cause of this issue.
>  
> {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
> fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
> pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
> pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
> errno: 255 (Unknown error 255) Please check that you are connecting to the 
> correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6150) Intermittent Pyarrow HDFS IO error

2019-08-06 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901386#comment-16901386
 ] 

Wes McKinney commented on ARROW-6150:
-

Our use of libhdfs is pretty straightforward, so the issue seems unlikely to be 
caused by a bug in the Arrow implementation. I've seen other reports of the 
errno 255, they might give some clue about what could be wrong with the job. If 
you find anything out (or a way to reliably reproduce) let us know

> Intermittent Pyarrow HDFS IO error
> --
>
> Key: ARROW-6150
> URL: https://issues.apache.org/jira/browse/ARROW-6150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Saurabh Bajaj
>Priority: Minor
>
> I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
> shown in traceback below) using PyArrow's HDFS IO library. However, the job 
> intermittently runs into the error shown below, not every run, only 
> sometimes. I'm unable to determine the root cause of this issue.
>  
> {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
> fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
> pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
> pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
> errno: 255 (Unknown error 255) Please check that you are connecting to the 
> correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6084) [Python] Support LargeList

2019-08-06 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6084.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4979
[https://github.com/apache/arrow/pull/4979]

> [Python] Support LargeList
> --
>
> Key: ARROW-6084
> URL: https://issues.apache.org/jira/browse/ARROW-6084
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6150) Intermittent Pyarrow HDFS IO error

2019-08-06 Thread Saurabh Bajaj (JIRA)
Saurabh Bajaj created ARROW-6150:


 Summary: Intermittent Pyarrow HDFS IO error
 Key: ARROW-6150
 URL: https://issues.apache.org/jira/browse/ARROW-6150
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
Reporter: Saurabh Bajaj


I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
shown in traceback below) using PyArrow's HDFS IO library. However, the job 
intermittently runs into the error shown below, not every run, only sometimes. 
I'm unable to determine the root cause of this issue.

 

{{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
errno: 255 (Unknown error 255) Please check that you are connecting to the 
correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (ARROW-5922) [Python] Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API

2019-08-06 Thread Saurabh Bajaj (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Bajaj closed ARROW-5922.

Resolution: Works for Me

> [Python] Unable to connect to HDFS from a worker/data node on a Kerberized 
> cluster using pyarrow' hdfs API
> --
>
> Key: ARROW-5922
> URL: https://issues.apache.org/jira/browse/ARROW-5922
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: Unix
>Reporter: Saurabh Bajaj
>Priority: Major
> Fix For: 0.14.0
>
>
> Here's what I'm trying:
> {{```}}
> {{import pyarrow as pa }}
> {{conf = \{"hadoop.security.authentication": "kerberos"} }}
> {{fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf)}}
> {{```}}
> However, when I submit this job to the cluster using {{Dask-YARN}}, I get the 
> following error:
> ```
> {{File "test/run.py", line 3 fs = 
> pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf) File 
> "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py",
>  line 211, in connect File 
> "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py",
>  line 38, in __init__ File "pyarrow/io-hdfs.pxi", line 105, in 
> pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 83, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed}}
> {{```}}
> I also tried setting {{host (to a name node)}} and {{port (=8020)}}, however 
> I run into the same error. Since the error is not descriptive, I'm not sure 
> which setting needs to be altered. Any clues anyone?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6088) [Rust] [DataFusion] Implement parallel execution for projection

2019-08-06 Thread Andy Grove (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-6088.
---
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4988
[https://github.com/apache/arrow/pull/4988]

> [Rust] [DataFusion] Implement parallel execution for projection
> ---
>
> Key: ARROW-6088
> URL: https://issues.apache.org/jira/browse/ARROW-6088
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6149) [Parquet] Decimal comparisons used for min/max statistics are not correct

2019-08-06 Thread Philip Felton (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Felton updated ARROW-6149:
-
Affects Version/s: 0.14.1

> [Parquet] Decimal comparisons used for min/max statistics are not correct
> -
>
> Key: ARROW-6149
> URL: https://issues.apache.org/jira/browse/ARROW-6149
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.1
>Reporter: Philip Felton
>Priority: Major
>
> The [Parquet Format 
> specifications|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]
>  says
> bq. If the column uses int32 or int64 physical types, then signed comparison 
> of the integer values produces the correct ordering. If the physical type is 
> fixed, then the correct ordering can be produced by flipping the 
> most-significant bit in the first byte and then using unsigned byte-wise 
> comparison.
> However this isn't followed in the C++ Parquet code. 16-byte decimal 
> comparison is implemented using a lexicographical comparison of signed chars.
> This appears to be because the function 
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183]
>  just goes off the sort_order (signed) and physical_type 
> (FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?

2019-08-06 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5977:
--
Labels: csv pull-request-available  (was: csv)

> [C++] [Python] Method for read_csv to limit which columns are read?
> ---
>
> Key: ARROW-5977
> URL: https://issues.apache.org/jira/browse/ARROW-5977
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.14.0
>Reporter: Jordan Samuels
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: csv, pull-request-available
>
> In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this 
> in pyarrow. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6055) [C++] Refactor arrow/io/hdfs.h to use common FileSystem API

2019-08-06 Thread Benjamin Kietzman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901210#comment-16901210
 ] 

Benjamin Kietzman commented on ARROW-6055:
--

In addition to removing io::FileSystem and io::FileStatistics, should 
HdfsPathInfo be replaced with fs::FileStats? It carries more information than 
fs::FileStats: last access time in addition to last modified time (though in 
seconds rather than ns since the epoch), block size, replication, and 
permissions

> [C++] Refactor arrow/io/hdfs.h to use common FileSystem API
> ---
>
> Key: ARROW-6055
> URL: https://issues.apache.org/jira/browse/ARROW-6055
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Benjamin Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> As part of this refactor, the FileSystem-related classes in 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/interfaces.h#L51 
> should be removed. The files should probably be moved also to arrow/filesystem



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?

2019-08-06 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-5977:
-

Assignee: Antoine Pitrou

> [C++] [Python] Method for read_csv to limit which columns are read?
> ---
>
> Key: ARROW-5977
> URL: https://issues.apache.org/jira/browse/ARROW-5977
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.14.0
>Reporter: Jordan Samuels
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: csv
>
> In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this 
> in pyarrow. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6149) [Parquet] Decimal comparisons used for min/max statistics are not correct

2019-08-06 Thread Philip Felton (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Felton updated ARROW-6149:
-
Description: 
The [Parquet Format 
specifications|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]
 says

bq. If the column uses int32 or int64 physical types, then signed comparison of 
the integer values produces the correct ordering. If the physical type is 
fixed, then the correct ordering can be produced by flipping the 
most-significant bit in the first byte and then using unsigned byte-wise 
comparison.

However this isn't followed in the C++ Parquet code. 16-byte decimal comparison 
is implemented using a lexicographical comparison of signed chars.

This appears to be because the function 
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183]
 just goes off the sort_order (signed) and physical_type 
(FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal.

  was:
The 
[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md|Parquet 
Format specifications] says

bq. If the column uses int32 or int64 physical types, then signed comparison of 
the integer values produces the correct ordering. If the physical type is 
fixed, then the correct ordering can be produced by flipping the 
most-significant bit in the first byte and then using unsigned byte-wise 
comparison.

However this isn't followed in the C++ Parquet code. 16-byte decimal comparison 
is implemented using a lexicographical comparison of signed chars.

This appears to be because the function 
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183]
 just goes off the sort_order (signed) and physical_type 
(FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal.


> [Parquet] Decimal comparisons used for min/max statistics are not correct
> -
>
> Key: ARROW-6149
> URL: https://issues.apache.org/jira/browse/ARROW-6149
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Philip Felton
>Priority: Major
>
> The [Parquet Format 
> specifications|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]
>  says
> bq. If the column uses int32 or int64 physical types, then signed comparison 
> of the integer values produces the correct ordering. If the physical type is 
> fixed, then the correct ordering can be produced by flipping the 
> most-significant bit in the first byte and then using unsigned byte-wise 
> comparison.
> However this isn't followed in the C++ Parquet code. 16-byte decimal 
> comparison is implemented using a lexicographical comparison of signed chars.
> This appears to be because the function 
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183]
>  just goes off the sort_order (signed) and physical_type 
> (FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6149) [Parquet] Decimal comparisons used for min/max statistics are not correct

2019-08-06 Thread Philip Felton (JIRA)
Philip Felton created ARROW-6149:


 Summary: [Parquet] Decimal comparisons used for min/max statistics 
are not correct
 Key: ARROW-6149
 URL: https://issues.apache.org/jira/browse/ARROW-6149
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Philip Felton


The 
[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md|Parquet 
Format specifications] says

bq. If the column uses int32 or int64 physical types, then signed comparison of 
the integer values produces the correct ordering. If the physical type is 
fixed, then the correct ordering can be produced by flipping the 
most-significant bit in the first byte and then using unsigned byte-wise 
comparison.

However this isn't followed in the C++ Parquet code. 16-byte decimal comparison 
is implemented using a lexicographical comparison of signed chars.

This appears to be because the function 
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183]
 just goes off the sort_order (signed) and physical_type 
(FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6055) [C++] Refactor arrow/io/hdfs.h to use common FileSystem API

2019-08-06 Thread Benjamin Kietzman (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Kietzman reassigned ARROW-6055:


Assignee: Benjamin Kietzman

> [C++] Refactor arrow/io/hdfs.h to use common FileSystem API
> ---
>
> Key: ARROW-6055
> URL: https://issues.apache.org/jira/browse/ARROW-6055
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Benjamin Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> As part of this refactor, the FileSystem-related classes in 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/interfaces.h#L51 
> should be removed. The files should probably be moved also to arrow/filesystem



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6055) [C++] Refactor arrow/io/hdfs.h to use common FileSystem API

2019-08-06 Thread Benjamin Kietzman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901168#comment-16901168
 ] 

Benjamin Kietzman commented on ARROW-6055:
--

[~wesmckinn] should io::FileSystem be deprecated or just deleted?

> [C++] Refactor arrow/io/hdfs.h to use common FileSystem API
> ---
>
> Key: ARROW-6055
> URL: https://issues.apache.org/jira/browse/ARROW-6055
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> As part of this refactor, the FileSystem-related classes in 
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/interfaces.h#L51 
> should be removed. The files should probably be moved also to arrow/filesystem



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?

2019-08-06 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901145#comment-16901145
 ] 

Neal Richardson commented on ARROW-5977:


Right, that was the other option that I said that all of the R readers support. 
We could support both ways, but the benefits of a separate vector of column 
names argument would be (1) you don't have to specify types for all other 
columns; (2) you wouldn't have to know the names of the other columns; and (3) 
you can't specify desired column order in the column_types because it is a map.

> [C++] [Python] Method for read_csv to limit which columns are read?
> ---
>
> Key: ARROW-5977
> URL: https://issues.apache.org/jira/browse/ARROW-5977
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.14.0
>Reporter: Jordan Samuels
>Priority: Major
>  Labels: csv
>
> In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this 
> in pyarrow. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?

2019-08-06 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901142#comment-16901142
 ] 

Francois Saint-Jacques commented on ARROW-5977:
---

Can we re-use the `ConverterOptions.column_types` for the same purpose? The 
field would:
# Select columns
# Possibly give a type hint.

> [C++] [Python] Method for read_csv to limit which columns are read?
> ---
>
> Key: ARROW-5977
> URL: https://issues.apache.org/jira/browse/ARROW-5977
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.14.0
>Reporter: Jordan Samuels
>Priority: Major
>  Labels: csv
>
> In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this 
> in pyarrow. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?

2019-08-06 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901141#comment-16901141
 ] 

Neal Richardson commented on ARROW-5977:


Yeah I agree that just an {{include_columns}} argument is fine. Any other sugar 
we want to have for selecting or deselecting columns can be handled at the 
Dataset layer (and thus be available regardless of the underlying storage type).

Null column probably makes sense if the column does not exist in the CSV.

> [C++] [Python] Method for read_csv to limit which columns are read?
> ---
>
> Key: ARROW-5977
> URL: https://issues.apache.org/jira/browse/ARROW-5977
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.14.0
>Reporter: Jordan Samuels
>Priority: Major
>  Labels: csv
>
> In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this 
> in pyarrow. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5766) [Python] Unpin jpype1 version

2019-08-06 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901139#comment-16901139
 ] 

Antoine Pitrou commented on ARROW-5766:
---

[~xhochy]

> [Python] Unpin jpype1 version
> -
>
> Key: ARROW-5766
> URL: https://issues.apache.org/jira/browse/ARROW-5766
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> According to the discussion in 
> https://github.com/conda-forge/jpype1-feedstock/issues/8 htere are some 
> changes that we must make to our code to stay on the released version of 
> jpype1



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?

2019-08-06 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901137#comment-16901137
 ] 

Antoine Pitrou commented on ARROW-5977:
---

So, to make things clear, a column in {{include_columns}} but not in the CSV 
file should produce a null column rather than emit an error?

> [C++] [Python] Method for read_csv to limit which columns are read?
> ---
>
> Key: ARROW-5977
> URL: https://issues.apache.org/jira/browse/ARROW-5977
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.14.0
>Reporter: Jordan Samuels
>Priority: Major
>  Labels: csv
>
> In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this 
> in pyarrow. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6142) [R] Install instructions on linux could be clearer

2019-08-06 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901135#comment-16901135
 ] 

Neal Richardson commented on ARROW-6142:


I've revised the README along these lines in 
[https://github.com/apache/arrow/pull/4948] but didn't touch {{install_arrow}} 
so I'll take a look here.

> [R] Install instructions on linux could be clearer
> --
>
> Key: ARROW-6142
> URL: https://issues.apache.org/jira/browse/ARROW-6142
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: R
>Affects Versions: 0.14.1
> Environment: Ubuntu 19.04
>Reporter: Karl Dunkle Werner
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: documentation
> Fix For: 0.15.0
>
>
> Installing R packages on Linux is almost always from source, which means 
> Arrow needs some system dependencies. The existing help message (from 
> arrow::install_arrow()) is very helpful in pointing that out, but it's still 
> a heavy lift for users who install R packages from source but don't plan to 
> develop Arrow itself.
> Here are a couple of things that could make things slightly smoother:
>  # I would be very grateful if the install_arrow() message or installation 
> page told me which libraries were essential to make the R package work.
>  # install_arrow() refers to a PPA. Previously I've only seen PPAs hosted on 
> launchpad.net, so the bintray URL threw me. Changing it to "bintray.com PPA" 
> instead of just "PPA" would have caused me less confusion. (Others may differ)
>  # A snap package would be easier than installing a new apt address, but I 
> understand that building for snap would be more packaging work and only 
> benefits Ubuntu users.
>  
> Thanks for making R bindings, and congratulations on the CRAN release!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?

2019-08-06 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901133#comment-16901133
 ] 

Wes McKinney commented on ARROW-5977:
-

I think just include is okay. It might make sense to co-develop this in 
conjunction with the Datasets interface to CSV files (since this needs to be 
able to select columns as well as insert missing fields -- which become all 
null -- this can happen as a post-scan operation though)

> [C++] [Python] Method for read_csv to limit which columns are read?
> ---
>
> Key: ARROW-5977
> URL: https://issues.apache.org/jira/browse/ARROW-5977
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.14.0
>Reporter: Jordan Samuels
>Priority: Major
>  Labels: csv
>
> In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this 
> in pyarrow. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6142) [R] Install instructions on linux could be clearer

2019-08-06 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6142:
--

Assignee: Neal Richardson

> [R] Install instructions on linux could be clearer
> --
>
> Key: ARROW-6142
> URL: https://issues.apache.org/jira/browse/ARROW-6142
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: R
>Affects Versions: 0.14.1
> Environment: Ubuntu 19.04
>Reporter: Karl Dunkle Werner
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: documentation
> Fix For: 0.15.0
>
>
> Installing R packages on Linux is almost always from source, which means 
> Arrow needs some system dependencies. The existing help message (from 
> arrow::install_arrow()) is very helpful in pointing that out, but it's still 
> a heavy lift for users who install R packages from source but don't plan to 
> develop Arrow itself.
> Here are a couple of things that could make things slightly smoother:
>  # I would be very grateful if the install_arrow() message or installation 
> page told me which libraries were essential to make the R package work.
>  # install_arrow() refers to a PPA. Previously I've only seen PPAs hosted on 
> launchpad.net, so the bintray URL threw me. Changing it to "bintray.com PPA" 
> instead of just "PPA" would have caused me less confusion. (Others may differ)
>  # A snap package would be easier than installing a new apt address, but I 
> understand that building for snap would be more packaging work and only 
> benefits Ubuntu users.
>  
> Thanks for making R bindings, and congratulations on the CRAN release!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6039) [GLib] Add garrow_array_filter()

2019-08-06 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6039:
--
Labels: pull-request-available  (was: )

> [GLib] Add garrow_array_filter()
> 
>
> Key: ARROW-6039
> URL: https://issues.apache.org/jira/browse/ARROW-6039
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Reporter: Yosuke Shiro
>Assignee: Yosuke Shiro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> Add bindings of a boolean selection filter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?

2019-08-06 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901126#comment-16901126
 ] 

Antoine Pitrou commented on ARROW-5977:
---

Ok, so what kind of ergonomics would you favour? Simply a {{include_columns}} 
vector of strings? And/or a {{exclude_columns}} vector as well?

> [C++] [Python] Method for read_csv to limit which columns are read?
> ---
>
> Key: ARROW-5977
> URL: https://issues.apache.org/jira/browse/ARROW-5977
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.14.0
>Reporter: Jordan Samuels
>Priority: Major
>  Labels: csv
>
> In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this 
> in pyarrow. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?

2019-08-06 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901124#comment-16901124
 ] 

Neal Richardson commented on ARROW-5977:


All of R's main CSV readers support this. One way they all expose this by 
allowing you to provide a null type for some columns when you specify their 
types explicitly. A couple of the readers allow you to specify columns by name 
or position to keep or drop. 

I think this is a good idea not just in the context of reading a CSV itself but 
also for the Datasets framework, where we are lazily reading chunks of data as 
needed and trying to be efficient with memory usage. 

> [C++] [Python] Method for read_csv to limit which columns are read?
> ---
>
> Key: ARROW-5977
> URL: https://issues.apache.org/jira/browse/ARROW-5977
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.14.0
>Reporter: Jordan Samuels
>Priority: Major
>  Labels: csv
>
> In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this 
> in pyarrow. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6039) [GLib] Add garrow_array_filter()

2019-08-06 Thread Yosuke Shiro (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yosuke Shiro updated ARROW-6039:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [GLib] Add garrow_array_filter()
> 
>
> Key: ARROW-6039
> URL: https://issues.apache.org/jira/browse/ARROW-6039
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Reporter: Yosuke Shiro
>Assignee: Yosuke Shiro
>Priority: Major
> Fix For: 0.15.0
>
>
> Add bindings of a boolean selection filter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5977) [C++] [Python] Method for read_csv to limit which columns are read?

2019-08-06 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901074#comment-16901074
 ] 

Antoine Pitrou commented on ARROW-5977:
---

[~npr] Ping.

> [C++] [Python] Method for read_csv to limit which columns are read?
> ---
>
> Key: ARROW-5977
> URL: https://issues.apache.org/jira/browse/ARROW-5977
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.14.0
>Reporter: Jordan Samuels
>Priority: Major
>  Labels: csv
>
> In pandas there is pd.read_csv(usecols=...) but I can't see a way to do this 
> in pyarrow. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6148) Missing debian build dependencies

2019-08-06 Thread Marcin Juszkiewicz (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Juszkiewicz reassigned ARROW-6148:
-

Assignee: Marcin Juszkiewicz

> Missing debian build dependencies
> -
>
> Key: ARROW-6148
> URL: https://issues.apache.org/jira/browse/ARROW-6148
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Francois Saint-Jacques
>Assignee: Marcin Juszkiewicz
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6148) Missing debian build dependencies

2019-08-06 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6148:
--
Labels: pull-request-available  (was: )

> Missing debian build dependencies
> -
>
> Key: ARROW-6148
> URL: https://issues.apache.org/jira/browse/ARROW-6148
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6148) Missing debian build dependencies

2019-08-06 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-6148:
-

 Summary: Missing debian build dependencies
 Key: ARROW-6148
 URL: https://issues.apache.org/jira/browse/ARROW-6148
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6147) [Go] implement a Flight client

2019-08-06 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-6147:
--

 Summary: [Go] implement a Flight client
 Key: ARROW-6147
 URL: https://issues.apache.org/jira/browse/ARROW-6147
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Sebastien Binet






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6107) [Go] ipc.Writer Option to skip appending data buffers

2019-08-06 Thread Sebastien Binet (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900964#comment-16900964
 ] 

Sebastien Binet commented on ARROW-6107:


ok.

(just nit-picking but to really assess the CGo overhead, one should directly 
call C, not C++-via-python :P. that said, it's a nice PoC.)

SGTM.

 

> [Go] ipc.Writer Option to skip appending data buffers
> -
>
> Key: ARROW-6107
> URL: https://issues.apache.org/jira/browse/ARROW-6107
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Nick Poorman
>Priority: Minor
>
> For cases where we have a known shared memory region, it would be great if 
> the ipc.Writer (and by extension ipc.Reader?) had the ability to write out 
> everything but the actual buffers holding the data. That way we can still 
> utilize the ipc mechanisms to communicate without having to serialize all the 
> underlying data across the wire.
>  
> This seems like it should be possible since the `RecordBatch` flatbuffers 
> only contain the metadata and the underlying data buffers are appended later. 
> We just need to skip appending the underlying data buffers.
>  
> [~sbinet] thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6146) [Go] implement a Plasma client

2019-08-06 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-6146:
--

 Summary: [Go] implement a Plasma client
 Key: ARROW-6146
 URL: https://issues.apache.org/jira/browse/ARROW-6146
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Sebastien Binet






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6145) [Java] UnionVector created by MinorType#getNewVector could not keep field type info properly

2019-08-06 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6145:
--
Labels: pull-request-available  (was: )

> [Java] UnionVector created by MinorType#getNewVector could not keep field 
> type info properly
> 
>
> Key: ARROW-6145
> URL: https://issues.apache.org/jira/browse/ARROW-6145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
>
> When I worked for other items, I found {{UnionVector}} created by 
> {{VectorSchemaRoot#create(Schema schema, BufferAllocator allocator)}} could 
> not keep field type info properly. For example, if we set metadata in 
> {{Field}} in schema, we could not get it back by {{UnionVector#getField}}.
> This is mainly because {{MinorType.Union.getNewVector}} did not pass 
> {{FieldType}} to vector and {{UnionVector#getField}} create a new {{Field}} 
> which cause inconsistent.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6145) [Java] UnionVector created by MinorType#getNewVector could not keep field type info properly

2019-08-06 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6145:
-

 Summary: [Java] UnionVector created by MinorType#getNewVector 
could not keep field type info properly
 Key: ARROW-6145
 URL: https://issues.apache.org/jira/browse/ARROW-6145
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


When I worked for other items, I found {{UnionVector}} created by 
{{VectorSchemaRoot#create(Schema schema, BufferAllocator allocator)}} could not 
keep field type info properly. For example, if we set metadata in {{Field}} in 
schema, we could not get it back by {{UnionVector#getField}}.

This is mainly because {{MinorType.Union.getNewVector}} did not pass 
{{FieldType}} to vector and {{UnionVector#getField}} create a new {{Field}} 
which cause inconsistent.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6144) Implement random function in Gandiva

2019-08-06 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6144:
--
Labels: pull-request-available  (was: )

> Implement random function in Gandiva
> 
>
> Key: ARROW-6144
> URL: https://issues.apache.org/jira/browse/ARROW-6144
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
>  Labels: pull-request-available
>
> Implement random(), random(int seed) functions



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty

2019-08-06 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6038:
--
Labels: pull-request-available windows  (was: windows)

> [Python] pyarrow.Table.from_batches produces corrupted table if any of the 
> batches were empty
> -
>
> Key: ARROW-6038
> URL: https://issues.apache.org/jira/browse/ARROW-6038
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.13.0, 0.14.0, 0.14.1
>Reporter: Piotr Bajger
>Priority: Minor
>  Labels: pull-request-available, windows
> Attachments: segfault_ex.py
>
>
> When creating a Table from a list/iterator of batches which contains an 
> "empty" RecordBatch a Table is produced but attempts to run any pyarrow 
> built-in functions (such as unique()) occasionally result in a Segfault.
> The MWE is attached: [^segfault_ex.py]
>  # The segfaults happen randomly, around 30% of the time.
>  # Commenting out line 10 in the MWE results in no segfaults.
>  # The segfault is triggered using the unique() function, but I doubt the 
> behaviour is specific to that function, from what I gather the problem lies 
> in Table creation.
> I'm on Windows 10, using Python 3.6 and pyarrow 0.14.0 installed through pip 
> (problem also occurs with 0.13.0 from conda-forge).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6144) Implement random function in Gandiva

2019-08-06 Thread Prudhvi Porandla (JIRA)
Prudhvi Porandla created ARROW-6144:
---

 Summary: Implement random function in Gandiva
 Key: ARROW-6144
 URL: https://issues.apache.org/jira/browse/ARROW-6144
 Project: Apache Arrow
  Issue Type: Task
  Components: C++ - Gandiva
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla


Implement random(), random(int seed) functions



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-1562) [C++] Numeric kernel implementations for add (+)

2019-08-06 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1562:
--
Labels: Analytics pull-request-available  (was: Analytics)

> [C++] Numeric kernel implementations for add (+)
> 
>
> Key: ARROW-1562
> URL: https://issues.apache.org/jira/browse/ARROW-1562
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics, pull-request-available
>
> This function should respect consistent type promotions between types of 
> different sizes and signed and unsigned integers



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5953) Thrift download ERRORS with apache-arrow-0.14.0

2019-08-06 Thread Marcin Juszkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900831#comment-16900831
 ] 

Marcin Juszkiewicz commented on ARROW-5953:
---

On Debian 'buster' I have similar failure with Arrow 0.14.1 when tried to 
rebuild package for arm64 architecture:

 

{{– Checking for module 'thrift'}}
{{--   No package 'thrift' found}}
{{– Could NOT find Thrift (missing: THRIFT_STATIC_LIB THRIFT_INCLUDE_DIR 
THRIFT_COMPILER)}}
{{Building Apache Thrift from source}}
{{Downloading Apache Thrift from Traceback (most recent call last):}}
{{  File "/usr/lib/python3.7/urllib/request.py", line 1317, in do_open}}
{{    encode_chunked=req.has_header('Transfer-encoding'))}}
{{  File "/usr/lib/python3.7/http/client.py", line 1229, in request}}
{{    self._send_request(method, url, body, headers, encode_chunked)}}
{{  File "/usr/lib/python3.7/http/client.py", line 1275, in _send_request}}
{{    self.endheaders(body, encode_chunked=encode_chunked)}}
{{  File "/usr/lib/python3.7/http/client.py", line 1224, in endheaders}}
{{    self._send_output(message_body, encode_chunked=encode_chunked)}}
{{  File "/usr/lib/python3.7/http/client.py", line 1016, in _send_output}}
{{    self.send(msg)}}
{{  File "/usr/lib/python3.7/http/client.py", line 956, in send}}
{{    self.connect()}}
{{  File "/usr/lib/python3.7/http/client.py", line 1392, in connect}}
{{    server_hostname=server_hostname)}}
{{  File "/usr/lib/python3.7/ssl.py", line 412, in wrap_socket}}
{{    session=session}}
{{  File "/usr/lib/python3.7/ssl.py", line 853, in _create}}
{{    self.do_handshake()}}
{{  File "/usr/lib/python3.7/ssl.py", line 1117, in do_handshake}}
{{    self._sslobj.do_handshake()}}
{{ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate 
verify failed: unable to get local issuer certificate (_ssl.c:1056)}}

> Thrift download ERRORS with apache-arrow-0.14.0
> ---
>
> Key: ARROW-5953
> URL: https://issues.apache.org/jira/browse/ARROW-5953
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.13.0, 0.14.0
> Environment: RHEL 6.7
>Reporter: Brian
>Priority: Major
>
> {color:#33}cmake returns:{color}
> requests.excetions.SSLError: hostname 'www.apache.org' doesn't match either 
> of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz
> {color:#33}during check for thrift download location.  {color}
> {color:#33}This occurs with a freshly inflated arrow source release tree 
> where cmake is running for the first time. {color}
> {color:#33}Reproducible with the release levels of apache-arrow-0.14.0 
> and  0.13.0.  I tried this 3-5x on 15Jul2019 and see it consistently each 
> time.{color}
> {color:#33}Here's the full context from cmake output: {color}
> {quote}-- Checking for module 'thrift'
> --   No package 'thrift' found
> -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB THRIFT_INCLUDE_DIR 
> THRIFT_COMPILER)
> Building Apache Thrift from source
> Downloading Apache Thrift from Traceback (most recent call last):
>   File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 
> 38, in 
>     suggested_mirror = get_url('[https://www.apache.org/dyn/]'
>   File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 
> 27, in get_url
>     return requests.get(url).content
>   File "/usr/lib/python2.6/site-packages/requests/api.py", line 68, in get
>     return request('get', url, **kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/api.py", line 50, in request
>     response = session.request(method=method, url=url, **kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 464, in 
> request
>     resp = self.send(prep, **send_kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 576, in 
> send
>     r = adapter.send(request, **kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 431, in 
> send
>     raise SSLError(e, request=request)
> requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either 
> of '*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz
> {quote}
> {color:#FF} {color}
> {color:#FF}{color:#33}Per Wes' suggestion I ran the following 
> directly:{color}{color}
> {color:#FF}{color:#33}python cpp/build-support/get_apache_mirror.py 
> [https://www-eu.apache.org/dist/] [http://us.mirrors.quenda.co/apache/]
> {color}{color}
> {color:#FF}{color:#33}with this output:{color}{color}
> [https://www-eu.apache.org/dist/]  [http://us.mirrors.quenda.co/apache/]
>  
>  
> *NOTE:* here are the cmake thrift log lines from a build of apache-arrow git 
> clone on 06Jul2019 where cmake/make were run fine.pwd
>  
> {quote}-- Checking for module 'thrift'
> 

[jira] [Closed] (ARROW-6129) Row_groups duplicate Rows

2019-08-06 Thread albertoramon (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

albertoramon closed ARROW-6129.
---
Resolution: Not A Problem

This is the expected behavior

> Row_groups duplicate Rows
> -
>
> Key: ARROW-6129
> URL: https://issues.apache.org/jira/browse/ARROW-6129
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.14.1
>Reporter: albertoramon
>Priority: Major
>  Labels: parquetWriter
> Attachments: tes_output.png, test01.py, top10.csv
>
>
> Using Row_Groups to write Parquet, duplicate rows:
>     Input: CSV 10 Rows
>     Row_Groups=1 --> Output 10 Rows 
>     Row_Groups=2 --> Output 20 Rows
>   !tes_output.png!
> Is this the expected?
> attached code snippet and CSV



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)