date:20200602

[jira] [Updated] (ARROW-4512) [R] Stream reader/writer API that takes socket stream

2020-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated ARROW-4512:

Affects Version/s: 1.0.0

> [R] Stream reader/writer API that takes socket stream
> -
>
> Key: ARROW-4512
> URL: https://issues.apache.org/jira/browse/ARROW-4512
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 0.12.0, 0.14.1, 1.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> I have been working on Spark integration with Arrow.
> I realised that there are no ways to use socket as input to use Arrow stream 
> format. For instance,
> I want to something like:
> {code}
> connStream <- socketConnection(port = , blocking = TRUE, open = "wb")
> rdf_slices <- # a list of data frames.
> stream_writer <- NULL
> tryCatch({
>   for (rdf_slice in rdf_slices) {
> batch <- record_batch(rdf_slice)
> if (is.null(stream_writer)) {
>   stream_writer <- RecordBatchStreamWriter(connStream, batch$schema)  # 
> Here, looks there's no way to use socket.
> }
> stream_writer$write_batch(batch)
>   }
> },
> finally = {
>   if (!is.null(stream_writer)) {
> stream_writer$close()
>   }
> })
> {code}
> Likewise, I cannot find a way to iterate the stream batch by batch
> {code}
> RecordBatchStreamReader(connStream)$batches()  # Here, looks there's no way 
> to use socket.
> {code}
> This looks easily possible in Python side but looks missing in R APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-4633) [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway

2020-06-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-4633:
--
Labels: dataset-parquet-read newbie parquet pull-request-available  (was: 
dataset-parquet-read newbie parquet)

> [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway
> --
>
> Key: ARROW-4633
> URL: https://issues.apache.org/jira/browse/ARROW-4633
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.12.0
> Environment: Linux, Python 3.7.1, pyarrow.__version__ = 0.12.0
>Reporter: Taylor Johnson
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: dataset-parquet-read, newbie, parquet, 
> pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The following code seems to suggest that ParquetFile.read(use_threads=False) 
> still creates a ThreadPool.  This is observed in 
> ParquetFile.read_row_group(use_threads=False) as well. 
> This does not appear to be a problem in 
> pyarrow.Table.to_pandas(use_threads=False).
> I've tried tracing the error.  Starting in python/pyarrow/parquet.py, both 
> ParquetReader.read_all() and ParquetReader.read_row_group() pass the 
> use_threads input along to self.reader which is a ParquetReader imported from 
> _parquet.pyx
> Following the calls into python/pyarrow/_parquet.pyx, we see that 
> ParquetReader.read_all() and ParquetReader.read_row_group() have the 
> following code which seems a bit suspicious
> {quote}if use_threads:
>     self.set_use_threads(use_threads)
> {quote}
> Why not just always call self.set_use_threads(use_threads)?
> The ParquetReader.set_use_threads simply calls 
> self.reader.get().set_use_threads(use_threads).  This self.reader is assigned 
> as unique_ptr[FileReader].  I think this points to 
> cpp/src/parquet/arrow/reader.cc, but I'm not sure about that.  The 
> FileReader::Impl::ReadRowGroup logic looks ok, as a call to 
> ::arrow::internal::GetCpuThreadPool() is only called if use_threads is True.  
> The same is true for ReadTable.
> So when is the ThreadPool getting created?
> Example code:
> --
> {quote}import pandas as pd
> import psutil
> import pyarrow as pa
> import pyarrow.parquet as pq
> use_threads=False
> p=psutil.Process()
> print('Starting with {} threads'.format(p.num_threads()))
> df = pd.DataFrame(\{'x':[0]})
> table = pa.Table.from_pandas(df)
> print('After table creation, {} threads'.format(p.num_threads()))
> df = table.to_pandas(use_threads=use_threads)
> print('table.to_pandas(use_threads={}), {} threads'.format(use_threads, 
> p.num_threads()))
> writer = pq.ParquetWriter('tmp.parquet', table.schema)
> writer.write_table(table)
> writer.close()
> print('After writing parquet file, {} threads'.format(p.num_threads()))
> pf = pq.ParquetFile('tmp.parquet')
> print('After ParquetFile, {} threads'.format(p.num_threads()))
> df = pf.read(use_threads=use_threads).to_pandas()
> print('After pf.read(use_threads={}), {} threads'.format(use_threads, 
> p.num_threads()))
> {quote}
> ---
> $ python pyarrow_test.py
> Starting with 1 threads
> After table creation, 1 threads
> table.to_pandas(use_threads=False), 1 threads
> After writing parquet file, 1 threads
> After ParquetFile, 1 threads
> After pf.read(use_threads=False), 5 threads



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-4633) [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-4633:
---

Assignee: Wes McKinney

> [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway
> --
>
> Key: ARROW-4633
> URL: https://issues.apache.org/jira/browse/ARROW-4633
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.12.0
> Environment: Linux, Python 3.7.1, pyarrow.__version__ = 0.12.0
>Reporter: Taylor Johnson
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: dataset-parquet-read, newbie, parquet
> Fix For: 1.0.0
>
>
> The following code seems to suggest that ParquetFile.read(use_threads=False) 
> still creates a ThreadPool.  This is observed in 
> ParquetFile.read_row_group(use_threads=False) as well. 
> This does not appear to be a problem in 
> pyarrow.Table.to_pandas(use_threads=False).
> I've tried tracing the error.  Starting in python/pyarrow/parquet.py, both 
> ParquetReader.read_all() and ParquetReader.read_row_group() pass the 
> use_threads input along to self.reader which is a ParquetReader imported from 
> _parquet.pyx
> Following the calls into python/pyarrow/_parquet.pyx, we see that 
> ParquetReader.read_all() and ParquetReader.read_row_group() have the 
> following code which seems a bit suspicious
> {quote}if use_threads:
>     self.set_use_threads(use_threads)
> {quote}
> Why not just always call self.set_use_threads(use_threads)?
> The ParquetReader.set_use_threads simply calls 
> self.reader.get().set_use_threads(use_threads).  This self.reader is assigned 
> as unique_ptr[FileReader].  I think this points to 
> cpp/src/parquet/arrow/reader.cc, but I'm not sure about that.  The 
> FileReader::Impl::ReadRowGroup logic looks ok, as a call to 
> ::arrow::internal::GetCpuThreadPool() is only called if use_threads is True.  
> The same is true for ReadTable.
> So when is the ThreadPool getting created?
> Example code:
> --
> {quote}import pandas as pd
> import psutil
> import pyarrow as pa
> import pyarrow.parquet as pq
> use_threads=False
> p=psutil.Process()
> print('Starting with {} threads'.format(p.num_threads()))
> df = pd.DataFrame(\{'x':[0]})
> table = pa.Table.from_pandas(df)
> print('After table creation, {} threads'.format(p.num_threads()))
> df = table.to_pandas(use_threads=use_threads)
> print('table.to_pandas(use_threads={}), {} threads'.format(use_threads, 
> p.num_threads()))
> writer = pq.ParquetWriter('tmp.parquet', table.schema)
> writer.write_table(table)
> writer.close()
> print('After writing parquet file, {} threads'.format(p.num_threads()))
> pf = pq.ParquetFile('tmp.parquet')
> print('After ParquetFile, {} threads'.format(p.num_threads()))
> df = pf.read(use_threads=use_threads).to_pandas()
> print('After pf.read(use_threads={}), {} threads'.format(use_threads, 
> p.num_threads()))
> {quote}
> ---
> $ python pyarrow_test.py
> Starting with 1 threads
> After table creation, 1 threads
> table.to_pandas(use_threads=False), 1 threads
> After writing parquet file, 1 threads
> After ParquetFile, 1 threads
> After pf.read(use_threads=False), 5 threads



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-4067) [C++] RFC: standardize ArrayBuilder subclasses

2020-06-02 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124509#comment-17124509
 ] 

Wes McKinney commented on ARROW-4067:
-

I removed this from any milestone since it does not seem to be doing much harm 
to us at the moment. 

> [C++] RFC: standardize ArrayBuilder subclasses
> --
>
> Key: ARROW-4067
> URL: https://issues.apache.org/jira/browse/ARROW-4067
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Minor
>  Labels: usability
>
> Each builder supports different and frequently differently named methods for 
> appending. It should be possible to establish a more consistent convention, 
> which would alleviate dev confusion and simplify generics.
> For example, let all Builders be required to define at minimum:
>  * {{Reserve(int64_t)}}
>  * a nested type named {{Scalar}}, which is the canonical scalar appended to 
> this builder. Append other types may be supported for convenience.
>  * {{UnsafeAppend(Scalar)}}
>  * {{UnsafeAppendNull()}}
> The other methods described below can be overridden if an optimization is 
> available or left defaulted (a CRTP helper can contain the default 
> implementations, for example {{Append(Scalar)}} would simply be a call to 
> Reserve then UnsafeAppend.
> In addition to their unsafe equivalents, {{Append(Scalar)}} and 
> {{AppendNull()}} should be available for appending without manual capacity 
> maintenance.
> It is not necessary for the rest of this RFC, but it would simplify builders 
> further if scalar append methods always had a single argument. For example, 
> this would mean abolishing {{BinaryBuilder::Append(const uint8_t*, int32_t)}} 
> in favor of {{BinaryBuilder::Append(basic_string_view)}}. There's no 
> runtime overhead involved in this change, and developers who have a pointer 
> and a length instead of a view can just construct one without boilerplate 
> using brace initialization: {code}b->Append({pointer, length});{code}
> Unsafe and safe methods should be provided for appending multiple values as 
> well. The default implementation will be a trivial loop but if optimizations 
> are available then this could be overridden (for example instead of copying 
> bits one by one into a BooleanBuilder, bytes could be memcpy'd). Append 
> methods for multiple values should accept two arguments, the first of which 
> contains values and the second of which defines validity. The canonical 
> multiple append method has signature {{Status(array_view values, 
> const uint8_t* valid_bytes)}}, but other overloads and helpers could be 
> provided as well:
> {code}
> b->Append({{1, 3, 4}}, all_valid); // append values with no nulls
> b->Append({{1, 3, 4}}, bool_vector); // use the elements of a vector 
> for validity
> b->Append({{1, 3, 4}}, bits(ptr)); // interpret ptr as a buffer of valid 
> bits, rather than valid bytes
> {code}
> Builders of nested types currently require developers to write boilerplate 
> wrangling the child builders. This could be alleviated by letting nested 
> builders' append methods return a helper as an output argument:
> {code}
> ListBuilder::List lst;
> RETURN_NOT_OK(list_builder.Append()); // ListBuilder::Scalar == 
> ListBuilder::ListBase*
> RETURN_NOT_OK(lst->Append(3));
> RETURN_NOT_OK(lst->Append(4));
> StructBuilder::Struct strct;
> RETURN_NOT_OK(struct_builder.Append());
> RETURN_NOT_OK(strct.Set(0, "uuid"));
> RETURN_NOT_OK(strct.Set(2, 47));
> RETURN_NOT_OK(strct->Finish()); // appends null to unspecified fields
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-4067) [C++] RFC: standardize ArrayBuilder subclasses

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4067:

Fix Version/s: (was: 1.0.0)

> [C++] RFC: standardize ArrayBuilder subclasses
> --
>
> Key: ARROW-4067
> URL: https://issues.apache.org/jira/browse/ARROW-4067
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Minor
>  Labels: usability
>
> Each builder supports different and frequently differently named methods for 
> appending. It should be possible to establish a more consistent convention, 
> which would alleviate dev confusion and simplify generics.
> For example, let all Builders be required to define at minimum:
>  * {{Reserve(int64_t)}}
>  * a nested type named {{Scalar}}, which is the canonical scalar appended to 
> this builder. Append other types may be supported for convenience.
>  * {{UnsafeAppend(Scalar)}}
>  * {{UnsafeAppendNull()}}
> The other methods described below can be overridden if an optimization is 
> available or left defaulted (a CRTP helper can contain the default 
> implementations, for example {{Append(Scalar)}} would simply be a call to 
> Reserve then UnsafeAppend.
> In addition to their unsafe equivalents, {{Append(Scalar)}} and 
> {{AppendNull()}} should be available for appending without manual capacity 
> maintenance.
> It is not necessary for the rest of this RFC, but it would simplify builders 
> further if scalar append methods always had a single argument. For example, 
> this would mean abolishing {{BinaryBuilder::Append(const uint8_t*, int32_t)}} 
> in favor of {{BinaryBuilder::Append(basic_string_view)}}. There's no 
> runtime overhead involved in this change, and developers who have a pointer 
> and a length instead of a view can just construct one without boilerplate 
> using brace initialization: {code}b->Append({pointer, length});{code}
> Unsafe and safe methods should be provided for appending multiple values as 
> well. The default implementation will be a trivial loop but if optimizations 
> are available then this could be overridden (for example instead of copying 
> bits one by one into a BooleanBuilder, bytes could be memcpy'd). Append 
> methods for multiple values should accept two arguments, the first of which 
> contains values and the second of which defines validity. The canonical 
> multiple append method has signature {{Status(array_view values, 
> const uint8_t* valid_bytes)}}, but other overloads and helpers could be 
> provided as well:
> {code}
> b->Append({{1, 3, 4}}, all_valid); // append values with no nulls
> b->Append({{1, 3, 4}}, bool_vector); // use the elements of a vector 
> for validity
> b->Append({{1, 3, 4}}, bits(ptr)); // interpret ptr as a buffer of valid 
> bits, rather than valid bytes
> {code}
> Builders of nested types currently require developers to write boilerplate 
> wrangling the child builders. This could be alleviated by letting nested 
> builders' append methods return a helper as an output argument:
> {code}
> ListBuilder::List lst;
> RETURN_NOT_OK(list_builder.Append()); // ListBuilder::Scalar == 
> ListBuilder::ListBase*
> RETURN_NOT_OK(lst->Append(3));
> RETURN_NOT_OK(lst->Append(4));
> StructBuilder::Struct strct;
> RETURN_NOT_OK(struct_builder.Append());
> RETURN_NOT_OK(strct.Set(0, "uuid"));
> RETURN_NOT_OK(strct.Set(2, 47));
> RETURN_NOT_OK(strct->Finish()); // appends null to unspecified fields
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-2702) [Python] Examine usages of Invalid and TypeError errors in numpy_to_arrow.cc to see if we are using the right error type in each instance

2020-06-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2702:
--
Labels: pull-request-available  (was: )

> [Python] Examine usages of Invalid and TypeError errors in numpy_to_arrow.cc 
> to see if we are using the right error type in each instance
> -
>
> Key: ARROW-2702
> URL: https://issues.apache.org/jira/browse/ARROW-2702
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See discussion in [https://github.com/apache/arrow/pull/2075]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6974) [C++] Refactor temporal casts to work with Scalar inputs

2020-06-02 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124498#comment-17124498
 ] 

Wes McKinney commented on ARROW-6974:
-

I edited the title to reframe the issue in the context of the new framework

> [C++] Refactor temporal casts to work with Scalar inputs
> 
>
> Key: ARROW-6974
> URL: https://issues.apache.org/jira/browse/ARROW-6974
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> Currently, the casting for time-like data is done with the {{ShiftTime}} 
> function. It _might_ be possible to simplify this with ArrayDataVisitor (to 
> avoid looping / checking the bitmap).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6974) [C++] Refactor temporal casts to work with Scalar inputs

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6974:

Summary: [C++] Refactor temporal casts to work with Scalar inputs  (was: 
[C++] Implement Cast kernel for time-likes with ArrayDataVisitor pattern)

> [C++] Refactor temporal casts to work with Scalar inputs
> 
>
> Key: ARROW-6974
> URL: https://issues.apache.org/jira/browse/ARROW-6974
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> Currently, the casting for time-like data is done with the {{ShiftTime}} 
> function. It _might_ be possible to simplify this with ArrayDataVisitor (to 
> avoid looping / checking the bitmap).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-2702) [Python] Examine usages of Invalid and TypeError errors in numpy_to_arrow.cc to see if we are using the right error type in each instance

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2702:
---

Assignee: Wes McKinney

> [Python] Examine usages of Invalid and TypeError errors in numpy_to_arrow.cc 
> to see if we are using the right error type in each instance
> -
>
> Key: ARROW-2702
> URL: https://issues.apache.org/jira/browse/ARROW-2702
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> See discussion in [https://github.com/apache/arrow/pull/2075]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8951) [C++] Fix compiler warning in compute/kernels/scalar_cast_temporal.cc

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8951.
-
Resolution: Fixed

Issue resolved by pull request 7330
[https://github.com/apache/arrow/pull/7330]

> [C++] Fix compiler warning in compute/kernels/scalar_cast_temporal.cc
> -
>
> Key: ARROW-8951
> URL: https://issues.apache.org/jira/browse/ARROW-8951
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The kernel functor can return an uninitialized value on errors
> {code}
> ../src/arrow/compute/kernels/scalar_cast_temporal.cc: In member function ‘OUT 
> arrow::compute::internal::ParseTimestamp::Call(arrow::compute::KernelContext*,
>  ARG0) const [with OUT = long int; ARG0 = 
> nonstd::sv_lite::basic_string_view]’:
> ../src/arrow/compute/kernels/scalar_cast_temporal.cc:267:12: warning: 
> ‘result’ may be used uninitialized in this function [-Wmaybe-uninitialized]
>  return result;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8976) [C++] compute::CallFunction can't Filter/Take with ChunkedArray

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8976.
-
  Assignee: Wes McKinney
Resolution: Fixed

This is addressed in 
https://github.com/apache/arrow/commit/199d089e6343df1a1aa95f75b9e99e05a2702257

> [C++] compute::CallFunction can't Filter/Take with ChunkedArray
> ---
>
> Key: ARROW-8976
> URL: https://issues.apache.org/jira/browse/ARROW-8976
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Followup to ARROW-8938
> {{Invalid: Kernel does not support chunked array arguments}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9018) [C++] Remove APIs that were deprecated in 0.17.x and prior

2020-06-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9018:
--
Labels: pull-request-available  (was: )

> [C++] Remove APIs that were deprecated in 0.17.x and prior
> --
>
> Key: ARROW-9018
> URL: https://issues.apache.org/jira/browse/ARROW-9018
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-9018) [C++] Remove APIs that were deprecated in 0.17.x and prior

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-9018:
---

Assignee: Wes McKinney

> [C++] Remove APIs that were deprecated in 0.17.x and prior
> --
>
> Key: ARROW-9018
> URL: https://issues.apache.org/jira/browse/ARROW-9018
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7009) [C++] Refactor filter/take kernels to use Datum instead of overloads

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7009.
-
Resolution: Fixed

This was done in 
https://github.com/apache/arrow/commit/199d089e6343df1a1aa95f75b9e99e05a2702257

> [C++] Refactor filter/take kernels to use Datum instead of overloads
> 
>
> Key: ARROW-7009
> URL: https://issues.apache.org/jira/browse/ARROW-7009
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 1.0.0
>
>
> Followup to ARROW-6784. See discussion on 
> [https://github.com/apache/arrow/pull/5686,|https://github.com/apache/arrow/pull/5686]
>  as well as ARROW-6959.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8917) [C++][Compute] Formalize "metafunction" concept

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8917.
-
Resolution: Fixed

Issue resolved by pull request 7318
[https://github.com/apache/arrow/pull/7318]

> [C++][Compute] Formalize "metafunction" concept
> ---
>
> Key: ARROW-8917
> URL: https://issues.apache.org/jira/browse/ARROW-8917
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> A metafunction is a function that provides the {{Execute}} API but does not 
> contain any kernels. Such functions can also handle non-Array/Scalar inputs 
> like RecordBatch or Table. 
> This will enable bindings to invoke such functions (like take, filter) like
> {code}
> call_function('take', [table, indices])
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7605) [C++] Merge jemalloc and other BUNDLED dependencies into libarrow.a

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7605:
---

Assignee: (was: Wes McKinney)

> [C++] Merge jemalloc and other BUNDLED dependencies into libarrow.a
> ---
>
> Key: ARROW-7605
> URL: https://issues.apache.org/jira/browse/ARROW-7605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> If ARROW_JEMALLOC=ON, then currently the libarrow.a cannot be used for static 
> linking without also obtaining libjemalloc_pic.a



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7017:
---

Assignee: (was: Wes McKinney)

> [C++] Refactor AddKernel to support other operations and types
> --
>
> Key: ARROW-7017
> URL: https://issues.apache.org/jira/browse/ARROW-7017
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: analytics
>
> * Should avoid using builders (and/or NULLs) since the output shape is known 
> a compute time.
>  * Should be refatored to support other operations, e.g. Substraction, 
> Multiplication.
>  * Should have a overflow, underflow detection mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8976) [C++] compute::CallFunction can't Filter/Take with ChunkedArray

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8976:
---

Assignee: (was: Wes McKinney)

> [C++] compute::CallFunction can't Filter/Take with ChunkedArray
> ---
>
> Key: ARROW-8976
> URL: https://issues.apache.org/jira/browse/ARROW-8976
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Followup to ARROW-8938
> {{Invalid: Kernel does not support chunked array arguments}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9020) read_json won't respect explicit_schema in parse_options

2020-06-02 Thread Felipe Santos (Jira)

Felipe Santos created ARROW-9020:


 Summary: read_json won't respect explicit_schema in parse_options
 Key: ARROW-9020
 URL: https://issues.apache.org/jira/browse/ARROW-9020
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.17.1
 Environment: CPython 3.8.2, MacOS Mojave 10.14.6
Reporter: Felipe Santos
 Fix For: 0.17.1


I am trying to read a json file using an explicit schema but it looks like the 
schema is ignored. Moreover, if the my schema contains a field not present in 
the json file, then the output table contains all the fields in the json file 
plus the fields of my schema not found in the file.

A minimal example:
{code:python}
import pyarrow as pa
from pyarrow import json

# allowing for type inference
print(json.read_json('tmp.json'))
# prints:
# pyarrow.Table
# foo: string
# baz: string

# using an explicit schema that would read only "foo"
schema = pa.schema([('foo', pa.string())])
print(json.read_json('tmp.json', 
parse_options=json.ParseOptions(explicit_schema=schema)))
# prints:
# pyarrow.Table
# foo: string
# baz: string

# using an explicit schema that would read only "not_a_field",
# which is not present in the json file
schema = pa.schema([('not_a_field', pa.string())])
print(json.read_json('tmp.json', 
parse_options=json.ParseOptions(explicit_schema=schema)))
# prints:
# pyarrow.Table
# not_a_field: string
# foo: string
# baz: string
{code}

And the tmp.json file looks like:
{code:json}
{"foo": "bar", "baz": "1"}

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-1292) [C++/Python] Expand libhdfs feature coverage

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1292:
---

Assignee: (was: Wes McKinney)

> [C++/Python] Expand libhdfs feature coverage
> 
>
> Key: ARROW-1292
> URL: https://issues.apache.org/jira/browse/ARROW-1292
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
>
> Umbrella JIRA. Will create child issues for more granular tasks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-5381) [C++] Crash at arrow::internal::CountSetBits

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5381:
---

Assignee: (was: Wes McKinney)

> [C++] Crash at arrow::internal::CountSetBits
> 
>
> Key: ARROW-5381
> URL: https://issues.apache.org/jira/browse/ARROW-5381
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Operating System: Windows 7 Professional 64-bit (6.1, 
> Build 7601) Service Pack 1(7601.win7sp1_ldr_escrow.181110-1429)
> Language: English (Regional Setting: English)
> System Manufacturer: SAMSUNG ELECTRONICS CO., LTD.
> System Model: RV420/RV520/RV720/E3530/S3530/E3420/E3520
> BIOS: Phoenix SecureCore-Tiano(tm) NB Version 2.1 05PQ
> Processor: Intel(R) Pentium(R) CPU B950 @ 2.10GHz (2 CPUs), ~2.1GHz
> Memory: 2048MB RAM
> Available OS Memory: 1962MB RAM
>   Page File: 1517MB used, 2405MB available
> Windows Dir: C:\Windows
> DirectX Version: DirectX 11
>Reporter: Tham
>Priority: Major
>  Labels: pull-request-available
> Attachments: bit-util.asm, iMac-late2009.png, popcnt_support.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I've got a lot of crash dump from a customer's windows machine. The 
> stacktrace shows that it crashed at arrow::internal::CountSetBits.
>  
> {code:java}
> STACK_TEXT:  
> 00c9`5354a4c0 7ff7`2f2830fd : 00c9`544841c0 ` 
> `1e00 ` : 
> CortexService!arrow::internal::CountSetBits+0x16d
> 00c9`5354a550 7ff7`2f2834b7 : 00c9`5337c930 ` 
> ` ` : 
> CortexService!arrow::ArrayData::GetNullCount+0x8d
> 00c9`5354a580 7ff7`2f13df55 : 00c9`54476080 00c9`5354a5d8 
> ` ` : 
> CortexService!arrow::Array::null_count+0x37
> 00c9`5354a5b0 7ff7`2f13fb68 : 00c9`5354ab40 00c9`5354a6f8 
> 00c9`54476080 ` : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::LevelBuilder::Visit >+0xa5
> 00c9`5354a640 7ff7`2f12fa34 : 00c9`5354a6f8 00c9`54476080 
> 00c9`5354ab40 ` : 
> CortexService!arrow::VisitArrayInline namespace'::LevelBuilder>+0x298
> 00c9`5354a680 7ff7`2f14bf03 : 00c9`5354ab40 00c9`5354a6f8 
> 00c9`54476080 ` : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::LevelBuilder::VisitInline+0x44
> 00c9`5354a6c0 7ff7`2f12fe2a : 00c9`5354ab40 00c9`5354ae18 
> 00c9`54476080 00c9`5354b208 : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::LevelBuilder::GenerateLevels+0x93
> 00c9`5354aa00 7ff7`2f14de56 : 00c9`5354b1f8 00c9`5354afc8 
> 00c9`54476080 `1e00 : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::ArrowColumnWriter::Write+0x25a
> 00c9`5354af20 7ff7`2f14e66b : 00c9`5354b1f8 00c9`5354b238 
> 00c9`54445c20 ` : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::ArrowColumnWriter::Write+0x2a6
> 00c9`5354b040 7ff7`2f12f137 : 00c9`544041f0 00c9`5354b4d8 
> 00c9`5354b4a8 ` : 
> CortexService!parquet::arrow::FileWriter::Impl::WriteColumnChunk+0x70b
> 00c9`5354b400 7ff7`2f14b4d5 : 00c9`54431180 00c9`5354b4d8 
> 00c9`5354b4a8 ` : 
> CortexService!parquet::arrow::FileWriter::WriteColumnChunk+0x67
> 00c9`5354b450 7ff7`2f12eef1 : 00c9`5354b5d8 00c9`5354b648 
> ` `1e00 : 
> CortexService!::operator()+0x195
> 00c9`5354b530 7ff7`2eb8e31e : 00c9`54431180 00c9`5354b760 
> 00c9`54442fb0 `1e00 : 
> CortexService!parquet::arrow::FileWriter::WriteTable+0x521
> 00c9`5354b730 7ff7`2eb58ac5 : 00c9`5307bd88 00c9`54442fb0 
> ` ` : 
> CortexService!Cortex::Storage::ParquetStreamWriter::writeRowGroup+0xfe
> 00c9`5354b860 7ff7`2eafdce6 : 00c9`5307bd80 00c9`5354ba08 
> 00c9`5354b9e0 00c9`5354b9d8 : 
> CortexService!Cortex::Storage::ParquetFileWriter::writeRowGroup+0x545
> 00c9`5354b9a0 7ff7`2eaf8bae : 00c9`53275600 00c9`53077220 
> `fffe ` : 
> CortexService!Cortex::Storage::DataStreamWriteWorker::onNewData+0x1a6
> {code}
> {code:java}
> FAILED_INSTRUCTION_ADDRESS: 
> CortexService!arrow::internal::CountSetBits+16d 
> [c:\jenkins\workspace\cortexv2-dev-win64-service\src\thirdparty\arrow\cpp\src\arrow\util\bit-util.cc
>  @ 99]
> 7ff7`2f3a4e4d f3480fb800  popcnt  rax,qword ptr [rax]
> FOLLOWUP_IP: 
> CortexService!arrow::internal::CountSetBits+16d 
>

[jira] [Assigned] (ARROW-6940) [C++] Expose Message-level IPC metadata in both read and write interfaces

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6940:
---

Assignee: (was: Wes McKinney)

> [C++] Expose Message-level IPC metadata in both read and write interfaces
> -
>
> Key: ARROW-6940
> URL: https://issues.apache.org/jira/browse/ARROW-6940
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> the Message flatbuffer type has {{custom_metadata}} but there is no API 
> support for reading and writing values to this field. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-6979) [R] Enable jemalloc in autobrew formula

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6979:
---

Assignee: (was: Wes McKinney)

> [R] Enable jemalloc in autobrew formula
> ---
>
> Key: ARROW-6979
> URL: https://issues.apache.org/jira/browse/ARROW-6979
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> See 
> https://github.com/apache/arrow/blob/59a6788c76330cf055bdbcbc7bdae7b0106c6656/dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb#L47



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-3077) [Website] Add page summarizing project contributions since project inception

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3077:
---

Assignee: (was: Wes McKinney)

> [Website] Add page summarizing project contributions since project inception
> 
>
> Key: ARROW-3077
> URL: https://issues.apache.org/jira/browse/ARROW-3077
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
>
> We have already been doing this on a per-release basis, e.g.
> http://arrow.apache.org/release/0.10.0.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-1596) [Python] Expand serialization test suite for NumPy arrays

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1596:
---

Assignee: (was: Wes McKinney)

> [Python] Expand serialization test suite for NumPy arrays
> -
>
> Key: ARROW-1596
> URL: https://issues.apache.org/jira/browse/ARROW-1596
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> see 
> https://github.com/dask/distributed/blob/master/distributed/protocol/tests/test_numpy.py#L30-L65
>  for inspiration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-5158) [Packaging][Wheel] Symlink libraries in wheels

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5158:
---

Assignee: Wes McKinney

> [Packaging][Wheel] Symlink libraries in wheels
> --
>
> Key: ARROW-5158
> URL: https://issues.apache.org/jira/browse/ARROW-5158
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Reporter: Krisztian Szucs
>Assignee: Wes McKinney
>Priority: Major
>  Labels: wheel
>
> Libraries are copied instead of symlinking in linux and osx wheels, which 
> result quiet big binaries:
>  
> This is what the wheel contains before running auditwheel:
>  
> {code}
> -rwxr-xr-x  1 root root 128K Apr  3 09:02 libarrow_boost_filesystem.so
> -rwxr-xr-x  1 root root 128K Apr  3 09:02 libarrow_boost_filesystem.so.1.66.0
> -rwxr-xr-x  1 root root 1.2M Apr  3 09:02 libarrow_boost_regex.so
> -rwxr-xr-x  1 root root 1.2M Apr  3 09:02 libarrow_boost_regex.so.1.66.0
> -rwxr-xr-x  1 root root  30K Apr  3 09:02 libarrow_boost_system.so
> -rwxr-xr-x  1 root root  30K Apr  3 09:02 libarrow_boost_system.so.1.66.0
> -rwxr-xr-x  1 root root 1.4M Apr  3 09:02 libarrow_python.so
> -rwxr-xr-x  1 root root 1.4M Apr  3 09:02 libarrow_python.so.14
> -rwxr-xr-x  1 root root  12M Apr  3 09:02 libarrow.so
> -rwxr-xr-x  1 root root  12M Apr  3 09:02 libarrow.so.14
> -rw-r--r--  1 root root 6.1M Apr  3 09:02 lib.cpp
> -rwxr-xr-x  1 root root 2.4M Apr  3 09:02 
> [lib.cpython-36m-x86_64-linux-gnu.so|http://lib.cpython-36m-x86_64-linux-gnu.so/]
> -rwxr-xr-x  1 root root  55M Apr  3 09:02 libgandiva.so
> -rwxr-xr-x  1 root root  55M Apr  3 09:02 libgandiva.so.14
> -rwxr-xr-x  1 root root 2.9M Apr  3 09:02 libparquet.so
> -rwxr-xr-x  1 root root 2.9M Apr  3 09:02 libparquet.so.14
> -rwxr-xr-x  1 root root 309K Apr  3 09:02 libplasma.so
> -rwxr-xr-x  1 root root 309K Apr  3 09:02 libplasma.so.14
>  {code}
> After running auditwheel, the repaired wheel contains:
>  
> {code}
> -rwxr-xr-x  1 root root 128K Apr  3 09:02 libarrow_boost_filesystem.so
> -rwxr-xr-x  1 root root 128K Apr  3 09:02 libarrow_boost_filesystem.so.1.66.0
> -rwxr-xr-x  1 root root 1.2M Apr  3 09:02 libarrow_boost_regex.so
> -rwxr-xr-x  1 root root 1.2M Apr  3 09:02 libarrow_boost_regex.so.1.66.0
> -rwxr-xr-x  1 root root  30K Apr  3 09:02 libarrow_boost_system.so
> -rwxr-xr-x  1 root root  30K Apr  3 09:02 libarrow_boost_system.so.1.66.0
> -rwxr-xr-x  1 root root 1.6M Apr  3 09:55 libarrow_python.so
> -rwxr-xr-x  1 root root 1.4M Apr  3 09:02 libarrow_python.so.14
> -rwxr-xr-x  1 root root  12M Apr  3 09:55 libarrow.so
> -rwxr-xr-x  1 root root  12M Apr  3 09:02 libarrow.so.14
> -rw-r--r--  1 root root 6.1M Apr  3 09:02 lib.cpp
> -rwxr-xr-x  1 root root 2.5M Apr  3 09:55 
> [lib.cpython-36m-x86_64-linux-gnu.so|http://lib.cpython-36m-x86_64-linux-gnu.so/]
> -rwxr-xr-x  1 root root  59M Apr  3 09:55 libgandiva.so
> -rwxr-xr-x  1 root root  55M Apr  3 09:02 libgandiva.so.14
> -rwxr-xr-x  1 root root 3.5M Apr  3 09:55 libparquet.so
> -rwxr-xr-x  1 root root 2.9M Apr  3 09:02 libparquet.so.14
> -rwxr-xr-x  1 root root 345K Apr  3 09:55 libplasma.so
> -rwxr-xr-x  1 root root 309K Apr  3 09:02 libplasma.so.14
> {code}
>  
> Here is the output of auditwheel 
> [https://travis-ci.org/kszucs/crossbow/builds/514605723#L3340]
> They should be symlinks, we have special code for this: 
> https://github.com/apache/arrow/blob/4495305092411e8551c60341e273c8aa3c14b282/python/setup.py#L489-L499
>  This is probably not going into the wheel as wheels are zip-files and they 
> don't support symlinks by default. So we probably need to pass the 
> `--symlinks` parameter to the wheel code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8157) [C++][Gandiva] Support building with LLVM 9

2020-06-02 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-8157:

Summary: [C++][Gandiva] Support building with LLVM 9  (was: [C++] Support 
LLVM 9)

> [C++][Gandiva] Support building with LLVM 9
> ---
>
> Key: ARROW-8157
> URL: https://issues.apache.org/jira/browse/ARROW-8157
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Jun NAITOH
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now that LLVM 9 has already been released.
> LLVM branch 10 has been created on https://apt.llvm.org/
> LLVM branch 9 has already been promoted to the old-stable branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9004) [C++][Gandiva] Support building with LLVM 10

2020-06-02 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-9004:

Summary: [C++][Gandiva] Support building with LLVM 10  (was: [C++][Gandiva] 
Upgrade to LLVM 10)

> [C++][Gandiva] Support building with LLVM 10
> 
>
> Key: ARROW-9004
> URL: https://issues.apache.org/jira/browse/ARROW-9004
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8988) [Python] After upgrade pyarrow from 0.15 to 0.17.1 connect to hdfs don`t work with libdfs jni

2020-06-02 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8988:
-
Labels: filesystem hdfs hortonworks libhdfs  (was: beginners hdfs 
hortonworks libhdfs pyarrow python3)

> [Python] After upgrade pyarrow from 0.15 to 0.17.1 connect to hdfs don`t work 
> with libdfs jni
> -
>
> Key: ARROW-8988
> URL: https://issues.apache.org/jira/browse/ARROW-8988
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
>Reporter: Pavel Dourugyan
>Priority: Major
>  Labels: filesystem, hdfs, hortonworks, libhdfs
> Attachments: 1.txt, 2.txt
>
>
> h2. Problem
> After upgrade pyarrow from 0.15 to 0.17, I have a some troubles. I 
> understand, that libhdfs3 no support now. However, in my case, libhdfs not 
> work too. See below.
> My experience in the Hadoop ecosystem is not big. Maybe, I took a some 
> wrongs. I installed Hortonworks HDP  from Ambari service on the virtual 
> machine, installed on my PC.
> I try that..
> 1.  just connect..
> %xmode Verbose
> import pyarrow as pa
> hdfs = pa.hdfs.connect(host='hdp.test.com', port=8020, user='hdfs')
> ---
> FileNotFoundError: [Errno 2] No such file or directory: 'hadoop': 'hadoop' 
> ([#1.txt])
> 2. to bypass if driver == 'libhdfs'..
> %xmode Verbose
> import pyarrow as pa
> hdfs = pa.HadoopFileSystem(host='hdp.test.com', port=8020, user='hdfs', 
> driver=None')
> ---
> OSError: Unable to load libjvm: /usr/java/latest//lib/server/libjvm.so: 
> cannot open shared object file: No such file or directory ([#2.txt])
> 3. With libhdfs3 it working:
> import hdfs3 
> hdfs = hdfs3.HDFileSystem(host='hdp.test.com', port=8020, user='hdfs')
> #ls remote folder
> hdfs.ls('/data/', detail=False)
> ['/data/TimeSheet.2020-04-11', '/data/test', '/data/test.json']
> h2. Environment.
> h4. +Client PC:+
> OS: Debian 10. Dev.: Anaconda3 (python 3.7.6), Jupyter Lab 2, pyarrow 0.17.1 
> (from conda-forge)
> +Hadoop+ (on VM – Oracle VirtualBox):
> OS: Oracle Linux 7.6.  Distr.: Hortonworks HDP 3.1.4
> libhdfs.so:
> [root@hdp /]# find / -name libhdfs.so
>  /usr/lib/ams-hbase/lib/hadoop-native/libhdfs.so
>  /usr/hdp/3.1.4.0-315/usr/lib/libhdfs.so
>  
>  Java path:
> [root@hdp /]# sudo alternatives --config java
>  
> ---
>  *+ 1       java-1.8.0-openjdk.x86_64 
> (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/bin/java)
>  
> libjvm:               
> [root@hdp /]# find / -name libjvm.*
>  
> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/server/libjvm.so
>  /usr/jdk64/jdk1.8.0_112/jre/lib/amd64/server/libjvm.so
>  
> I tried many settings (. Below last :
> # etc/profile.
>  ...
> export JAVA_HOME=$(dirname $(dirname $(readlink $(readlink $(which javac)
> export JRE_HOME=$JAVA_HOME/jre
> export 
> JAVA_CLASSPATH=$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
> export HADOOP_HOME=/usr/hdp/3.1.4.0-315/hadoop
> export HADOOP_CLASSPATH=$(find $HADOOP_HOME -name '*.jar' | xargs echo | tr ' 
> ' ':')
> export ARROW_LIBHDFS_DIR=/usr/lib/ams-hbase/lib/hadoop-native
> export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
> export CLASSPATH==.:$CLASSPATH:$JAVA_CLASSPATH:$HADOOP_CLASSPATH
> export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JRE_HOME/lib/amd64/server
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9019) pyarrow hdfs fails to connect to for HDFS 3.x cluster

2020-06-02 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9019:
-
Labels: filesystem hdfs  (was: )

> pyarrow hdfs fails to connect to for HDFS 3.x cluster
> -
>
> Key: ARROW-9019
> URL: https://issues.apache.org/jira/browse/ARROW-9019
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Thomas Graves
>Priority: Major
>  Labels: filesystem, hdfs
>
> I'm trying to use the pyarrow hdfs connector with Hadoop 3.1.3 and I get an 
> error that looks like a protobuf or jar mismatch problem with Hadoop. The 
> same code works on a Hadoop 2.9 cluster.
>  
> I'm wondering if there is something special I need to do or if pyarrow 
> doesn't support Hadoop 3.x yet?
> Note I tried with pyarrow 0.15.1, 0.16.0, and 0.17.1.
>  
>     import pyarrow as pa
>     hdfs_kwargs = dict(host="namenodehost",
>                       port=9000,
>                       user="tgraves",
>                       driver='libhdfs',
>                       kerb_ticket=None,
>                       extra_conf=None)
>     fs = pa.hdfs.connect(**hdfs_kwargs)
>     res = fs.exists("/user/tgraves")
>  
> Error that I get on Hadoop 3.x is:
>  
> dfsExists: invokeMethod((Lorg/apache/hadoop/fs/Path;)Z) error:
> ClassCastException: 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to 
> org.apache.hadoop.shaded.com.google.protobuf.Messagejava.lang.ClassCastException:
>  
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>         at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:904)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8983) [Python] Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0

2020-06-02 Thread Valentyn Tymofieiev (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124144#comment-17124144
 ] 

Valentyn Tymofieiev commented on ARROW-8983:


Thank you. Installing setuptools, wheel and cython is fast. I think some 
compilation is happening under the hood. Will ask question on that PR.

> [Python] Downloading sources of pyarrow and its requirements from pypi takes 
> several minutes starting from 0.16.0
> -
>
> Key: ARROW-8983
> URL: https://issues.apache.org/jira/browse/ARROW-8983
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.16.0, 0.17.0, 0.17.1
>Reporter: Valentyn Tymofieiev
>Priority: Minor
>
> It appears that 
>   python -m pip download --dest /tmp pyarrow==0.17.1 --no-binary :all:
> takes several minutes to execute. 
> There seems to be an increase in runtime starting from 0.16.0: on Python 2 
>  python -m pip download --dest /tmp pyarrow==0.15.1 --no-binary :all:
> appears to be somewhat faster, but the same command is still slow on Py3.
> The command is stuck for a while with "Installing build dependencies ... ", 
> and increased CPU usage.
> The intent of this command is to download source tarball for a package and 
> its dependencies.
> Some investigation was started on the mailing list: 
> https://lists.apache.org/thread.html/r9baa48a9d1517834c285f0f238f29fcf54405cb7cf1e681314239d7f%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8904) [Python] Fix usages of deprecated C++ APIs related to child/field

2020-06-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8904:
--
Labels: pull-request-available  (was: )

> [Python] Fix usages of deprecated C++ APIs related to child/field
> -
>
> Key: ARROW-8904
> URL: https://issues.apache.org/jira/browse/ARROW-8904
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code}
> -- Running cmake --build for pyarrow
> cmake --build . --config debug -- -j16
> [19/20] Building CXX object CMakeFiles/lib.dir/lib.cpp.o
> lib.cpp:20265:85: warning: 'num_children' is deprecated: Use num_fields() 
> [-Wdeprecated-declarations]
>   __pyx_t_1 = __pyx_f_7pyarrow_3lib__normalize_index(__pyx_v_i, 
> __pyx_v_self->type->num_children()); if (unlikely(__pyx_t_1 == 
> ((Py_ssize_t)-1L))) __PYX_ERR(1, 119, __pyx_L1_error)
>   
>   ^
> /home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
> explicitly marked deprecated here
>   ARROW_DEPRECATED("Use num_fields()")
>   ^
> /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from 
> macro 'ARROW_DEPRECATED'
> #  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
>^
> lib.cpp:20276:76: warning: 'child' is deprecated: Use field(i) 
> [-Wdeprecated-declarations]
>   __pyx_t_2 = 
> __pyx_f_7pyarrow_3lib_pyarrow_wrap_field(__pyx_v_self->type->child(__pyx_v_index));
>  if (unlikely(!__pyx_t_2)) __PYX_ERR(1, 120, __pyx_L1_error)
>^
> /home/wesm/local/include/arrow/type.h:251:3: note: 'child' has been 
> explicitly marked deprecated here
>   ARROW_DEPRECATED("Use field(i)")
>   ^
> /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from 
> macro 'ARROW_DEPRECATED'
> #  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
>^
> lib.cpp:20507:56: warning: 'num_children' is deprecated: Use num_fields() 
> [-Wdeprecated-declarations]
>   __pyx_t_1 = __Pyx_PyInt_From_int(__pyx_v_self->type->num_children()); if 
> (unlikely(!__pyx_t_1)) __PYX_ERR(1, 139, __pyx_L1_error)
>^
> /home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
> explicitly marked deprecated here
>   ARROW_DEPRECATED("Use num_fields()")
>   ^
> /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from 
> macro 'ARROW_DEPRECATED'
> #  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
>^
> lib.cpp:23361:44: warning: 'num_children' is deprecated: Use num_fields() 
> [-Wdeprecated-declarations]
>   __pyx_r = __pyx_v_self->__pyx_base.type->num_children();
>^
> /home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
> explicitly marked deprecated here
>   ARROW_DEPRECATED("Use num_fields()")
>   ^
> /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from 
> macro 'ARROW_DEPRECATED'
> #  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
>^
> lib.cpp:24039:44: warning: 'num_children' is deprecated: Use num_fields() 
> [-Wdeprecated-declarations]
>   __pyx_r = __pyx_v_self->__pyx_base.type->num_children();
>^
> /home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
> explicitly marked deprecated here
>   ARROW_DEPRECATED("Use num_fields()")
>   ^
> /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from 
> macro 'ARROW_DEPRECATED'
> #  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
>^
> lib.cpp:58220:37: warning: 'child' is deprecated: Use field(pos) 
> [-Wdeprecated-declarations]
>   __pyx_v_child = __pyx_v_self->ap->child(__pyx_v_child_id);
> ^
> /home/wesm/local/include/arrow/array.h:1281:3: note: 'child' has been 
> explicitly marked deprecated here
>   ARROW_DEPRECATED("Use field(pos)")
>   ^
> /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from 
> macro 'ARROW_DEPRECATED'
> #  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
>^
> lib.cpp:58956:74: warning: 'children' is

[jira] [Updated] (ARROW-8951) [C++] Fix compiler warning in compute/kernels/scalar_cast_temporal.cc

2020-06-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8951:
--
Labels: pull-request-available  (was: )

> [C++] Fix compiler warning in compute/kernels/scalar_cast_temporal.cc
> -
>
> Key: ARROW-8951
> URL: https://issues.apache.org/jira/browse/ARROW-8951
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The kernel functor can return an uninitialized value on errors
> {code}
> ../src/arrow/compute/kernels/scalar_cast_temporal.cc: In member function ‘OUT 
> arrow::compute::internal::ParseTimestamp::Call(arrow::compute::KernelContext*,
>  ARG0) const [with OUT = long int; ARG0 = 
> nonstd::sv_lite::basic_string_view]’:
> ../src/arrow/compute/kernels/scalar_cast_temporal.cc:267:12: warning: 
> ‘result’ may be used uninitialized in this function [-Wmaybe-uninitialized]
>  return result;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8904) [Python] Fix usages of deprecated C++ APIs related to child/field

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8904:
---

Assignee: Wes McKinney

> [Python] Fix usages of deprecated C++ APIs related to child/field
> -
>
> Key: ARROW-8904
> URL: https://issues.apache.org/jira/browse/ARROW-8904
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> {code}
> -- Running cmake --build for pyarrow
> cmake --build . --config debug -- -j16
> [19/20] Building CXX object CMakeFiles/lib.dir/lib.cpp.o
> lib.cpp:20265:85: warning: 'num_children' is deprecated: Use num_fields() 
> [-Wdeprecated-declarations]
>   __pyx_t_1 = __pyx_f_7pyarrow_3lib__normalize_index(__pyx_v_i, 
> __pyx_v_self->type->num_children()); if (unlikely(__pyx_t_1 == 
> ((Py_ssize_t)-1L))) __PYX_ERR(1, 119, __pyx_L1_error)
>   
>   ^
> /home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
> explicitly marked deprecated here
>   ARROW_DEPRECATED("Use num_fields()")
>   ^
> /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from 
> macro 'ARROW_DEPRECATED'
> #  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
>^
> lib.cpp:20276:76: warning: 'child' is deprecated: Use field(i) 
> [-Wdeprecated-declarations]
>   __pyx_t_2 = 
> __pyx_f_7pyarrow_3lib_pyarrow_wrap_field(__pyx_v_self->type->child(__pyx_v_index));
>  if (unlikely(!__pyx_t_2)) __PYX_ERR(1, 120, __pyx_L1_error)
>^
> /home/wesm/local/include/arrow/type.h:251:3: note: 'child' has been 
> explicitly marked deprecated here
>   ARROW_DEPRECATED("Use field(i)")
>   ^
> /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from 
> macro 'ARROW_DEPRECATED'
> #  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
>^
> lib.cpp:20507:56: warning: 'num_children' is deprecated: Use num_fields() 
> [-Wdeprecated-declarations]
>   __pyx_t_1 = __Pyx_PyInt_From_int(__pyx_v_self->type->num_children()); if 
> (unlikely(!__pyx_t_1)) __PYX_ERR(1, 139, __pyx_L1_error)
>^
> /home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
> explicitly marked deprecated here
>   ARROW_DEPRECATED("Use num_fields()")
>   ^
> /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from 
> macro 'ARROW_DEPRECATED'
> #  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
>^
> lib.cpp:23361:44: warning: 'num_children' is deprecated: Use num_fields() 
> [-Wdeprecated-declarations]
>   __pyx_r = __pyx_v_self->__pyx_base.type->num_children();
>^
> /home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
> explicitly marked deprecated here
>   ARROW_DEPRECATED("Use num_fields()")
>   ^
> /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from 
> macro 'ARROW_DEPRECATED'
> #  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
>^
> lib.cpp:24039:44: warning: 'num_children' is deprecated: Use num_fields() 
> [-Wdeprecated-declarations]
>   __pyx_r = __pyx_v_self->__pyx_base.type->num_children();
>^
> /home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
> explicitly marked deprecated here
>   ARROW_DEPRECATED("Use num_fields()")
>   ^
> /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from 
> macro 'ARROW_DEPRECATED'
> #  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
>^
> lib.cpp:58220:37: warning: 'child' is deprecated: Use field(pos) 
> [-Wdeprecated-declarations]
>   __pyx_v_child = __pyx_v_self->ap->child(__pyx_v_child_id);
> ^
> /home/wesm/local/include/arrow/array.h:1281:3: note: 'child' has been 
> explicitly marked deprecated here
>   ARROW_DEPRECATED("Use field(pos)")
>   ^
> /home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from 
> macro 'ARROW_DEPRECATED'
> #  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
>^
> lib.cpp:58956:74: warning: 'children' is deprecated: Use fields() 
> [-Wdeprecated-declarations]
>   __pyx_v_child_fields = 
>

[jira] [Commented] (ARROW-8950) [C++] Make head optional in s3fs

2020-06-02 Thread Remi Dettai (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124066#comment-17124066
 ] 

Remi Dettai commented on ARROW-8950:


You need more than this. If you don't have a "read from end" type function in 
the filesystem API, you will still need to get the size first in order to read 
the end of the file. The primary usecase for this is of course parquet, where 
you need to read the footer first.

A workaround if we don't want to extend the generic filesystem API would be to 
provide the file size manually when opening the file, this way you could use 
file sizes you got in batches with list commands or from some kind of catalog.

> [C++] Make head optional in s3fs
> 
>
> Key: ARROW-8950
> URL: https://issues.apache.org/jira/browse/ARROW-8950
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Remi Dettai
>Assignee: Antoine Pitrou
>Priority: Major
>
> When you open an input file with the f3fs, it issues a head request to S3 to 
> check if the file is present/authorized and get the size 
> (https://github.com/apache/arrow/blob/f16f76ab7693ae085e82f4269a0a0bc23770bef9/cpp/src/arrow/filesystem/s3fs.cc#L407).
> This call comes with a non-neglictable cost:
>  * adds latency
>  * priced the same as a GET request by AWS
> I fail to see usecases where this call is really crucial:
>  * if the file is not present/authorized, failing at first read seems to have 
> mostly the same effect as failing on opening. I agree that it is kind of 
> "usual" for an _open_ call to fail eagerly, so to avoid surprises we could 
> add a flag indicating if we don't need to fail when running _OpenInputFile_ 
> on an inaccessible file.
>  * getting the size can be done on the first read, and could be mostly 
> avoided on caller side if the filesystem api provided read-from-end 
> capabilities (compatible with fs reads using _ios::end_ and on http 
> filesystems with _bytes=-xxx_). Worst case scenario the call to _head_ could 
> be done lazily when calling _getSize()._
> I agree that it makes things a bit more complex, and I understand that you 
> would not want to complexify the generic fs api because of blob storage 
> behavior. But obviously there are workloads where this has a significant 
> impact.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8951) [C++] Fix compiler warning in compute/kernels/scalar_cast_temporal.cc

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8951:
---

Assignee: Wes McKinney

> [C++] Fix compiler warning in compute/kernels/scalar_cast_temporal.cc
> -
>
> Key: ARROW-8951
> URL: https://issues.apache.org/jira/browse/ARROW-8951
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> The kernel functor can return an uninitialized value on errors
> {code}
> ../src/arrow/compute/kernels/scalar_cast_temporal.cc: In member function ‘OUT 
> arrow::compute::internal::ParseTimestamp::Call(arrow::compute::KernelContext*,
>  ARG0) const [with OUT = long int; ARG0 = 
> nonstd::sv_lite::basic_string_view]’:
> ../src/arrow/compute/kernels/scalar_cast_temporal.cc:267:12: warning: 
> ‘result’ may be used uninitialized in this function [-Wmaybe-uninitialized]
>  return result;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9017) [Python] Refactor the Scalar classes

2020-06-02 Thread Krisztian Szucs (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124048#comment-17124048
 ] 

Krisztian Szucs commented on ARROW-9017:


I started to factor out the elementwise conversion code required to convert 
single python objects to intermediate C representation. I hit a couple of 
roadblock in that conversion code and there were also missing utilities like 
the GetScalar Ben has implemented recently. 

We also have an outstanding issue with the auto chunking during conversion: in 
case of nested types a binary/string field gets chunked if the size limited is 
reached but the rest of the fields have a single chunk resulting a corrupted 
nested array.

> [Python] Refactor the Scalar classes
> 
>
> Key: ARROW-9017
> URL: https://issues.apache.org/jira/browse/ARROW-9017
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> The situation regarding scalars in Python is currently not optimal.
> We have two different "types" of scalars:
> - {{ArrayValue(Scalar)}} (and subclasses of that for all types):  this is 
> used when you access a single element of an array (eg {{arr[0]}})
> - {{ScalarValue(Scalar)}} (and subclasses of that for _some_ types): this is 
> used when wrapping a C++ scalar into a python scalar, eg when you get back a 
> scalar from a reduction like {{arr.sum()}}.
> And while we have two versions of scalars, neither of them can actually 
> easily be used as scalar as they both can't be constructed from a python 
> scalar (there is no {{scalar(1)}} function to use when calling a kernel, for 
> example).
> I think we should try to unify those scalar classes? (which probably means 
> getting rid of the ArrayValue scalar)
> In addition, there is an issue of trying to re-use python scalar <-> arrow 
> conversion code, as this is also logic for this in the {{python_to_arrow.cc}} 
> code. But this is probably a bigger change. cc [~kszucs] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9019) pyarrow hdfs fails to connect to for HDFS 3.x cluster

2020-06-02 Thread Thomas Graves (Jira)

Thomas Graves created ARROW-9019:


 Summary: pyarrow hdfs fails to connect to for HDFS 3.x cluster
 Key: ARROW-9019
 URL: https://issues.apache.org/jira/browse/ARROW-9019
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Thomas Graves


I'm trying to use the pyarrow hdfs connector with Hadoop 3.1.3 and I get an 
error that looks like a protobuf or jar mismatch problem with Hadoop. The same 
code works on a Hadoop 2.9 cluster.
 
I'm wondering if there is something special I need to do or if pyarrow doesn't 
support Hadoop 3.x yet?
Note I tried with pyarrow 0.15.1, 0.16.0, and 0.17.1.
 
    import pyarrow as pa
    hdfs_kwargs = dict(host="namenodehost",
                      port=9000,
                      user="tgraves",
                      driver='libhdfs',
                      kerb_ticket=None,
                      extra_conf=None)
    fs = pa.hdfs.connect(**hdfs_kwargs)
    res = fs.exists("/user/tgraves")
 
Error that I get on Hadoop 3.x is:
 
dfsExists: invokeMethod((Lorg/apache/hadoop/fs/Path;)Z) error:
ClassCastException: 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
 cannot be cast to 
org.apache.hadoop.shaded.com.google.protobuf.Messagejava.lang.ClassCastException:
 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
 cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
        at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:904)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-6459) [C++] Remove "python" from conda_env_cpp.yml

2020-06-02 Thread Krisztian Szucs (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124037#comment-17124037
 ] 

Krisztian Szucs edited comment on ARROW-6459 at 6/2/20, 4:45 PM:
-

Yep. I excluded it originally, but to build/test the libarrow_python (perhaps 
on travis) we added it.


was (Author: kszucs):
Yep.

> [C++] Remove "python" from conda_env_cpp.yml
> 
>
> Key: ARROW-6459
> URL: https://issues.apache.org/jira/browse/ARROW-6459
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Minor
> Fix For: 1.0.0
>
>
> I'm not sure why "python" is in this dependency file -- if it is used to 
> maintain a toolchain external to a particular Python environment then it 
> confuses CMake like
> {code}
> CMake Warning at cmake_modules/BuildUtils.cmake:529 (add_executable):
>   Cannot generate a safe runtime search path for target arrow-python-test
>   because there is a cycle in the constraint graph:
> dir 0 is [/home/wesm/code/arrow/cpp/build/debug]
> dir 1 is [/home/wesm/miniconda/envs/arrow-3.7/lib]
>   dir 2 must precede it due to runtime library [libcrypto.so.1.1]
> dir 2 is [/home/wesm/cpp-toolchain/lib]
>   dir 1 must precede it due to runtime library [libpython3.7m.so.1.0]
>   Some of these libraries may not be found correctly.
> Call Stack (most recent call first):
>   src/arrow/CMakeLists.txt:52 (add_test_case)
>   src/arrow/python/CMakeLists.txt:139 (add_arrow_test)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8937) [C++] Add "parse_strptime" function for string to timestamp conversions using the kernels framework

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8937.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7312
[https://github.com/apache/arrow/pull/7312]

> [C++] Add "parse_strptime" function for string to timestamp conversions using 
> the kernels framework
> ---
>
> Key: ARROW-8937
> URL: https://issues.apache.org/jira/browse/ARROW-8937
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This should be relatively straightforward to implement using the new kernels 
> framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6459) [C++] Remove "python" from conda_env_cpp.yml

2020-06-02 Thread Krisztian Szucs (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124037#comment-17124037
 ] 

Krisztian Szucs commented on ARROW-6459:


Yep.

> [C++] Remove "python" from conda_env_cpp.yml
> 
>
> Key: ARROW-6459
> URL: https://issues.apache.org/jira/browse/ARROW-6459
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Minor
> Fix For: 1.0.0
>
>
> I'm not sure why "python" is in this dependency file -- if it is used to 
> maintain a toolchain external to a particular Python environment then it 
> confuses CMake like
> {code}
> CMake Warning at cmake_modules/BuildUtils.cmake:529 (add_executable):
>   Cannot generate a safe runtime search path for target arrow-python-test
>   because there is a cycle in the constraint graph:
> dir 0 is [/home/wesm/code/arrow/cpp/build/debug]
> dir 1 is [/home/wesm/miniconda/envs/arrow-3.7/lib]
>   dir 2 must precede it due to runtime library [libcrypto.so.1.1]
> dir 2 is [/home/wesm/cpp-toolchain/lib]
>   dir 1 must precede it due to runtime library [libpython3.7m.so.1.0]
>   Some of these libraries may not be found correctly.
> Call Stack (most recent call first):
>   src/arrow/CMakeLists.txt:52 (add_test_case)
>   src/arrow/python/CMakeLists.txt:139 (add_arrow_test)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-6459) [C++] Remove "python" from conda_env_cpp.yml

2020-06-02 Thread Krisztian Szucs (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-6459:
--

Assignee: Krisztian Szucs

> [C++] Remove "python" from conda_env_cpp.yml
> 
>
> Key: ARROW-6459
> URL: https://issues.apache.org/jira/browse/ARROW-6459
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Minor
> Fix For: 1.0.0
>
>
> I'm not sure why "python" is in this dependency file -- if it is used to 
> maintain a toolchain external to a particular Python environment then it 
> confuses CMake like
> {code}
> CMake Warning at cmake_modules/BuildUtils.cmake:529 (add_executable):
>   Cannot generate a safe runtime search path for target arrow-python-test
>   because there is a cycle in the constraint graph:
> dir 0 is [/home/wesm/code/arrow/cpp/build/debug]
> dir 1 is [/home/wesm/miniconda/envs/arrow-3.7/lib]
>   dir 2 must precede it due to runtime library [libcrypto.so.1.1]
> dir 2 is [/home/wesm/cpp-toolchain/lib]
>   dir 1 must precede it due to runtime library [libpython3.7m.so.1.0]
>   Some of these libraries may not be found correctly.
> Call Stack (most recent call first):
>   src/arrow/CMakeLists.txt:52 (add_test_case)
>   src/arrow/python/CMakeLists.txt:139 (add_arrow_test)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7313) [C++] Add function for retrieving a scalar from an array slot

2020-06-02 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124036#comment-17124036
 ] 

Neal Richardson commented on ARROW-7313:


Yes, I believe it is called {{Array::GetScalar}}. I used this in the R bindings.

> [C++] Add function for retrieving a scalar from an array slot
> -
>
> Key: ARROW-7313
> URL: https://issues.apache.org/jira/browse/ARROW-7313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> It'd be useful to construct scalar values given an array and an index.
> {code}
> /* static */ std::shared_ptr Scalar::FromArray(const Array&, int64_t);
> {code}
> Since this is much less efficient than unboxing the entire array and 
> accessing its buffers directly, it should not be used in hot loops.
> [~kszucs] [~fsaintjacques]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-8934) [C++] Add timestamp subtract kernel aliased to int64 subtract implementation

2020-06-02 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124032#comment-17124032
 ] 

Wes McKinney edited comment on ARROW-8934 at 6/2/20, 4:41 PM:
--

We must be careful not to instantiate unnecessary templates when doing this


was (Author: wesmckinn):
We must be careful not to instantitate unnecessary templates when doing this

> [C++] Add timestamp subtract kernel aliased to int64 subtract implementation
> 
>
> Key: ARROW-8934
> URL: https://issues.apache.org/jira/browse/ARROW-8934
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> We can use the same scalar exec function for int64 subtraction as well as 
> {{(array[TIMESTAMP], array[TIMESTAMP]) -> duration}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8934) [C++] Add timestamp subtract kernel aliased to int64 subtract implementation

2020-06-02 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124032#comment-17124032
 ] 

Wes McKinney commented on ARROW-8934:
-

We must be careful not to instantitate unnecessary templates when doing this

> [C++] Add timestamp subtract kernel aliased to int64 subtract implementation
> 
>
> Key: ARROW-8934
> URL: https://issues.apache.org/jira/browse/ARROW-8934
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> We can use the same scalar exec function for int64 subtraction as well as 
> {{(array[TIMESTAMP], array[TIMESTAMP]) -> duration}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8801) [Python] Memory leak on read from parquet file with UTC timestamps using pandas

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8801:

Priority: Blocker  (was: Critical)

> [Python] Memory leak on read from parquet file with UTC timestamps using 
> pandas
> ---
>
> Key: ARROW-8801
> URL: https://issues.apache.org/jira/browse/ARROW-8801
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0, 0.17.0
> Environment: Tested using pyarrow 0.17.0, pandas 1.0.3, python 3.7.5, 
> mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2, 
> ubuntu 20.04 (linux).
>Reporter: Rauli Ruohonen
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Given dump.py script 
>  
> {code:java}
> import pandas as pd
> import numpy as np
> x = pd.to_datetime(np.random.randint(0, 2**32, size=2**20), unit='ms', 
> utc=True)
> pd.DataFrame({'x': x}).to_parquet('data.parquet', engine='pyarrow', 
> compression=None)
> {code}
> and load.py script
>  
> {code:java}
> import sys
> import pandas as pd
> def foo(engine):
>     for _ in range(2**9):
>         pd.read_parquet('data.parquet', engine=engine)
>     print('Done')
>     input()
> foo(sys.argv[1])
> {code}
> running first "python dump.py" and then "python load.py pyarrow", on my 
> machine python memory usage stays at 4+ GB while it waits for input. If using 
> "python load.py fastparquet" instead, it is about 100 MB, so it should be a 
> pyarrow issue instead of a pandas issue. The leak disappears if "utc=True" is 
> removed from dump.py, in which case the timestamp is timezone-unaware.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8994) [C++] Disable include-what-you-use cpplint lint checks

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8994.
-
Resolution: Fixed

I did this in 
https://github.com/apache/arrow/commit/94a5026edb652d060110cac170380edf3d856f05

> [C++] Disable include-what-you-use cpplint lint checks
> --
>
> Key: ARROW-8994
> URL: https://issues.apache.org/jira/browse/ARROW-8994
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> If we want to be serious about IWYU, it would be better to use IWYU directly. 
> The minimal checks that IWYU does can be a nuisance rather than addressing 
> the problem holistically



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8683) [C++] Add option for user-defined version identifier for Arrow libraries

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8683:

Fix Version/s: (was: 1.0.0)

> [C++] Add option for user-defined version identifier for Arrow libraries
> 
>
> Key: ARROW-8683
> URL: https://issues.apache.org/jira/browse/ARROW-8683
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> It would be useful to be able to "watermark" shared libraries with e.g. the 
> git hash to determine the exact origin of a particular build of the project. 
> The version identifier could default to the current git revision but be 
> overridden in the CMake invocation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8713) [CI] Try to reduce the env boilerplate used on github actions

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8713:

Fix Version/s: (was: 1.0.0)

> [CI] Try to reduce the env boilerplate used on github actions
> -
>
> Key: ARROW-8713
> URL: https://issues.apache.org/jira/browse/ARROW-8713
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Krisztian Szucs
>Priority: Major
>
> Since the 
> [removal|https://github.com/apache/arrow/pull/7081/files#diff-4e5e90c6228fd48698d074241c2ba760L58]
>  of docker-compose named volumes the configuration gets more tolerant to 
> undefined environment variables, so we can remove the matrix value 
> propagations to the env variables in its current 
> [form|https://github.com/apache/arrow/blob/master/.github/workflows/python_cron.yml#L93-L103]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7957) [Python] ParquetDataset cannot take HadoopFileSystem as filesystem

2020-06-02 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-7957:


Assignee: Joris Van den Bossche

> [Python] ParquetDataset cannot take HadoopFileSystem as filesystem
> --
>
> Key: ARROW-7957
> URL: https://issues.apache.org/jira/browse/ARROW-7957
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Catherine
>Assignee: Joris Van den Bossche
>Priority: Critical
> Fix For: 1.0.0
>
>
> {{from pyarrow.fs import HadoopFileSystem}}
>  {{import pyarrow.parquet as pq}}
>  
> {{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
>  {{hdfs, path = HadoopFileSystem.from_uri(file_name)}}
>  {{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}}
>  
> has error:
>  {{OSError: Unrecognized filesystem:  'pyarrow._hdfs.HadoopFileSystem'>}}
>  
> When I tried using the deprecated {{HadoopFileSystem}}:
> {{import pyarrow}}
>  {{import pyarrow.parquet as pq}}
>  
> {{file_name = "hdfs://localhost:9000/test/file_name.pq"}}
> {{hdfs = pyarrow.hdfs.connect('localhost', 9000)}}
> {{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}}
> {{pa_schema = dataset.schema.to_arrow_schema()}}
> {{pieces = dataset.pieces}}
> {{for piece in pieces: }}
> {{    print(piece.path)}}
>  
> {{piece.path}} lose the {{hdfs://localhost:9000}} prefix.
>  
> I think {{ParquetDataset}} should accept {{pyarrow.fs.}}{{HadoopFileSystem as 
> filesystem?}}
> And {{piece.path}} should have the prefix?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-8578) [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on compiling system"

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-8578.
---
Fix Version/s: (was: 1.0.0)
   Resolution: Cannot Reproduce

Haven't seen this in a while, so closing

> [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on 
> compiling system"
> 
>
> Key: ARROW-8578
> URL: https://issues.apache.org/jira/browse/ARROW-8578
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Wes McKinney
>Priority: Major
>
> Tried compiling and running this today  (with grpc 1.28.1)
> {code}
> $ release/arrow-flight-benchmark 
> Using standalone server: false
> Server running with pid 22385
> Testing method: DoGet
> Server host: localhost
> Server port: 31337
> E0423 21:54:15.174285695   22385 socket_utils_common_posix.cc:222] check for 
> SO_REUSEPORT: {"created":"@1587696855.174280083","description":"SO_REUSEPORT 
> unavailable on compiling 
> system","file":"../src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":190}
> Server host: localhost
> {code}
> my Linux kernel
> {code}
> $ uname -a
> Linux 4.15.0-1079-oem #89-Ubuntu SMP Fri Mar 27 05:22:11 UTC 2020 x86_64 
> x86_64 x86_64 GNU/Linux
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8510) [C++] arrow/dataset/file_base.cc fails to compile with internal compiler error with "Visual Studio 15 2017 Win64" generator

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8510:
---

Assignee: Wes McKinney  (was: Francois Saint-Jacques)

> [C++] arrow/dataset/file_base.cc fails to compile with internal compiler 
> error with "Visual Studio 15 2017 Win64" generator
> ---
>
> Key: ARROW-8510
> URL: https://issues.apache.org/jira/browse/ARROW-8510
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
> Fix For: 1.0.0
>
>
> I discovered this while running the release verification on Windows. There 
> was an obscuring issue which is that if the build fails, the verification 
> script continues. I will fix that



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8523) [C++] Optimize BitmapReader

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8523:

Fix Version/s: (was: 1.0.0)

> [C++] Optimize BitmapReader
> ---
>
> Key: ARROW-8523
> URL: https://issues.apache.org/jira/browse/ARROW-8523
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8518) [Python] Create tools to enable optional components (like Gandiva, Flight) to be built and deployed as separate Python packages

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8518:

Fix Version/s: (was: 1.0.0)
   2.0.0

> [Python] Create tools to enable optional components (like Gandiva, Flight) to 
> be built and deployed as separate Python packages
> ---
>
> Key: ARROW-8518
> URL: https://issues.apache.org/jira/browse/ARROW-8518
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 2.0.0
>
>
> Our current monolithic approach to Python packaging isn't likely to be 
> sustainable long-term.
> At a high level, I would propose a structure like this:
> {code}
> pip install pyarrow  # core package containing libarrow, libarrow_python, and 
> any other common bundled C++ library dependencies
> pip install pyarrow-flight  # installs pyarrow, pyarrow_flight
> pip install pyarrow-gandiva # installs pyarrow, pyarrow_gandiva
> {code}
> We can maintain the semantic appearance of a single {{pyarrow}} package by 
> having thin API modules that would look like
> {code}
> CONTENTS OF pyarrow/flight.py
> from pyarrow_flight import *
> {code}
> Obviously, this is more difficult to build and package:
> * CMake and setup.py files must be refactored a bit so that we can reuse code 
> between the parent and child packages
> * Separate conda and wheel packages must be produced. With conda this seems 
> more straightforward but since the child wheels depend on the parent core 
> wheel, the build process seems more complicated
> In any case, I don't think these challenges are insurmountable. This will 
> have several benefits:
> * Smaller installation footprint for simple use cases (though note we are 
> STILL duplicating shared libraries in the wheels, which is quite bad)
> * Less developer anxiety about expanding the scope of what Python code is 
> shipped from apache/arrow. If in 5 years we are shipping 5 different Python 
> wheels with each Apache Arrow release, that sounds completely fine to me. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8500) [C++] Use selection vectors in Filter implementation for record batches, tables

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8500:
---

Assignee: Wes McKinney

> [C++] Use selection vectors in Filter implementation for record batches, 
> tables
> ---
>
> Key: ARROW-8500
> URL: https://issues.apache.org/jira/browse/ARROW-8500
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> The current implementation of {{Filter}} on RecordBatch, Table does redundant 
> analysis of the filter array. It would be more efficient in most cases (i.e. 
> whenever there are multiple columns) to convert the boolean array into a 
> selection vector and then use {{Take}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8359) [C++/Python] Enable aarch64/ppc64le build in conda recipes

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8359:

Fix Version/s: (was: 1.0.0)

> [C++/Python] Enable aarch64/ppc64le build in conda recipes
> --
>
> Key: ARROW-8359
> URL: https://issues.apache.org/jira/browse/ARROW-8359
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Packaging, Python
>Reporter: Uwe Korn
>Priority: Major
>
> These two new arches were added in the conda recipes, we should also build 
> them as nightlies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8296) [C++][Dataset] IpcFileFormat should support writing files with compressed buffers

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8296:

Fix Version/s: (was: 1.0.0)

> [C++][Dataset] IpcFileFormat should support writing files with compressed 
> buffers
> -
>
> Key: ARROW-8296
> URL: https://issues.apache.org/jira/browse/ARROW-8296
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8328) [C++] MSVC is not respecting warning-disable flags

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8328:

Fix Version/s: (was: 1.0.0)

> [C++] MSVC is not respecting warning-disable flags
> --
>
> Key: ARROW-8328
> URL: https://issues.apache.org/jira/browse/ARROW-8328
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>
> We provide [warning-disabling flags to 
> MSVC|https://github.com/apache/arrow/blob/72433c6/cpp/cmake_modules/SetupCxxFlags.cmake#L151-L153]
>  including one which should disable all conversion warnings. However this is 
> not completely effectual and Appveyor will still emit conversion warnings 
> (which are then treated as errors), requiring insertion of otherwise 
> unnecessary explicit casts or {{#pragma}}s (for example 
> https://github.com/apache/arrow/pull/6820 ).
> Perhaps flag ordering is significant? In any case, as we have conversion 
> warnings disabled for other compilers we should ensure they are completely 
> disabled for MSVC as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8250) [C++] Add "random access" / slice read API to RecordBatchFileReader

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8250:

Fix Version/s: (was: 1.0.0)

> [C++] Add "random access" / slice read API to RecordBatchFileReader
> ---
>
> Key: ARROW-8250
> URL: https://issues.apache.org/jira/browse/ARROW-8250
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> If you want to read a small section of a file, it is not possible to easily 
> determine the relevant record batches that need "rehydrating".
> I would propose the following:
> * A way to cheaply read (and cache, so this doesn't have to be done multiple 
> times) all the RecordBatch metadata without deserializing the record batch 
> data structures themselves
> * Based on the metadata you can then determine the range of batches that need 
> to be rehydrated and then sliced accordingly to produce the Table of interest
> This functionality can be lifted into the Feather read APIs also



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8950) [C++] Make head optional in s3fs

2020-06-02 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124012#comment-17124012
 ] 

Antoine Pitrou commented on ARROW-8950:
---

Would it be ok to be able to disable it in {{S3Options}}?

> [C++] Make head optional in s3fs
> 
>
> Key: ARROW-8950
> URL: https://issues.apache.org/jira/browse/ARROW-8950
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Remi Dettai
>Assignee: Antoine Pitrou
>Priority: Major
>
> When you open an input file with the f3fs, it issues a head request to S3 to 
> check if the file is present/authorized and get the size 
> (https://github.com/apache/arrow/blob/f16f76ab7693ae085e82f4269a0a0bc23770bef9/cpp/src/arrow/filesystem/s3fs.cc#L407).
> This call comes with a non-neglictable cost:
>  * adds latency
>  * priced the same as a GET request by AWS
> I fail to see usecases where this call is really crucial:
>  * if the file is not present/authorized, failing at first read seems to have 
> mostly the same effect as failing on opening. I agree that it is kind of 
> "usual" for an _open_ call to fail eagerly, so to avoid surprises we could 
> add a flag indicating if we don't need to fail when running _OpenInputFile_ 
> on an inaccessible file.
>  * getting the size can be done on the first read, and could be mostly 
> avoided on caller side if the filesystem api provided read-from-end 
> capabilities (compatible with fs reads using _ios::end_ and on http 
> filesystems with _bytes=-xxx_). Worst case scenario the call to _head_ could 
> be done lazily when calling _getSize()._
> I agree that it makes things a bit more complex, and I understand that you 
> would not want to complexify the generic fs api because of blob storage 
> behavior. But obviously there are workloads where this has a significant 
> impact.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8766) [Python] A FileSystem implementation based on Python callbacks

2020-06-02 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124011#comment-17124011
 ] 

Joris Van den Bossche commented on ARROW-8766:
--

cc [~apitrou]

> [Python] A FileSystem implementation based on Python callbacks
> --
>
> Key: ARROW-8766
> URL: https://issues.apache.org/jira/browse/ARROW-8766
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset-dask-integration, filesystem
>
> The new {{pyarrow.fs}} filesystems are now actual C++ objects, and no longer 
> "just" a python interface. So they can't easily be expanded from the Python 
> side, and the existing integration with {{fsspec}} filesystems is therefore 
> also not working anymore. 
> One possible solution is  to have a C++ filesystem that calls back into a 
> python object for each of its methods (possibly similar to how you can 
> implement a flight server in Python, I suppose). 
> Such a FileSystem implementation would allow to make a {{pyarrow.fs}} wrapper 
> for {{fsspec}} filesystems, and thus allow such filesystems to be used in 
> pyarrow where new filesystems are expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8050) [Python][Packaging] Do not include generated Cython source files in wheel packages

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8050:
---

Assignee: Wes McKinney

> [Python][Packaging] Do not include generated Cython source files in wheel 
> packages
> --
>
> Key: ARROW-8050
> URL: https://issues.apache.org/jira/browse/ARROW-8050
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> As originally reported in https://github.com/apache/arrow/issues/6560, the 
> generated .cpp files from Cython seem to be included in the wheel archives



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8113) [C++] Implement a lighter-weight variant

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8113:

Fix Version/s: (was: 1.0.0)

> [C++] Implement a lighter-weight variant
> 
>
> Key: ARROW-8113
> URL: https://issues.apache.org/jira/browse/ARROW-8113
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>
> {{util::variant}} is an extremely useful structure but its header slows 
> compilation significantly, so using it in public headers is questionable 
> https://github.com/apache/arrow/pull/6545#discussion_r388406246
> I'll try writing a lighter weight version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8157) [C++] Support LLVM 9

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8157:

Summary: [C++] Support LLVM 9  (was: [C++] Upgrade to LLVM 9)

> [C++] Support LLVM 9
> 
>
> Key: ARROW-8157
> URL: https://issues.apache.org/jira/browse/ARROW-8157
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Jun NAITOH
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now that LLVM 9 has already been released.
> LLVM branch 10 has been created on https://apt.llvm.org/
> LLVM branch 9 has already been promoted to the old-stable branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8034) [JavaScript][Integration] Enable custom_metadata integtration test

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8034:

Fix Version/s: (was: 1.0.0)

> [JavaScript][Integration] Enable custom_metadata integtration test
> --
>
> Key: ARROW-8034
> URL: https://issues.apache.org/jira/browse/ARROW-8034
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration, JavaScript
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Priority: Major
>
> https://github.com/apache/arrow/pull/6556 adds an integration test including 
> custom metadata but JavaScript is skipped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8033) [Go][Integration] Enable custom_metadata integtration test

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8033:

Fix Version/s: (was: 1.0.0)

> [Go][Integration] Enable custom_metadata integtration test
> --
>
> Key: ARROW-8033
> URL: https://issues.apache.org/jira/browse/ARROW-8033
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go, Integration
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Priority: Major
>
> https://github.com/apache/arrow/pull/6556 adds an integration test including 
> custom metadata but Go is skipped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7938) [C++] Add tests for DayTimeIntervalBuilder

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7938:

Fix Version/s: (was: 1.0.0)

> [C++] Add tests for DayTimeIntervalBuilder
> --
>
> Key: ARROW-7938
> URL: https://issues.apache.org/jira/browse/ARROW-7938
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Micah Kornfield
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7964) [C++] Add short representation string to common classes

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7964:

Fix Version/s: (was: 1.0.0)

> [C++] Add short representation string to common classes
> ---
>
> Key: ARROW-7964
> URL: https://issues.apache.org/jira/browse/ARROW-7964
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> This should apply primarly to DataType, Field, and Schema. It should not try 
> to print things like metadata and nullability. This is not meant for 
> serialization but quick glance.
>  
> {code:java}
> i32
> list
> dict
> struct<,>>>
> schema<>{code}
> Once we have that, we can have a small print diagnostic to array/tables like 
> R.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7900) [Integration][JavaScript] Add null type integration test

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7900:

Fix Version/s: (was: 1.0.0)

> [Integration][JavaScript] Add null type integration test
> 
>
> Key: ARROW-7900
> URL: https://issues.apache.org/jira/browse/ARROW-7900
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration, JavaScript
>Reporter: Neal Richardson
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7878) [C++] Implement LogicalPlan and LogicalPlanBuilder

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7878:

Fix Version/s: (was: 1.0.0)
   2.0.0

> [C++] Implement LogicalPlan and LogicalPlanBuilder
> --
>
> Key: ARROW-7878
> URL: https://issues.apache.org/jira/browse/ARROW-7878
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 18h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7767) [C++] Add a facility to create a Bitmap buffer from an data pointer with a specified sentinel

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7767:

Fix Version/s: (was: 1.0.0)

> [C++] Add a facility to create a Bitmap buffer from an data pointer with a 
> specified sentinel
> -
>
> Key: ARROW-7767
> URL: https://issues.apache.org/jira/browse/ARROW-7767
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> This is a special case for R and other cases where the null value is 
> represented by a sentinel. This would read the data pointer and return a null 
> bitmap buffer where bits are activate for every row where the value is not 
> the sentinel value. If no sentinel is encountered, return nullptr. 
> {code:c++}
> template 
> Result> NullBitmapFromSentinelData(MemoryPool* pool, 
> const CType* data, size_t n_values, CType sentinel_value);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7594) [C++] Implement HTTP and FTP file systems

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7594:

Fix Version/s: (was: 1.0.0)

> [C++] Implement HTTP and FTP file systems
> -
>
> Key: ARROW-7594
> URL: https://issues.apache.org/jira/browse/ARROW-7594
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Priority: Major
>
> It'd be handy to have (probably read only) a generic filesystem 
> implementation which wrapped {{any cURLable base url}}:
> {code}
> ARROW_ASSIGN_OR_RAISE(auto fs, 
> HttpFileSystem::Make("https://some.site/json-api/v3;));
> ASSERT_OK_AND_ASSIGN(auto json_stream, fs->OpenInputStream("slug"));
> // ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7607) [C++] Add to cpp/examples minimal examples of using Arrow as a dependency of another CMake project

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7607:

Fix Version/s: (was: 1.0.0)

> [C++] Add to cpp/examples minimal examples of using Arrow as a dependency of 
> another CMake project
> --
>
> Key: ARROW-7607
> URL: https://issues.apache.org/jira/browse/ARROW-7607
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> It would be helpful to third party developers to have a working example of 
> how we are expecting CMake users to build and use Arrow as an external 
> project in their CMake C++ projects



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7313) [C++] Add function for retrieving a scalar from an array slot

2020-06-02 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124005#comment-17124005
 ] 

Wes McKinney commented on ARROW-7313:
-

Is this done?

> [C++] Add function for retrieving a scalar from an array slot
> -
>
> Key: ARROW-7313
> URL: https://issues.apache.org/jira/browse/ARROW-7313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> It'd be useful to construct scalar values given an array and an index.
> {code}
> /* static */ std::shared_ptr Scalar::FromArray(const Array&, int64_t);
> {code}
> Since this is much less efficient than unboxing the entire array and 
> accessing its buffers directly, it should not be used in hot loops.
> [~kszucs] [~fsaintjacques]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7102) [Python] Make filesystems compatible with fsspec

2020-06-02 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7102:
-
Labels: FileSystem dataset-dask-integration  (was: FileSystem)

> [Python] Make filesystems compatible with fsspec
> 
>
> Key: ARROW-7102
> URL: https://issues.apache.org/jira/browse/ARROW-7102
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Tom Augspurger
>Priority: Major
>  Labels: FileSystem, dataset-dask-integration
>
> Update: regarding compatibility with {{fsspec}}, there are two directions of 
> wrapping possible:
> * Make a {{fsspec}} wrapper for {{pyarrow.fs}} (-> tracked in ARROW-8780, 
> this can ensure {{pyarrow.fs}} filesystems can be used where {{fsspec}} 
> filesytems are expected )
> * Make a {{pyarrow.fs}} wrapper for {{fsspec}} (-> tracked in ARROW-8766 this 
> can ensure {{fsspec}} filesystems can be used where {{pyarrow.fs}} filesytems 
> are expected )
> 
> [fsspec|https://filesystem-spec.readthedocs.io/en/latest] defines a common 
> API for a variety filesystem implementations. I'm proposing a FSSpecWrapper, 
> similar to S3FSWrapper, that works with any fsspec implementation.
>  
> Right now, pyarrow has a pyarrow.filesystems.S3FSWrapper, which is specific 
> to s3fs. 
> [https://github.com/apache/arrow/blob/21ad7ac1162eab188a1e15923fb1de5b795337ec/python/pyarrow/filesystem.py#L320].
>  This implementation could be removed entirely once an FSSPecWrapper is done, 
> or kept as an alias if it's part of the public API.
>  
> This is realted to ARROW-3717, which requested a GCSFSWrapper for working 
> with google cloud storage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8766) [Python] A FileSystem implementation based on Python callbacks

2020-06-02 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8766:
-
Labels: dataset-dask-integration filesystem  (was: filesystem)

> [Python] A FileSystem implementation based on Python callbacks
> --
>
> Key: ARROW-8766
> URL: https://issues.apache.org/jira/browse/ARROW-8766
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset-dask-integration, filesystem
>
> The new {{pyarrow.fs}} filesystems are now actual C++ objects, and no longer 
> "just" a python interface. So they can't easily be expanded from the Python 
> side, and the existing integration with {{fsspec}} filesystems is therefore 
> also not working anymore. 
> One possible solution is  to have a C++ filesystem that calls back into a 
> python object for each of its methods (possibly similar to how you can 
> implement a flight server in Python, I suppose). 
> Such a FileSystem implementation would allow to make a {{pyarrow.fs}} wrapper 
> for {{fsspec}} filesystems, and thus allow such filesystems to be used in 
> pyarrow where new filesystems are expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8866) [C++] Split Type::UNION into Type::SPARSE_UNION and Type::DENSE_UNION

2020-06-02 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-8866:
--
Priority: Blocker  (was: Major)

> [C++] Split Type::UNION into Type::SPARSE_UNION and Type::DENSE_UNION
> -
>
> Key: ARROW-8866
> URL: https://issues.apache.org/jira/browse/ARROW-8866
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Similar to the recent {{Type::INTERVAL}} split, having these two array types 
> which have different memory layouts under the same {{Type::type}} value makes 
> function dispatch somewhat more complicated. This issue is less critical from 
> INTERVAL so this may not be urgent but seems like a good pre-1.0 change



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7078) [Developer] Add Windows utility script to use Dependencies.exe to dump DLL dependencies for diagnostic purposes

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7078:

Fix Version/s: (was: 1.0.0)

> [Developer] Add Windows utility script to use Dependencies.exe to dump DLL 
> dependencies for diagnostic purposes
> ---
>
> Key: ARROW-7078
> URL: https://issues.apache.org/jira/browse/ARROW-7078
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Developer Tools
>Reporter: Wes McKinney
>Priority: Major
>
> See
> https://lucasg.github.io/2018/04/29/Dependencies-command-line/
> This would help us diagnose DLL load issues



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9016) [Java] Remove direct references to Netty/Unsafe Allocators

2020-06-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-9016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9016:
--
Labels: pull-request-available  (was: )

> [Java] Remove direct references to Netty/Unsafe Allocators
> --
>
> Key: ARROW-9016
> URL: https://issues.apache.org/jira/browse/ARROW-9016
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Ryan Murray
>Assignee: Ryan Murray
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As part of ARROW-8230 this removes direct references to Netty and Unsafe 
> Allocation managers in the `DefaultAllocationManagerOption`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6941) [C++] Unpin gtest in build environment

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6941:

Fix Version/s: (was: 1.0.0)

> [C++] Unpin gtest in build environment
> --
>
> Key: ARROW-6941
> URL: https://issues.apache.org/jira/browse/ARROW-6941
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Follow up to failure triaged in ARROW-6834



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6818) [Doc] Format docs confusing

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6818:

Fix Version/s: (was: 1.0.0)

> [Doc] Format docs confusing
> ---
>
> Key: ARROW-6818
> URL: https://issues.apache.org/jira/browse/ARROW-6818
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Format
>Reporter: Antoine Pitrou
>Assignee: Micah Kornfield
>Priority: Major
>
> I find there are several issues in the format docs.
> 1) there is a claimed distinction between "logical types" and "physical 
> types", but the "physical types" actually lists logical types such as Map
> 2) the "logical types" document doesn't actually list logical types, it just 
> sends to the flatbuffers file. One shouldn't have to read a flatbuffers file 
> to understand the Arrow format.
> 3) some terminology seems unusual, such as "relative type"
> 4) why is there a link to the Apache Drill docs?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6858) [C++] Create Python script to handle transitive component dependencies

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6858:

Fix Version/s: (was: 1.0.0)

> [C++] Create Python script to handle transitive component dependencies
> --
>
> Key: ARROW-6858
> URL: https://issues.apache.org/jira/browse/ARROW-6858
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> In the C++ build system, we are handling relationships between optional 
> components in an ad hoc fashion
> https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L266
> This doesn't seem ideal. 
> As discussed on the mailing list, I suggest declaring dependencies in a 
> Python data structure and then generating and checking in a .cmake file that 
> can be {{include}}d. This will be a big easier than maintaining this on an ad 
> hoc basis. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-6856) [C++] Use ArrayData instead of Array for ArrayData::dictionary

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6856.
-
Resolution: Fixed

I did this in 
https://github.com/apache/arrow/commit/94a5026edb652d060110cac170380edf3d856f05

> [C++] Use ArrayData instead of Array for ArrayData::dictionary
> --
>
> Key: ARROW-6856
> URL: https://issues.apache.org/jira/browse/ARROW-6856
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This would be helpful for consistency. {{DictionaryArray}} may want to cache 
> a "boxed" version of this to return from {{DictionaryArray::dictionary}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6436) [C++] vendor a half precision floating point library

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6436:

Fix Version/s: (was: 1.0.0)

> [C++] vendor a half precision floating point library
> 
>
> Key: ARROW-6436
> URL: https://issues.apache.org/jira/browse/ARROW-6436
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
>
> Clang and GCC provide _Float16 and there are numerous polyfills which can 
> emulate a 16 bit float for other platforms. This would fill a hole in the 
> kernels and other code which don't currently support HALF_FLOAT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6404) [C++] CMake build of arrow libraries fails on Windows

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6404:

Fix Version/s: (was: 1.0.0)

> [C++] CMake build of arrow libraries fails on Windows
> -
>
> Key: ARROW-6404
> URL: https://issues.apache.org/jira/browse/ARROW-6404
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: ARF
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: build, pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> I am trying to build the python pyarrow extension on Windows 10 using Visual 
> Studio 2015 Build Tools and the current stable CMake.
> Following [the 
> instructions|https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst]
>  to the letter, CMake fails with the error:
> {{??CMake Error at cmake_modules/SetupCxxFlags.cmake:42 (string):??}}
> {{?? string no output variable specified??}}
> {{??Call Stack (most recent call first):??}}
> {{?? CMakeLists.txt:357 (include)??}}
> 
> Complete output:
> {{(pyarrow-dev) Z:\devel\arrow\cpp\build>cmake -G "Visual Studio 14 2015 
> Win64" ^}}
> {{More? -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^}}
> {{More? -DARROW_CXXFLAGS="/WX /MP" ^}}
> {{More? -DARROW_GANDIVA=on ^}}
> {{More? -DARROW_PARQUET=on ^}}
> {{More? -DARROW_PYTHON=on ..}}
> {{-- Building using CMake version: 3.15.2}}
> {{CMake Error at CMakeLists.txt:30 (string):}}
> {{ string no output variable specified}}
> {{-- Selecting Windows SDK version to target Windows 10.0.17763.}}
> {{-- The C compiler identification is MSVC 19.0.24210.0}}
> {{-- The CXX compiler identification is MSVC 19.0.24210.0}}
> {{-- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual 
> Studio 14.0/VC/bin/x86_amd64/cl.exe}}
> {{-- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual 
> Studio 14.0/VC/bin/x86_amd64/cl.exe -- works}}
> {{-- Detecting C compiler ABI info}}
> {{-- Detecting C compiler ABI info - done}}
> {{-- Detecting C compile features}}
> {{-- Detecting C compile features - done}}
> {{-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual 
> Studio 14.0/VC/bin/x86_amd64/cl.exe}}
> {{-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual 
> Studio 14.0/VC/bin/x86_amd64/cl.exe -- works}}
> {{-- Detecting CXX compiler ABI info}}
> {{-- Detecting CXX compiler ABI info - done}}
> {{-- Detecting CXX compile features}}
> {{-- Detecting CXX compile features - done}}
> {{-- Arrow version: 0.15.0 (full: '0.15.0-SNAPSHOT')}}
> {{-- Arrow SO version: 15 (full: 15.0.0)}}
> {{-- Found PkgConfig: 
> Z:/Systemdateien/Miniconda3/envs/pyarrow-dev/Library/bin/pkg-config.exe 
> (found version "0.29.2")}}
> {{-- clang-tidy not found}}
> {{-- clang-format not found}}
> {{-- infer not found}}
> {{-- Found PythonInterp: 
> Z:/Systemdateien/Miniconda3/envs/pyarrow-dev/python.exe (found version 
> "3.7.3")}}
> {{-- Found cpplint executable at Z:/devel/arrow/cpp/build-support/cpplint.py}}
> {{-- Compiler command: C:/Program Files (x86)/Microsoft Visual Studio 
> 14.0/VC/bin/x86_amd64/cl.exe}}
> {{-- Compiler version:}}
> {{-- Compiler id: MSVC}}
> {{Selected compiler msvc}}
> {{-- Performing Test CXX_SUPPORTS_SSE4_2}}
> {{-- Performing Test CXX_SUPPORTS_SSE4_2 - Failed}}
> {{-- Performing Test CXX_SUPPORTS_ALTIVEC}}
> {{-- Performing Test CXX_SUPPORTS_ALTIVEC - Failed}}
> {{-- Performing Test CXX_SUPPORTS_ARMCRC}}
> {{-- Performing Test CXX_SUPPORTS_ARMCRC - Failed}}
> {{-- Performing Test CXX_SUPPORTS_ARMV8_CRC_CRYPTO}}
> {{-- Performing Test CXX_SUPPORTS_ARMV8_CRC_CRYPTO - Failed}}
> {{CMake Error at cmake_modules/SetupCxxFlags.cmake:42 (string):}}
> {{ string no output variable specified}}
> {{Call Stack (most recent call first):}}
> {{ CMakeLists.txt:357 (include)}}
> {{-- Arrow build warning level: CHECKIN}}
> {{Configured for build (set with cmake 
> -DCMAKE_BUILD_TYPE=\{release,debug,...})}}
> {{CMake Error at cmake_modules/SetupCxxFlags.cmake:438 (message):}}
> {{ Unknown build type:}}
> {{Call Stack (most recent call first):}}
> {{ CMakeLists.txt:357 (include)}}
> {{-- Configuring incomplete, errors occurred!}}
> {{See also "Z:/devel/arrow/cpp/build/CMakeFiles/CMakeOutput.log".}}
> {{See also "Z:/devel/arrow/cpp/build/CMakeFiles/CMakeError.log".}}
> {{(pyarrow-dev) Z:\devel\arrow\cpp\build>}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6459) [C++] Remove "python" from conda_env_cpp.yml

2020-06-02 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123995#comment-17123995
 ] 

Wes McKinney commented on ARROW-6459:
-

I still find this to be a nuisance, [~kszucs] could you take a look at this?

> [C++] Remove "python" from conda_env_cpp.yml
> 
>
> Key: ARROW-6459
> URL: https://issues.apache.org/jira/browse/ARROW-6459
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Minor
> Fix For: 1.0.0
>
>
> I'm not sure why "python" is in this dependency file -- if it is used to 
> maintain a toolchain external to a particular Python environment then it 
> confuses CMake like
> {code}
> CMake Warning at cmake_modules/BuildUtils.cmake:529 (add_executable):
>   Cannot generate a safe runtime search path for target arrow-python-test
>   because there is a cycle in the constraint graph:
> dir 0 is [/home/wesm/code/arrow/cpp/build/debug]
> dir 1 is [/home/wesm/miniconda/envs/arrow-3.7/lib]
>   dir 2 must precede it due to runtime library [libcrypto.so.1.1]
> dir 2 is [/home/wesm/cpp-toolchain/lib]
>   dir 1 must precede it due to runtime library [libpython3.7m.so.1.0]
>   Some of these libraries may not be found correctly.
> Call Stack (most recent call first):
>   src/arrow/CMakeLists.txt:52 (add_test_case)
>   src/arrow/python/CMakeLists.txt:139 (add_arrow_test)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6043) [Python] Array equals returns incorrectly if NaNs are in arrays

2020-06-02 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123991#comment-17123991
 ] 

Wes McKinney commented on ARROW-6043:
-

I still think we need to clearly distinguish "data structure equality" from 
"semantic equality". For example, "semantic equality" is probably better 
addressed by kernels ({{(a == b).all()}})

> [Python] Array equals returns incorrectly if NaNs are in arrays
> ---
>
> Key: ARROW-6043
> URL: https://issues.apache.org/jira/browse/ARROW-6043
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Keith Kraus
>Priority: Major
> Fix For: 1.0.0
>
>
> {code:python}
> import numpy as np
> import pyarrow as pa
> data = [0, 1, np.nan, None, 4]
> arr1 = pa.array(data)
> arr2 = pa.array(data)
> pa.Array.equals(arr1, arr2)
> {code}
> Unsure if this is expected behavior, but in Arrow 0.12.1 this returned `True` 
> as compared to `False` in 0.14.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6179) [C++] ExtensionType subclass for "unknown" types?

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6179:

Fix Version/s: (was: 1.0.0)

> [C++] ExtensionType subclass for "unknown" types?
> -
>
> Key: ARROW-6179
> URL: https://issues.apache.org/jira/browse/ARROW-6179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++, when receiving IPC with extension type metadata for a type that is 
> unknown (the name is not registered), we currently fall back to returning the 
> "raw" storage array. The custom metadata (extension name and metadata) is 
> still available in the Field metadata.
> Alternatively, we could also have a generic {{ExtensionType}} class that can 
> hold such "unknown" extension type (eg {{UnknowExtensionType}} or 
> {{GenericExtensionType}}), keeping the extension name and metadata in the 
> Array's type. 
> This could be a single class where several instances can be created given a 
> storage type, extension name and optionally extension metadata. It would be a 
> way to have an unregistered extension type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5679) [Python] Drop Python 3.5 from support matrix

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5679:

Fix Version/s: (was: 1.0.0)

> [Python] Drop Python 3.5 from support matrix
> 
>
> Key: ARROW-5679
> URL: https://issues.apache.org/jira/browse/ARROW-5679
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>
> We probably need to maintain Python 3.5 on Linux and macOS for the time 
> being, but we may want to drop it for Windows since conda-forge isn't 
> supporting Python 3.5 anymore, so maintaining wheels for Python 3.5 will come 
> with extra cost



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-5345) [C++] Relax Field hashing in DictionaryMemo

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5345:

Fix Version/s: (was: 1.0.0)

> [C++] Relax Field hashing in DictionaryMemo
> ---
>
> Key: ARROW-5345
> URL: https://issues.apache.org/jira/browse/ARROW-5345
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> Follow up to ARROW-3144
> Currently we associate dictionaries with a hash table mapping a Field's 
> memory address to a dictionary id. This poses an issue if two RecordBatches 
> are equal (equal field names, equal types) but were instantiated separately. 
> We don't have a hash function in C++ for Field so we should consider 
> implementing one and using that instead (if it is not too expensive) so that 
> same but "different" (different C++ objects) won't blow up in the user's face 
> with an unintuitive error (this did in fact occur once in the Python test 
> suite, not sure exactly why it wasn't a problem before, I think it worked "by 
> accident")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-5679) [Python] Drop Python 3.5 from support matrix

2020-06-02 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123987#comment-17123987
 ] 

Wes McKinney commented on ARROW-5679:
-

It seems like there's no urgency on this

> [Python] Drop Python 3.5 from support matrix
> 
>
> Key: ARROW-5679
> URL: https://issues.apache.org/jira/browse/ARROW-5679
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> We probably need to maintain Python 3.5 on Linux and macOS for the time 
> being, but we may want to drop it for Windows since conda-forge isn't 
> supporting Python 3.5 anymore, so maintaining wheels for Python 3.5 will come 
> with extra cost



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-5679) [Python] Drop Python 3.5 from support matrix

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5679:
---

Assignee: (was: Krisztian Szucs)

> [Python] Drop Python 3.5 from support matrix
> 
>
> Key: ARROW-5679
> URL: https://issues.apache.org/jira/browse/ARROW-5679
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> We probably need to maintain Python 3.5 on Linux and macOS for the time 
> being, but we may want to drop it for Windows since conda-forge isn't 
> supporting Python 3.5 anymore, so maintaining wheels for Python 3.5 will come 
> with extra cost



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-5082) [Python][Packaging] Reduce size of macOS and manylinux1 wheels

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5082:
---

Assignee: Wes McKinney

> [Python][Packaging] Reduce size of macOS and manylinux1 wheels
> --
>
> Key: ARROW-5082
> URL: https://issues.apache.org/jira/browse/ARROW-5082
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available, wheel
> Fix For: 1.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> The wheels more than tripled in size from 0.12.0 to 0.13.0. I think this is 
> mostly because of LLVM but we should take a closer look to see if the size 
> can be reduced



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-5082) [Python][Packaging] Reduce size of macOS and manylinux1 wheels

2020-06-02 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123984#comment-17123984
 ] 

Wes McKinney commented on ARROW-5082:
-

I'm going to work on this

> [Python][Packaging] Reduce size of macOS and manylinux1 wheels
> --
>
> Key: ARROW-5082
> URL: https://issues.apache.org/jira/browse/ARROW-5082
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available, wheel
> Fix For: 1.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> The wheels more than tripled in size from 0.12.0 to 0.13.0. I think this is 
> mostly because of LLVM but we should take a closer look to see if the size 
> can be reduced



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9018) [C++] Remove APIs that were deprecated in 0.17.x and prior

2020-06-02 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9018:
---

 Summary: [C++] Remove APIs that were deprecated in 0.17.x and prior
 Key: ARROW-9018
 URL: https://issues.apache.org/jira/browse/ARROW-9018
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-4096) [C++] Preserve "ordered" metadata in some special cases in dictionary unification

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4096:

Fix Version/s: (was: 1.0.0)

> [C++] Preserve "ordered" metadata in some special cases in dictionary 
> unification
> -
>
> Key: ARROW-4096
> URL: https://issues.apache.org/jira/browse/ARROW-4096
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Ben Kietzman
>Priority: Major
>
> In the event that all dictionaries are prefixes of a common dictionary, and 
> all have ordered=true (note: this is not the same thing as being sorted), the 
> resulting unified dictionary can also have ordered=true



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-3739) [C++] Add option to convert a particular column to timestamps or dates using a passed strptime-compatible string

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-3739.
---
Fix Version/s: (was: 1.0.0)
   Resolution: Fixed

duplicate of ARROW-8711


> [C++] Add option to convert a particular column to timestamps or dates using 
> a passed strptime-compatible string
> 
>
> Key: ARROW-3739
> URL: https://issues.apache.org/jira/browse/ARROW-3739
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv, dataset
>
> Probably will need something like
> {code}
> ...
> types={'date_col': csv.convert_date('%Y%m%d')}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6388) [C++] Consider implementing BufferOuputStream using BufferBuilder internally

2020-06-02 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6388:

Fix Version/s: (was: 1.0.0)

> [C++] Consider implementing BufferOuputStream using BufferBuilder internally
> 
>
> Key: ARROW-6388
> URL: https://issues.apache.org/jira/browse/ARROW-6388
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> See discussion in ARROW-6381 https://github.com/apache/arrow/pull/5222
> We should be careful that this doesn't introduce any performance regression.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9017) [Python] Refactor the Scalar classes

2020-06-02 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123957#comment-17123957
 ] 

Antoine Pitrou commented on ARROW-9017:
---

We should definitely unify those around a Cython wrapper to C++ 
{{arrow::Scalar}}, IMHO.

> [Python] Refactor the Scalar classes
> 
>
> Key: ARROW-9017
> URL: https://issues.apache.org/jira/browse/ARROW-9017
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> The situation regarding scalars in Python is currently not optimal.
> We have two different "types" of scalars:
> - {{ArrayValue(Scalar)}} (and subclasses of that for all types):  this is 
> used when you access a single element of an array (eg {{arr[0]}})
> - {{ScalarValue(Scalar)}} (and subclasses of that for _some_ types): this is 
> used when wrapping a C++ scalar into a python scalar, eg when you get back a 
> scalar from a reduction like {{arr.sum()}}.
> And while we have two versions of scalars, neither of them can actually 
> easily be used as scalar as they both can't be constructed from a python 
> scalar (there is no {{scalar(1)}} function to use when calling a kernel, for 
> example).
> I think we should try to unify those scalar classes? (which probably means 
> getting rid of the ArrayValue scalar)
> In addition, there is an issue of trying to re-use python scalar <-> arrow 
> conversion code, as this is also logic for this in the {{python_to_arrow.cc}} 
> code. But this is probably a bigger change. cc [~kszucs] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9017) [Python] Refactor the Scalar classes

2020-06-02 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123950#comment-17123950
 ] 

Joris Van den Bossche commented on ARROW-9017:
--

And comment from Ben: relevant recent addition:

{code}
 Result> Array::GetScalar(int64_t i) const;
{code}

> [Python] Refactor the Scalar classes
> 
>
> Key: ARROW-9017
> URL: https://issues.apache.org/jira/browse/ARROW-9017
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> The situation regarding scalars in Python is currently not optimal.
> We have two different "types" of scalars:
> - {{ArrayValue(Scalar)}} (and subclasses of that for all types):  this is 
> used when you access a single element of an array (eg {{arr[0]}})
> - {{ScalarValue(Scalar)}} (and subclasses of that for _some_ types): this is 
> used when wrapping a C++ scalar into a python scalar, eg when you get back a 
> scalar from a reduction like {{arr.sum()}}.
> And while we have two versions of scalars, neither of them can actually 
> easily be used as scalar as they both can't be constructed from a python 
> scalar (there is no {{scalar(1)}} function to use when calling a kernel, for 
> example).
> I think we should try to unify those scalar classes? (which probably means 
> getting rid of the ArrayValue scalar)
> In addition, there is an issue of trying to re-use python scalar <-> arrow 
> conversion code, as this is also logic for this in the {{python_to_arrow.cc}} 
> code. But this is probably a bigger change. cc [~kszucs] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 >

1 - 100 of 156 matches

Mail list logo