[jira] [Assigned] (ARROW-6841) [C++] Upgrade to LLVM 8

2020-03-16 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-6841:
---

Assignee: Jun NAITOH

> [C++] Upgrade to LLVM 8
> ---
>
> Key: ARROW-6841
> URL: https://issues.apache.org/jira/browse/ARROW-6841
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Jun NAITOH
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Now that LLVM 9 has been released, LLVM 8 has been promoted to stable 
> according to 
> http://apt.llvm.org/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8129) [C++][Compute] Refine compare sorting kernel

2020-03-16 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060602#comment-17060602
 ] 

Yibo Cai commented on ARROW-8129:
-

Pierre,

Another 10% performance improvement is observed after applying your change. The 
only problem is binary/string types don't have raw_values() member function.

Disassembled code shows the difference (may be not very accurate as it's hard 
to locate the original code after compiler optimization)

{code:c}
*(values_begin + left) < *(values_begin + right);

7b6d29:   49 8b 7c fd 00  mov0x0(%r13,%rdi,8),%rdi
7b6d2e:   4b 39 7c cd 00  cmp%rdi,0x0(%r13,%r9,8)
{code}

{code:c}
values.GetView(left) < values.GetView(right);

7ba98f:   4c 8b 49 08 mov0x8(%rcx),%r9
7ba993:   4c 8b 51 20 mov0x20(%rcx),%r10
7ba997:   4d 8b 49 20 mov0x20(%r9),%r9
7ba99b:   49 8d 1c da lea(%r10,%rbx,8),%rbx
7ba99f:   49 8d 3c fa lea(%r10,%rdi,8),%rdi
7ba9a3:   4a 8b 3c cf mov(%rdi,%r9,8),%rdi
7ba9a7:   4a 39 3c cb cmp%rdi,(%rbx,%r9,8)
{code}

> [C++][Compute] Refine compare sorting kernel
> 
>
> Key: ARROW-8129
> URL: https://issues.apache.org/jira/browse/ARROW-8129
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>
> Sorting kernel implements two comparison functions, 
> [CompareValues|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L67]
>  use array.Value() for numeric data and 
> [CompareViews|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L72]
>  uses array.GetView() for non-numeric ones. It can be simplified by using 
> GetView() only as all data types support GetView().
> To my surprise, benchmark shows about 40% performance improvement after the 
> change.
> After some digging, I find in current code, the [comparison 
> callback|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L94]
>  is not inlined (check disassembled code), it leads to a function call. It's 
> very bad for this hot loop. Using only GetView() fixes this issue, code 
> inlined okay.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8087) [C++][Dataset] Order of keys with HivePartitioning is lost in resulting schema

2020-03-16 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8087.
---
Resolution: Fixed

Issue resolved by pull request 6594
[https://github.com/apache/arrow/pull/6594]

> [C++][Dataset] Order of keys with HivePartitioning is lost in resulting schema
> --
>
> Key: ARROW-8087
> URL: https://issues.apache.org/jira/browse/ARROW-8087
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, when reading a partitioned dataset with hive partitioning, it 
> seems that the partition columns get sorted alphabetically when appending 
> them to the schema (while the old ParquetDataset implementation keeps the 
> order as it is present in the paths).  
> For a regular partitioning this order is consistent for all fragments.
> So for example for the typical NYC Taxi data example, with datasets, the 
> schema ends with columns "month, year", while the ParquetDataset appends them 
> as "year, month".
> Python example:
> {code}
> foo_keys = [0, 1]
> bar_keys = ['a', 'b', 'c']
> N = 30
> df = pd.DataFrame({
> 'foo': np.array(foo_keys, dtype='i4').repeat(15),
> 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
> 'values': np.random.randn(N)
> })
> pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
> {code}
> {code}
> >>> pq.read_table("test_order").schema
> values: double
> foo: dictionary
> bar: dictionary
> >>> ds.dataset("test_order", format="parquet", partitioning="hive").schema
> values: double
> bar: string
> foo: int32
> {code}
> so "foo, bar" vs "bar, foo" (the fact that it are dictionaries is something 
> else)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7616) [Java] Support comparing value ranges for dense union vector

2020-03-16 Thread Brian Hulette (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette resolved ARROW-7616.
--
Fix Version/s: 0.17.0
   Resolution: Fixed

Issue resolved by pull request 6355
[https://github.com/apache/arrow/pull/6355]

> [Java] Support comparing value ranges for dense union vector
> 
>
> Key: ARROW-7616
> URL: https://issues.apache.org/jira/browse/ARROW-7616
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> After we support dense union vectors, we should support range value 
> comparisons for them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8126) [C++][Compute] Add Top-K kernel benchmark

2020-03-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8126:
--
Labels: pull-request-available  (was: )

> [C++][Compute] Add Top-K kernel benchmark
> -
>
> Key: ARROW-8126
> URL: https://issues.apache.org/jira/browse/ARROW-8126
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Compute
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8027) [Developer][Integration] Add integration tests for duplicate field names

2020-03-16 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8027.
-
Fix Version/s: (was: 1.0.0)
   0.17.0
   Resolution: Fixed

Issue resolved by pull request 6636
[https://github.com/apache/arrow/pull/6636]

> [Developer][Integration] Add integration tests for duplicate field names
> 
>
> Key: ARROW-8027
> URL: https://issues.apache.org/jira/browse/ARROW-8027
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Schemas and nested types whose fields' names are not unique are permitted, so 
> the integration tests should include a case which exercises these.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8027) [Developer][Integration] Add integration tests for duplicate field names

2020-03-16 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8027:
---

Assignee: Ben Kietzman

> [Developer][Integration] Add integration tests for duplicate field names
> 
>
> Key: ARROW-8027
> URL: https://issues.apache.org/jira/browse/ARROW-8027
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Schemas and nested types whose fields' names are not unique are permitted, so 
> the integration tests should include a case which exercises these.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8129) [C++][Compute] Refine compare sorting kernel

2020-03-16 Thread Pierre Belzile (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060566#comment-17060566
 ] 

Pierre Belzile commented on ARROW-8129:
---

Yibo,

Given that you are looking at this, I wonder if the following wouldn't be 
faster for CompareValues:
{code:java}
auto values_begin = values.raw_values(); // this also exists everywhere
std::stable_sort(indices_begin, nulls_begin,
 [values_begin](uint64_t left, uint64_t right) {
  return *(values_begin + left) < *(values_begin + right);
 });

{code}
It avoids dereferencing the shared_ptr data_ and adding the offset twice for 
each comparisons.

> [C++][Compute] Refine compare sorting kernel
> 
>
> Key: ARROW-8129
> URL: https://issues.apache.org/jira/browse/ARROW-8129
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>
> Sorting kernel implements two comparison functions, 
> [CompareValues|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L67]
>  use array.Value() for numeric data and 
> [CompareViews|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L72]
>  uses array.GetView() for non-numeric ones. It can be simplified by using 
> GetView() only as all data types support GetView().
> To my surprise, benchmark shows about 40% performance improvement after the 
> change.
> After some digging, I find in current code, the [comparison 
> callback|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L94]
>  is not inlined (check disassembled code), it leads to a function call. It's 
> very bad for this hot loop. Using only GetView() fixes this issue, code 
> inlined okay.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7979) [C++] Implement experimental buffer compression in IPC messages

2020-03-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7979:
--
Labels: pull-request-available  (was: )

> [C++] Implement experimental buffer compression in IPC messages
> ---
>
> Key: ARROW-7979
> URL: https://issues.apache.org/jira/browse/ARROW-7979
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>
> The idea is that this can be used for experiments and bespoke applications 
> (e.g. in the context of ARROW-5510). If this is adopted formally into the IPC 
> format then the experimental implementation can be altered to match the 
> specification



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7812) [Packaging][Python] Upgrade LLVM in manylinux1 docker image

2020-03-16 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-7812.
-
Resolution: Duplicate

> [Packaging][Python] Upgrade LLVM in manylinux1 docker image
> ---
>
> Key: ARROW-7812
> URL: https://issues.apache.org/jira/browse/ARROW-7812
> Project: Apache Arrow
>  Issue Type: Task
>  Components: CI
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8095) [CI][Crossbow] Nightly turbodbc job fails

2020-03-16 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060538#comment-17060538
 ] 

Kouhei Sutou commented on ARROW-8095:
-

[~apitrou] [~uwe] The build error can be fixed by the following Apache Arrow 
C++ change:

{noformat}
diff --git a/cpp/src/arrow/array/builder_dict.h 
b/cpp/src/arrow/array/builder_dict.h
index 98829010d..57674a1bb 100644
--- a/cpp/src/arrow/array/builder_dict.h
+++ b/cpp/src/arrow/array/builder_dict.h
@@ -175,6 +175,12 @@ class DictionaryBuilderBase : public ArrayBuilder {
 return Append(util::string_view(value, length));
   }
 
+  /// \brief Append a string (only for string types)
+  template 
+  enable_if_string_like Append(const char* value, int32_t length) {
+return Append(util::string_view(value, length));
+  }
+
   /// \brief Append a scalar null value
   Status AppendNull() final {
 length_ += 1;
{noformat}

Which should we change Apache Arrow C++ or Turbodbc?

> [CI][Crossbow] Nightly turbodbc job fails
> -
>
> Key: ARROW-8095
> URL: https://issues.apache.org/jira/browse/ARROW-8095
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Neal Richardson
>Priority: Blocker
> Fix For: 0.17.0
>
>
> Turbodbc fails to compile (both "master" and "latest" versions with this 
> error):
> {code}
> FAILED: 
> cpp/turbodbc_arrow/Library/CMakeFiles/turbodbc_arrow_support.dir/src/arrow_result_set.cpp.o
>  
> /opt/conda/envs/arrow/bin/x86_64-conda_cos6-linux-gnu-c++  
> -Dturbodbc_arrow_support_EXPORTS -I/turbodbc/cpp/turbodbc_arrow/Library 
> -I/turbodbc/cpp/turbodbc_arrow/../cpp_odbc/Library 
> -I/turbodbc/cpp/turbodbc_arrow/../turbodbc/Library 
> -I/turbodbc/pybind11/include -isystem /opt/conda/envs/arrow/include -isystem 
> /opt/conda/envs/arrow/include/python3.7m -isystem 
> /opt/conda/envs/arrow/lib/python3.7/site-packages/numpy/core/include 
> -fvisibility-inlines-hidden -Wall -Wextra -g -O0 -pedantic -fPIC 
> -fvisibility=hidden   -std=c++11 -std=c++14 -MD -MT 
> cpp/turbodbc_arrow/Library/CMakeFiles/turbodbc_arrow_support.dir/src/arrow_result_set.cpp.o
>  -MF 
> cpp/turbodbc_arrow/Library/CMakeFiles/turbodbc_arrow_support.dir/src/arrow_result_set.cpp.o.d
>  -o 
> cpp/turbodbc_arrow/Library/CMakeFiles/turbodbc_arrow_support.dir/src/arrow_result_set.cpp.o
>  -c /turbodbc/cpp/turbodbc_arrow/Library/src/arrow_result_set.cpp
> /turbodbc/cpp/turbodbc_arrow/Library/src/arrow_result_set.cpp: In member 
> function 'arrow::Status 
> turbodbc_arrow::{anonymous}::StringDictionaryBuilderProxy::AppendProxy(const 
> char*, int32_t)':
> /turbodbc/cpp/turbodbc_arrow/Library/src/arrow_result_set.cpp:67:36: error: 
> no matching function for call to 
> 'turbodbc_arrow::{anonymous}::StringDictionaryBuilderProxy::Append(const 
> char*&, int32_t&)'
>  return Append(value, length);
> ^
> In file included from /opt/conda/envs/arrow/include/arrow/builder.h:26:0,
>  from /opt/conda/envs/arrow/include/arrow/api.h:26,
>  from 
> /turbodbc/cpp/turbodbc_arrow/Library/src/arrow_result_set.cpp:6:
> /opt/conda/envs/arrow/include/arrow/array/builder_dict.h:143:10: note: 
> candidate: arrow::Status arrow::internal::DictionaryBuilderBase T>::Append(const Scalar&) [with BuilderType = arrow::AdaptiveIntBuilder; T = 
> arrow::StringType; arrow::internal::DictionaryBuilderBase T>::Scalar = nonstd::sv_lite::basic_string_view]
>Status Append(const Scalar& value) {
>   ^~
> /opt/conda/envs/arrow/include/arrow/array/builder_dict.h:143:10: note:   
> candidate expects 1 argument, 2 provided
> /opt/conda/envs/arrow/include/arrow/array/builder_dict.h:156:43: note: 
> candidate: template arrow::enable_if_fixed_size_binary arrow::Status> arrow::internal::DictionaryBuilderBase T>::Append(const uint8_t*) [with T1 = T1; BuilderType = 
> arrow::AdaptiveIntBuilder; T = arrow::StringType]
>enable_if_fixed_size_binary Append(const uint8_t* value) {
>^~
> /opt/conda/envs/arrow/include/arrow/array/builder_dict.h:156:43: note:   
> template argument deduction/substitution failed:
> /turbodbc/cpp/turbodbc_arrow/Library/src/arrow_result_set.cpp:67:36: note:   
> candidate expects 1 argument, 2 provided
>  return Append(value, length);
> ^
> In file included from /opt/conda/envs/arrow/include/arrow/builder.h:26:0,
>  from /opt/conda/envs/arrow/include/arrow/api.h:26,
>  from 
> /turbodbc/cpp/turbodbc_arrow/Library/src/arrow_result_set.cpp:6:
> /opt/conda/envs/arrow/include/arrow/array/builder_dict.h:162:43: note: 
> candidate: template arrow::enable_if_fixed_size_binary arrow::Status> arrow::internal::DictionaryBuilderBase 

[jira] [Assigned] (ARROW-7233) [C++] Add Result APIs to IPC module

2020-03-16 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7233:
---

Assignee: Wes McKinney

> [C++] Add Result APIs to IPC module
> --
>
> Key: ARROW-7233
> URL: https://issues.apache.org/jira/browse/ARROW-7233
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Wes McKinney
>Priority: Major
>
> src/arrow/ipc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8133) [CI] Github Actions sometimes fail to checkout Arrow

2020-03-16 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-8133.
-
Fix Version/s: 0.17.0
   Resolution: Fixed

Issue resolved by pull request 6634
[https://github.com/apache/arrow/pull/6634]

> [CI] Github Actions sometimes fail to checkout Arrow
> 
>
> Key: ARROW-8133
> URL: https://issues.apache.org/jira/browse/ARROW-8133
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Failing build 
> https://github.com/apache/arrow/pull/6632/checks?check_run_id=511663097



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7807) [R] Installation on RHEL 7 Cannot call io___MemoryMappedFile__Open()

2020-03-16 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-7807.

Fix Version/s: 0.17.0
 Assignee: Neal Richardson
   Resolution: Fixed

Closing; please reopen if this is still an issue

> [R] Installation on RHEL 7 Cannot call io___MemoryMappedFile__Open()
> 
>
> Key: ARROW-7807
> URL: https://issues.apache.org/jira/browse/ARROW-7807
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.16.0
> Environment: RHEL 7.6
> Custom R build in a non-default location
>Reporter: Omar Yassin
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 0.17.0
>
>
> Hey Team,
> I've been using Arrow successfully in python through conda and have been able 
> to write and read parquet files successfully. Now I'm trying to have R users 
> consume some of the parquet files I've produced. They run a shared R build in 
> a custom location, so can't use conda with their setup. We tried installing 
> the C++ libraries system-wide and then the R library in a user's directory, 
> but kept getting {color:#de350b}Cannot call 
> io{color}{color:#de350b}{{__MemoryMappedFile_}}{color}{color:#de350b}Open() 
> {color}errors on{color:#de350b} read_parquet(){color}. I'm not sure if we've 
> missed a step, or where to continue debugging. Does the R package have any 
> known issues on RHEL 7? Below are some details:
>  
> Environment:
>  * RHEL 7.6
>  * Custom local R environment in a non-default location
> Steps taken:
>  # Installed the C++ libraries first (now live in /usr/lib64) as described 
> (v.0.16.0) in [https://arrow.apache.org/install/]
>  # Ran {color:#de350b}{{install.packages('arrow')}}{color} in an interactive 
> R session
>  # It couldn't find the C++ libraries and said {{{color:#de350b}No C++ 
> binaries found for rhel-7{color}}}
>  # Couldn't find 
> [https://dl.bintray.com/ursalabs/arrow-r/libarrow/bin/rhel-7/arrow-0.16.0.zip]
>  when it tried to pull the binaries
>  # Source download didn't work due to internal github firewall rules
>  # Installed without errors, but threw {{{color:#de350b}Cannot call 
> io{color}}}{color:#de350b}{{__MemoryMappedFile_}}{color}{{{color:#de350b}Open(){color}}}
>  error on {{{color:#de350b}read_parquet(){color}}}
>  # Removed the rlib/arrow directory and tried a different route
>  # Set {{{color:#de350b}LIBARROW_BINARY_DISTRO='centos-7'{color}}}
>  # Set {{{color:#de350b}PKG_CONFIG=/usr/lib64{color}}}
>  # Ran {color:#de350b}{{install.packages('arrow')}}{color} in an interactive 
> R session
>  # Binaries and package seemed to install correctly without complaints
>  # Still threw {{{color:#de350b}Cannot call 
> {color}}}{color:#de350b}{{io__MemoryMappedFile_Open{{{color:#de350b}(){color} 
> on {color:#de350b}read_parquet{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8127) [C++] [Parquet] Incorrect column chunk metadata for multipage batch writes

2020-03-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8127:
--
Labels: pull-request-available  (was: )

> [C++] [Parquet] Incorrect column chunk metadata for multipage batch writes
> --
>
> Key: ARROW-8127
> URL: https://issues.apache.org/jira/browse/ARROW-8127
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: TP Boudreau
>Assignee: TP Boudreau
>Priority: Minor
>  Labels: pull-request-available
> Attachments: multipage-batch-write.cc
>
>
> When writing to a buffered column writer using PLAIN encoding, if the size of 
> the batch supplied for writing exceeds the page size for the writer, the 
> resulting file has an incorrect data_page_offset set in its column chunk 
> metadata.  This causes an exception to be thrown when reading the file (file 
> appears to be too short to the reader).
> For example, the attached code, which attempts to write a batch of 262145 
> Int32's (= 1048576 + 4 bytes) using the default page size of 1048576 bytes 
> (with buffered writer, PLAIN encoding), fails on reading, throwing the error: 
> "Tried reading 1048678 bytes starting at position 1048633 from file but only 
> got 333".
> The error is caused by the second page write tripping the conditional here 
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L302,
>  in the serialized in-memory writer wrapped by the buffered writer.
> The fix builds the metadata with offsets from the terminal sink rather than 
> the in memory buffered sink.  A PR is coming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8027) [Developer][Integration] Add integration tests for duplicate field names

2020-03-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8027:
--
Labels: pull-request-available  (was: )

> [Developer][Integration] Add integration tests for duplicate field names
> 
>
> Key: ARROW-8027
> URL: https://issues.apache.org/jira/browse/ARROW-8027
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Schemas and nested types whose fields' names are not unique are permitted, so 
> the integration tests should include a case which exercises these.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8118) [R] dim method for FileSystemDataset

2020-03-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8118:
--
Labels: features pull-request-available  (was: features)

> [R] dim method for FileSystemDataset
> 
>
> Key: ARROW-8118
> URL: https://issues.apache.org/jira/browse/ARROW-8118
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Sam Albers
>Priority: Minor
>  Labels: features, pull-request-available
>
> I been using this function enough that I wonder if a) would be useful in the 
> package and b) whether this is something you think is worth working on. The 
> basic problem is that if you have a hierarchical file structure that 
> accommodates using open_dataset, it is definitely useful to know the amount 
> of data you are dealing with. My idea is that 'FileSystemDataset' would have 
> dim, nrow and ncol methods. Here is how I've been using it:
> {code:java}
> library(arrow)
> x <- open_dataset("data/rivers-data/", partitioning = c("prov", "month"))
> dim_arrow <- function(x) {
>  rows <- sum(purrr::map_dbl(x$files, 
> ~ParquetFileReader$create(.x)$ReadTable()$num_rows))
>  cols <- x$schema$num_fields
>  
>  c(rows, cols)
> }
> dim_arrow(x)
> #> [1] 426929 7
> {code}
>  
> Ideally this would work on arrow_dplyr_query objects as well but I haven't 
> quite figured out how that filters based on the partitioning variables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8132) [C++] arrow-s3fs-test failing on master

2020-03-16 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-8132.

Fix Version/s: 0.17.0
   Resolution: Fixed

Issue resolved by pull request 6632
[https://github.com/apache/arrow/pull/6632]

> [C++] arrow-s3fs-test failing on master
> ---
>
> Key: ARROW-8132
> URL: https://issues.apache.org/jira/browse/ARROW-8132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Hatem Helal
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Log:
> [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/branch/master/job/9dgr7xl635yuwh7y#L1917]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8134) [C++][CI] Revisit the flaky S3 tests caused by recent minio

2020-03-16 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8134:
--

 Summary: [C++][CI] Revisit the flaky S3 tests caused by recent 
minio 
 Key: ARROW-8134
 URL: https://issues.apache.org/jira/browse/ARROW-8134
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Continuous Integration
Reporter: Krisztian Szucs


See change in 
https://github.com/apache/arrow/pull/6632/files#diff-01448dc0a0e3217fd1579b81a915d593R843



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8133) [CI] Github Actions sometimes fail to checkout Arrow

2020-03-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8133:
--
Labels: pull-request-available  (was: )

> [CI] Github Actions sometimes fail to checkout Arrow
> 
>
> Key: ARROW-8133
> URL: https://issues.apache.org/jira/browse/ARROW-8133
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>
> Failing build 
> https://github.com/apache/arrow/pull/6632/checks?check_run_id=511663097



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8133) [CI] Github Actions sometimes fail to checkout Arrow

2020-03-16 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8133:
--

 Summary: [CI] Github Actions sometimes fail to checkout Arrow
 Key: ARROW-8133
 URL: https://issues.apache.org/jira/browse/ARROW-8133
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


Failing build 
https://github.com/apache/arrow/pull/6632/checks?check_run_id=511663097



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas

2020-03-16 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7365:
---

Assignee: Wes McKinney

> [Python] Support FixedSizeList type in conversion to numpy/pandas
> -
>
> Key: ARROW-7365
> URL: https://issues.apache.org/jira/browse/ARROW-7365
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.17.0
>
>
> Follow-up on ARROW-7261, still need to add support for FixedSizeListType in 
> the arrow -> python conversion (arrow_to_pandas.cc)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-2703) [C++] Always use statically-linked Boost with private namespace

2020-03-16 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2703.
---
Resolution: Won't Fix

> [C++] Always use statically-linked Boost with private namespace
> ---
>
> Key: ARROW-2703
> URL: https://issues.apache.org/jira/browse/ARROW-2703
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> We have recently added tooling to ship Python wheels with a bundled, private 
> Boost (using the bcp tool). We might consider statically-linking a private 
> Boost exclusively in libarrow (i.e. built via our thirdparty toolchain) to 
> avoid any conflicts with other libraries that may use a different version of 
> Boost



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-2702) [Python] Examine usages of Invalid and TypeError errors in numpy_to_arrow.cc to see if we are using the right error type in each instance

2020-03-16 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2702:

Fix Version/s: 1.0.0

> [Python] Examine usages of Invalid and TypeError errors in numpy_to_arrow.cc 
> to see if we are using the right error type in each instance
> -
>
> Key: ARROW-2702
> URL: https://issues.apache.org/jira/browse/ARROW-2702
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> See discussion in [https://github.com/apache/arrow/pull/2075]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7390) [C++][Dataset] Concurrency race in Projector::Project

2020-03-16 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-7390:
--
Fix Version/s: 0.17.0

> [C++][Dataset] Concurrency race in Projector::Project 
> --
>
> Key: ARROW-7390
> URL: https://issues.apache.org/jira/browse/ARROW-7390
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.17.0
>
>
> When a DataFragment is invoked by 2 scan tasks of the same DataFragment, 
> there's a race to invoke SetInputSchema. Note that ResizeMissingColumns also 
> suffers from this race. The ideal goal is to make Project a const method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7783) [C++] ARROW_DATASET should enable ARROW_COMPUTE

2020-03-16 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-7783:
-

Assignee: Francois Saint-Jacques

> [C++] ARROW_DATASET should enable ARROW_COMPUTE
> ---
>
> Key: ARROW-7783
> URL: https://issues.apache.org/jira/browse/ARROW-7783
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.17.0
>
>
> Currenty, passing {{-DARROW_DATASET=ON}} to CMake doesn't enable 
> ARROW_COMPUTE, which leads to linker errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-2621) [Python/CI] Use pep8speaks for Python PRs

2020-03-16 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2621.
---
Fix Version/s: (was: 2.0.0)
   Resolution: Won't Fix

The current lint GHA task provides pretty quick feedback. If we find that the 
feedback isn't fast enough in the future we can look for a way to bubble lint 
failures into PR comments

> [Python/CI] Use pep8speaks for Python PRs
> -
>
> Key: ARROW-2621
> URL: https://issues.apache.org/jira/browse/ARROW-2621
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration, Python
>Reporter: Uwe Korn
>Priority: Minor
>  Labels: beginner
>
> It would be nice if we would get automated comments by 
> [https://pep8speaks.com/] on the Python PRs. This should be much better 
> readable than the current `flake8` ouput in the Travis logs. This issue is 
> split up into two tasks:
>  * Create an issue with INFRA kindly asking them for activating pep8speaks 
> for Arrow
>  * Setup {{.pep8speaks.yml}} to align with our {{flake8}} config. For 
> reference, see Pandas' config: 
> [https://github.com/pandas-dev/pandas/blob/master/.pep8speaks.yml] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-2628) [Python] parquet.write_to_dataset is memory-hungry on large DataFrames

2020-03-16 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2628:

Labels: dataset parquet  (was: parquet)

> [Python] parquet.write_to_dataset is memory-hungry on large DataFrames
> --
>
> Key: ARROW-2628
> URL: https://issues.apache.org/jira/browse/ARROW-2628
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, parquet
>
> See discussion in https://github.com/apache/arrow/issues/1749. We should 
> consider strategies for writing very large tables to a partitioned directory 
> scheme. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-2628) [Python] parquet.write_to_dataset is memory-hungry on large DataFrames

2020-03-16 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060363#comment-17060363
 ] 

Wes McKinney commented on ARROW-2628:
-

cc [~jorisvandenbossche] -- this will be an important dataset use case

> [Python] parquet.write_to_dataset is memory-hungry on large DataFrames
> --
>
> Key: ARROW-2628
> URL: https://issues.apache.org/jira/browse/ARROW-2628
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, parquet
>
> See discussion in https://github.com/apache/arrow/issues/1749. We should 
> consider strategies for writing very large tables to a partitioned directory 
> scheme. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5336) [C++] Implement arrow::Concatenate for dictionary-encoded arrays with unequal dictionaries

2020-03-16 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-5336:

Fix Version/s: (was: 0.17.0)
   1.0.0

> [C++] Implement arrow::Concatenate for dictionary-encoded arrays with unequal 
> dictionaries
> --
>
> Key: ARROW-5336
> URL: https://issues.apache.org/jira/browse/ARROW-5336
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently (as of ARROW-3144) if any dictionary is different, an error is 
> returned



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5336) [C++] Implement arrow::Concatenate for dictionary-encoded arrays with unequal dictionaries

2020-03-16 Thread Ben Kietzman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060349#comment-17060349
 ] 

Ben Kietzman commented on ARROW-5336:
-

[~wesm] would it be acceptable to require that if two dictionaries differ then 
one must be a prefix of the other?

> [C++] Implement arrow::Concatenate for dictionary-encoded arrays with unequal 
> dictionaries
> --
>
> Key: ARROW-5336
> URL: https://issues.apache.org/jira/browse/ARROW-5336
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 0.17.0
>
>
> Currently (as of ARROW-3144) if any dictionary is different, an error is 
> returned



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8070) [Python] Array.cast segfaults on unsupported cast from list to utf8

2020-03-16 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060317#comment-17060317
 ] 

Neal Richardson commented on ARROW-8070:


[~kszucs] see also https://issues.apache.org/jira/browse/ARROW-8025

> [Python] Array.cast segfaults on unsupported cast from list to utf8
> ---
>
> Key: ARROW-8070
> URL: https://issues.apache.org/jira/browse/ARROW-8070
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Daniel Nugent
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 0.17.0
>
>
> Was messing around with some nested arrays and found a pretty easy to 
> reproduce segfault:
> {code:java}
> Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48)
> [GCC 7.3.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np, pyarrow as pa
> >>> pa.__version__
> '0.16.0'
> >>> np.__version__
> '1.18.1'
> >>> x=[np.array([b'a',b'b'])]
> >>> a = pa.array(x,pa.list_(pa.binary()))
> >>> a
> 
> [
>   [
> 61,
> 62
>   ]
> ]
> >>> a.cast(pa.string())
> Segmentation fault
> {code}
> I don't know if that cast makes sense, but I left the checks on, so I would 
> not expect a segfault from it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8070) [Python] Array.cast segfaults on unsupported cast from list to utf8

2020-03-16 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-8070:
--

Assignee: Krisztian Szucs

> [Python] Array.cast segfaults on unsupported cast from list to utf8
> ---
>
> Key: ARROW-8070
> URL: https://issues.apache.org/jira/browse/ARROW-8070
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Daniel Nugent
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 0.17.0
>
>
> Was messing around with some nested arrays and found a pretty easy to 
> reproduce segfault:
> {code:java}
> Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48)
> [GCC 7.3.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np, pyarrow as pa
> >>> pa.__version__
> '0.16.0'
> >>> np.__version__
> '1.18.1'
> >>> x=[np.array([b'a',b'b'])]
> >>> a = pa.array(x,pa.list_(pa.binary()))
> >>> a
> 
> [
>   [
> 61,
> 62
>   ]
> ]
> >>> a.cast(pa.string())
> Segmentation fault
> {code}
> I don't know if that cast makes sense, but I left the checks on, so I would 
> not expect a segfault from it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8092) [CI][Crossbow] OSX wheels fail on bundled bzip2

2020-03-16 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-8092:
--

Assignee: Krisztian Szucs

> [CI][Crossbow] OSX wheels fail on bundled bzip2
> ---
>
> Key: ARROW-8092
> URL: https://issues.apache.org/jira/browse/ARROW-8092
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Packaging, Python
>Reporter: Neal Richardson
>Assignee: Krisztian Szucs
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> See e.g. 
> [https://travis-ci.org/github/ursa-labs/crossbow/builds/661245916#L6104]
> [https://travis-ci.org/github/ursa-labs/crossbow/builds/661246751#L6103]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8124) [Rust] Update library dependencies

2020-03-16 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8124:

Summary: [Rust] Update library dependencies  (was: Update library 
dependencies)

> [Rust] Update library dependencies
> --
>
> Key: ARROW-8124
> URL: https://issues.apache.org/jira/browse/ARROW-8124
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust
>Reporter: Bryant Biggs
>Assignee: Bryant Biggs
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Update rust library dependencies to the latest - except for thrift and 
> sqlparser which require additional work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8124) Update library dependencies

2020-03-16 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-8124.
---
Resolution: Fixed

Issue resolved by pull request 6626
[https://github.com/apache/arrow/pull/6626]

> Update library dependencies
> ---
>
> Key: ARROW-8124
> URL: https://issues.apache.org/jira/browse/ARROW-8124
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust
>Reporter: Bryant Biggs
>Assignee: Bryant Biggs
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Update rust library dependencies to the latest - except for thrift and 
> sqlparser which require additional work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8124) Update library dependencies

2020-03-16 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-8124:
-

Assignee: Bryant Biggs

> Update library dependencies
> ---
>
> Key: ARROW-8124
> URL: https://issues.apache.org/jira/browse/ARROW-8124
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Rust
>Reporter: Bryant Biggs
>Assignee: Bryant Biggs
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Update rust library dependencies to the latest - except for thrift and 
> sqlparser which require additional work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8131) [Python] Add dynamic attributes to PyArrow ExtensionArray

2020-03-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8131:
--
Labels: Python3 pull-request-available  (was: Python3)

> [Python] Add dynamic attributes to PyArrow ExtensionArray
> -
>
> Key: ARROW-8131
> URL: https://issues.apache.org/jira/browse/ARROW-8131
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.16.0
> Environment: Ubuntu 19.10 + Python 3.7
>Reporter: Paul Balanca
>Priority: Major
>  Labels: Python3, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the present implementation, the interface of the class `ExtensionArray` is 
> not extendable by user. One can not easily inherit from it, as the 
> constructor __init__ can not be called directly, or it does not allow adding 
> dynamically atttributes.
> Keeping the current design with build methods `from_*`, I believe it could 
> then make sense to allow dynamic attributes in `ExtensionArray` (see 
> [https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]).
>  The runtime & size cost of the Python objects would be fairly minimal, 
> compared to increased flexibility it would allow.
> A typical use case where it could be useful would be dynamic mixins (added by 
> custom Factory), allowing projects based on PyArrow to extend (! :)) the 
> interface with specific business logic. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7858) [C++][Python] Support casting an Extension type to its storage type

2020-03-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7858:
--
Labels: pull-request-available  (was: )

> [C++][Python] Support casting an Extension type to its storage type
> ---
>
> Key: ARROW-7858
> URL: https://issues.apache.org/jira/browse/ARROW-7858
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>
> Currently, casting an extension type will always fail: "No cast implemented 
> from extension to ...".
> However, for casting, we could fall back to the storage array's casting rules?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7858) [C++][Python] Support casting an Extension type to its storage type

2020-03-16 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-7858:
--

Assignee: Krisztian Szucs

> [C++][Python] Support casting an Extension type to its storage type
> ---
>
> Key: ARROW-7858
> URL: https://issues.apache.org/jira/browse/ARROW-7858
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Krisztian Szucs
>Priority: Major
>
> Currently, casting an extension type will always fail: "No cast implemented 
> from extension to ...".
> However, for casting, we could fall back to the storage array's casting rules?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8131) [Python] Add dynamic attributes to PyArrow ExtensionArray

2020-03-16 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8131:

Summary: [Python] Add dynamic attributes to PyArrow ExtensionArray  (was: 
Add dynamic attributes to PyArrow ExtensionArray)

> [Python] Add dynamic attributes to PyArrow ExtensionArray
> -
>
> Key: ARROW-8131
> URL: https://issues.apache.org/jira/browse/ARROW-8131
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.16.0
> Environment: Ubuntu 19.10 + Python 3.7
>Reporter: Paul Balanca
>Priority: Major
>  Labels: Python3
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the present implementation, the interface of the class `ExtensionArray` is 
> not extendable by user. One can not easily inherit from it, as the 
> constructor __init__ can not be called directly, or it does not allow adding 
> dynamically atttributes.
> Keeping the current design with build methods `from_*`, I believe it could 
> then make sense to allow dynamic attributes in `ExtensionArray` (see 
> [https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]).
>  The runtime & size cost of the Python objects would be fairly minimal, 
> compared to increased flexibility it would allow.
> A typical use case where it could be useful would be dynamic mixins (added by 
> custom Factory), allowing projects based on PyArrow to extend (! :)) the 
> interface with specific business logic. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas

2020-03-16 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060240#comment-17060240
 ] 

Wes McKinney commented on ARROW-7365:
-

[~balancap] what you're describing may be a bit outside the scope of this 
particular issue. 

In https://issues.apache.org/jira/browse/ARROW-1614 and elsewhere we have 
discussed allowing ndarray values to be embedded in Arrow array cells -- the 
extension type facility would be the ideal way to get this functionality 
bootstrapped. Let's discuss more there

> [Python] Support FixedSizeList type in conversion to numpy/pandas
> -
>
> Key: ARROW-7365
> URL: https://issues.apache.org/jira/browse/ARROW-7365
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 0.17.0
>
>
> Follow-up on ARROW-7261, still need to add support for FixedSizeListType in 
> the arrow -> python conversion (arrow_to_pandas.cc)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8132) [C++] arrow-s3fs-test failing on master

2020-03-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8132:
--
Labels: pull-request-available  (was: )

> [C++] arrow-s3fs-test failing on master
> ---
>
> Key: ARROW-8132
> URL: https://issues.apache.org/jira/browse/ARROW-8132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Hatem Helal
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> Log:
> [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/branch/master/job/9dgr7xl635yuwh7y#L1917]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8132) [C++] arrow-s3fs-test failing on master

2020-03-16 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-8132:
-

Assignee: Antoine Pitrou

> [C++] arrow-s3fs-test failing on master
> ---
>
> Key: ARROW-8132
> URL: https://issues.apache.org/jira/browse/ARROW-8132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Hatem Helal
>Assignee: Antoine Pitrou
>Priority: Major
>
> Log:
> [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/branch/master/job/9dgr7xl635yuwh7y#L1917]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8132) [C++] arrow-s3fs-test failing on master

2020-03-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060167#comment-17060167
 ] 

Antoine Pitrou commented on ARROW-8132:
---

Yes, saw this. It looks like this is a behaviour change in a recent 
[Minio|https://min.io/] version.

> [C++] arrow-s3fs-test failing on master
> ---
>
> Key: ARROW-8132
> URL: https://issues.apache.org/jira/browse/ARROW-8132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Hatem Helal
>Priority: Major
>
> Log:
> [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/branch/master/job/9dgr7xl635yuwh7y#L1917]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8132) [C++] arrow-s3fs-test failing on master

2020-03-16 Thread Hatem Helal (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hatem Helal updated ARROW-8132:
---
Issue Type: Bug  (was: Improvement)

> [C++] arrow-s3fs-test failing on master
> ---
>
> Key: ARROW-8132
> URL: https://issues.apache.org/jira/browse/ARROW-8132
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Hatem Helal
>Priority: Major
>
> Log:
> [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/branch/master/job/9dgr7xl635yuwh7y#L1917]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8132) [C++] arrow-s3fs-test failing on master

2020-03-16 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-8132:
--
Component/s: C++

> [C++] arrow-s3fs-test failing on master
> ---
>
> Key: ARROW-8132
> URL: https://issues.apache.org/jira/browse/ARROW-8132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Hatem Helal
>Priority: Major
>
> Log:
> [https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/branch/master/job/9dgr7xl635yuwh7y#L1917]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8132) [C++] arrow-s3fs-test failing on master

2020-03-16 Thread Hatem Helal (Jira)
Hatem Helal created ARROW-8132:
--

 Summary: [C++] arrow-s3fs-test failing on master
 Key: ARROW-8132
 URL: https://issues.apache.org/jira/browse/ARROW-8132
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Hatem Helal


Log:

[https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/branch/master/job/9dgr7xl635yuwh7y#L1917]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8111) [C++][CSV] Support MM/DD/YYYY date format

2020-03-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8111:
--
Labels: pull-request-available  (was: )

> [C++][CSV] Support MM/DD/ date format
> -
>
> Key: ARROW-8111
> URL: https://issues.apache.org/jira/browse/ARROW-8111
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Artem Alekseev
>Assignee: Artem Alekseev
>Priority: Major
>  Labels: pull-request-available
>
> Currently, date parser supports only -MM-DD format. For our workload we 
> need MM/DD/ format. It is obvious that CSV parser should support 
> different date formats, so we may start from implementing MM/DD/ format. 
> Also, we may use some date parsing library, which would solve the problem for 
> us.
> Also, we may need to somehow specify a format for every column in CSV parser. 
> If you have any implementation ideas in mind, please share, so that I can 
> implement it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8131) Add dynamic attributes to PyArrow ExtensionArray

2020-03-16 Thread Paul Balanca (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Balanca updated ARROW-8131:

Labels: Python3  (was: pull-request-available)

> Add dynamic attributes to PyArrow ExtensionArray
> 
>
> Key: ARROW-8131
> URL: https://issues.apache.org/jira/browse/ARROW-8131
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.16.0
> Environment: Ubuntu 19.10 + Python 3.7
>Reporter: Paul Balanca
>Priority: Major
>  Labels: Python3
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the present implementation, the interface of the class `ExtensionArray` is 
> not extendable by user. One can not easily inherit from it, as the 
> constructor __init__ can not be called directly, or it does not allow adding 
> dynamically atttributes.
> Keeping the current design with build methods `from_*`, I believe it could 
> then make sense to allow dynamic attributes in `ExtensionArray` (see 
> [https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]).
>  The runtime & size cost of the Python objects would be fairly minimal, 
> compared to increased flexibility it would allow.
> A typical use case where it could be useful would be dynamic mixins (added by 
> custom Factory), allowing projects based on PyArrow to extend (! :)) the 
> interface with specific business logic. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8131) Add dynamic attributes to PyArrow ExtensionArray

2020-03-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8131:
--
Labels: pull-request-available  (was: )

> Add dynamic attributes to PyArrow ExtensionArray
> 
>
> Key: ARROW-8131
> URL: https://issues.apache.org/jira/browse/ARROW-8131
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.16.0
> Environment: Ubuntu 19.10 + Python 3.7
>Reporter: Paul Balanca
>Priority: Major
>  Labels: pull-request-available
>
> In the present implementation, the interface of the class `ExtensionArray` is 
> not extendable by user. One can not easily inherit from it, as the 
> constructor __init__ can not be called directly, or it does not allow adding 
> dynamically atttributes.
> Keeping the current design with build methods `from_*`, I believe it could 
> then make sense to allow dynamic attributes in `ExtensionArray` (see 
> [https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]).
>  The runtime & size cost of the Python objects would be fairly minimal, 
> compared to increased flexibility it would allow.
> A typical use case where it could be useful would be dynamic mixins (added by 
> custom Factory), allowing projects based on PyArrow to extend (! :)) the 
> interface with specific business logic. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8111) [C++][CSV] Support MM/DD/YYYY date format

2020-03-16 Thread Artem Alekseev (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060082#comment-17060082
 ] 

Artem Alekseev edited comment on ARROW-8111 at 3/16/20, 10:01 AM:
--

Oh, I found that we actually needed US MM/DD/ format, so I will rename the 
issue :) 
Also, for disambiguate US and EU formats we can add explicit locale param to 
the parser.


was (Author: fexolm):
Oh, I found that we actually needed US MM/DD/ format, so I will rename the 
issue :) 

> [C++][CSV] Support MM/DD/ date format
> -
>
> Key: ARROW-8111
> URL: https://issues.apache.org/jira/browse/ARROW-8111
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Artem Alekseev
>Assignee: Artem Alekseev
>Priority: Major
>
> Currently, date parser supports only -MM-DD format. For our workload we 
> need MM/DD/ format. It is obvious that CSV parser should support 
> different date formats, so we may start from implementing MM/DD/ format. 
> Also, we may use some date parsing library, which would solve the problem for 
> us.
> Also, we may need to somehow specify a format for every column in CSV parser. 
> If you have any implementation ideas in mind, please share, so that I can 
> implement it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8111) [C++][CSV] Support MM/DD/YYYY date format

2020-03-16 Thread Artem Alekseev (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Alekseev updated ARROW-8111:
--
Description: 
Currently, date parser supports only -MM-DD format. For our workload we 
need MM/DD/ format. It is obvious that CSV parser should support different 
date formats, so we may start from implementing MM/DD/ format. 

Also, we may use some date parsing library, which would solve the problem for 
us.

Also, we may need to somehow specify a format for every column in CSV parser. 

If you have any implementation ideas in mind, please share, so that I can 
implement it. 

  was:
Currently, date parser supports only -MM-DD format. For our workload we 
need DD/MM/ format. It is obvious that CSV parser should support different 
date formats, so we may start from implementing DD/MM/ format. 

Also, we may use some date parsing library, which would solve the problem for 
us.

Also, we may need to somehow specify a format for every column in CSV parser. 

If you have any implementation ideas in mind, please share, so that I can 
implement it. 


> [C++][CSV] Support MM/DD/ date format
> -
>
> Key: ARROW-8111
> URL: https://issues.apache.org/jira/browse/ARROW-8111
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Artem Alekseev
>Assignee: Artem Alekseev
>Priority: Major
>
> Currently, date parser supports only -MM-DD format. For our workload we 
> need MM/DD/ format. It is obvious that CSV parser should support 
> different date formats, so we may start from implementing MM/DD/ format. 
> Also, we may use some date parsing library, which would solve the problem for 
> us.
> Also, we may need to somehow specify a format for every column in CSV parser. 
> If you have any implementation ideas in mind, please share, so that I can 
> implement it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8111) [C++][CSV] Support MM/DD/YYYY date format

2020-03-16 Thread Artem Alekseev (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Alekseev updated ARROW-8111:
--
Summary: [C++][CSV] Support MM/DD/ date format  (was: [C++][CSV] 
Support DD/MM/ date format)

> [C++][CSV] Support MM/DD/ date format
> -
>
> Key: ARROW-8111
> URL: https://issues.apache.org/jira/browse/ARROW-8111
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Artem Alekseev
>Priority: Major
>
> Currently, date parser supports only -MM-DD format. For our workload we 
> need DD/MM/ format. It is obvious that CSV parser should support 
> different date formats, so we may start from implementing DD/MM/ format. 
> Also, we may use some date parsing library, which would solve the problem for 
> us.
> Also, we may need to somehow specify a format for every column in CSV parser. 
> If you have any implementation ideas in mind, please share, so that I can 
> implement it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8131) Add dynamic attributes to PyArrow ExtensionArray

2020-03-16 Thread Paul Balanca (Jira)
Paul Balanca created ARROW-8131:
---

 Summary: Add dynamic attributes to PyArrow ExtensionArray
 Key: ARROW-8131
 URL: https://issues.apache.org/jira/browse/ARROW-8131
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.16.0
 Environment: Ubuntu 19.10 + Python 3.7
Reporter: Paul Balanca


In the present implementation, the interface of the class `ExtensionArray` is 
not extendable by user. One can not easily inherit from it, as the constructor 
__init__ can not be called directly, or it does not allow adding dynamically 
atttributes.

Keeping the current design with build methods `from_*`, I believe it could then 
make sense to allow dynamic attributes in `ExtensionArray` (see 
[https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#dynamic-attributes]).
 The runtime & size cost of the Python objects would be fairly minimal, 
compared to increased flexibility it would allow.

A typical use case where it could be useful would be dynamic mixins (added by 
custom Factory), allowing projects based on PyArrow to extend (! :)) the 
interface with specific business logic. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8111) [C++][CSV] Support MM/DD/YYYY date format

2020-03-16 Thread Artem Alekseev (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Alekseev reassigned ARROW-8111:
-

Assignee: Artem Alekseev

> [C++][CSV] Support MM/DD/ date format
> -
>
> Key: ARROW-8111
> URL: https://issues.apache.org/jira/browse/ARROW-8111
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Artem Alekseev
>Assignee: Artem Alekseev
>Priority: Major
>
> Currently, date parser supports only -MM-DD format. For our workload we 
> need DD/MM/ format. It is obvious that CSV parser should support 
> different date formats, so we may start from implementing DD/MM/ format. 
> Also, we may use some date parsing library, which would solve the problem for 
> us.
> Also, we may need to somehow specify a format for every column in CSV parser. 
> If you have any implementation ideas in mind, please share, so that I can 
> implement it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8111) [C++][CSV] Support DD/MM/YYYY date format

2020-03-16 Thread Artem Alekseev (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060082#comment-17060082
 ] 

Artem Alekseev commented on ARROW-8111:
---

Oh, I found that we actually needed US MM/DD/ format, so I will rename the 
issue :) 

> [C++][CSV] Support DD/MM/ date format
> -
>
> Key: ARROW-8111
> URL: https://issues.apache.org/jira/browse/ARROW-8111
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Artem Alekseev
>Priority: Major
>
> Currently, date parser supports only -MM-DD format. For our workload we 
> need DD/MM/ format. It is obvious that CSV parser should support 
> different date formats, so we may start from implementing DD/MM/ format. 
> Also, we may use some date parsing library, which would solve the problem for 
> us.
> Also, we may need to somehow specify a format for every column in CSV parser. 
> If you have any implementation ideas in mind, please share, so that I can 
> implement it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas

2020-03-16 Thread Paul Balanca (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060071#comment-17060071
 ] 

Paul Balanca commented on ARROW-7365:
-

If I may continue the discussion point raised in ARROW-8010.

I believe there is a use case for FixedSizeList arrays to be convertible to 
two-dimensional Numpy arrays (or even multi-dimensional ones). There exist many 
applications where ones want to store small vectors/matrices with known static 
dimensions (i.e. 3d vector, 3d affine transform). The fixed size Arrow column 
format is ideal for that kind of purpose, and then allow to write 
high-performance code on this kind of storage.

But in order to be possible to write this kind of high perf. pipelines base on 
PyArrow, one needs to be able to extract the full 2D Numpy array from the 
PyArrow object. Technically, it is possible as shown by the small example in 
ARROW-8010, but it would be probably valuable to be part of the official API.

Is the `to_numpy` the right place to implement it? I am not sure, I probably 
don't have the depth of view on this project to have a good opinion. But I 
believe there are numerous pure Numpy computation pipeline based on PyArrow 
in-memory storage which would benefit from a "closer to metal" Numpy API, 
independent of the Pandas-like series representation.

> [Python] Support FixedSizeList type in conversion to numpy/pandas
> -
>
> Key: ARROW-7365
> URL: https://issues.apache.org/jira/browse/ARROW-7365
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 0.17.0
>
>
> Follow-up on ARROW-7261, still need to add support for FixedSizeListType in 
> the arrow -> python conversion (arrow_to_pandas.cc)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8010) [Python] Fixed size list not convertible to Numpy Array / pandas Series

2020-03-16 Thread Paul Balanca (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060064#comment-17060064
 ] 

Paul Balanca commented on ARROW-8010:
-

Thanks for the quick answer. Sorry I did not notice first it was already 
existing.

> [Python] Fixed size list not convertible to Numpy Array / pandas Series
> ---
>
> Key: ARROW-8010
> URL: https://issues.apache.org/jira/browse/ARROW-8010
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.16.0
> Environment: Ubuntu 19.10 + python 3.7
>Reporter: Paul Balanca
>Priority: Major
>
> Fixed size list of base types (i.e. int, float, ...) are not convertible to 
> Numpy array.
> The following code:
> {code:java}
> import pyarrow as pa
> t = pa.list_(pa.float32(), 2)
> arr = pa.array([[1, 2], [3, 4], [5, 6]], type=t)
> arr.to_numpy(){code}
> raises a not implemented Arrow error as there is no Pandas block equivalent.
> It sounds reasonable that the conversion to Pandas fails, but I would expect 
> a natural conversion to Numpy Array, as according to the Fixed Size List 
> Layout ([https://arrow.apache.org/docs/format/Columnar.html#]), the former 
> could be mapped to a 2-dimensional row major matrix (e.g. 3x2 in the previous 
> example).
> Note we can get the expected result by working around using flatten:
> {code:java}
> arr.flatten().to_numpy().reshape((-1, t.list_size)){code}
> This form of memory representation is quite natural if ones wants to use 
> Apache Arrow for in-memory collection of 2D/3D points, where we wish to have 
> coordinates contiguous in memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8130) [C++][Gandiva] Fix Dex visitor in llvm_generator to handle interval type

2020-03-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8130:
--
Labels: pull-request-available  (was: )

> [C++][Gandiva] Fix Dex visitor in llvm_generator to handle interval type
> 
>
> Key: ARROW-8130
> URL: https://issues.apache.org/jira/browse/ARROW-8130
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8130) [C++][Gandiva] Fix Dex visitor in llvm_generator to handle interval type

2020-03-16 Thread Prudhvi Porandla (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prudhvi Porandla updated ARROW-8130:

Summary: [C++][Gandiva] Fix Dex visitor in llvm_generator to handle 
interval type  (was: [C++][Gandiva] Fix dexVisitor)

> [C++][Gandiva] Fix Dex visitor in llvm_generator to handle interval type
> 
>
> Key: ARROW-8130
> URL: https://issues.apache.org/jira/browse/ARROW-8130
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Prudhvi Porandla
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8130) [C++][Gandiva] Fix Dex visitor in llvm_generator to handle interval type

2020-03-16 Thread Prudhvi Porandla (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prudhvi Porandla reassigned ARROW-8130:
---

Assignee: Prudhvi Porandla

> [C++][Gandiva] Fix Dex visitor in llvm_generator to handle interval type
> 
>
> Key: ARROW-8130
> URL: https://issues.apache.org/jira/browse/ARROW-8130
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8130) [C++][Gandiva] Fix dexVisitor

2020-03-16 Thread Prudhvi Porandla (Jira)
Prudhvi Porandla created ARROW-8130:
---

 Summary: [C++][Gandiva] Fix dexVisitor
 Key: ARROW-8130
 URL: https://issues.apache.org/jira/browse/ARROW-8130
 Project: Apache Arrow
  Issue Type: Task
Reporter: Prudhvi Porandla






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8111) [C++][CSV] Support DD/MM/YYYY date format

2020-03-16 Thread Artem Alekseev (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060042#comment-17060042
 ] 

Artem Alekseev commented on ARROW-8111:
---

Ok, thanks, folks! I'll create a draft patch soon to discuss more in detail.

> [C++][CSV] Support DD/MM/ date format
> -
>
> Key: ARROW-8111
> URL: https://issues.apache.org/jira/browse/ARROW-8111
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Artem Alekseev
>Priority: Major
>
> Currently, date parser supports only -MM-DD format. For our workload we 
> need DD/MM/ format. It is obvious that CSV parser should support 
> different date formats, so we may start from implementing DD/MM/ format. 
> Also, we may use some date parsing library, which would solve the problem for 
> us.
> Also, we may need to somehow specify a format for every column in CSV parser. 
> If you have any implementation ideas in mind, please share, so that I can 
> implement it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8129) [C++][Compute] Refine compare sorting kernel

2020-03-16 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8129:
---

 Summary: [C++][Compute] Refine compare sorting kernel
 Key: ARROW-8129
 URL: https://issues.apache.org/jira/browse/ARROW-8129
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Yibo Cai
Assignee: Yibo Cai


Sorting kernel implements two comparison functions, 
[CompareValues|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L67]
 use array.Value() for numeric data and 
[CompareViews|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L72]
 uses array.GetView() for non-numeric ones. It can be simplified by using 
GetView() only as all data types support GetView().

To my surprise, benchmark shows about 40% performance improvement after the 
change.

After some digging, I find in current code, the [comparison 
callback|https://github.com/apache/arrow/blob/ab21f0ee429c2a2c82e4dbc5d216ab1da74221a2/cpp/src/arrow/compute/kernels/sort_to_indices.cc#L94]
 is not inlined (check disassembled code), it leads to a function call. It's 
very bad for this hot loop. Using only GetView() fixes this issue, code inlined 
okay.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8128) [C#] NestedType children serialized on wrong length

2020-03-16 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-8128.
-
Fix Version/s: 0.17.0
   Resolution: Fixed

Issue resolved by pull request 6628
[https://github.com/apache/arrow/pull/6628]

> [C#] NestedType children serialized on wrong length
> ---
>
> Key: ARROW-8128
> URL: https://issues.apache.org/jira/browse/ARROW-8128
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Reporter: Takashi Hashida
>Assignee: Takashi Hashida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Each node of NestedType children is serialized on a previous node Length and 
> NullCount.
> This causes wrong data access at ListArray.GetValueOffset and so on.
>  
> [https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Ipc/ArrowReaderImplementation.cs#L219]
>  
> {code:java}
> Flatbuf.FieldNode childFieldNode = recordBatchEnumerator.CurrentNode;
> recordBatchEnumerator.MoveNextNode();
> {code}
> At this lines, MoveNextNode should be executed before assigning CurrentNode.
> this can be reproduced by changing TestData.ArrayCreator.Visit(ListType type) 
> like below and execute ArrowFileReaderTests.
> {code:java}
> public void Visit(ListType type)
>  {
>  var builder = new ListArray.Builder(type.ValueField).Reserve(Length);
> //Todo : Support various types
>  var valueBuilder = (Int64Array.Builder)builder.ValueBuilder.Reserve(Length);
> for (var i = 0; i < Length; i++)
>  {
>  builder.Append();
>  valueBuilder.Append(i);
>  }
> //Add a value to check if Values length can exceed ListArray length
>  valueBuilder.Append(0);
> Array = builder.Build();
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8128) [C#] NestedType children serialized on wrong length

2020-03-16 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-8128:
---

Assignee: Takashi Hashida

> [C#] NestedType children serialized on wrong length
> ---
>
> Key: ARROW-8128
> URL: https://issues.apache.org/jira/browse/ARROW-8128
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Reporter: Takashi Hashida
>Assignee: Takashi Hashida
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Each node of NestedType children is serialized on a previous node Length and 
> NullCount.
> This causes wrong data access at ListArray.GetValueOffset and so on.
>  
> [https://github.com/apache/arrow/blob/master/csharp/src/Apache.Arrow/Ipc/ArrowReaderImplementation.cs#L219]
>  
> {code:java}
> Flatbuf.FieldNode childFieldNode = recordBatchEnumerator.CurrentNode;
> recordBatchEnumerator.MoveNextNode();
> {code}
> At this lines, MoveNextNode should be executed before assigning CurrentNode.
> this can be reproduced by changing TestData.ArrayCreator.Visit(ListType type) 
> like below and execute ArrowFileReaderTests.
> {code:java}
> public void Visit(ListType type)
>  {
>  var builder = new ListArray.Builder(type.ValueField).Reserve(Length);
> //Todo : Support various types
>  var valueBuilder = (Int64Array.Builder)builder.ValueBuilder.Reserve(Length);
> for (var i = 0; i < Length; i++)
>  {
>  builder.Append();
>  valueBuilder.Append(i);
>  }
> //Add a value to check if Values length can exceed ListArray length
>  valueBuilder.Append(0);
> Array = builder.Build();
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)