[jira] [Updated] (ARROW-8906) [Rust] Support reading multiple CSV files for schema inference

2020-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8906:
--
Labels: pull-request-available  (was: )

> [Rust] Support reading multiple CSV files for schema inference
> --
>
> Key: ARROW-8906
> URL: https://issues.apache.org/jira/browse/ARROW-8906
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: QP Hou
>Assignee: QP Hou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8906) [Rust] Support reading multiple CSV files for schema inference

2020-05-22 Thread QP Hou (Jira)
QP Hou created ARROW-8906:
-

 Summary: [Rust] Support reading multiple CSV files for schema 
inference
 Key: ARROW-8906
 URL: https://issues.apache.org/jira/browse/ARROW-8906
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: QP Hou
Assignee: QP Hou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8901) [C++] Reduce number of take kernels

2020-05-22 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114518#comment-17114518
 ] 

Wes McKinney commented on ARROW-8901:
-

We probably need at least int8 through int64 (so we can use take to unpack 
dictionaries). A different code path will probably be used for running "take" 
in a selection vector context (per ARROW-8903)

> [C++] Reduce number of take kernels
> ---
>
> Key: ARROW-8901
> URL: https://issues.apache.org/jira/browse/ARROW-8901
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> After ARROW-8792 we can observe that we are generating 312 take kernels
> {code}
> In [1]: import pyarrow.compute as pc  
> 
> In [2]: reg = pc.function_registry()  
> 
> In [3]: reg.get_function('take')  
> 
> Out[3]: 
> arrow.compute.Function
> kind: vector
> num_kernels: 312
> {code}
> You can see them all here: 
> https://gist.github.com/wesm/c3085bf40fa2ee5e555204f8c65b4ad5
> It's probably going to be sufficient to only support int16, int32, and int64 
> index types for almost all types and insert implicit casts (once we implement 
> implicit-cast-insertion into the execution code) for other index types. If we 
> determine that there is some performance hot path where we need to specialize 
> for other index types, then we can always do that.
> Additionally, we should be able to collapse the date/time kernels since we're 
> just moving memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8905) [C++] Collapse Take APIs from 8 to 1 or 2

2020-05-22 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8905:

Description: There are currently 8 {{arrow::compute::Take}} functions with 
different function signatures. Fewer functions would make life easier for 
binding developers  (was: There are currently 8 {{Take}} functions with 
different function signatures. Fewer functions would make life easier for 
binding developers)

> [C++] Collapse Take APIs from 8 to 1 or 2
> -
>
> Key: ARROW-8905
> URL: https://issues.apache.org/jira/browse/ARROW-8905
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> There are currently 8 {{arrow::compute::Take}} functions with different 
> function signatures. Fewer functions would make life easier for binding 
> developers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8905) [C++] Collapse Take APIs from 8 to 1 or 2

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8905:
---

 Summary: [C++] Collapse Take APIs from 8 to 1 or 2
 Key: ARROW-8905
 URL: https://issues.apache.org/jira/browse/ARROW-8905
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


There are currently 8 {{Take}} functions with different function signatures. 
Fewer functions would make life easier for binding developers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6775) [C++] [Python] Proposal for several Array utility functions

2020-05-22 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114510#comment-17114510
 ] 

Wes McKinney commented on ARROW-6775:
-

I think these can all be implemented as kernels with the new compute framework 
after ARROW-8792. I have linked the issue

> [C++] [Python] Proposal for several Array utility functions
> ---
>
> Key: ARROW-6775
> URL: https://issues.apache.org/jira/browse/ARROW-6775
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Python
>Reporter: Zhuo Peng
>Priority: Minor
>
> Hi,
> We developed several utilities that computes / accesses certain properties of 
> Arrays and wonder if they make sense to get them into the upstream (into both 
> the C++ API and pyarrow) and assuming yes, where is the best place to put 
> them?
> Maybe I have overlooked existing APIs that already do the same.. in that case 
> please point out.
>  
> 1/ ListLengthFromListArray(ListArray&)
> Returns lengths of lists in a ListArray, as a Int32Array (or Int64Array for 
> large lists). For example:
> [[1, 2, 3], [], None] => [3, 0, 0] (or [3, 0, None], but we hope the returned 
> array can be converted to numpy)
>  
> 2/ GetBinaryArrayTotalByteSize(BinaryArray&)
> Returns the total byte size of a BinaryArray (basically offset[len - 1] - 
> offset[0]).
> Alternatively, a BinaryArray::Flatten() -> Uint8Array would work.
>  
> 3/ GetArrayNullBitmapAsByteArray(Array&)
> Returns the array's null bitmap as a UInt8Array (which can be efficiently 
> converted to a bool numpy array)
>  
> 4/ GetFlattenedArrayParentIndices(ListArray&)
> Makes a int32 array of the same length as the flattened ListArray. 
> returned_array[i] == j means i-th element in the flattened ListArray came 
> from j-th list in the ListArray.
> For example [[1,2,3], [], None, [4,5]] => [0, 0, 0, 3, 3]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3520) [C++] Implement List Flatten kernel

2020-05-22 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114509#comment-17114509
 ] 

Wes McKinney commented on ARROW-3520:
-

This would be fine as a {{VectorFunction}}

> [C++] Implement List Flatten kernel
> ---
>
> Key: ARROW-3520
> URL: https://issues.apache.org/jira/browse/ARROW-3520
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 2.0.0
>
>
> see also ARROW-45



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8904) [Python] Fix usages of deprecated C++ APIs related to child/field

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8904:
---

 Summary: [Python] Fix usages of deprecated C++ APIs related to 
child/field
 Key: ARROW-8904
 URL: https://issues.apache.org/jira/browse/ARROW-8904
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


{code}
-- Running cmake --build for pyarrow
cmake --build . --config debug -- -j16
[19/20] Building CXX object CMakeFiles/lib.dir/lib.cpp.o
lib.cpp:20265:85: warning: 'num_children' is deprecated: Use num_fields() 
[-Wdeprecated-declarations]
  __pyx_t_1 = __pyx_f_7pyarrow_3lib__normalize_index(__pyx_v_i, 
__pyx_v_self->type->num_children()); if (unlikely(__pyx_t_1 == 
((Py_ssize_t)-1L))) __PYX_ERR(1, 119, __pyx_L1_error)

^
/home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use num_fields()")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:20276:76: warning: 'child' is deprecated: Use field(i) 
[-Wdeprecated-declarations]
  __pyx_t_2 = 
__pyx_f_7pyarrow_3lib_pyarrow_wrap_field(__pyx_v_self->type->child(__pyx_v_index));
 if (unlikely(!__pyx_t_2)) __PYX_ERR(1, 120, __pyx_L1_error)
   ^
/home/wesm/local/include/arrow/type.h:251:3: note: 'child' has been explicitly 
marked deprecated here
  ARROW_DEPRECATED("Use field(i)")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:20507:56: warning: 'num_children' is deprecated: Use num_fields() 
[-Wdeprecated-declarations]
  __pyx_t_1 = __Pyx_PyInt_From_int(__pyx_v_self->type->num_children()); if 
(unlikely(!__pyx_t_1)) __PYX_ERR(1, 139, __pyx_L1_error)
   ^
/home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use num_fields()")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:23361:44: warning: 'num_children' is deprecated: Use num_fields() 
[-Wdeprecated-declarations]
  __pyx_r = __pyx_v_self->__pyx_base.type->num_children();
   ^
/home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use num_fields()")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:24039:44: warning: 'num_children' is deprecated: Use num_fields() 
[-Wdeprecated-declarations]
  __pyx_r = __pyx_v_self->__pyx_base.type->num_children();
   ^
/home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use num_fields()")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:58220:37: warning: 'child' is deprecated: Use field(pos) 
[-Wdeprecated-declarations]
  __pyx_v_child = __pyx_v_self->ap->child(__pyx_v_child_id);
^
/home/wesm/local/include/arrow/array.h:1281:3: note: 'child' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use field(pos)")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:58956:74: warning: 'children' is deprecated: Use fields() 
[-Wdeprecated-declarations]
  __pyx_v_child_fields = 
__pyx_v_self->__pyx_base.__pyx_base.type->type->children();
 ^
/home/wesm/local/include/arrow/type.h:257:3: note: 'children' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use fields()")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))

[jira] [Created] (ARROW-8903) [C++] Implement optimized "unsafe take" for use with selection vectors for kernel execution

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8903:
---

 Summary: [C++] Implement optimized "unsafe take" for use with 
selection vectors for kernel execution
 Key: ARROW-8903
 URL: https://issues.apache.org/jira/browse/ARROW-8903
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Selection vectors constructed from filters do not need to be subjected to 
boundschecking and other such safety checks as are present with a usual 
invocation of {{take}}. So based on the type width of a selection vector 
(uint16?) we should implement highly streamlined take implementations that 
additionally take into consideration that selection vectors are monotonic by 
construction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8815) [Dev][Release] Binary upload script should retry on unexpected bintray request error

2020-05-22 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-8815.
-
Resolution: Fixed

Issue resolved by pull request 7192
[https://github.com/apache/arrow/pull/7192]

> [Dev][Release] Binary upload script should retry on unexpected bintray 
> request error
> 
>
> Key: ARROW-8815
> URL: https://issues.apache.org/jira/browse/ARROW-8815
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> During uploading the binaries to bintray the script exited multiple times 
> because of unhandled HTTP errors. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8902) [rust][datafusion] optimize count(*) queries on parquet sources

2020-05-22 Thread Alex Gaynor (Jira)
Alex Gaynor created ARROW-8902:
--

 Summary: [rust][datafusion] optimize count(*) queries on parquet 
sources
 Key: ARROW-8902
 URL: https://issues.apache.org/jira/browse/ARROW-8902
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Alex Gaynor


Currently, as far as I can tell, when you perform a `select count(*) from 
dataset` in datafusion against a parquet dataset, the way this is implemented 
is by doing a scan on column 0, and counting up all of the rows (specifically I 
think it counts the # of rows in each batch).

 

However, for the specific case of just counting _everythign_ in a parquet file, 
you can just read the rowcount from the footer metadata, so it's O(1) instead 
of O(n)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8901) [C++] Reduce number of take kernels

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8901:
---

 Summary: [C++] Reduce number of take kernels
 Key: ARROW-8901
 URL: https://issues.apache.org/jira/browse/ARROW-8901
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


After ARROW-8792 we can observe that we are generating 312 take kernels

{code}
In [1]: import pyarrow.compute as pc
  

In [2]: reg = pc.function_registry()
  

In [3]: reg.get_function('take')
  
Out[3]: 
arrow.compute.Function
kind: vector
num_kernels: 312
{code}

You can see them all here: 
https://gist.github.com/wesm/c3085bf40fa2ee5e555204f8c65b4ad5

It's probably going to be sufficient to only support int16, int32, and int64 
index types for almost all types and insert implicit casts (once we implement 
implicit-cast-insertion into the execution code) for other index types. If we 
determine that there is some performance hot path where we need to specialize 
for other index types, then we can always do that.

Additionally, we should be able to collapse the date/time kernels since we're 
just moving memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8900) Respect HTTP(S)_PROXY for S3 Filesystems and/or expose proxy options as parameters

2020-05-22 Thread Daniel Nugent (Jira)
Daniel Nugent created ARROW-8900:


 Summary: Respect HTTP(S)_PROXY for S3 Filesystems and/or expose 
proxy options as parameters
 Key: ARROW-8900
 URL: https://issues.apache.org/jira/browse/ARROW-8900
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 0.17.0
Reporter: Daniel Nugent


HTTP_PROXY and HTTPS_PROXY are not automatically respected by the 
Aws::Client::ClientConfiguration (see: 
https://github.com/aws/aws-sdk-cpp/issues/1049)

Either Arrow should respect them or make them available as parameters when 
connecting to S3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4390) [R] Serialize "labeled" metadata in Feather files, IPC messages

2020-05-22 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114428#comment-17114428
 ] 

Neal Richardson commented on ARROW-4390:


After exploring more, I don't think this requires an extension type, we just 
need to collect R attributes and store them as schema metadata.

> [R] Serialize "labeled" metadata in Feather files, IPC messages
> ---
>
> Key: ARROW-4390
> URL: https://issues.apache.org/jira/browse/ARROW-4390
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Wes McKinney
>Priority: Major
>
> see https://github.com/apache/arrow/issues/3480



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8899) [R] Add R metadata like pandas metadata for round-trip fidelity

2020-05-22 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8899:
--

 Summary: [R] Add R metadata like pandas metadata for round-trip 
fidelity
 Key: ARROW-8899
 URL: https://issues.apache.org/jira/browse/ARROW-8899
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
 Fix For: 1.0.0


Arrow Schema and Field objects have custom_metadata fields to store arbitrary 
strings in a key-value store. Pandas stores JSON in a "pandas" key and uses 
that to improve the fidelity of round-tripping data to Arrow/Parquet/Feather 
and back. 
https://pandas.pydata.org/docs/dev/development/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
 describes this a bit.

You can see this pandas metadata in the sample Parquet file:

{code:r}
tab <- read_parquet(system.file("v0.7.1.parquet", package="arrow"), 
as_data_frame = FALSE)
tab

# Table
# 10 rows x 11 columns
# $carat 
# $cut 
# $color 
# $clarity 
# $depth 
# $table 
# $price 
# $x 
# $y 
# $z 
# $__index_level_0__ 

tab$metadata

# $pandas
# [1] "{\"index_columns\": [\"__index_level_0__\"], \"column_indexes\": 
[{\"name\": null, \"pandas_type\": \"string\", \"numpy_type\": \"object\", 
\"metadata\": null}], \"columns\": [{\"name\": \"carat\", \"pandas_type\": 
\"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": 
\"cut\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", 
\"metadata\": null}, {\"name\": \"color\", \"pandas_type\": \"unicode\", 
\"numpy_type\": \"object\", \"metadata\": null}, {\"name\": \"clarity\", 
\"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, 
{\"name\": \"depth\", \"pandas_type\": \"float64\", \"numpy_type\": 
\"float64\", \"metadata\": null}, {\"name\": \"table\", \"pandas_type\": 
\"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": 
\"price\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": 
null}, {\"name\": \"x\", \"pandas_type\": \"float64\", \"numpy_type\": 
\"float64\", \"metadata\": null}, {\"name\": \"y\", \"pandas_type\": 
\"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": 
\"z\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": 
null}, {\"name\": \"__index_level_0__\", \"pandas_type\": \"int64\", 
\"numpy_type\": \"int64\", \"metadata\": null}], \"pandas_version\": 
\"0.20.1\"}"
{code}

We should do something similar in R: store the "attributes" for each column in 
a data.frame when we convert to Arrow, and restore those attributes when we 
read from Arrow. 

Since ARROW-8703, you could naively do this all in R, something like:

{code:r}
tab$metadata$r <- lapply(df, attributes)
{code}

on the conversion to Arrow, and in as.data.frame(), do

{code:r}
if (!is.null(tab$metadata$r)) {
  df[] <- mapply(function(col, meta) {
attributes(col) <- meta
  }, col = df, meta = tab$metadata$r)
}
{code}

However, it's trickier than this because:

* {{tab$metadata$r}} needs to be serialized to string and deserialized on the 
way back. Pandas uses JSON but arrow doesn't currently have a JSON R 
dependency. The C++ build does include rapidjson, maybe we could tap into that? 
Alternatively, we could {{dput()}} to dump the R attributes, which might have 
higher fidelity in addition to zero dependencies, but there are tradeoffs.
* We'll need to do the same for all places where Tables and RecordBatches are 
created/converted
* We'll need to make sure that nested types (structs) get the same coverage
* This metadata only is attached to Schemas, meaning that Arrays/ChunkedArrays 
don't have a place to store extra metadata. So we probably want to attach to 
the R6 (Chunked)Array objects a metadata/attributes field so that if we convert 
an R vector to array, or if we extract an array out of a record batch, we don't 
lose the attributes.

Doing this should resolve ARROW-4390 and make ARROW-8867 trivial as well.

Finally, a note about this custom metadata vs. extension types. Extension types 
can be defined by [adding metadata to a 
Field|https://arrow.apache.org/docs/format/Columnar.html#extension-types] (in a 
Schema). I think this is out of scope here because we're only concerned with R 
roundtrip fidelity. If there were a type that (for example) R and Pandas both 
had that Arrow did not, we could define an extension type so that we could 
share that across the implementations. But unless/until there is value in 
establishing that extension type standard, let's not worry with it. (In other 
words, in R we should ignore pandas metadata; if there's anything that pandas 
wants to share with R, it will define it somewhere else.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8898) [C++] Determine desirable maximum length for ExecBatch in pipelined and parallel execution of kernels

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8898:
---

 Summary: [C++] Determine desirable maximum length for ExecBatch in 
pipelined and parallel execution of kernels
 Key: ARROW-8898
 URL: https://issues.apache.org/jira/browse/ARROW-8898
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Maximum lengths like 16K or 64K seem to be popular, but we should write our own 
benchmarks so that we can justify the choice of default chunksize



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8897) [C++] Determine strategy for propagating failures in initializing built-in function registry in arrow/compute

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8897:
---

 Summary: [C++] Determine strategy for propagating failures in 
initializing built-in function registry in arrow/compute
 Key: ARROW-8897
 URL: https://issues.apache.org/jira/browse/ARROW-8897
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


As discussed on https://github.com/apache/arrow/pull/7240, we are using 
{{DCHECK_OK}} to check statuses when initializing the built-in registry. 

We could propagate failures by changing {{arrow::compute::GetFunctionRegistry}} 
to return Result, but there may be other ways



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8896) [C++] Reimplement dictionary unpacking in Cast kernels using Take

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8896:
---

 Summary: [C++] Reimplement dictionary unpacking in Cast kernels 
using Take
 Key: ARROW-8896
 URL: https://issues.apache.org/jira/browse/ARROW-8896
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


As suggested by [~apitrou] this should yield less code to maintain



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8895) [C++] Add C++ unit tests for filter and take functions on temporal type inputs, including timestamps

2020-05-22 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8895:

Summary: [C++] Add C++ unit tests for filter and take functions on temporal 
type inputs, including timestamps  (was: [C++] Add C++ unit tests for filter 
function on temporal type inputs, including timestamps)

> [C++] Add C++ unit tests for filter and take functions on temporal type 
> inputs, including timestamps
> 
>
> Key: ARROW-8895
> URL: https://issues.apache.org/jira/browse/ARROW-8895
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> These are used in R but not tested in C++, so I only found out that I had 
> missed adding the kernels to the Filter VectorFunction when running the R 
> test suite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8895) [C++] Add C++ unit tests for filter function on temporal type inputs, including timestamps

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8895:
---

 Summary: [C++] Add C++ unit tests for filter function on temporal 
type inputs, including timestamps
 Key: ARROW-8895
 URL: https://issues.apache.org/jira/browse/ARROW-8895
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


These are used in R but not tested in C++, so I only found out that I had 
missed adding the kernels to the Filter VectorFunction when running the R test 
suite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8894) [C++] C++ array kernels framework and execution buildout (umbrella issue)

2020-05-22 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8894:

Description: 
In the wake of ARROW-8792, this issue is to serve as an umbrella issue for 
follow up work and associated "buildout" which includes things like:

* Implementation of many new function types and adding new kernel cases to 
existing functions
* Adding implicit casting functionality to function execution
* Creation of "bound" physical array expressions and execution thereof
* Pipeline execution (executing multiple kernels while eliminating temporary 
allocation)
* Parallel execution of scalar and aggregate kernels (including parallel 
execution of pipelined kernels)

There's quite a few existing JIRAs in the project that I'll attach to this 
issue and I'll open plenty more issues as things occur to me to help organize 
the work. 

  was:
In the wake of ARROW-8792, this issue is to serve as an umbrella issue for 
follow up work and associated "buildout" which includes things like:

* Implementation of many new function types and adding new kernel cases to 
existing functions
* Adding implicit casting functionality to function execution
* Creation of "bound" physical arrays expressions
* Pipeline execution (executing multiple kernels while eliminating temporary 
allocation)
* Parallel execution of scalar and aggregate kernels (including parallel 
execution of pipelined kernels)

There's quite a few existing JIRAs in the project that I'll attach to this 
issue and I'll open plenty more issues as things occur to me to help organize 
the work. 


> [C++] C++ array kernels framework and execution buildout (umbrella issue)
> -
>
> Key: ARROW-8894
> URL: https://issues.apache.org/jira/browse/ARROW-8894
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> In the wake of ARROW-8792, this issue is to serve as an umbrella issue for 
> follow up work and associated "buildout" which includes things like:
> * Implementation of many new function types and adding new kernel cases to 
> existing functions
> * Adding implicit casting functionality to function execution
> * Creation of "bound" physical array expressions and execution thereof
> * Pipeline execution (executing multiple kernels while eliminating temporary 
> allocation)
> * Parallel execution of scalar and aggregate kernels (including parallel 
> execution of pipelined kernels)
> There's quite a few existing JIRAs in the project that I'll attach to this 
> issue and I'll open plenty more issues as things occur to me to help organize 
> the work. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8894) [C++] C++ array kernels framework and execution buildout (umbrella issue)

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8894:
---

 Summary: [C++] C++ array kernels framework and execution buildout 
(umbrella issue)
 Key: ARROW-8894
 URL: https://issues.apache.org/jira/browse/ARROW-8894
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


In the wake of ARROW-8792, this issue is to serve as an umbrella issue for 
follow up work and associated "buildout" which includes things like:

* Implementation of many new function types and adding new kernel cases to 
existing functions
* Adding implicit casting functionality to function execution
* Creation of "bound" physical arrays expressions
* Pipeline execution (executing multiple kernels while eliminating temporary 
allocation)
* Parallel execution of scalar and aggregate kernels (including parallel 
execution of pipelined kernels)

There's quite a few existing JIRAs in the project that I'll attach to this 
issue and I'll open plenty more issues as things occur to me to help organize 
the work. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8890) [R] Fix C++ lint issue

2020-05-22 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8890.
---
Resolution: Fixed

Issue resolved by pull request 7251
[https://github.com/apache/arrow/pull/7251]

> [R] Fix C++ lint issue 
> ---
>
> Key: ARROW-8890
> URL: https://issues.apache.org/jira/browse/ARROW-8890
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-8893) [R] Fix cpplint issues introduced by ARROW-8885

2020-05-22 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-8893.
---
Fix Version/s: (was: 1.0.0)
   Resolution: Duplicate

dup of ARROW-8890

> [R] Fix cpplint issues introduced by ARROW-8885
> ---
>
> Key: ARROW-8893
> URL: https://issues.apache.org/jira/browse/ARROW-8893
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Wes McKinney
>Priority: Major
>
> {code}
> (arrow-3.7) 12:34 ~/code/arrow/r $ ./lint.sh 
> /home/wesm/code/arrow/r/src/arrow_types.h:20:  Include the directory when 
> naming .h files  [build/include_subdir] [4]
> /home/wesm/code/arrow/r/src/arrow_types.h:66:  Add #include  for 
> forward  [build/include_what_you_use] [4]
> /home/wesm/code/arrow/r/src/arrow_types.h:83:  Add #include  for 
> vector<>  [build/include_what_you_use] [4]
> /home/wesm/code/arrow/r/src/arrow_types.h:95:  Add #include  for 
> numeric_limits<>  [build/include_what_you_use] [4]
> /home/wesm/code/arrow/r/src/arrow_types.h:110:  Add #include  for 
> shared_ptr<>  [build/include_what_you_use] [4]
> /home/wesm/code/arrow/r/src/arrow_exports.h:22:  Include the directory when 
> naming .h files  [build/include_subdir] [4]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8455) [Rust] [Parquet] Arrow column read on partially compatible files

2020-05-22 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved ARROW-8455.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 6935
[https://github.com/apache/arrow/pull/6935]

> [Rust] [Parquet] Arrow column read on partially compatible files
> 
>
> Key: ARROW-8455
> URL: https://issues.apache.org/jira/browse/ARROW-8455
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.16.0
>Reporter: Remi Dettai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Seen behavior: When reading a Parquet file into Arrow with 
> `get_record_reader_by_columns`, it will fail if one of the column of the file 
> is a list (or any other unsupported type).
> Expected behavior: it should only fail if you are actually reading the column 
> with unsuported type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8893) [R] Fix cpplint issues introduced by ARROW-8885

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8893:
---

 Summary: [R] Fix cpplint issues introduced by ARROW-8885
 Key: ARROW-8893
 URL: https://issues.apache.org/jira/browse/ARROW-8893
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Wes McKinney
 Fix For: 1.0.0


{code}
(arrow-3.7) 12:34 ~/code/arrow/r $ ./lint.sh 
/home/wesm/code/arrow/r/src/arrow_types.h:20:  Include the directory when 
naming .h files  [build/include_subdir] [4]
/home/wesm/code/arrow/r/src/arrow_types.h:66:  Add #include  for 
forward  [build/include_what_you_use] [4]
/home/wesm/code/arrow/r/src/arrow_types.h:83:  Add #include  for 
vector<>  [build/include_what_you_use] [4]
/home/wesm/code/arrow/r/src/arrow_types.h:95:  Add #include  for 
numeric_limits<>  [build/include_what_you_use] [4]
/home/wesm/code/arrow/r/src/arrow_types.h:110:  Add #include  for 
shared_ptr<>  [build/include_what_you_use] [4]

/home/wesm/code/arrow/r/src/arrow_exports.h:22:  Include the directory when 
naming .h files  [build/include_subdir] [4]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8892) [C++][CI] CI builds for MSVC do not build benchmarks

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8892:
---

 Summary: [C++][CI] CI builds for MSVC do not build benchmarks
 Key: ARROW-8892
 URL: https://issues.apache.org/jira/browse/ARROW-8892
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


We must ensure that our benchmarks always build on Windows

I'm fixing these errors for example in ARROW-8792

{code}
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(249): error 
C2220: warning treated as error - no 'object' file generated
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(256): note: see 
reference to function template instantiation 'void 
parquet::BM_PlainEncodingSpaced(benchmark::State &)' 
being compiled
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(249): warning 
C4244: 'argument': conversion from 'const int64_t' to 'int', possible loss of 
data
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(292): warning 
C4244: 'argument': conversion from 'const int64_t' to 'int', possible loss of 
data
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(306): note: see 
reference to function template instantiation 'void 
parquet::BM_PlainDecodingSpaced(benchmark::State &)' 
being compiled
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(299): warning 
C4244: 'argument': conversion from 'int64_t' to 'int', possible loss of data
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(300): warning 
C4244: 'argument': conversion from 'const int64_t' to 'int', possible loss of 
data
[11/67] Linking CXX executable release\arrow-ipc-read-write-benchmark.exe
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-555) [C++] String algorithm library for StringArray/BinaryArray

2020-05-22 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114217#comment-17114217
 ] 

Wes McKinney commented on ARROW-555:


Yes, that's the idea. I can try to implement {{str.split}} which would be 
{{String -> List}} in Arrow types. 

> [C++] String algorithm library for StringArray/BinaryArray
> --
>
> Key: ARROW-555
> URL: https://issues.apache.org/jira/browse/ARROW-555
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
>
> This is a parent JIRA for starting a module for processing strings in-memory 
> arranged in Arrow format. This will include using the re2 C++ regular 
> expression library and other standard string manipulations (such as those 
> found on Python's string objects)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8878) [R] how to install when behind a firewall?

2020-05-22 Thread Olaf (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114186#comment-17114186
 ] 

Olaf commented on ARROW-8878:
-

Hi [~npr], thanks for replying back. Please see below:

 
 * getOption("download.file.method") returns wget
 * sorry for the low-tech question, but can I install manually without cloning? 
That is, simply going to the github page [https://github.com/apache/arrow], 
manually downloading the zip and then installing using the "install from zip" 
utility in Rstudio? Would that work correctly?

 

Thanks!!

> [R] how to install when behind a firewall?
> --
>
> Key: ARROW-8878
> URL: https://issues.apache.org/jira/browse/ARROW-8878
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: r
>Reporter: Olaf
>Priority: Major
>
> Hello there and thanks again for this beautiful package!
> I am trying to install {{arrow}} on linux and I got a few problematic 
> warnings during the install. My computer is behind a firewall so not all the 
> connections coming from rstudio are allowed.
>  
> {code:java}
> > sessionInfo()
> R version 3.6.1 (2019-07-05)
> Platform: x86_64-ubuntu18-linux-gnu (64-bit)
> Running under: Ubuntu 18.04.4 LTS
> Matrix products: default
> BLAS/LAPACK: 
> /apps/intel/2019.1/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
> locale:
>  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
>  [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 
>  [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C 
> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] MKLthreads_0.1
> loaded via a namespace (and not attached):
> [1] compiler_3.6.1 tools_3.6.1
> {code}
>  
> after running {{install.packages("arrow")}} I get
>  
> {code:java}
>  
> installing *source* package ?arrow? ...
> ** package ?arrow? successfully unpacked and MD5 sums checked
> ** using staged installation
> *** Successfully retrieved C++ source
> *** Proceeding without C++ dependencies
> Warning message:
> In unzip(tf1, exdir = src_dir) : error 1 in extracting from zip file
> ./configure: line 132: cd: libarrow/arrow-0.17.1/lib: No such file or 
> directory
> - NOTE ---
> After installation, please run arrow::install_arrow()
> for help installing required runtime libraries
> -
> {code}
>  
>  
> However, the installation ends normally.
>  
> {code:java}
>  ** R
> ** inst
> ** byte-compile and prepare package for lazy loading
> ** help
> *** installing help indices
> ** building package indices
> ** installing vignettes
> ** testing if installed package can be loaded from temporary location
> ** checking absolute paths in shared objects and dynamic libraries
> ** testing if installed package can be loaded from final location
> ** testing if installed package keeps a record of temporary installation path
> * DONE (arrow)
> {code}
>  
> So I go ahead and try to run arrow::install_arrow() and get a similar warning.
>  
> {code:java}
> installing *source* package ?arrow? ...
> ** package ?arrow? successfully unpacked and MD5 sums checked
> ** using staged installation
> *** Successfully retrieved C++ binaries for ubuntu-18.04
> Warning messages:
> 1: In file(file, "rt") :
>  URL 
> 'https://raw.githubusercontent.com/ursa-labs/arrow-r-nightly/master/linux/distro-map.csv':
>  status was 'Couldn't connect to server'
> 2: In unzip(bin_file, exdir = dst_dir) :
>  error 1 in extracting from zip file
> ./configure: line 132: cd: libarrow/arrow-0.17.1/lib: No such file or 
> directory
> - NOTE ---
> After installation, please run arrow::install_arrow()
> for help installing required runtime libraries
> {code}
> And unfortunately I cannot read any parquet file.
> {noformat}
> Error in fetch(key) : lazy-load database 
> '/mydata/R/x86_64-ubuntu18-linux-gnu-library/3.6/arrow/help/arrow.rdb' is 
> corrupt{noformat}
>  
> Could you please tell me how to fix this? Can I just copy the zip from github 
> and do a manual install in Rstudio?
>  
> Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-555) [C++] String algorithm library for StringArray/BinaryArray

2020-05-22 Thread Maarten Breddels (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114105#comment-17114105
 ] 

Maarten Breddels commented on ARROW-555:


Sounds good. I think it would help me a lot to see str->scalar and str->str 
(and possibly a str->[str, str]) example. They can be trivial, like always 
return ["a", "b"], but with that, I can probably get up to speed very quickly, 
if it's not too much to ask. 

> [C++] String algorithm library for StringArray/BinaryArray
> --
>
> Key: ARROW-555
> URL: https://issues.apache.org/jira/browse/ARROW-555
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
>
> This is a parent JIRA for starting a module for processing strings in-memory 
> arranged in Arrow format. This will include using the re2 C++ regular 
> expression library and other standard string manipulations (such as those 
> found on Python's string objects)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8891) [C++] Split non-cast compute kernels into a separate shared library

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8891:
---

 Summary: [C++] Split non-cast compute kernels into a separate 
shared library
 Key: ARROW-8891
 URL: https://issues.apache.org/jira/browse/ARROW-8891
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Since we are going to implement a lot more precompiled kernels, I am not sure 
it makes sense to require all of them to be compiled unconditionally just to 
get access to {{compute::Cast}}, which is needed in many different contexts.

After ARROW-8792 is merged, I would suggest creating a plugin hook for adding a 
bundle of kernels from a shared library outside of libarrow.so, and then moving 
all the object code outside of Cast to something like libarrow_compute.so. Then 
we can change the CMake flags to compile Cast kernels always (?) and then opt 
in to building the additional kernels package separately



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8510) [C++] arrow/dataset/file_base.cc fails to compile with internal compiler error with "Visual Studio 15 2017 Win64" generator

2020-05-22 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8510:
-

Assignee: Francois Saint-Jacques

> [C++] arrow/dataset/file_base.cc fails to compile with internal compiler 
> error with "Visual Studio 15 2017 Win64" generator
> ---
>
> Key: ARROW-8510
> URL: https://issues.apache.org/jira/browse/ARROW-8510
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Developer Tools
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Blocker
> Fix For: 1.0.0
>
>
> I discovered this while running the release verification on Windows. There 
> was an obscuring issue which is that if the build fails, the verification 
> script continues. I will fix that



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8890) [R] Fix C++ lint issue

2020-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8890:
--
Labels: pull-request-available  (was: )

> [R] Fix C++ lint issue 
> ---
>
> Key: ARROW-8890
> URL: https://issues.apache.org/jira/browse/ARROW-8890
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8889) [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None

2020-05-22 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8889:
-

Assignee: David Li

> [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None
> --
>
> Key: ARROW-8889
> URL: https://issues.apache.org/jira/browse/ARROW-8889
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1, 0.17.1
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This seems to only happen for Python 3.6 and 3.7. It doesn't happen with 3.8. 
> It seems to happen even when built from source, but I used the wheels for 
> this reproduction.
> {noformat}
> > uname -a
> Linux chaconne 5.6.13-arch1-1 #1 SMP PREEMPT Thu, 14 May 2020 06:52:53 + 
> x86_64 GNU/Linux
> > python --version
> Python 3.7.7
> > pip freeze
> numpy==1.18.4
> pyarrow==0.17.1{noformat}
> Reproduction:
> {code:python}
> import pyarrow as pa
> table = pa.Table.from_arrays([pa.array([1,2,3])], names=["a"])
> batches = table.to_batches()
> batches[0].equals(None)
> {code}
> {noformat}
> #0  0x7fffdf9d34f0 in arrow::RecordBatch::num_columns() const () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
> #1  0x7fffdf9d69e9 in arrow::RecordBatch::Equals(arrow::RecordBatch 
> const&, bool) const () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
> #2  0x7fffe084a6e0 in 
> __pyx_pw_7pyarrow_3lib_11RecordBatch_31equals(_object*, _object*, _object*) 
> () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so
> #3  0x556b97e4 in _PyMethodDef_RawFastCallKeywords 
> (method=0x7fffe0c1b760 <__pyx_methods_7pyarrow_3lib_RecordBatch+288>, 
> self=0x7fffdefd7110, args=0x7786f5c8, nargs=, 
> kwnames=)
> at /tmp/build/80754af9/python_1585000375785/work/Objects/call.c:694
> #4  0x556c06af in _PyMethodDescr_FastCallKeywords 
> (descrobj=0x7fffdefa4050, args=0x7786f5c0, nargs=2, kwnames=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Objects/descrobject.c:288
> #5  0x55724add in call_function (kwnames=0x0, oparg=2, 
> pp_stack=) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:4593
> #6  _PyEval_EvalFrameDefault (f=, throwflag=) 
> at /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3110
> #7  0x55669289 in _PyEval_EvalCodeWithName (_co=0x778a68a0, 
> globals=, locals=, args=, 
> argcount=, kwnames=0x0, kwargs=0x0, kwcount=, 
> kwstep=2, 
> defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3930
> #8  0x5566a1c4 in PyEval_EvalCodeEx (_co=, 
> globals=, locals=, args=, 
> argcount=, kws=, kwcount=0, defs=0x0, 
> defcount=0, kwdefs=0x0, 
> closure=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3959
> #9  0x5566a1ec in PyEval_EvalCode (co=, 
> globals=, locals=) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:524
> #10 0x55780cb4 in run_mod (mod=, filename= out>, globals=0x778d7c30, locals=0x778d7c30, flags=, 
> arena=)
> at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:1035
> #11 0x5578b0d1 in PyRun_FileExFlags (fp=0x558c24d0, 
> filename_str=, start=, globals=0x778d7c30, 
> locals=0x778d7c30, closeit=1, flags=0x7fffe1b0)
> at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:988
> #12 0x5578b2c3 in PyRun_SimpleFileExFlags (fp=0x558c24d0, 
> filename=, closeit=1, flags=0x7fffe1b0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:429
> #13 0x5578c3f5 in pymain_run_file (p_cf=0x7fffe1b0, 
> filename=0x558e51f0 L"repro.py", fp=0x558c24d0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:462
> #14 pymain_run_filename (cf=0x7fffe1b0, pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:1641
> #15 pymain_run_python (pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:2902
> #16 pymain_main (pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3442
> #17 0x5578c51c in _Py_UnixMain (argc=, argv= out>) at /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3477
> #18 0x77dcd002 in __libc_start_main () from /usr/lib/libc.so.6
> #19 0x5572fac0 in _start () at ../sysdeps/x86_64/elf/start.S:103
> {noformat}



--
This 

[jira] [Resolved] (ARROW-8889) [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None

2020-05-22 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8889.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7249
[https://github.com/apache/arrow/pull/7249]

> [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None
> --
>
> Key: ARROW-8889
> URL: https://issues.apache.org/jira/browse/ARROW-8889
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1, 0.17.1
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This seems to only happen for Python 3.6 and 3.7. It doesn't happen with 3.8. 
> It seems to happen even when built from source, but I used the wheels for 
> this reproduction.
> {noformat}
> > uname -a
> Linux chaconne 5.6.13-arch1-1 #1 SMP PREEMPT Thu, 14 May 2020 06:52:53 + 
> x86_64 GNU/Linux
> > python --version
> Python 3.7.7
> > pip freeze
> numpy==1.18.4
> pyarrow==0.17.1{noformat}
> Reproduction:
> {code:python}
> import pyarrow as pa
> table = pa.Table.from_arrays([pa.array([1,2,3])], names=["a"])
> batches = table.to_batches()
> batches[0].equals(None)
> {code}
> {noformat}
> #0  0x7fffdf9d34f0 in arrow::RecordBatch::num_columns() const () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
> #1  0x7fffdf9d69e9 in arrow::RecordBatch::Equals(arrow::RecordBatch 
> const&, bool) const () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
> #2  0x7fffe084a6e0 in 
> __pyx_pw_7pyarrow_3lib_11RecordBatch_31equals(_object*, _object*, _object*) 
> () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so
> #3  0x556b97e4 in _PyMethodDef_RawFastCallKeywords 
> (method=0x7fffe0c1b760 <__pyx_methods_7pyarrow_3lib_RecordBatch+288>, 
> self=0x7fffdefd7110, args=0x7786f5c8, nargs=, 
> kwnames=)
> at /tmp/build/80754af9/python_1585000375785/work/Objects/call.c:694
> #4  0x556c06af in _PyMethodDescr_FastCallKeywords 
> (descrobj=0x7fffdefa4050, args=0x7786f5c0, nargs=2, kwnames=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Objects/descrobject.c:288
> #5  0x55724add in call_function (kwnames=0x0, oparg=2, 
> pp_stack=) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:4593
> #6  _PyEval_EvalFrameDefault (f=, throwflag=) 
> at /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3110
> #7  0x55669289 in _PyEval_EvalCodeWithName (_co=0x778a68a0, 
> globals=, locals=, args=, 
> argcount=, kwnames=0x0, kwargs=0x0, kwcount=, 
> kwstep=2, 
> defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3930
> #8  0x5566a1c4 in PyEval_EvalCodeEx (_co=, 
> globals=, locals=, args=, 
> argcount=, kws=, kwcount=0, defs=0x0, 
> defcount=0, kwdefs=0x0, 
> closure=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3959
> #9  0x5566a1ec in PyEval_EvalCode (co=, 
> globals=, locals=) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:524
> #10 0x55780cb4 in run_mod (mod=, filename= out>, globals=0x778d7c30, locals=0x778d7c30, flags=, 
> arena=)
> at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:1035
> #11 0x5578b0d1 in PyRun_FileExFlags (fp=0x558c24d0, 
> filename_str=, start=, globals=0x778d7c30, 
> locals=0x778d7c30, closeit=1, flags=0x7fffe1b0)
> at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:988
> #12 0x5578b2c3 in PyRun_SimpleFileExFlags (fp=0x558c24d0, 
> filename=, closeit=1, flags=0x7fffe1b0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:429
> #13 0x5578c3f5 in pymain_run_file (p_cf=0x7fffe1b0, 
> filename=0x558e51f0 L"repro.py", fp=0x558c24d0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:462
> #14 pymain_run_filename (cf=0x7fffe1b0, pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:1641
> #15 pymain_run_python (pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:2902
> #16 pymain_main (pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3442
> #17 0x5578c51c in _Py_UnixMain (argc=, argv= out>) at /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3477
> #18 0x77dcd002 in 

[jira] [Created] (ARROW-8890) [R] Fix C++ lint issue

2020-05-22 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8890:
-

 Summary: [R] Fix C++ lint issue 
 Key: ARROW-8890
 URL: https://issues.apache.org/jira/browse/ARROW-8890
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Francois Saint-Jacques
Assignee: Francois Saint-Jacques
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8889) [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None

2020-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8889:
--
Labels: pull-request-available  (was: )

> [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None
> --
>
> Key: ARROW-8889
> URL: https://issues.apache.org/jira/browse/ARROW-8889
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1, 0.17.1
>Reporter: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This seems to only happen for Python 3.6 and 3.7. It doesn't happen with 3.8. 
> It seems to happen even when built from source, but I used the wheels for 
> this reproduction.
> {noformat}
> > uname -a
> Linux chaconne 5.6.13-arch1-1 #1 SMP PREEMPT Thu, 14 May 2020 06:52:53 + 
> x86_64 GNU/Linux
> > python --version
> Python 3.7.7
> > pip freeze
> numpy==1.18.4
> pyarrow==0.17.1{noformat}
> Reproduction:
> {code:python}
> import pyarrow as pa
> table = pa.Table.from_arrays([pa.array([1,2,3])], names=["a"])
> batches = table.to_batches()
> batches[0].equals(None)
> {code}
> {noformat}
> #0  0x7fffdf9d34f0 in arrow::RecordBatch::num_columns() const () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
> #1  0x7fffdf9d69e9 in arrow::RecordBatch::Equals(arrow::RecordBatch 
> const&, bool) const () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
> #2  0x7fffe084a6e0 in 
> __pyx_pw_7pyarrow_3lib_11RecordBatch_31equals(_object*, _object*, _object*) 
> () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so
> #3  0x556b97e4 in _PyMethodDef_RawFastCallKeywords 
> (method=0x7fffe0c1b760 <__pyx_methods_7pyarrow_3lib_RecordBatch+288>, 
> self=0x7fffdefd7110, args=0x7786f5c8, nargs=, 
> kwnames=)
> at /tmp/build/80754af9/python_1585000375785/work/Objects/call.c:694
> #4  0x556c06af in _PyMethodDescr_FastCallKeywords 
> (descrobj=0x7fffdefa4050, args=0x7786f5c0, nargs=2, kwnames=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Objects/descrobject.c:288
> #5  0x55724add in call_function (kwnames=0x0, oparg=2, 
> pp_stack=) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:4593
> #6  _PyEval_EvalFrameDefault (f=, throwflag=) 
> at /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3110
> #7  0x55669289 in _PyEval_EvalCodeWithName (_co=0x778a68a0, 
> globals=, locals=, args=, 
> argcount=, kwnames=0x0, kwargs=0x0, kwcount=, 
> kwstep=2, 
> defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3930
> #8  0x5566a1c4 in PyEval_EvalCodeEx (_co=, 
> globals=, locals=, args=, 
> argcount=, kws=, kwcount=0, defs=0x0, 
> defcount=0, kwdefs=0x0, 
> closure=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3959
> #9  0x5566a1ec in PyEval_EvalCode (co=, 
> globals=, locals=) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:524
> #10 0x55780cb4 in run_mod (mod=, filename= out>, globals=0x778d7c30, locals=0x778d7c30, flags=, 
> arena=)
> at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:1035
> #11 0x5578b0d1 in PyRun_FileExFlags (fp=0x558c24d0, 
> filename_str=, start=, globals=0x778d7c30, 
> locals=0x778d7c30, closeit=1, flags=0x7fffe1b0)
> at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:988
> #12 0x5578b2c3 in PyRun_SimpleFileExFlags (fp=0x558c24d0, 
> filename=, closeit=1, flags=0x7fffe1b0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:429
> #13 0x5578c3f5 in pymain_run_file (p_cf=0x7fffe1b0, 
> filename=0x558e51f0 L"repro.py", fp=0x558c24d0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:462
> #14 pymain_run_filename (cf=0x7fffe1b0, pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:1641
> #15 pymain_run_python (pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:2902
> #16 pymain_main (pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3442
> #17 0x5578c51c in _Py_UnixMain (argc=, argv= out>) at /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3477
> #18 0x77dcd002 in __libc_start_main () from /usr/lib/libc.so.6
> #19 0x5572fac0 in _start () at ../sysdeps/x86_64/elf/start.S:103
> {noformat}



--
This message was sent by Atlassian Jira

[jira] [Commented] (ARROW-8889) [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None

2020-05-22 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113981#comment-17113981
 ] 

David Li commented on ARROW-8889:
-

I tried with a wheel for 0.15.1 and it happens as well. (It doesn't happen with 
0.15.1 built from source.) So it seems this has been around a while.

> [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None
> --
>
> Key: ARROW-8889
> URL: https://issues.apache.org/jira/browse/ARROW-8889
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1, 0.17.1
>Reporter: David Li
>Priority: Major
>
> This seems to only happen for Python 3.6 and 3.7. It doesn't happen with 3.8. 
> It seems to happen even when built from source, but I used the wheels for 
> this reproduction.
> {noformat}
> > uname -a
> Linux chaconne 5.6.13-arch1-1 #1 SMP PREEMPT Thu, 14 May 2020 06:52:53 + 
> x86_64 GNU/Linux
> > python --version
> Python 3.7.7
> > pip freeze
> numpy==1.18.4
> pyarrow==0.17.1{noformat}
> Reproduction:
> {code:python}
> import pyarrow as pa
> table = pa.Table.from_arrays([pa.array([1,2,3])], names=["a"])
> batches = table.to_batches()
> batches[0].equals(None)
> {code}
> {noformat}
> #0  0x7fffdf9d34f0 in arrow::RecordBatch::num_columns() const () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
> #1  0x7fffdf9d69e9 in arrow::RecordBatch::Equals(arrow::RecordBatch 
> const&, bool) const () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
> #2  0x7fffe084a6e0 in 
> __pyx_pw_7pyarrow_3lib_11RecordBatch_31equals(_object*, _object*, _object*) 
> () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so
> #3  0x556b97e4 in _PyMethodDef_RawFastCallKeywords 
> (method=0x7fffe0c1b760 <__pyx_methods_7pyarrow_3lib_RecordBatch+288>, 
> self=0x7fffdefd7110, args=0x7786f5c8, nargs=, 
> kwnames=)
> at /tmp/build/80754af9/python_1585000375785/work/Objects/call.c:694
> #4  0x556c06af in _PyMethodDescr_FastCallKeywords 
> (descrobj=0x7fffdefa4050, args=0x7786f5c0, nargs=2, kwnames=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Objects/descrobject.c:288
> #5  0x55724add in call_function (kwnames=0x0, oparg=2, 
> pp_stack=) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:4593
> #6  _PyEval_EvalFrameDefault (f=, throwflag=) 
> at /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3110
> #7  0x55669289 in _PyEval_EvalCodeWithName (_co=0x778a68a0, 
> globals=, locals=, args=, 
> argcount=, kwnames=0x0, kwargs=0x0, kwcount=, 
> kwstep=2, 
> defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3930
> #8  0x5566a1c4 in PyEval_EvalCodeEx (_co=, 
> globals=, locals=, args=, 
> argcount=, kws=, kwcount=0, defs=0x0, 
> defcount=0, kwdefs=0x0, 
> closure=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3959
> #9  0x5566a1ec in PyEval_EvalCode (co=, 
> globals=, locals=) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:524
> #10 0x55780cb4 in run_mod (mod=, filename= out>, globals=0x778d7c30, locals=0x778d7c30, flags=, 
> arena=)
> at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:1035
> #11 0x5578b0d1 in PyRun_FileExFlags (fp=0x558c24d0, 
> filename_str=, start=, globals=0x778d7c30, 
> locals=0x778d7c30, closeit=1, flags=0x7fffe1b0)
> at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:988
> #12 0x5578b2c3 in PyRun_SimpleFileExFlags (fp=0x558c24d0, 
> filename=, closeit=1, flags=0x7fffe1b0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:429
> #13 0x5578c3f5 in pymain_run_file (p_cf=0x7fffe1b0, 
> filename=0x558e51f0 L"repro.py", fp=0x558c24d0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:462
> #14 pymain_run_filename (cf=0x7fffe1b0, pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:1641
> #15 pymain_run_python (pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:2902
> #16 pymain_main (pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3442
> #17 0x5578c51c in _Py_UnixMain (argc=, argv= out>) at /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3477
> #18 0x77dcd002 in __libc_start_main () from /usr/lib/libc.so.6
> #19 0x5572fac0 in _start () at ../sysdeps/x86_64/elf/start.S:103
> {noformat}



--
This 

[jira] [Updated] (ARROW-8889) [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None

2020-05-22 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-8889:

Affects Version/s: 0.15.1

> [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None
> --
>
> Key: ARROW-8889
> URL: https://issues.apache.org/jira/browse/ARROW-8889
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1, 0.17.1
>Reporter: David Li
>Priority: Major
>
> This seems to only happen for Python 3.6 and 3.7. It doesn't happen with 3.8. 
> It seems to happen even when built from source, but I used the wheels for 
> this reproduction.
> {noformat}
> > uname -a
> Linux chaconne 5.6.13-arch1-1 #1 SMP PREEMPT Thu, 14 May 2020 06:52:53 + 
> x86_64 GNU/Linux
> > python --version
> Python 3.7.7
> > pip freeze
> numpy==1.18.4
> pyarrow==0.17.1{noformat}
> Reproduction:
> {code:python}
> import pyarrow as pa
> table = pa.Table.from_arrays([pa.array([1,2,3])], names=["a"])
> batches = table.to_batches()
> batches[0].equals(None)
> {code}
> {noformat}
> #0  0x7fffdf9d34f0 in arrow::RecordBatch::num_columns() const () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
> #1  0x7fffdf9d69e9 in arrow::RecordBatch::Equals(arrow::RecordBatch 
> const&, bool) const () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
> #2  0x7fffe084a6e0 in 
> __pyx_pw_7pyarrow_3lib_11RecordBatch_31equals(_object*, _object*, _object*) 
> () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so
> #3  0x556b97e4 in _PyMethodDef_RawFastCallKeywords 
> (method=0x7fffe0c1b760 <__pyx_methods_7pyarrow_3lib_RecordBatch+288>, 
> self=0x7fffdefd7110, args=0x7786f5c8, nargs=, 
> kwnames=)
> at /tmp/build/80754af9/python_1585000375785/work/Objects/call.c:694
> #4  0x556c06af in _PyMethodDescr_FastCallKeywords 
> (descrobj=0x7fffdefa4050, args=0x7786f5c0, nargs=2, kwnames=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Objects/descrobject.c:288
> #5  0x55724add in call_function (kwnames=0x0, oparg=2, 
> pp_stack=) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:4593
> #6  _PyEval_EvalFrameDefault (f=, throwflag=) 
> at /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3110
> #7  0x55669289 in _PyEval_EvalCodeWithName (_co=0x778a68a0, 
> globals=, locals=, args=, 
> argcount=, kwnames=0x0, kwargs=0x0, kwcount=, 
> kwstep=2, 
> defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3930
> #8  0x5566a1c4 in PyEval_EvalCodeEx (_co=, 
> globals=, locals=, args=, 
> argcount=, kws=, kwcount=0, defs=0x0, 
> defcount=0, kwdefs=0x0, 
> closure=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3959
> #9  0x5566a1ec in PyEval_EvalCode (co=, 
> globals=, locals=) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:524
> #10 0x55780cb4 in run_mod (mod=, filename= out>, globals=0x778d7c30, locals=0x778d7c30, flags=, 
> arena=)
> at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:1035
> #11 0x5578b0d1 in PyRun_FileExFlags (fp=0x558c24d0, 
> filename_str=, start=, globals=0x778d7c30, 
> locals=0x778d7c30, closeit=1, flags=0x7fffe1b0)
> at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:988
> #12 0x5578b2c3 in PyRun_SimpleFileExFlags (fp=0x558c24d0, 
> filename=, closeit=1, flags=0x7fffe1b0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:429
> #13 0x5578c3f5 in pymain_run_file (p_cf=0x7fffe1b0, 
> filename=0x558e51f0 L"repro.py", fp=0x558c24d0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:462
> #14 pymain_run_filename (cf=0x7fffe1b0, pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:1641
> #15 pymain_run_python (pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:2902
> #16 pymain_main (pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3442
> #17 0x5578c51c in _Py_UnixMain (argc=, argv= out>) at /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3477
> #18 0x77dcd002 in __libc_start_main () from /usr/lib/libc.so.6
> #19 0x5572fac0 in _start () at ../sysdeps/x86_64/elf/start.S:103
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8889) [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None

2020-05-22 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113977#comment-17113977
 ] 

David Li commented on ARROW-8889:
-

I have a core dump but it's too large. Let me upload it somewhere else.

> [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None
> --
>
> Key: ARROW-8889
> URL: https://issues.apache.org/jira/browse/ARROW-8889
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
>Reporter: David Li
>Priority: Major
>
> This seems to only happen for Python 3.6 and 3.7. It doesn't happen with 3.8. 
> It seems to happen even when built from source, but I used the wheels for 
> this reproduction.
> {noformat}
> > uname -a
> Linux chaconne 5.6.13-arch1-1 #1 SMP PREEMPT Thu, 14 May 2020 06:52:53 + 
> x86_64 GNU/Linux
> > python --version
> Python 3.7.7
> > pip freeze
> numpy==1.18.4
> pyarrow==0.17.1{noformat}
> Reproduction:
> {code:python}
> import pyarrow as pa
> table = pa.Table.from_arrays([pa.array([1,2,3])], names=["a"])
> batches = table.to_batches()
> batches[0].equals(None)
> {code}
> {noformat}
> #0  0x7fffdf9d34f0 in arrow::RecordBatch::num_columns() const () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
> #1  0x7fffdf9d69e9 in arrow::RecordBatch::Equals(arrow::RecordBatch 
> const&, bool) const () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
> #2  0x7fffe084a6e0 in 
> __pyx_pw_7pyarrow_3lib_11RecordBatch_31equals(_object*, _object*, _object*) 
> () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so
> #3  0x556b97e4 in _PyMethodDef_RawFastCallKeywords 
> (method=0x7fffe0c1b760 <__pyx_methods_7pyarrow_3lib_RecordBatch+288>, 
> self=0x7fffdefd7110, args=0x7786f5c8, nargs=, 
> kwnames=)
> at /tmp/build/80754af9/python_1585000375785/work/Objects/call.c:694
> #4  0x556c06af in _PyMethodDescr_FastCallKeywords 
> (descrobj=0x7fffdefa4050, args=0x7786f5c0, nargs=2, kwnames=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Objects/descrobject.c:288
> #5  0x55724add in call_function (kwnames=0x0, oparg=2, 
> pp_stack=) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:4593
> #6  _PyEval_EvalFrameDefault (f=, throwflag=) 
> at /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3110
> #7  0x55669289 in _PyEval_EvalCodeWithName (_co=0x778a68a0, 
> globals=, locals=, args=, 
> argcount=, kwnames=0x0, kwargs=0x0, kwcount=, 
> kwstep=2, 
> defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3930
> #8  0x5566a1c4 in PyEval_EvalCodeEx (_co=, 
> globals=, locals=, args=, 
> argcount=, kws=, kwcount=0, defs=0x0, 
> defcount=0, kwdefs=0x0, 
> closure=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3959
> #9  0x5566a1ec in PyEval_EvalCode (co=, 
> globals=, locals=) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:524
> #10 0x55780cb4 in run_mod (mod=, filename= out>, globals=0x778d7c30, locals=0x778d7c30, flags=, 
> arena=)
> at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:1035
> #11 0x5578b0d1 in PyRun_FileExFlags (fp=0x558c24d0, 
> filename_str=, start=, globals=0x778d7c30, 
> locals=0x778d7c30, closeit=1, flags=0x7fffe1b0)
> at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:988
> #12 0x5578b2c3 in PyRun_SimpleFileExFlags (fp=0x558c24d0, 
> filename=, closeit=1, flags=0x7fffe1b0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:429
> #13 0x5578c3f5 in pymain_run_file (p_cf=0x7fffe1b0, 
> filename=0x558e51f0 L"repro.py", fp=0x558c24d0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:462
> #14 pymain_run_filename (cf=0x7fffe1b0, pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:1641
> #15 pymain_run_python (pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:2902
> #16 pymain_main (pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3442
> #17 0x5578c51c in _Py_UnixMain (argc=, argv= out>) at /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3477
> #18 0x77dcd002 in __libc_start_main () from /usr/lib/libc.so.6
> #19 0x5572fac0 in _start () at ../sysdeps/x86_64/elf/start.S:103
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8889) [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None

2020-05-22 Thread David Li (Jira)
David Li created ARROW-8889:
---

 Summary: [Python] Python 3.7 SIGSEGV when comparing RecordBatch to 
None
 Key: ARROW-8889
 URL: https://issues.apache.org/jira/browse/ARROW-8889
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.17.1
Reporter: David Li


This seems to only happen for Python 3.6 and 3.7. It doesn't happen with 3.8. 
It seems to happen even when built from source, but I used the wheels for this 
reproduction.
{noformat}
> uname -a
Linux chaconne 5.6.13-arch1-1 #1 SMP PREEMPT Thu, 14 May 2020 06:52:53 + 
x86_64 GNU/Linux
> python --version
Python 3.7.7
> pip freeze
numpy==1.18.4
pyarrow==0.17.1{noformat}
Reproduction:
{code:python}
import pyarrow as pa
table = pa.Table.from_arrays([pa.array([1,2,3])], names=["a"])
batches = table.to_batches()
batches[0].equals(None)
{code}
{noformat}
#0  0x7fffdf9d34f0 in arrow::RecordBatch::num_columns() const () from 
/home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
#1  0x7fffdf9d69e9 in arrow::RecordBatch::Equals(arrow::RecordBatch const&, 
bool) const () from 
/home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
#2  0x7fffe084a6e0 in 
__pyx_pw_7pyarrow_3lib_11RecordBatch_31equals(_object*, _object*, _object*) () 
from 
/home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so
#3  0x556b97e4 in _PyMethodDef_RawFastCallKeywords 
(method=0x7fffe0c1b760 <__pyx_methods_7pyarrow_3lib_RecordBatch+288>, 
self=0x7fffdefd7110, args=0x7786f5c8, nargs=, 
kwnames=)
at /tmp/build/80754af9/python_1585000375785/work/Objects/call.c:694
#4  0x556c06af in _PyMethodDescr_FastCallKeywords 
(descrobj=0x7fffdefa4050, args=0x7786f5c0, nargs=2, kwnames=0x0) at 
/tmp/build/80754af9/python_1585000375785/work/Objects/descrobject.c:288
#5  0x55724add in call_function (kwnames=0x0, oparg=2, 
pp_stack=) at 
/tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:4593
#6  _PyEval_EvalFrameDefault (f=, throwflag=) at 
/tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3110
#7  0x55669289 in _PyEval_EvalCodeWithName (_co=0x778a68a0, 
globals=, locals=, args=, 
argcount=, kwnames=0x0, kwargs=0x0, kwcount=, 
kwstep=2, 
defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at 
/tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3930
#8  0x5566a1c4 in PyEval_EvalCodeEx (_co=, 
globals=, locals=, args=, 
argcount=, kws=, kwcount=0, defs=0x0, defcount=0, 
kwdefs=0x0, 
closure=0x0) at 
/tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3959
#9  0x5566a1ec in PyEval_EvalCode (co=, 
globals=, locals=) at 
/tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:524
#10 0x55780cb4 in run_mod (mod=, filename=, globals=0x778d7c30, locals=0x778d7c30, flags=, 
arena=)
at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:1035
#11 0x5578b0d1 in PyRun_FileExFlags (fp=0x558c24d0, 
filename_str=, start=, globals=0x778d7c30, 
locals=0x778d7c30, closeit=1, flags=0x7fffe1b0)
at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:988
#12 0x5578b2c3 in PyRun_SimpleFileExFlags (fp=0x558c24d0, 
filename=, closeit=1, flags=0x7fffe1b0) at 
/tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:429
#13 0x5578c3f5 in pymain_run_file (p_cf=0x7fffe1b0, 
filename=0x558e51f0 L"repro.py", fp=0x558c24d0) at 
/tmp/build/80754af9/python_1585000375785/work/Modules/main.c:462
#14 pymain_run_filename (cf=0x7fffe1b0, pymain=0x7fffe2c0) at 
/tmp/build/80754af9/python_1585000375785/work/Modules/main.c:1641
#15 pymain_run_python (pymain=0x7fffe2c0) at 
/tmp/build/80754af9/python_1585000375785/work/Modules/main.c:2902
#16 pymain_main (pymain=0x7fffe2c0) at 
/tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3442
#17 0x5578c51c in _Py_UnixMain (argc=, argv=) at /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3477
#18 0x77dcd002 in __libc_start_main () from /usr/lib/libc.so.6
#19 0x5572fac0 in _start () at ../sysdeps/x86_64/elf/start.S:103
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8885) [R] Don't include everything everywhere

2020-05-22 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8885.
---
Resolution: Fixed

Issue resolved by pull request 7245
[https://github.com/apache/arrow/pull/7245]

> [R] Don't include everything everywhere
> ---
>
> Key: ARROW-8885
> URL: https://issues.apache.org/jira/browse/ARROW-8885
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I noticed that we were jamming all of our arrow #includes in one header file 
> in the R bindings and then including that everywhere. Seemed like that was 
> wasteful and probably causing compilation to be slower.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8696) [Java] Convert tests to integration tests

2020-05-22 Thread Ryan Murray (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Murray resolved ARROW-8696.

Resolution: Fixed

> [Java] Convert tests to integration tests
> -
>
> Key: ARROW-8696
> URL: https://issues.apache.org/jira/browse/ARROW-8696
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Ryan Murray
>Assignee: Ryan Murray
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Some tests under arrow-memory and arrow-vector are integration tests but run 
> via main(). We should convert them to proper integration tests under maven 
> failsafe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8696) [Java] Convert tests to integration tests

2020-05-22 Thread Ryan Murray (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113870#comment-17113870
 ] 

Ryan Murray commented on ARROW-8696:


Closed in https://github.com/apache/arrow/pull/7100 via 93ba086 

> [Java] Convert tests to integration tests
> -
>
> Key: ARROW-8696
> URL: https://issues.apache.org/jira/browse/ARROW-8696
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Ryan Murray
>Assignee: Ryan Murray
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Some tests under arrow-memory and arrow-vector are integration tests but run 
> via main(). We should convert them to proper integration tests under maven 
> failsafe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8888) [Python] Heuristic in dataframe_to_arrays that decides to multithread convert cause slow conversions

2020-05-22 Thread Kevin Glasson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Glasson updated ARROW-:
-
Description: 
When calling pa.Table.from_pandas() the code path that uses the 
ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the 
conversion is much much slower.

 
 I have a simple example - but the time difference is much worse with a real 
table.

 
{code:java}
Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18)
 Type 'copyright', 'credits' or 'license' for more information
 IPython 7.13.0 – An enhanced Interactive Python. Type '?' for help.
In [1]: import pyarrow as pa
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({"A": [0] * 1000})
In [4]: %timeit table = pa.Table.from_pandas(df)
 577 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: %timeit table = pa.Table.from_pandas(df, nthreads=1)
 106 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1 loops each)
{code}
 

  was:
When calling pa.Table.from_pandas() the code path that uses the 
ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the 
conversion is much much slower.

 
 I have a simple example - but the time difference is much worse with a real 
table.

 

 
{code:java}
Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18)
 Type 'copyright', 'credits' or 'license' for more information
 IPython 7.13.0 – An enhanced Interactive Python. Type '?' for help.
In [1]: import pyarrow as pa
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({"A": [0] * 1000})
In [4]: %timeit table = pa.Table.from_pandas(df)
 577 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: %timeit table = pa.Table.from_pandas(df, nthreads=1)
 106 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1 loops each)
{code}
 

Summary: [Python] Heuristic in dataframe_to_arrays that decides to 
multithread convert cause slow conversions  (was: Heuristic in 
dataframe_to_arrays that decides to multithread convert cause slow conversions)

> [Python] Heuristic in dataframe_to_arrays that decides to multithread convert 
> cause slow conversions
> 
>
> Key: ARROW-
> URL: https://issues.apache.org/jira/browse/ARROW-
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
> Environment: MacOS: 10.15.4 (Also happening on windows 10)
> Python: 3.7.3
> Pyarrow: 0.16.0
> Pandas: 0.25.3
>Reporter: Kevin Glasson
>Priority: Minor
>
> When calling pa.Table.from_pandas() the code path that uses the 
> ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the 
> conversion is much much slower.
>  
>  I have a simple example - but the time difference is much worse with a real 
> table.
>  
> {code:java}
> Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18)
>  Type 'copyright', 'credits' or 'license' for more information
>  IPython 7.13.0 – An enhanced Interactive Python. Type '?' for help.
> In [1]: import pyarrow as pa
> In [2]: import pandas as pd
> In [3]: df = pd.DataFrame({"A": [0] * 1000})
> In [4]: %timeit table = pa.Table.from_pandas(df)
>  577 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> In [5]: %timeit table = pa.Table.from_pandas(df, nthreads=1)
>  106 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1 loops each)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8888) Heuristic in dataframe_to_arrays that decides to multithread convert cause slow conversions

2020-05-22 Thread Kevin Glasson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Glasson updated ARROW-:
-
Description: 
When calling pa.Table.from_pandas() the code path that uses the 
ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the 
conversion is much much slower.

 
 I have a simple example - but the time difference is much worse with a real 
table.

 

 
{code:java}
Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18)
 Type 'copyright', 'credits' or 'license' for more information
 IPython 7.13.0 – An enhanced Interactive Python. Type '?' for help.
In [1]: import pyarrow as pa
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({"A": [0] * 1000})
In [4]: %timeit table = pa.Table.from_pandas(df)
 577 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: %timeit table = pa.Table.from_pandas(df, nthreads=1)
 106 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1 loops each)
{code}
 

  was:
When calling pa.Table.from_pandas() the code path that uses the 
ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the 
conversion is much much slower.

 
I have a simple example - but the time difference is much worse with a real 
table.


Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pyarrow as pa

In [2]: import pandas as pd

In [3]: df = pd.DataFrame(\{"A": [0] * 1000})

In [4]: %timeit table = pa.Table.from_pandas(df)
577 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit table = pa.Table.from_pandas(df, nthreads=1)
106 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1 loops each)


> Heuristic in dataframe_to_arrays that decides to multithread convert cause 
> slow conversions
> ---
>
> Key: ARROW-
> URL: https://issues.apache.org/jira/browse/ARROW-
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
> Environment: MacOS: 10.15.4 (Also happening on windows 10)
> Python: 3.7.3
> Pyarrow: 0.16.0
> Pandas: 0.25.3
>Reporter: Kevin Glasson
>Priority: Minor
>
> When calling pa.Table.from_pandas() the code path that uses the 
> ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the 
> conversion is much much slower.
>  
>  I have a simple example - but the time difference is much worse with a real 
> table.
>  
>  
> {code:java}
> Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18)
>  Type 'copyright', 'credits' or 'license' for more information
>  IPython 7.13.0 – An enhanced Interactive Python. Type '?' for help.
> In [1]: import pyarrow as pa
> In [2]: import pandas as pd
> In [3]: df = pd.DataFrame({"A": [0] * 1000})
> In [4]: %timeit table = pa.Table.from_pandas(df)
>  577 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> In [5]: %timeit table = pa.Table.from_pandas(df, nthreads=1)
>  106 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1 loops each)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8888) Heuristic in dataframe_to_arrays that decides to multithread convert cause slow conversions

2020-05-22 Thread Kevin Glasson (Jira)
Kevin Glasson created ARROW-:


 Summary: Heuristic in dataframe_to_arrays that decides to 
multithread convert cause slow conversions
 Key: ARROW-
 URL: https://issues.apache.org/jira/browse/ARROW-
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
 Environment: MacOS: 10.15.4 (Also happening on windows 10)
Python: 3.7.3
Pyarrow: 0.16.0
Pandas: 0.25.3
Reporter: Kevin Glasson


When calling pa.Table.from_pandas() the code path that uses the 
ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the 
conversion is much much slower.

 
I have a simple example - but the time difference is much worse with a real 
table.


Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pyarrow as pa

In [2]: import pandas as pd

In [3]: df = pd.DataFrame(\{"A": [0] * 1000})

In [4]: %timeit table = pa.Table.from_pandas(df)
577 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit table = pa.Table.from_pandas(df, nthreads=1)
106 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1 loops each)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8402) [Java] Support ValidateFull methods in Java

2020-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8402:
--
Labels: pull-request-available  (was: )

> [Java] Support ValidateFull methods in Java
> ---
>
> Key: ARROW-8402
> URL: https://issues.apache.org/jira/browse/ARROW-8402
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We need to support ValidateFull methods in Java, just like we do in C++. 
> This is required by ARROW-5926.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8887) [Java] Buffer size for complex vectors increases rapidly in case of clear/write loop

2020-05-22 Thread Pindikura Ravindra (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra resolved ARROW-8887.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7247
[https://github.com/apache/arrow/pull/7247]

> [Java] Buffer size for complex vectors increases rapidly in case of 
> clear/write loop
> 
>
> Key: ARROW-8887
> URL: https://issues.apache.org/jira/browse/ARROW-8887
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Projjal Chanda
>Assignee: Projjal Chanda
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Similar to https://issues.apache.org/jira/browse/ARROW-5232



--
This message was sent by Atlassian Jira
(v8.3.4#803005)