[jira] [Updated] (ARROW-1696) [C++] Add codec benchmarks

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1696:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Add codec benchmarks
> --
>
> Key: ARROW-1696
> URL: https://issues.apache.org/jira/browse/ARROW-1696
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> This will also help users validate in release builds that the compression 
> libraries have been built with the appropriate optimization levels, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-1950) [Python] pandas_type in pandas metadata incorrect for List types

2018-02-07 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud resolved ARROW-1950.
--
Resolution: Fixed

Issue resolved by pull request 1571
[https://github.com/apache/arrow/pull/1571]

> [Python] pandas_type in pandas metadata incorrect for List types
> 
>
> Key: ARROW-1950
> URL: https://issues.apache.org/jira/browse/ARROW-1950
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see https://github.com/pandas-dev/pandas/pull/18201#issuecomment-353042438



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1950) [Python] pandas_type in pandas metadata incorrect for List types

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356405#comment-16356405
 ] 

ASF GitHub Bot commented on ARROW-1950:
---

cpcloud commented on issue #1571: ARROW-1950: [Python] pandas_type in pandas 
metadata incorrect for List types
URL: https://github.com/apache/arrow/pull/1571#issuecomment-363983929
 
 
   Sweet.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pandas_type in pandas metadata incorrect for List types
> 
>
> Key: ARROW-1950
> URL: https://issues.apache.org/jira/browse/ARROW-1950
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see https://github.com/pandas-dev/pandas/pull/18201#issuecomment-353042438



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1950) [Python] pandas_type in pandas metadata incorrect for List types

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356401#comment-16356401
 ] 

ASF GitHub Bot commented on ARROW-1950:
---

cpcloud closed pull request #1571: ARROW-1950: [Python] pandas_type in pandas 
metadata incorrect for List types
URL: https://github.com/apache/arrow/pull/1571
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/arrow_to_pandas.cc 
b/cpp/src/arrow/python/arrow_to_pandas.cc
index fcf05f833..a17d14bf6 100644
--- a/cpp/src/arrow/python/arrow_to_pandas.cc
+++ b/cpp/src/arrow/python/arrow_to_pandas.cc
@@ -56,8 +56,8 @@
 namespace arrow {
 namespace py {
 
-using internal::kPandasTimestampNull;
 using internal::kNanosecondsInDay;
+using internal::kPandasTimestampNull;
 
 using compute::Datum;
 
@@ -90,7 +90,6 @@ struct WrapBytes {
 
 static inline bool ListTypeSupported(const DataType& type) {
   switch (type.id()) {
-case Type::NA:
 case Type::UINT8:
 case Type::INT8:
 case Type::UINT16:
@@ -104,6 +103,7 @@ static inline bool ListTypeSupported(const DataType& type) {
 case Type::BINARY:
 case Type::STRING:
 case Type::TIMESTAMP:
+case Type::NA:  // empty list
   // The above types are all supported.
   return true;
 case Type::LIST: {
@@ -696,7 +696,6 @@ class ObjectBlock : public PandasBlock {
 } else if (type == Type::LIST) {
   auto list_type = std::static_pointer_cast(col->type());
   switch (list_type->value_type()->id()) {
-CONVERTLISTSLIKE_CASE(FloatType, NA)
 CONVERTLISTSLIKE_CASE(UInt8Type, UINT8)
 CONVERTLISTSLIKE_CASE(Int8Type, INT8)
 CONVERTLISTSLIKE_CASE(UInt16Type, UINT16)
@@ -711,6 +710,7 @@ class ObjectBlock : public PandasBlock {
 CONVERTLISTSLIKE_CASE(BinaryType, BINARY)
 CONVERTLISTSLIKE_CASE(StringType, STRING)
 CONVERTLISTSLIKE_CASE(ListType, LIST)
+CONVERTLISTSLIKE_CASE(NullType, NA)
 default: {
   std::stringstream ss;
   ss << "Not implemented type for conversion from List to Pandas 
ObjectBlock: "
diff --git a/python/pyarrow/pandas_compat.py b/python/pyarrow/pandas_compat.py
index 987bb7555..f5e56a9b2 100644
--- a/python/pyarrow/pandas_compat.py
+++ b/python/pyarrow/pandas_compat.py
@@ -45,7 +45,7 @@ def get_logical_type_map():
 
 if not _logical_type_map:
 _logical_type_map.update({
-pa.lib.Type_NA: 'float64',  # NaNs
+pa.lib.Type_NA: 'empty',
 pa.lib.Type_BOOL: 'bool',
 pa.lib.Type_INT8: 'int8',
 pa.lib.Type_INT16: 'int16',
diff --git a/python/pyarrow/tests/test_array.py 
b/python/pyarrow/tests/test_array.py
index 1d5d30071..efbcef5e1 100644
--- a/python/pyarrow/tests/test_array.py
+++ b/python/pyarrow/tests/test_array.py
@@ -455,7 +455,7 @@ def test_simple_type_construction():
 @pytest.mark.parametrize(
 ('type', 'expected'),
 [
-(pa.null(), 'float64'),
+(pa.null(), 'empty'),
 (pa.bool_(), 'bool'),
 (pa.int8(), 'int8'),
 (pa.int16(), 'int16'),
diff --git a/python/pyarrow/tests/test_convert_pandas.py 
b/python/pyarrow/tests/test_convert_pandas.py
index 4f0a68729..7dbf0d7ed 100644
--- a/python/pyarrow/tests/test_convert_pandas.py
+++ b/python/pyarrow/tests/test_convert_pandas.py
@@ -1404,6 +1404,57 @@ def test_empty_list_roundtrip(self):
 
 tm.assert_frame_equal(result, df)
 
+def test_empty_list_metadata(self):
+# Create table with array of empty lists, forced to have type
+# list(string) in pyarrow
+c1 = [["test"], ["a", "b"], None]
+c2 = [[], [], []]
+arrays = OrderedDict([
+('c1', pa.array(c1, type=pa.list_(pa.string(,
+('c2', pa.array(c2, type=pa.list_(pa.string(,
+])
+rb = pa.RecordBatch.from_arrays(
+list(arrays.values()),
+list(arrays.keys())
+)
+tbl = pa.Table.from_batches([rb])
+
+# First roundtrip changes schema, because pandas cannot preserve the
+# type of empty lists
+df = tbl.to_pandas()
+tbl2 = pa.Table.from_pandas(df, preserve_index=True)
+md2 = json.loads(tbl2.schema.metadata[b'pandas'].decode('utf8'))
+
+# Second roundtrip
+df2 = tbl2.to_pandas()
+expected = pd.DataFrame(OrderedDict([('c1', c1), ('c2', c2)]))
+
+tm.assert_frame_equal(df2, expected)
+
+assert md2['columns'] == [
+{
+'name': 'c1',
+'field_name': 'c1',
+'metadata': None,
+'numpy_type': 'object',
+'pandas_type': 'list[unicode]',
+},
+ 

[jira] [Updated] (ARROW-590) Add integration tests for Union types

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-590:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> Add integration tests for Union types
> -
>
> Key: ARROW-590
> URL: https://issues.apache.org/jira/browse/ARROW-590
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Java - Vectors
>Reporter: Wes McKinney
>Assignee: Li Jin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-772) [C++] Implement take kernel functions

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-772:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Implement take kernel functions
> -
>
> Key: ARROW-772
> URL: https://issues.apache.org/jira/browse/ARROW-772
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Jingyuan Wang
>Priority: Major
>  Labels: Analytics
> Fix For: 0.10.0
>
>
> Among other things, this can be used to convert from DictionaryArray back to 
> dense array. This is equivalent to {{ndarray.take}} in NumPy



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-352) [Format] Interval(DAY_TIME) has no unit

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-352:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [Format] Interval(DAY_TIME) has no unit
> ---
>
> Key: ARROW-352
> URL: https://issues.apache.org/jira/browse/ARROW-352
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Format
>Reporter: Julien Le Dem
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> Interval(DATE_TIME) assumes milliseconds.
> we should have a time unit like timestamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-352) [Format] Interval(DAY_TIME) has no unit

2018-02-07 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356296#comment-16356296
 ] 

Wes McKinney commented on ARROW-352:


It doesn't look like we will resolve this for 0.9.0

> [Format] Interval(DAY_TIME) has no unit
> ---
>
> Key: ARROW-352
> URL: https://issues.apache.org/jira/browse/ARROW-352
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Format
>Reporter: Julien Le Dem
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> Interval(DATE_TIME) assumes milliseconds.
> we should have a time unit like timestamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2083) Support skipping builds

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356286#comment-16356286
 ] 

ASF GitHub Bot commented on ARROW-2083:
---

wesm commented on a change in pull request #1568: ARROW-2083: [CI] Detect 
changed components on Travis-CI
URL: https://github.com/apache/arrow/pull/1568#discussion_r166802741
 
 

 ##
 File path: ci/travis_script_site.sh
 ##
 @@ -0,0 +1,29 @@
+#!/usr/bin/env bash
 
 Review comment:
   This should be `travis_script_javadoc.sh` I guess


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support skipping builds
> ---
>
> Key: ARROW-2083
> URL: https://issues.apache.org/jira/browse/ARROW-2083
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> While appveyor supports a  [skip appveyor] you cannot skip only travis. What 
> is the feeling about adding e.g. 
> [https://github.com/travis-ci/travis-ci/issues/5032#issuecomment-273626567] 
> to our build. We could also do some simple kind of change detection that we 
> don't build the C++/Python parts and only Java and the integration tests if 
> there was a change in the PR that only affects Java.
> I think it might be worthwhile to spend a bit on that to get a bit of load of 
> the CI infrastructure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1949) [Python] Add option to Array.from_pandas and pyarrow.array to perform unsafe casts

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1949:

Summary: [Python] Add option to Array.from_pandas and pyarrow.array to 
perform unsafe casts  (was: [C++] Add option to Array.from_pandas and 
pyarrow.array to perform unsafe casts)

> [Python] Add option to Array.from_pandas and pyarrow.array to perform unsafe 
> casts
> --
>
> Key: ARROW-1949
> URL: https://issues.apache.org/jira/browse/ARROW-1949
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> Per mailing list thread



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1956:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Support reading specific partitions from a partitioned parquet 
> dataset
> ---
>
> Key: ARROW-1956
> URL: https://issues.apache.org/jira/browse/ARROW-1956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Affects Versions: 0.8.0
> Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: parquet
> Fix For: 0.10.0
>
> Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1638) [Java] IPC roundtrip for null type

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1638:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Java] IPC roundtrip for null type
> --
>
> Key: ARROW-1638
> URL: https://issues.apache.org/jira/browse/ARROW-1638
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Siddharth Teotia
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1632) [Python] Permit categorical conversions in Table.to_pandas on a per-column basis

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1632:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Permit categorical conversions in Table.to_pandas on a per-column 
> basis
> 
>
> Key: ARROW-1632
> URL: https://issues.apache.org/jira/browse/ARROW-1632
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> Currently this is all or nothing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1636) Integration tests for null type

2018-02-07 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356268#comment-16356268
 ] 

Wes McKinney commented on ARROW-1636:
-

If we can get to this for 0.9.0, that's great too

> Integration tests for null type
> ---
>
> Key: ARROW-1636
> URL: https://issues.apache.org/jira/browse/ARROW-1636
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Java - Vectors
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> This was not implemented on the C++ side, and came up in ARROW-1584. 
> Realistically arrays may be of null type, and we should be able to message 
> these correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1636) Integration tests for null type

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1636:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Integration tests for null type
> ---
>
> Key: ARROW-1636
> URL: https://issues.apache.org/jira/browse/ARROW-1636
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Java - Vectors
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> This was not implemented on the C++ side, and came up in ARROW-1584. 
> Realistically arrays may be of null type, and we should be able to message 
> these correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1837) [Java] Unable to read unsigned integers outside signed range for bit width in integration tests

2018-02-07 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356266#comment-16356266
 ] 

Wes McKinney commented on ARROW-1837:
-

Moving this to 0.10.0. Maybe we can deal with unsigned integers in Java in the 
next release cycle

> [Java] Unable to read unsigned integers outside signed range for bit width in 
> integration tests
> ---
>
> Key: ARROW-1837
> URL: https://issues.apache.org/jira/browse/ARROW-1837
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 0.10.0
>
> Attachments: generated_primitive.json
>
>
> I believe this was introduced recently (perhaps in the refactors), but there 
> was a problem where the integration tests weren't being properly run that hid 
> the error from us
> see https://github.com/apache/arrow/pull/1294#issuecomment-345553066



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-640:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Arrow scalar values should have a sensible __hash__ and comparison
> ---
>
> Key: ARROW-640
> URL: https://issues.apache.org/jira/browse/ARROW-640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Miki Tebeka
>Priority: Major
> Fix For: 0.10.0
>
>
> {noformat}
> In [86]: arr = pa.from_pylist([1, 1, 1, 2])
> In [87]: set(arr)
> Out[87]: {1, 2, 1, 1}
> In [88]: arr[0] == arr[1]
> Out[88]: False
> In [89]: arr
> Out[89]: 
> 
> [
>   1,
>   1,
>   1,
>   2
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-448) [Python] Load HdfsClient default options from core-site.xml or hdfs-site.xml, if available

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-448:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Load HdfsClient default options from core-site.xml or hdfs-site.xml, 
> if available
> --
>
> Key: ARROW-448
> URL: https://issues.apache.org/jira/browse/ARROW-448
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> This will yield a nicer user experience for some users



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2083) Support skipping builds

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356267#comment-16356267
 ] 

ASF GitHub Bot commented on ARROW-2083:
---

kou commented on issue #1568: ARROW-2083: [CI] Detect changed components on 
Travis-CI
URL: https://github.com/apache/arrow/pull/1568#issuecomment-363956649
 
 
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support skipping builds
> ---
>
> Key: ARROW-2083
> URL: https://issues.apache.org/jira/browse/ARROW-2083
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> While appveyor supports a  [skip appveyor] you cannot skip only travis. What 
> is the feeling about adding e.g. 
> [https://github.com/travis-ci/travis-ci/issues/5032#issuecomment-273626567] 
> to our build. We could also do some simple kind of change detection that we 
> don't build the C++/Python parts and only Java and the integration tests if 
> there was a change in the PR that only affects Java.
> I think it might be worthwhile to spend a bit on that to get a bit of load of 
> the CI infrastructure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-473) [C++/Python] Add public API for retrieving block locations for a particular HDFS file

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-473:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++/Python] Add public API for retrieving block locations for a particular 
> HDFS file
> -
>
> Key: ARROW-473
> URL: https://issues.apache.org/jira/browse/ARROW-473
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> This is necessary for applications looking to schedule data-local work. 
> libhdfs does not have APIs to request the block locations directly, so we 
> need to see if the {{hdfsGetHosts}} function will do what we need. For 
> libhdfs3 there is a public API function 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-889) [Python] Add nicer __repr__ for Column

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-889:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Add nicer __repr__ for Column
> --
>
> Key: ARROW-889
> URL: https://issues.apache.org/jira/browse/ARROW-889
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1035) [Python] Add ASV benchmarks for streaming columnar deserialization

2018-02-07 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356261#comment-16356261
 ] 

Wes McKinney commented on ARROW-1035:
-

It would be good to do this for 0.9.0 to make sure we haven't any major perf 
regressions over the last several major releases

> [Python] Add ASV benchmarks for streaming columnar deserialization
> --
>
> Key: ARROW-1035
> URL: https://issues.apache.org/jira/browse/ARROW-1035
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> We need to carefully monitor the performance of critical operations like 
> streaming format to pandas wall clock time a la 
> http://wesmckinney.com/blog/arrow-streaming-columnar/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-842) [Python] Handle more kinds of null sentinel objects from pandas 0.x

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-842:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Handle more kinds of null sentinel objects from pandas 0.x
> ---
>
> Key: ARROW-842
> URL: https://issues.apache.org/jira/browse/ARROW-842
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> Follow-on work to ARROW-707. See 
> https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L193 
> and discussion in https://github.com/apache/arrow/pull/554



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1837) [Java] Unable to read unsigned integers outside signed range for bit width in integration tests

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1837:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Java] Unable to read unsigned integers outside signed range for bit width in 
> integration tests
> ---
>
> Key: ARROW-1837
> URL: https://issues.apache.org/jira/browse/ARROW-1837
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 0.10.0
>
> Attachments: generated_primitive.json
>
>
> I believe this was introduced recently (perhaps in the refactors), but there 
> was a problem where the integration tests weren't being properly run that hid 
> the error from us
> see https://github.com/apache/arrow/pull/1294#issuecomment-345553066



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1266) [Plasma] Move heap allocations to arrow memory pool

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1266:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Plasma] Move heap allocations to arrow memory pool
> ---
>
> Key: ARROW-1266
> URL: https://issues.apache.org/jira/browse/ARROW-1266
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Philipp Moritz
>Priority: Major
> Fix For: 0.10.0
>
>
> At the moment we are allocating memory with std::vectors and even new in some 
> places, this should be cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1786) [Format] List expected on-wire buffer layouts for each kind of Arrow physical type in specification

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1786:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Format] List expected on-wire buffer layouts for each kind of Arrow physical 
> type in specification
> ---
>
> Key: ARROW-1786
> URL: https://issues.apache.org/jira/browse/ARROW-1786
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> see ARROW-1693, ARROW-1785



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1176) C++: Replace WrappedBinary with Tensorflow's StringPiece

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1176:

Fix Version/s: (was: 0.9.0)
   0.10.0

> C++: Replace WrappedBinary with Tensorflow's StringPiece
> 
>
> Key: ARROW-1176
> URL: https://issues.apache.org/jira/browse/ARROW-1176
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> Instead of using the very simple {{WrappedBinary}} class, we may want to use 
> Tensorflow's {{StringPiece}} to handle binary cells: 
> https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/lib/core/stringpiece.h



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1896) [C++] Do not allocate memory for primitive outputs in CastKernel::Call implementation

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1896:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Do not allocate memory for primitive outputs in CastKernel::Call 
> implementation
> -
>
> Key: ARROW-1896
> URL: https://issues.apache.org/jira/browse/ARROW-1896
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> This is some refactoring / tidying. Unless an output of cast has a 
> non-determinate size (e.g. is Binary or something else), the 
> {{CastKernel::Call}} implementation should assume that it is writing into 
> pre-allocated memory. The corresponding memory allocation can be lifted into 
> the {{arrow::compute::Cast}} API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1928) [C++] Add benchmarks comparing performance of internal::BitmapReader/Writer with naive approaches

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1928:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Add benchmarks comparing performance of internal::BitmapReader/Writer 
> with naive approaches
> -
>
> Key: ARROW-1928
> URL: https://issues.apache.org/jira/browse/ARROW-1928
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> The performance may also vary across platforms/compilers. This would be 
> helpful to know how much they help



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1899) [Python] Refactor handling of null sentinels in python/numpy_to_arrow.cc

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1899:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Refactor handling of null sentinels in python/numpy_to_arrow.cc
> 
>
> Key: ARROW-1899
> URL: https://issues.apache.org/jira/browse/ARROW-1899
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> See comments in 
> https://github.com/apache/arrow/commit/ad30138a0ec9be3dfb179d1e9425a4502d556085
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1789) [Format] Consolidate specification documents and improve clarity for new implementation authors

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1789:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Format] Consolidate specification documents and improve clarity for new 
> implementation authors
> ---
>
> Key: ARROW-1789
> URL: https://issues.apache.org/jira/browse/ARROW-1789
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> See discussion in https://github.com/apache/arrow/issues/1296
> I believe the specification documents Layout.md, Metadata.md, and IPC.md 
> would benefit from being consolidated into a single Markdown document that 
> would be sufficient (along with the Flatbuffers schemas) to create a complete 
> Arrow implementation capable of reading and writing the binary format



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1988) [Python] Extend flavor=spark in Parquet writing to handle INT types

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1988:

Issue Type: Bug  (was: New Feature)

> [Python] Extend flavor=spark in Parquet writing to handle INT types
> ---
>
> Key: ARROW-1988
> URL: https://issues.apache.org/jira/browse/ARROW-1988
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.9.0
>
>
> See the relevant code sections at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L139
> We should cater for them in the {{pyarrow}} code and also reach out to Spark 
> developers so that they are supported there in the longterm.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2063) [C++] Implement variant of FixedSizeBufferWriter that also supports reading (like MemoryMappedFile)

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2063:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Implement variant of FixedSizeBufferWriter that also supports reading 
> (like MemoryMappedFile)
> ---
>
> Key: ARROW-2063
> URL: https://issues.apache.org/jira/browse/ARROW-2063
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> This would be helpful for testing, among other things



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1975) [C++] Add abi-compliance-checker to build process

2018-02-07 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356258#comment-16356258
 ] 

Wes McKinney commented on ARROW-1975:
-

[~xhochy] do you want to do this for 0.9.0?

> [C++] Add abi-compliance-checker to build process
> -
>
> Key: ARROW-1975
> URL: https://issues.apache.org/jira/browse/ARROW-1975
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.9.0
>
>
> I would like to check our baseline modules with 
> https://lvc.github.io/abi-compliance-checker/ to ensure that version upgrades 
> are much smoother and that we don‘t break the ABI in patch releases. 
> As we‘re pre-1.0 yet, I accept that there will be breakage but I would like 
> to keep them to a minimum. Currently the biggest pain with Arrow is you need 
> to pin it in Python always with {{==0.x.y}}, otherwise segfaults are 
> inevitable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1993) [Python] Add function for determining implied Arrow schema from pandas.DataFrame

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1993:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Add function for determining implied Arrow schema from 
> pandas.DataFrame
> 
>
> Key: ARROW-1993
> URL: https://issues.apache.org/jira/browse/ARROW-1993
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> Currently the only option is to use {{Table/Array.from_pandas}} which does 
> significant unnecessary work and allocates memory. If only the schema is of 
> interest, then we could do less work and not allocate memory



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1987) [Website] Enable Docker-based documentation generator to build at a specific Arrow commit

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1987:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Website] Enable Docker-based documentation generator to build at a specific 
> Arrow commit
> -
>
> Key: ARROW-1987
> URL: https://issues.apache.org/jira/browse/ARROW-1987
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> Currently both the Docker setup and the Arrow repo have to be at the same 
> commit. It would be useful to create a checkout in the Docker image and 
> enable the build version to be passed in



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1870) [JS] Enable build scripts to work with NodeJS 6.10.2 LTS

2018-02-07 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356257#comment-16356257
 ] 

Wes McKinney commented on ARROW-1870:
-

[~paul.e.taylor] is this important?

> [JS] Enable build scripts to work with NodeJS 6.10.2 LTS
> 
>
> Key: ARROW-1870
> URL: https://issues.apache.org/jira/browse/ARROW-1870
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1833) [Java] Add accessor methods for data buffers that skip null checking

2018-02-07 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356256#comment-16356256
 ] 

Wes McKinney commented on ARROW-1833:
-

Does someone want to take a shot at this? 

> [Java] Add accessor methods for data buffers that skip null checking
> 
>
> Key: ARROW-1833
> URL: https://issues.apache.org/jira/browse/ARROW-1833
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1923) [C++] Make easier to use const ChunkedArray& with Datum in computation context

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1923:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Make easier to use const ChunkedArray& with Datum in computation context
> --
>
> Key: ARROW-1923
> URL: https://issues.apache.org/jira/browse/ARROW-1923
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> Currently this only accepts a {{shared_ptr}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1715) [Python] Implement pickling for Array, Column, ChunkedArray, RecordBatch, Table

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1715:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Implement pickling for Array, Column, ChunkedArray, RecordBatch, 
> Table
> ---
>
> Key: ARROW-1715
> URL: https://issues.apache.org/jira/browse/ARROW-1715
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1796) [Python] RowGroup filtering on file level

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1796:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] RowGroup filtering on file level
> -
>
> Key: ARROW-1796
> URL: https://issues.apache.org/jira/browse/ARROW-1796
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> We can build upon the API defined in {{fastparquet}} for defining RowGroup 
> filters: 
> https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 
> and translate them into the C++ enums we will define in 
> https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to 
> provide the user with a simple predicate pushdown API that we can extend in 
> the background from RowGroup to Page level later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1722) [C++] Add linting script to look for C++/CLI issues

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1722:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Add linting script to look for C++/CLI issues
> ---
>
> Key: ARROW-1722
> URL: https://issues.apache.org/jira/browse/ARROW-1722
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> This includes:
> * Using {{nullptr}} in header files (we must instead use an appropriate macro 
> to use {{__nullptr}} when the host compiler is C++/CLI)
> * Including {{}} in a public header (e.g. header files without "impl" 
> or "internal" in their name)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1509) [Python] Write serialized object as a stream of encapsulated IPC messages

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1509:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Write serialized object as a stream of encapsulated IPC messages
> -
>
> Key: ARROW-1509
> URL: https://issues.apache.org/jira/browse/ARROW-1509
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> The structure of the stream in {{arrow::py::WriteSerializedObject}} is 
> generated on an ad hoc basis -- the components of the stream would be easier 
> to manipulate if this were internally a generic stream of IPC messages. For 
> example, one would be able to examine only the union that represents the 
> structure of the serialized payload and leave the tensors undisturbed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1692) [Python, Java] UnionArray round trip not working

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1692:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python, Java] UnionArray round trip not working
> 
>
> Key: ARROW-1692
> URL: https://issues.apache.org/jira/browse/ARROW-1692
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Philipp Moritz
>Priority: Major
> Fix For: 0.10.0
>
> Attachments: union_array.arrow
>
>
> I'm currently working on making pyarrow.serialization data available from the 
> Java side, one problem I was running into is that it seems the Java 
> implementation cannot read UnionArrays generated from C++. To make this 
> easily reproducible I created a clean Python implementation for creating 
> UnionArrays: https://github.com/apache/arrow/pull/1216
> The data is generated with the following script:
> {code}
> import pyarrow as pa
> binary = pa.array([b'a', b'b', b'c', b'd'], type='binary')
> int64 = pa.array([1, 2, 3], type='int64')
> types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8')
> value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
> result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets)
> batch = pa.RecordBatch.from_arrays([result], ["test"])
> sink = pa.BufferOutputStream()
> writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> writer.write_batch(batch)
> sink.close()
> b = sink.get_result()
> with open("union_array.arrow", "wb") as f:
> f.write(b)
> # Sanity check: Read the batch in again
> with open("union_array.arrow", "rb") as f:
> b = f.read()
> reader = pa.RecordBatchStreamReader(pa.BufferReader(b))
> batch = reader.read_next_batch()
> print("union array is", batch.column(0))
> {code}
> I attached the file generated by that script. Then when I run the following 
> code in Java:
> {code}
> RootAllocator allocator = new RootAllocator(10);
> ByteArrayInputStream in = new 
> ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow")));
> ArrowStreamReader reader = new ArrowStreamReader(in, allocator);
> reader.loadNextBatch()
> {code}
> I get the following error:
> {code}
> |  java.lang.IllegalArgumentException thrown: Could not load buffers for 
> field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error 
> message: can not truncate buffer to a larger size 7: 0
> |at VectorLoader.loadBuffers (VectorLoader.java:83)
> |at VectorLoader.load (VectorLoader.java:62)
> |at ArrowReader$1.visit (ArrowReader.java:125)
> |at ArrowReader$1.visit (ArrowReader.java:111)
> |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
> |at ArrowReader.loadNextBatch (ArrowReader.java:137)
> |at (#7:1)
> {code}
> It seems like Java is not picking up that the UnionArray is Dense instead of 
> Sparse. After changing the default in 
> java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, 
> I get this:
> {code}
> jshell> reader.getVectorSchemaRoot().getSchema()
> $9 ==> Schema [0])<: Int(64, true)>
> {code}
> but then reading doesn't work:
> {code}
> jshell> reader.loadNextBatch()
> |  java.lang.IllegalArgumentException thrown: Could not load buffers for 
> field list: Union(Dense, [1])<: Struct Int(64, true). error message: can not truncate buffer to a larger size 1: > 0
> |at VectorLoader.loadBuffers (VectorLoader.java:83)
> |at VectorLoader.load (VectorLoader.java:62)
> |at ArrowReader$1.visit (ArrowReader.java:125)
> |at ArrowReader$1.visit (ArrowReader.java:111)
> |at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
> |at ArrowReader.loadNextBatch (ArrowReader.java:137)
> |at (#8:1)
> {code}
> Any help with this is appreciated!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1639) [Python] More efficient serialization for RangeIndex in serialize_pandas

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1639:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] More efficient serialization for RangeIndex in serialize_pandas
> 
>
> Key: ARROW-1639
> URL: https://issues.apache.org/jira/browse/ARROW-1639
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1572) [C++] Implement "value counts" kernels for tabulating value frequencies

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1572:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Implement "value counts" kernels for tabulating value frequencies
> ---
>
> Key: ARROW-1572
> URL: https://issues.apache.org/jira/browse/ARROW-1572
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
> Fix For: 0.10.0
>
>
> This is related to "match", "isin", and "unique" since hashing is generally 
> required



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1501) [JS] JavaScript integration tests

2018-02-07 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356255#comment-16356255
 ] 

Wes McKinney commented on ARROW-1501:
-

[~bhulette] [~paul.e.taylor] can this be closed?

> [JS] JavaScript integration tests
> -
>
> Key: ARROW-1501
> URL: https://issues.apache.org/jira/browse/ARROW-1501
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> Tracking JIRA for integration test-related issues



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1570) [C++] Define API for creating a kernel instance from function of scalar input and output with a particular signature

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1570:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Define API for creating a kernel instance from function of scalar input 
> and output with a particular signature
> 
>
> Key: ARROW-1570
> URL: https://issues.apache.org/jira/browse/ARROW-1570
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
> Fix For: 0.10.0
>
>
> This could include an {{std::function}} instance (but these cannot be inlined 
> by the C++ compiler), but should also permit use with inline-able functions 
> or functors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1470) [C++] Add BufferAllocator abstract interface

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1470:
---

Assignee: (was: Wes McKinney)

> [C++] Add BufferAllocator abstract interface
> 
>
> Key: ARROW-1470
> URL: https://issues.apache.org/jira/browse/ARROW-1470
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> There are some situations ({{arrow::ipc::SerializeRecordBatch}} where we pass 
> a {{MemoryPool*}} solely to call {{AllocateBuffer}} using it. This is not as 
> flexible as it could be, since there are situation where we may wish to 
> allocate from shared memory instead. 
> So instead:
> {code}
> Func(..., BufferAllocator* allocator, ...) {
>   ...
>   std::shared_ptr buffer;
>   RETURN_NOT_OK(allocator->Allocate(nbytes, ));
>   ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1470) [C++] Add BufferAllocator abstract interface

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1470:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Add BufferAllocator abstract interface
> 
>
> Key: ARROW-1470
> URL: https://issues.apache.org/jira/browse/ARROW-1470
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> There are some situations ({{arrow::ipc::SerializeRecordBatch}} where we pass 
> a {{MemoryPool*}} solely to call {{AllocateBuffer}} using it. This is not as 
> flexible as it could be, since there are situation where we may wish to 
> allocate from shared memory instead. 
> So instead:
> {code}
> Func(..., BufferAllocator* allocator, ...) {
>   ...
>   std::shared_ptr buffer;
>   RETURN_NOT_OK(allocator->Allocate(nbytes, ));
>   ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-1362) [Integration] Validate vector type layout in IPC messages

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1362.
-
   Resolution: Won't Fix
Fix Version/s: (was: 0.9.0)
   0.8.0

Vector layout is no longer in the IPC metadata

> [Integration] Validate vector type layout in IPC messages
> -
>
> Key: ARROW-1362
> URL: https://issues.apache.org/jira/browse/ARROW-1362
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.8.0
>
>
> We do not rigorously check in the integration tests whether the vector buffer 
> layout is what we expect it to be in the schema metadata



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1382) [Python] Deduplicate non-scalar Python objects when using pyarrow.serialize

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1382:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Deduplicate non-scalar Python objects when using pyarrow.serialize
> ---
>
> Key: ARROW-1382
> URL: https://issues.apache.org/jira/browse/ARROW-1382
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
> Fix For: 0.10.0
>
>
> If a Python object appears multiple times within a list/tuple/dictionary, 
> then when pyarrow serializes the object, it will duplicate the object many 
> times. This leads to a potentially huge expansion in the size of the object 
> (e.g., the serialized version of {{100 * [np.zeros(10 ** 6)]}} will be 100 
> times bigger than it needs to be).
> {code}
> import pyarrow as pa
> l = [0]
> original_object = [l, l]
> # Serialize and deserialize the object.
> buf = pa.serialize(original_object).to_buffer()
> new_object = pa.deserialize(buf)
> # This works.
> assert original_object[0] is original_object[1]
> # This fails.
> assert new_object[0] is new_object[1]
> {code}
> One potential way to address this is to use the Arrow dictionary encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-987) [JS] Implement JSON writer for Integration tests

2018-02-07 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356252#comment-16356252
 ] 

Wes McKinney commented on ARROW-987:


[~paul.e.taylor] [~bhulette] where does this stand?

> [JS] Implement JSON writer for Integration tests
> 
>
> Key: ARROW-987
> URL: https://issues.apache.org/jira/browse/ARROW-987
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Priority: Major
> Fix For: 0.9.0
>
>
> Rather than storing generated binary files in the repo, we could just run the 
> integration tests on the JS implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-1298) C++: Add prefix to jemalloc functions to guard against issues when using multiple allocators in the same process

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1298.
-
   Resolution: Fixed
Fix Version/s: (was: 0.9.0)
   0.8.0

Resolved in ARROW-1282, pls reopen if any issues

> C++: Add prefix to jemalloc functions to guard against issues when using 
> multiple allocators in the same process
> 
>
> Key: ARROW-1298
> URL: https://issues.apache.org/jira/browse/ARROW-1298
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jeff Knupp
>Assignee: Jeff Knupp
>Priority: Major
> Fix For: 0.8.0
>
>
> Based on research done for ARROW-1282, when using jemalloc along with other 
> allocators, it is recommended to build jemalloc with a prefix to be used on 
> all calls (so as not to confuse those calls with that of other allocators). 
> See https://github.com/jemalloc/jemalloc/wiki/Getting-Started for more info. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-47) C++: Consider adding a scalar type object model

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-47:
--
Fix Version/s: (was: 0.9.0)
   0.10.0

> C++: Consider adding a scalar type object model
> ---
>
> Key: ARROW-47
> URL: https://issues.apache.org/jira/browse/ARROW-47
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> Just did this on the Python side. In later analytics routines, passing in 
> scalar values (example: Array + Scalar) requires some kind of container. Some 
> systems, like the R language, solve this problem with length-1 arrays, but we 
> should do some analysis of use cases and figure out what will work best for 
> Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-976) [Python] Provide API for defining and reading Parquet datasets with more ad hoc partition schemes

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-976:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Provide API for defining and reading Parquet datasets with more ad 
> hoc partition schemes
> -
>
> Key: ARROW-976
> URL: https://issues.apache.org/jira/browse/ARROW-976
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-40) C++: Reinterpret Struct arrays as tables

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-40?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-40:
--
Fix Version/s: (was: 0.9.0)
   0.10.0

> C++: Reinterpret Struct arrays as tables
> 
>
> Key: ARROW-40
> URL: https://issues.apache.org/jira/browse/ARROW-40
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> This is mostly a question of layering container types, but will be provided 
> as an API convenience. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-973) [Website] Add FAQ page about project

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-973:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [Website] Add FAQ page about project
> 
>
> Key: ARROW-973
> URL: https://issues.apache.org/jira/browse/ARROW-973
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> As some suggested initial topics for the FAQ:
> * How Apache Arrow is related to Apache Parquet (the difference between a 
> "storage format" and an "in-memory format" causes confusion)
> * How is Arrow similar to / different from Flatbuffers and Cap'n Proto
> * How Arrow uses Flatbuffers (I have had people incorrectly state to me 
> things like "Arrow is just Flatbuffers under the hood")
> Any other ideas?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-974) [Website] Add Use Cases section to the website

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-974:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [Website] Add Use Cases section to the website
> --
>
> Key: ARROW-974
> URL: https://issues.apache.org/jira/browse/ARROW-974
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> This will contain a list of "canonical use cases" for Arrow:
> * In-memory data structure for vectorized analytics / SIMD, or creating a 
> column-oriented analytic database system
> * Reading and writing columnar storage formats like Apache Parquet
> * Faster alternative to Thrift, Protobuf, or Avro in RPC
> * Shared memory IPC (zero-copy in-situ analytics)
> Any other ideas?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-41) C++: Convert RecordBatch to StructArray, and back

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-41?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-41:
--
Fix Version/s: (was: 0.9.0)
   0.10.0

> C++: Convert RecordBatch to StructArray, and back
> -
>
> Key: ARROW-41
> URL: https://issues.apache.org/jira/browse/ARROW-41
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> With {{arrow::TableBatchReader}}, we can turn a Table into a sequence of one 
> or more RecordBatches. It would be useful to be able to easily convert 
> between RecordBatch and a StructArray (which can be semantically equivalent 
> in some contexts)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-549) [C++] Add function to concatenate like-typed arrays

2018-02-07 Thread Panchen Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Panchen Xue reassigned ARROW-549:
-

Assignee: Panchen Xue

> [C++] Add function to concatenate like-typed arrays
> ---
>
> Key: ARROW-549
> URL: https://issues.apache.org/jira/browse/ARROW-549
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: Analytics
> Fix For: 0.9.0
>
>
> A la 
> {{Status arrow::Concatenate(const std::vector& 
> arrays, MemoryPool* pool, std::shared_ptr* out)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356217#comment-16356217
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

wesm commented on issue #1575: ARROW-1425: [Python] Document Arrow timestamps, 
and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#issuecomment-363943606
 
 
   Well, the scope of ARROW-1425 is to explain to Python users what they need 
to know to make correct joint use of pandas, Arrow, and Spark. I have push 
rights on this branch so I can edit directly, maybe tonight or sometime tomorrow


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356216#comment-16356216
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

wesm commented on issue #1575: ARROW-1425: [Python] Document Arrow timestamps, 
and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#issuecomment-363943606
 
 
   Well, the scope of ARROW-1425 is to explain to Python users what they need 
to know to make correct use of pandas, Arrow, and Spark. I have push rights on 
this branch so I can edit directly, maybe tonight or sometime tomorrow


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356196#comment-16356196
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

icexelloss commented on issue #1575: ARROW-1425: [Python] Document Arrow 
timestamps, and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#issuecomment-363939714
 
 
   @wesm This is not a Python specific document. Is there a better place for 
this other than under python?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356195#comment-16356195
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

icexelloss commented on a change in pull request #1575: ARROW-1425: [Python] 
Document Arrow timestamps, and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#discussion_r166783544
 
 

 ##
 File path: python/doc/source/timestamps.rst
 ##
 @@ -0,0 +1,433 @@
+All About Timestamps (work in progress)
 
 Review comment:
   It is a big document. It's pretty long right now because there are quite bit 
of concepts to clarify, about 50% of the doc is about concepts and the other 
half is about Arrow <-> Spark.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356182#comment-16356182
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

wesm commented on a change in pull request #1575: ARROW-1425 [Doc] Document 
Arrow timestamps, and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#discussion_r166779301
 
 

 ##
 File path: python/doc/source/timestamps.rst
 ##
 @@ -0,0 +1,433 @@
+All About Timestamps (work in progress)
 
 Review comment:
   This is a big document. I'd like to see if we can make this about 50% as 
long or less. I will review in more detail as soon as I can and make some 
comments to help


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-634) Add integration tests for FixedSizeBinary

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-634.

Resolution: Fixed

Resolved in 
https://github.com/apache/arrow/commit/f69e9dba1c610e2bce63b2b1b1cacdd7b5cd4000

> Add integration tests for FixedSizeBinary
> -
>
> Key: ARROW-634
> URL: https://issues.apache.org/jira/browse/ARROW-634
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Jingyuan Wang
>Priority: Major
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-633) [Java] Add support for FixedSizeBinary type

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-633.

Resolution: Fixed

Issue resolved by pull request 1492
[https://github.com/apache/arrow/pull/1492]

> [Java] Add support for FixedSizeBinary type
> ---
>
> Key: ARROW-633
> URL: https://issues.apache.org/jira/browse/ARROW-633
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Jingyuan Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-633) [Java] Add support for FixedSizeBinary type

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356177#comment-16356177
 ] 

ASF GitHub Bot commented on ARROW-633:
--

wesm commented on issue #1492: ARROW-633/634: [Java] Add FixedSizeBinary 
support in Java and integration tests (Updated)
URL: https://github.com/apache/arrow/pull/1492#issuecomment-363934376
 
 
   thanks all, please open any follow-up JIRAs


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Add support for FixedSizeBinary type
> ---
>
> Key: ARROW-633
> URL: https://issues.apache.org/jira/browse/ARROW-633
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Jingyuan Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-633) [Java] Add support for FixedSizeBinary type

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356172#comment-16356172
 ] 

ASF GitHub Bot commented on ARROW-633:
--

jacques-n commented on issue #1492: ARROW-633/634: [Java] Add FixedSizeBinary 
support in Java and integration tests (Updated)
URL: https://github.com/apache/arrow/pull/1492#issuecomment-363932089
 
 
   You're right. I forgot that the code is autogenerated and was looking for it 
here. I'm +1 on this PR. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Add support for FixedSizeBinary type
> ---
>
> Key: ARROW-633
> URL: https://issues.apache.org/jira/browse/ARROW-633
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Jingyuan Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356158#comment-16356158
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

ts-dpb commented on issue #1575: ARROW-1425 [Doc] Document Arrow timestamps, 
and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#issuecomment-363929187
 
 
   cc: @icexelloss 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356154#comment-16356154
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

ts-dpb opened a new pull request #1575: ARROW-1425 [Doc] Document Arrow 
timestamps, and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-07 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356138#comment-16356138
 ] 

Phillip Cloud commented on ARROW-1973:
--

Working on this.

> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1950) [Python] pandas_type in pandas metadata incorrect for List types

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356039#comment-16356039
 ] 

ASF GitHub Bot commented on ARROW-1950:
---

cpcloud commented on issue #1571: ARROW-1950: [Python] pandas_type in pandas 
metadata incorrect for List types
URL: https://github.com/apache/arrow/pull/1571#issuecomment-363902634
 
 
   @xhochy do you mind if i merge this one when it passes? i want to see if my 
gitbox powers are working.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pandas_type in pandas metadata incorrect for List types
> 
>
> Key: ARROW-1950
> URL: https://issues.apache.org/jira/browse/ARROW-1950
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see https://github.com/pandas-dev/pandas/pull/18201#issuecomment-353042438



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2110) [Python] Only require pytest-runner on test commands

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356035#comment-16356035
 ] 

ASF GitHub Bot commented on ARROW-2110:
---

wesm closed pull request #1570: ARROW-2110: [Python] Only require pytest-runner 
on test commands
URL: https://github.com/apache/arrow/pull/1570
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/setup.py b/python/setup.py
index 726bb51f2..849d1203b 100644
--- a/python/setup.py
+++ b/python/setup.py
@@ -428,6 +428,13 @@ def parse_version(root):
 else:
 return version
 
+
+# Only include pytest-runner in setup_requires if we're invoking tests
+if {'pytest', 'test', 'ptr'}.intersection(sys.argv):
+setup_requires = ['pytest-runner']
+else:
+setup_requires = []
+
 setup(
 name="pyarrow",
 packages=['pyarrow', 'pyarrow.tests'],
@@ -447,7 +454,7 @@ def parse_version(root):
 ]
 },
 use_scm_version={"root": "..", "relative_to": __file__, "parse": 
parse_version},
-setup_requires=['setuptools_scm', 'cython >= 0.23', 'pytest-runner'],
+setup_requires=['setuptools_scm', 'cython >= 0.23'] + setup_requires,
 install_requires=install_requires,
 tests_require=['pytest', 'pandas'],
 description="Python library for Apache Arrow",


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Only require pytest-runner on test commands
> 
>
> Key: ARROW-2110
> URL: https://issues.apache.org/jira/browse/ARROW-2110
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> We only require it for tests, otherwise we should not depend on it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2110) [Python] Only require pytest-runner on test commands

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2110.
-
Resolution: Fixed

Issue resolved by pull request 1570
[https://github.com/apache/arrow/pull/1570]

> [Python] Only require pytest-runner on test commands
> 
>
> Key: ARROW-2110
> URL: https://issues.apache.org/jira/browse/ARROW-2110
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> We only require it for tests, otherwise we should not depend on it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2087) [Python] Binaries of 3rdparty are not stripped in manylinux1 base image

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356030#comment-16356030
 ] 

ASF GitHub Bot commented on ARROW-2087:
---

wesm closed pull request #1564: ARROW-2087: [Python] Binaries of 3rdparty are 
not stripped in manylinux1 base image
URL: https://github.com/apache/arrow/pull/1564
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/manylinux1/Dockerfile-x86_64 
b/python/manylinux1/Dockerfile-x86_64
index 98b559535..1ade9ab10 100644
--- a/python/manylinux1/Dockerfile-x86_64
+++ b/python/manylinux1/Dockerfile-x86_64
@@ -14,7 +14,7 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
-FROM quay.io/xhochy/arrow_manylinux1_x86_64_base:ARROW-2086
+FROM quay.io/xhochy/arrow_manylinux1_x86_64_base:ARROW-2087
 
 ADD arrow /arrow
 WORKDIR /arrow/cpp
diff --git a/python/manylinux1/Dockerfile-x86_64_base 
b/python/manylinux1/Dockerfile-x86_64_base
index b7687533a..1f15f77d8 100644
--- a/python/manylinux1/Dockerfile-x86_64_base
+++ b/python/manylinux1/Dockerfile-x86_64_base
@@ -28,11 +28,9 @@ RUN /build_boost.sh
 ADD scripts/build_jemalloc.sh /
 RUN /build_jemalloc.sh
 
-WORKDIR /
 # Install cmake manylinux1 package
-RUN /opt/python/cp35-cp35m/bin/pip install cmake ninja
-RUN ln -s /opt/python/cp35-cp35m/bin/cmake /usr/bin/cmake
-RUN ln -s /opt/python/cp35-cp35m/bin/ninja /usr/bin/ninja
+ADD scripts/install_cmake.sh /
+RUN /install_cmake.sh
 
 ADD scripts/build_gtest.sh /
 RUN /build_gtest.sh
@@ -69,10 +67,7 @@ ADD scripts/build_ccache.sh /
 RUN /build_ccache.sh
 
 WORKDIR /
-RUN git clone https://github.com/matthew-brett/multibuild.git
-WORKDIR /multibuild
-RUN git checkout ffe59955ad8690c2f8bb74766cb7e9b0d0ee3963
-WORKDIR /
+RUN git clone https://github.com/matthew-brett/multibuild.git && cd multibuild 
&& git checkout ffe59955ad8690c2f8bb74766cb7e9b0d0ee3963
 
 ADD scripts/build_virtualenvs.sh /
 RUN /build_virtualenvs.sh
diff --git a/python/manylinux1/scripts/build_virtualenvs.sh 
b/python/manylinux1/scripts/build_virtualenvs.sh
index e64157065..b54861126 100755
--- a/python/manylinux1/scripts/build_virtualenvs.sh
+++ b/python/manylinux1/scripts/build_virtualenvs.sh
@@ -33,18 +33,26 @@ for PYTHON in ${PYTHON_VERSIONS}; do
 PATH="$PATH:$(cpython_path $PYTHON)"
 
 echo "=== (${PYTHON}) Installing build dependencies ==="
-$PIPI_IO "numpy==1.10.1"
+$PIPI_IO "numpy==1.10.4"
 $PIPI_IO "cython==0.25.2"
-$PIPI_IO "pandas==0.20.1"
+$PIPI_IO "pandas==0.20.3"
 $PIPI_IO "virtualenv==15.1.0"
 
 echo "=== (${PYTHON}) Preparing virtualenv for tests ==="
 "$(cpython_path $PYTHON)/bin/virtualenv" -p ${PYTHON_INTERPRETER} 
--no-download /venv-test-${PYTHON}
 source /venv-test-${PYTHON}/bin/activate
-pip install pytest 'numpy==1.12.1' 'pandas==0.20.1'
+pip install pytest 'numpy==1.14.0' 'pandas==0.20.3'
 deactivate
 done
 
+# Remove debug symbols from libraries that were installed via wheel.
+find /venv-test-*/lib/*/site-packages/pandas -name '*.so' -exec strip '{}' ';'
+find /venv-test-*/lib/*/site-packages/numpy -name '*.so' -exec strip '{}' ';'
+find /opt/_internal/cpython-*/lib/*/site-packages/pandas -name '*.so' -exec 
strip '{}' ';'
+# Only Python 3.6 packages are stripable as they are built inside of the image 
+find /opt/_internal/cpython-3.6.4/lib/python3.6/site-packages/numpy -name 
'*.so' -exec strip '{}' ';'
+find /opt/_internal/*/lib/*/site-packages/Cython -name '*.so' -exec strip '{}' 
';'
+
 # Remove pip cache again. It's useful during the virtualenv creation but we
 # don't want it persisted in the docker layer, ~264MiB
 rm -rf /root/.cache
diff --git a/python/manylinux1/scripts/install_cmake.sh 
b/python/manylinux1/scripts/install_cmake.sh
new file mode 100755
index 0..864348514
--- /dev/null
+++ b/python/manylinux1/scripts/install_cmake.sh
@@ -0,0 +1,22 @@
+#!/bin/bash -e
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language 

[jira] [Resolved] (ARROW-2087) [Python] Binaries of 3rdparty are not stripped in manylinux1 base image

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2087.
-
Resolution: Fixed

Issue resolved by pull request 1564
[https://github.com/apache/arrow/pull/1564]

> [Python] Binaries of 3rdparty are not stripped in manylinux1 base image
> ---
>
> Key: ARROW-2087
> URL: https://issues.apache.org/jira/browse/ARROW-2087
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> CMake pip package: 
> [https://github.com/scikit-build/cmake-python-distributions/issues/32]
> Pandas pip package: [https://github.com/pandas-dev/pandas/issues/19531]
> NumPy pip package: https://github.com/numpy/numpy/issues/10519



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2073) [Python] Create StructArray from sequence of tuples given a known data type

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356016#comment-16356016
 ] 

ASF GitHub Bot commented on ARROW-2073:
---

pitrou commented on issue #1572: ARROW-2073: [Python] Create struct array from 
sequence of tuples
URL: https://github.com/apache/arrow/pull/1572#issuecomment-363898967
 
 
   Ok, I fixed the typo.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Create StructArray from sequence of tuples given a known data type
> ---
>
> Key: ARROW-2073
> URL: https://issues.apache.org/jira/browse/ARROW-2073
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> Following ARROW-1705, we should support calling {{pa.array}} with a sequence 
> of tuples, presuming a struct type is passed for the {{type}} parameter.
> We also probably want to disallow mixed inputs, e.g. a sequence of both dicts 
> and tuples. The user should use only one idiom at a time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2087) [Python] Binaries of 3rdparty are not stripped in manylinux1 base image

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355995#comment-16355995
 ] 

ASF GitHub Bot commented on ARROW-2087:
---

wesm commented on issue #1564: ARROW-2087: [Python] Binaries of 3rdparty are 
not stripped in manylinux1 base image
URL: https://github.com/apache/arrow/pull/1564#issuecomment-363894613
 
 
   Retriggered build


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Binaries of 3rdparty are not stripped in manylinux1 base image
> ---
>
> Key: ARROW-2087
> URL: https://issues.apache.org/jira/browse/ARROW-2087
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> CMake pip package: 
> [https://github.com/scikit-build/cmake-python-distributions/issues/32]
> Pandas pip package: [https://github.com/pandas-dev/pandas/issues/19531]
> NumPy pip package: https://github.com/numpy/numpy/issues/10519



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1394) [Plasma] Add optional extension for allocating memory on GPUs

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355993#comment-16355993
 ] 

ASF GitHub Bot commented on ARROW-1394:
---

pcmoritz commented on issue #1445: ARROW-1394: [Plasma] Add optional extension 
for allocating memory on GPUs
URL: https://github.com/apache/arrow/pull/1445#issuecomment-363894149
 
 
   @Wapaul1 Is working on python integration and an end-to-end example that 
shows how to use this and then there are some loose ends that need to be fixed 
(eviction, hashing), we can create JIRAs for these.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Plasma] Add optional extension for allocating memory on GPUs
> -
>
> Key: ARROW-1394
> URL: https://issues.apache.org/jira/browse/ARROW-1394
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Plasma (C++)
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> It would be useful to be able to allocate memory to be shared between 
> processes via Plasma using the CUDA IPC API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-633) [Java] Add support for FixedSizeBinary type

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355983#comment-16355983
 ] 

ASF GitHub Bot commented on ARROW-633:
--

alphalfalfa commented on issue #1492: ARROW-633/634: [Java] Add FixedSizeBinary 
support in Java and integration tests (Updated)
URL: https://github.com/apache/arrow/pull/1492#issuecomment-363893785
 
 
   @jacques-n, to make sure that I understand the question properly, do you 
mean there should be proper interfaces defined for FixedSizeBinary type inside 
AbstractFieldWriter and AbstractFieldReader? 
   Currently, AbstractFieldWriter has:
 `public void writeFixedSizeBinary(ArrowBuf buffer)`
   AbstractFieldReaderWriter has:
 `public byte[] readByteArray()`
   What would be the desired interfaces to add?
   
   I would prefer to add the necessary fix in a separate PR as this one has 
grown too big if it is OK. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Add support for FixedSizeBinary type
> ---
>
> Key: ARROW-633
> URL: https://issues.apache.org/jira/browse/ARROW-633
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Jingyuan Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2111) [C++] Linting could be faster

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355979#comment-16355979
 ] 

ASF GitHub Bot commented on ARROW-2111:
---

pitrou commented on a change in pull request #1573: ARROW-2111: [C++] Lint in 
parallel
URL: https://github.com/apache/arrow/pull/1573#discussion_r166738489
 
 

 ##
 File path: cpp/CMakeLists.txt
 ##
 @@ -455,11 +455,14 @@ if (UNIX)
   message(STATUS "Found cpplint executable at ${CPPLINT_BIN}")
 
   # Full lint
-  add_custom_target(lint ${CPPLINT_BIN}
+  # Balancing act: cpplint.py takes a non-trivial time to launch,
+  # so process 12 files per invocation, while still ensuring parallelism
+  add_custom_target(lint echo ${FILTERED_LINT_FILES} | xargs -n12 -P8
 
 Review comment:
   Indeed there's a `if (UNIX)` a few lines above.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Linting could be faster
> -
>
> Key: ARROW-2111
> URL: https://issues.apache.org/jira/browse/ARROW-2111
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently {{make lint}} style-checks C++ files sequentially (by calling 
> {{cpplint}}). We could instead style-check those files in parallel.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2106) pyarrow.array can't take a pandas Series of python datetime objects.

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2106:

Fix Version/s: 0.9.0

> pyarrow.array can't take a pandas Series of python datetime objects.
> 
>
> Key: ARROW-2106
> URL: https://issues.apache.org/jira/browse/ARROW-2106
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Naveen Michaud-Agrawal
>Priority: Minor
> Fix For: 0.9.0
>
>
> {{> import pyarrow}}
>  > from datetime import datetime
>  > import pandas
>  > dt = pandas.Series([datetime(2017, 12, 1), datetime(2017, 12, 3), 
> datetime(2017, 12, 15)], dtype=object)
>  > pyarrow.array(dt, from_pandas=True)
> Raises following:
> ---
>  ArrowInvalid Traceback (most recent call last)
>   in ()
>  > 1 pyarrow.array(dt, from_pandas=True)
> array.pxi in pyarrow.lib.array()
> array.pxi in pyarrow.lib._ndarray_to_array()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error inferring Arrow type for Python object array. Got Python 
> object of type datetime but can only handle these types: string, bool, float, 
> int, date, time, decimal, list, array
> As far as I can tell, the issue seems to be the call to PyDate_CheckExact 
> here (instead of using PyDate_Check):
> [https://github.com/apache/arrow/blob/3098c1411930259070efb571fb350304b18ddc70/cpp/src/arrow/python/numpy_to_arrow.cc#L1005]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2111) [C++] Linting could be faster

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355978#comment-16355978
 ] 

ASF GitHub Bot commented on ARROW-2111:
---

wesm closed pull request #1573: ARROW-2111: [C++] Lint in parallel
URL: https://github.com/apache/arrow/pull/1573
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index 073471283..62c8e6590 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -455,11 +455,14 @@ if (UNIX)
   message(STATUS "Found cpplint executable at ${CPPLINT_BIN}")
 
   # Full lint
-  add_custom_target(lint ${CPPLINT_BIN}
+  # Balancing act: cpplint.py takes a non-trivial time to launch,
+  # so process 12 files per invocation, while still ensuring parallelism
+  add_custom_target(lint echo ${FILTERED_LINT_FILES} | xargs -n12 -P8
+  ${CPPLINT_BIN}
   --verbose=2
   --linelength=90
   
--filter=-whitespace/comments,-readability/todo,-build/header_guard,-build/c++11,-runtime/references,-build/include_order
-  ${FILTERED_LINT_FILES})
+  )
 endif (UNIX)
 
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Linting could be faster
> -
>
> Key: ARROW-2111
> URL: https://issues.apache.org/jira/browse/ARROW-2111
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently {{make lint}} style-checks C++ files sequentially (by calling 
> {{cpplint}}). We could instead style-check those files in parallel.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2111) [C++] Linting could be faster

2018-02-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2111.
-
   Resolution: Fixed
Fix Version/s: 0.9.0

Issue resolved by pull request 1573
[https://github.com/apache/arrow/pull/1573]

> [C++] Linting could be faster
> -
>
> Key: ARROW-2111
> URL: https://issues.apache.org/jira/browse/ARROW-2111
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently {{make lint}} style-checks C++ files sequentially (by calling 
> {{cpplint}}). We could instead style-check those files in parallel.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1394) [Plasma] Add optional extension for allocating memory on GPUs

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355975#comment-16355975
 ] 

ASF GitHub Bot commented on ARROW-1394:
---

wesm commented on issue #1445: ARROW-1394: [Plasma] Add optional extension for 
allocating memory on GPUs
URL: https://github.com/apache/arrow/pull/1445#issuecomment-363891955
 
 
   Thank you for building this! Are there any follow ups for the GPU support 
that you've thought of?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Plasma] Add optional extension for allocating memory on GPUs
> -
>
> Key: ARROW-1394
> URL: https://issues.apache.org/jira/browse/ARROW-1394
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Plasma (C++)
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> It would be useful to be able to allocate memory to be shared between 
> processes via Plasma using the CUDA IPC API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2111) [C++] Linting could be faster

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355974#comment-16355974
 ] 

ASF GitHub Bot commented on ARROW-2111:
---

wesm commented on a change in pull request #1573: ARROW-2111: [C++] Lint in 
parallel
URL: https://github.com/apache/arrow/pull/1573#discussion_r166737523
 
 

 ##
 File path: cpp/CMakeLists.txt
 ##
 @@ -455,11 +455,14 @@ if (UNIX)
   message(STATUS "Found cpplint executable at ${CPPLINT_BIN}")
 
   # Full lint
-  add_custom_target(lint ${CPPLINT_BIN}
+  # Balancing act: cpplint.py takes a non-trivial time to launch,
+  # so process 12 files per invocation, while still ensuring parallelism
+  add_custom_target(lint echo ${FILTERED_LINT_FILES} | xargs -n12 -P8
 
 Review comment:
   The lint target is only being created on non-Windows at the moment. I opened 
https://issues.apache.org/jira/browse/ARROW-2112 to make that possible


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Linting could be faster
> -
>
> Key: ARROW-2111
> URL: https://issues.apache.org/jira/browse/ARROW-2111
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
>
> Currently {{make lint}} style-checks C++ files sequentially (by calling 
> {{cpplint}}). We could instead style-check those files in parallel.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2112) [C++] Enable cpplint to be run on Windows

2018-02-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2112:
---

 Summary: [C++] Enable cpplint to be run on Windows
 Key: ARROW-2112
 URL: https://issues.apache.org/jira/browse/ARROW-2112
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


See discussion in ARROW-2111 https://github.com/apache/arrow/pull/1573



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355969#comment-16355969
 ] 

ASF GitHub Bot commented on ARROW-969:
--

wesm commented on issue #1574: ARROW-969: [C++] Add add/remove field functions 
for RecordBatch
URL: https://github.com/apache/arrow/pull/1574#issuecomment-363890696
 
 
   Will review when I can, thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python] Add add/remove field functions for RecordBatch
> ---
>
> Key: ARROW-969
> URL: https://issues.apache.org/jira/browse/ARROW-969
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Analogous to the Table equivalents



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1950) [Python] pandas_type in pandas metadata incorrect for List types

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355935#comment-16355935
 ] 

ASF GitHub Bot commented on ARROW-1950:
---

cpcloud commented on a change in pull request #1571: ARROW-1950: [Python] 
pandas_type in pandas metadata incorrect for List types
URL: https://github.com/apache/arrow/pull/1571#discussion_r166729132
 
 

 ##
 File path: python/pyarrow/tests/test_convert_pandas.py
 ##
 @@ -1404,6 +1404,57 @@ def test_empty_list_roundtrip(self):
 
 tm.assert_frame_equal(result, df)
 
+def test_empty_list_metadata(self):
+# Create table with array of empty lists, forced to have type
+# list(string) in pyarrow
+c1 = [["test"], ["a", "b"], None]
+c2 = [[], [], []]
+arrays = {
 
 Review comment:
   Yep, thanks! I saw that on the CI :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pandas_type in pandas metadata incorrect for List types
> 
>
> Key: ARROW-1950
> URL: https://issues.apache.org/jira/browse/ARROW-1950
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see https://github.com/pandas-dev/pandas/pull/18201#issuecomment-353042438



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1950) [Python] pandas_type in pandas metadata incorrect for List types

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355924#comment-16355924
 ] 

ASF GitHub Bot commented on ARROW-1950:
---

xhochy commented on a change in pull request #1571: ARROW-1950: [Python] 
pandas_type in pandas metadata incorrect for List types
URL: https://github.com/apache/arrow/pull/1571#discussion_r166728024
 
 

 ##
 File path: python/pyarrow/tests/test_convert_pandas.py
 ##
 @@ -1404,6 +1404,57 @@ def test_empty_list_roundtrip(self):
 
 tm.assert_frame_equal(result, df)
 
+def test_empty_list_metadata(self):
+# Create table with array of empty lists, forced to have type
+# list(string) in pyarrow
+c1 = [["test"], ["a", "b"], None]
+c2 = [[], [], []]
+arrays = {
 
 Review comment:
   You will need to use `OrderedDict` here probably to get the same ordering of 
the results in all Python versions.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pandas_type in pandas metadata incorrect for List types
> 
>
> Key: ARROW-1950
> URL: https://issues.apache.org/jira/browse/ARROW-1950
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see https://github.com/pandas-dev/pandas/pull/18201#issuecomment-353042438



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2111) [C++] Linting could be faster

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355923#comment-16355923
 ] 

ASF GitHub Bot commented on ARROW-2111:
---

xhochy commented on a change in pull request #1573: ARROW-2111: [C++] Lint in 
parallel
URL: https://github.com/apache/arrow/pull/1573#discussion_r166727508
 
 

 ##
 File path: cpp/CMakeLists.txt
 ##
 @@ -455,11 +455,14 @@ if (UNIX)
   message(STATUS "Found cpplint executable at ${CPPLINT_BIN}")
 
   # Full lint
-  add_custom_target(lint ${CPPLINT_BIN}
+  # Balancing act: cpplint.py takes a non-trivial time to launch,
+  # so process 12 files per invocation, while still ensuring parallelism
+  add_custom_target(lint echo ${FILTERED_LINT_FILES} | xargs -n12 -P8
 
 Review comment:
   This probably won't work on Windows but I guess we don't execute this on 
windows? Otherwise we could turn this into several targets that all form a 
single lint target at the end and let make/ninja handle the parallelisation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Linting could be faster
> -
>
> Key: ARROW-2111
> URL: https://issues.apache.org/jira/browse/ARROW-2111
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
>
> Currently {{make lint}} style-checks C++ files sequentially (by calling 
> {{cpplint}}). We could instead style-check those files in parallel.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2073) [Python] Create StructArray from sequence of tuples given a known data type

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355919#comment-16355919
 ] 

ASF GitHub Bot commented on ARROW-2073:
---

xhochy commented on a change in pull request #1572: ARROW-2073: [Python] Create 
struct array from sequence of tuples
URL: https://github.com/apache/arrow/pull/1572#discussion_r166726751
 
 

 ##
 File path: python/pyarrow/tests/test_convert_builtin.py
 ##
 @@ -531,6 +531,45 @@ def test_struct_from_dicts():
 assert arr.to_pylist() == expected
 
 
+def test_struct_from_tuples():
+ty = pa.struct([pa.field('a', pa.int32()),
+pa.field('b', pa.string()),
+pa.field('c', pa.bool_())])
+
+data = [(5, 'foo', True),
+(6, 'bar', False)]
+expected = [{'a': 5, 'b': 'foo', 'c': True},
+{'a': 6, 'b': 'bar', 'c': False}]
+arr = pa.array(data, type=ty)
+assert arr.to_pylist() == expected
+
+# With omitted values
+data = [(5, 'foo', None),
+None,
+(6, None, False)]
+expected = [{'a': 5, 'b': 'foo', 'c': None},
+None,
+{'a': 6, 'b': None, 'c': False}]
+arr = pa.array(data, type=ty)
+assert arr.to_pylist() == expected
+
+# Invalid tuple size
+for tup in [(5, 'foo'), (), ('5', 'foo', True, None)]:
+with pytest.raises(ValueError, match="(?i)tuple size"):
+pa.array([tup], type=ty)
+
+
+def test_struct_from_mixed_sequence():
+# It is forgotten to mix dicts and tuples when initializing a struct array
 
 Review comment:
   Typo: s/forgotten/forbidden/


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Create StructArray from sequence of tuples given a known data type
> ---
>
> Key: ARROW-2073
> URL: https://issues.apache.org/jira/browse/ARROW-2073
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> Following ARROW-1705, we should support calling {{pa.array}} with a sequence 
> of tuples, presuming a struct type is passed for the {{type}} parameter.
> We also probably want to disallow mixed inputs, e.g. a sequence of both dicts 
> and tuples. The user should use only one idiom at a time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355914#comment-16355914
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

xuepanchen commented on a change in pull request #1551: ARROW-1942: [C++] Hash 
table specializations for small integers
URL: https://github.com/apache/arrow/pull/1551#discussion_r166725654
 
 

 ##
 File path: cpp/src/arrow/compute/kernels/util-internal.h
 ##
 @@ -47,11 +52,11 @@ using enable_if_timestamp =
 typename std::enable_if::value>::type;
 
 template 
-using enable_if_has_c_type =
-typename std::enable_if::value ||
-std::is_base_of::value ||
-std::is_base_of::value ||
-std::is_base_of::value>::type;
+using enable_if_has_c_type = typename std::enable_if<
+!std::is_same::value && !std::is_same::value &&
 
 Review comment:
   I am not sure how to exclude 8-bit integers in the functor declarations. I 
am thinking creating a new template function like 
"enable_if_has_c_type_excluding_8bit_integer" here and use it in the 
declaration of hash table pass for primitive types


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-07 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1942:
--
Labels: pull-request-available  (was: )

> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2109) [C++] Boost 1.66 compilation fails on Windows on linkage stage

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355869#comment-16355869
 ] 

ASF GitHub Bot commented on ARROW-2109:
---

wesm commented on issue #1567: ARROW-2109: [C++] Completely disable boost 
autolink on MSVC build
URL: https://github.com/apache/arrow/pull/1567#issuecomment-363868606
 
 
   thanks @MaxRis!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Boost 1.66 compilation fails on Windows on linkage stage
> --
>
> Key: ARROW-2109
> URL: https://issues.apache.org/jira/browse/ARROW-2109
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: Windows
>Reporter: Max Risuhin
>Assignee: Max Risuhin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Boost's autolinking should be disable on compilation with MSVC, since it 
> causes linkage with shared import libs, instead of expected static. Following 
> error occurs:
> `LINK : fatal error LNK1104: cannot open file 'boost_filesystem.lib'`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355847#comment-16355847
 ] 

ASF GitHub Bot commented on ARROW-969:
--

xuepanchen opened a new pull request #1574: ARROW-969: [C++] Add add/remove 
field functions for RecordBatch
URL: https://github.com/apache/arrow/pull/1574
 
 
   Add AddColumn and RemoveColumn methods for RecordBatch, as well as test cases


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python] Add add/remove field functions for RecordBatch
> ---
>
> Key: ARROW-969
> URL: https://issues.apache.org/jira/browse/ARROW-969
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Analogous to the Table equivalents



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch

2018-02-07 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-969:
-
Labels: pull-request-available  (was: )

> [C++/Python] Add add/remove field functions for RecordBatch
> ---
>
> Key: ARROW-969
> URL: https://issues.apache.org/jira/browse/ARROW-969
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Analogous to the Table equivalents



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2073) [Python] Create StructArray from sequence of tuples given a known data type

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355817#comment-16355817
 ] 

ASF GitHub Bot commented on ARROW-2073:
---

pitrou commented on issue #1572: ARROW-2073: [Python] Create struct array from 
sequence of tuples
URL: https://github.com/apache/arrow/pull/1572#issuecomment-363855180
 
 
   AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.45


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Create StructArray from sequence of tuples given a known data type
> ---
>
> Key: ARROW-2073
> URL: https://issues.apache.org/jira/browse/ARROW-2073
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> Following ARROW-1705, we should support calling {{pa.array}} with a sequence 
> of tuples, presuming a struct type is passed for the {{type}} parameter.
> We also probably want to disallow mixed inputs, e.g. a sequence of both dicts 
> and tuples. The user should use only one idiom at a time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >