[jira] [Commented] (ARROW-2066) [Python] Document reading Parquet files from Azure Blob Store

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372520#comment-16372520
 ] 

ASF GitHub Bot commented on ARROW-2066:
---

rjrussell77 commented on a change in pull request #1544: ARROW-2066: [Python] 
Document using pyarrow with Azure Blob Store
URL: https://github.com/apache/arrow/pull/1544#discussion_r169875471
 
 

 ##
 File path: python/doc/source/parquet.rst
 ##
 @@ -237,3 +237,44 @@ throughput:
 
pq.read_table(where, nthreads=4)
pq.ParquetDataset(where).read(nthreads=4)
+
+Reading a Parquet File from Azure Blob storage
+--
+
+The code below shows how to use Azure's storage sdk along with pyarrow to read
+a parquet file into a Pandas dataframe.
+This is suitable for executing inside a Jupyter notebook running on a Python 3
+kernel.
+
+Dependencies: 
+
+* python 3.6.2 
+* azure-storage 0.36.0 
+* pyarrow 0.8.0 
+
+.. code-block:: python
+
+   import pyarrow.parquet as pq
+   import io
+   from azure.storage.blob import BlockBlobService
+
+   account_name = '...'
+   account_key = '...'
+   container_name = '...'
+   parquet_file = 'mysample.parquet'
+
+   block_blob_service = BlockBlobService(account_name=account_name, 
account_key=account_key)
+   try:
+  block_blob_service.get_blob_to_stream(container_name=container_name, 
blob_name=parquet_file, stream=byte_stream)
+  pd = pq.read_table(source=byte_stream).to_pandas()
+  pd.head(10)
+   except Exception as err:
+  print("Error: {0}".format(err))
+   finally:
+  byte_stream.close()
+
 
 Review comment:
   Added try/except/finally block to ensure closure of the stream


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document reading Parquet files from Azure Blob Store
> -
>
> Key: ARROW-2066
> URL: https://issues.apache.org/jira/browse/ARROW-2066
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> See https://github.com/apache/arrow/issues/1510



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2066) [Python] Document reading Parquet files from Azure Blob Store

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372513#comment-16372513
 ] 

ASF GitHub Bot commented on ARROW-2066:
---

rjrussell77 commented on a change in pull request #1544: ARROW-2066: [Python] 
Document using pyarrow with Azure Blob Store
URL: https://github.com/apache/arrow/pull/1544#discussion_r169871502
 
 

 ##
 File path: python/doc/source/parquet.rst
 ##
 @@ -237,3 +237,40 @@ throughput:
 
pq.read_table(where, nthreads=4)
pq.ParquetDataset(where).read(nthreads=4)
+
+Reading a Parquet File from Azure Blob storage
+--
+
+The code below shows how to use Azure's storage sdk along with pyarrow to read
+a parquet file into a Pandas dataframe.
+This is suitable for executing inside a Jupyter notebook running on a Python 3
+kernel.
+
+Dependencies: 
+
+* python 3.6.2 
+* azure-storage 0.36.0 
+* pyarrow 0.8.0 
+
+.. code-block:: python
+
+   import pyarrow.parquet as pq
+   import io
+   from azure.storage.blob import BlockBlobService
+
+   account_name = '...'
+   account_key = '...'
+   container_name = '...'
+   parquet_file = 'mysample.parquet'
+
+   block_blob_service = BlockBlobService(account_name=account_name, 
account_key=account_key)
+   byte_stream = io.BytesIO()
+   block_blob_service.get_blob_to_stream(container_name=container_name, 
blob_name=parquet_file, stream=byte_stream)
+   pd = pq.read_table(source=byte_stream).to_pandas()
+   pd.head(10)
 
 Review comment:
   @xhochy Good feedback - I replaced the temp file buffer with BytesIO stream 
instead.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document reading Parquet files from Azure Blob Store
> -
>
> Key: ARROW-2066
> URL: https://issues.apache.org/jira/browse/ARROW-2066
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> See https://github.com/apache/arrow/issues/1510



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2176) [C++] Extend DictionaryBuilder to support delta dictionaries

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372483#comment-16372483
 ] 

ASF GitHub Bot commented on ARROW-2176:
---

alendit commented on issue #1629: ARROW-2176: [C++] Extend DictionaryBuilder to 
support delta dictionaries
URL: https://github.com/apache/arrow/pull/1629#issuecomment-367583889
 
 
   There was a typo in the last commit. Fixed it and rebased again.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Extend DictionaryBuilder to support delta dictionaries
> 
>
> Key: ARROW-2176
> URL: https://issues.apache.org/jira/browse/ARROW-2176
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> [The IPC format|https://arrow.apache.org/docs/ipc.html] specifies a 
> possibility of sending additional dictionary batches with a previously seen 
> id and a isDelta flag to extend the existing dictionaries with new entries. 
> Right now, the DictioniaryBuilder (as well as IPC writer and reader) do not 
> support generation of delta dictionaries.
> This pull request contains a basic implementation of the DictionaryBuilder 
> with delta dictionaries support. The use API can be seen in the dictionary 
> tests (i.e. 
> [here|https://github.com/alendit/arrow/blob/delta_dictionary_builder/cpp/src/arrow/array-test.cc#L1773]).
>  The basic idea is that the user just reuses the builder object after calling 
> Finish(Array*) for the first time. Subsequent calls to Append will create new 
> entries only for the unseen element and reuse id from previous dictionaries 
> for the seen ones.
> Some considerations:
>  # The API is pretty implicit, and additional flag for Finish, which 
> explicitly indicates a desire to use the builder for delta dictionary 
> generation might be expedient from the error avoidance point of view.
>  # Right now the implementation uses an additional "overflow dictionary" to 
> store the seen items. This adds a copy on each Finish call and an additional 
> lookup at each GetItem or Append call. I assume, we might get away with 
> returning Array slices at Finish, which would remove the need for an 
> additional overflow dictionary. If the gist of the PR is approved, I can look 
> into further optimizations.
> The Writer and Reader extensions would be pretty simple, since the 
> DictionaryBuilder API remains basically the same. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2132) [Doc] Add links / mentions of Plasma store to main README

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372270#comment-16372270
 ] 

ASF GitHub Bot commented on ARROW-2132:
---

robertnishihara commented on issue #1636: ARROW-2132: Add link to Plasma in 
main README
URL: https://github.com/apache/arrow/pull/1636#issuecomment-367529248
 
 
   This looks good to me. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Doc] Add links / mentions of Plasma store to main README
> -
>
> Key: ARROW-2132
> URL: https://issues.apache.org/jira/browse/ARROW-2132
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> This should be listed as separate from, but noted as a part of, the C++ 
> implementation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1345) [Python] Conversion from nested NumPy arrays fails on integers other than int64, float32

2018-02-21 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1345:
--
Labels: pull-request-available  (was: )

> [Python] Conversion from nested NumPy arrays fails on integers other than 
> int64, float32
> 
>
> Key: ARROW-1345
> URL: https://issues.apache.org/jira/browse/ARROW-1345
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The inferred types are the largest ones, and then later conversion fails on 
> any arrays with smaller types because only exact conversions are implemented 
> thus far



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1345) [Python] Conversion from nested NumPy arrays fails on integers other than int64, float32

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1637#comment-1637
 ] 

ASF GitHub Bot commented on ARROW-1345:
---

wesm opened a new pull request #1643: ARROW-1345: [Python] Test conversion from 
nested NumPy arrays with smaller int, float types
URL: https://github.com/apache/arrow/pull/1643
 
 
   This also resolves ARROW-2008


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from nested NumPy arrays fails on integers other than 
> int64, float32
> 
>
> Key: ARROW-1345
> URL: https://issues.apache.org/jira/browse/ARROW-1345
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The inferred types are the largest ones, and then later conversion fails on 
> any arrays with smaller types because only exact conversions are implemented 
> thus far



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2153) [C++] Decimal conversion not working for exponential notation

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372198#comment-16372198
 ] 

ASF GitHub Bot commented on ARROW-2153:
---

wesm commented on issue #1618: ARROW-2153/ARROW-2160: [C++/Python]  Fix decimal 
precision inference
URL: https://github.com/apache/arrow/pull/1618#issuecomment-367511884
 
 
   needs rebase


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Decimal conversion not working for exponential notation
> -
>
> Key: ARROW-2153
> URL: https://issues.apache.org/jira/browse/ARROW-2153
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('2E+1')]}))
> {code}
>  
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 350, in dataframe_to_arrays
> convert_types)]
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 349, in 
> for c, t in zip(columns_to_convert,
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 345, in convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8270)
> pyarrow.lib.ArrowInvalid: Expected base ten digit or decimal point but found 
> 'E' instead.
> {code}
> In manual cases clearly we can write {{decimal.Decimal('20')}} instead of 
> {{decimal.Decimal('2E+1')}} but during arithmetical operations inside an 
> application the exponential notation can be produced out of control (it is 
> actually the _normalized_ form of the decimal number) plus for some values 
> the exponential notation is the only form expressing the significance so this 
> should be accepted.
> The [documentation|https://docs.python.org/3/library/decimal.html] suggests 
> using following transformation but that's only possible when the significance 
> information doesn't need to be kept:
> {code:java}
> def remove_exponent(d):
> return d.quantize(Decimal(1)) if d == d.to_integral() else d.normalize()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2145) [Python] Decimal conversion not working for NaN values

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372200#comment-16372200
 ] 

ASF GitHub Bot commented on ARROW-2145:
---

wesm commented on issue #1610: ARROW-2145/ARROW-2157: [Python] Decimal 
conversion not working for NaN values
URL: https://github.com/apache/arrow/pull/1610#issuecomment-367511927
 
 
   needs rebase


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Decimal conversion not working for NaN values
> --
>
> Key: ARROW-2145
> URL: https://issues.apache.org/jira/browse/ARROW-2145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('NaN')]}))
> {code}
> throws following exception:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in 
> dataframe_to_arrays
> convert_types)]
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in 
> 
> for c, t in zip(columns_to_convert,
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in 
> convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068)
> pyarrow.lib.ArrowException: Unknown error: an integer is required (got type 
> str)
> {code}
> Same problem with other special decimal values like {{infinity}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1632) [Python] Permit categorical conversions in Table.to_pandas on a per-column basis

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372194#comment-16372194
 ] 

ASF GitHub Bot commented on ARROW-1632:
---

wesm commented on issue #1620: ARROW-1632: [Python] Permit categorical 
conversions in Table.to_pandas on a per-column basis
URL: https://github.com/apache/arrow/pull/1620#issuecomment-367511649
 
 
   rebased


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Permit categorical conversions in Table.to_pandas on a per-column 
> basis
> 
>
> Key: ARROW-1632
> URL: https://issues.apache.org/jira/browse/ARROW-1632
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently this is all or nothing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2176) [C++] Extend DictionaryBuilder to support delta dictionaries

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372197#comment-16372197
 ] 

ASF GitHub Bot commented on ARROW-2176:
---

wesm commented on issue #1629: ARROW-2176: [C++] Extend DictionaryBuilder to 
support delta dictionaries
URL: https://github.com/apache/arrow/pull/1629#issuecomment-367511802
 
 
   rebased


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Extend DictionaryBuilder to support delta dictionaries
> 
>
> Key: ARROW-2176
> URL: https://issues.apache.org/jira/browse/ARROW-2176
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> [The IPC format|https://arrow.apache.org/docs/ipc.html] specifies a 
> possibility of sending additional dictionary batches with a previously seen 
> id and a isDelta flag to extend the existing dictionaries with new entries. 
> Right now, the DictioniaryBuilder (as well as IPC writer and reader) do not 
> support generation of delta dictionaries.
> This pull request contains a basic implementation of the DictionaryBuilder 
> with delta dictionaries support. The use API can be seen in the dictionary 
> tests (i.e. 
> [here|https://github.com/alendit/arrow/blob/delta_dictionary_builder/cpp/src/arrow/array-test.cc#L1773]).
>  The basic idea is that the user just reuses the builder object after calling 
> Finish(Array*) for the first time. Subsequent calls to Append will create new 
> entries only for the unseen element and reuse id from previous dictionaries 
> for the seen ones.
> Some considerations:
>  # The API is pretty implicit, and additional flag for Finish, which 
> explicitly indicates a desire to use the builder for delta dictionary 
> generation might be expedient from the error avoidance point of view.
>  # Right now the implementation uses an additional "overflow dictionary" to 
> store the seen items. This adds a copy on each Finish call and an additional 
> lookup at each GetItem or Append call. I assume, we might get away with 
> returning Array slices at Finish, which would remove the need for an 
> additional overflow dictionary. If the gist of the PR is approved, I can look 
> into further optimizations.
> The Writer and Reader extensions would be pretty simple, since the 
> DictionaryBuilder API remains basically the same. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2176) [C++] Extend DictionaryBuilder to support delta dictionaries

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372196#comment-16372196
 ] 

ASF GitHub Bot commented on ARROW-2176:
---

wesm commented on issue #1629: ARROW-2176: [C++] Extend DictionaryBuilder to 
support delta dictionaries
URL: https://github.com/apache/arrow/pull/1629#issuecomment-367511802
 
 
   rebaed


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Extend DictionaryBuilder to support delta dictionaries
> 
>
> Key: ARROW-2176
> URL: https://issues.apache.org/jira/browse/ARROW-2176
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> [The IPC format|https://arrow.apache.org/docs/ipc.html] specifies a 
> possibility of sending additional dictionary batches with a previously seen 
> id and a isDelta flag to extend the existing dictionaries with new entries. 
> Right now, the DictioniaryBuilder (as well as IPC writer and reader) do not 
> support generation of delta dictionaries.
> This pull request contains a basic implementation of the DictionaryBuilder 
> with delta dictionaries support. The use API can be seen in the dictionary 
> tests (i.e. 
> [here|https://github.com/alendit/arrow/blob/delta_dictionary_builder/cpp/src/arrow/array-test.cc#L1773]).
>  The basic idea is that the user just reuses the builder object after calling 
> Finish(Array*) for the first time. Subsequent calls to Append will create new 
> entries only for the unseen element and reuse id from previous dictionaries 
> for the seen ones.
> Some considerations:
>  # The API is pretty implicit, and additional flag for Finish, which 
> explicitly indicates a desire to use the builder for delta dictionary 
> generation might be expedient from the error avoidance point of view.
>  # Right now the implementation uses an additional "overflow dictionary" to 
> store the seen items. This adds a copy on each Finish call and an additional 
> lookup at each GetItem or Append call. I assume, we might get away with 
> returning Array slices at Finish, which would remove the need for an 
> additional overflow dictionary. If the gist of the PR is approved, I can look 
> into further optimizations.
> The Writer and Reader extensions would be pretty simple, since the 
> DictionaryBuilder API remains basically the same. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2191) [C++] Only use specific version of jemalloc

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372193#comment-16372193
 ] 

ASF GitHub Bot commented on ARROW-2191:
---

wesm commented on issue #1633: ARROW-2191: [C++] Only use specific version of 
jemalloc
URL: https://github.com/apache/arrow/pull/1633#issuecomment-367511562
 
 
   rebased


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Only use specific version of jemalloc
> ---
>
> Key: ARROW-2191
> URL: https://issues.apache.org/jira/browse/ARROW-2191
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> As we want to avoid conflicts with system copies of jemalloc that are either 
> outdated or are incompatible (known bugs, known shortcomings, or don't 
> compile) with older compilers, we want to only use our own prefixed version 
> of jemalloc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372192#comment-16372192
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

wesm commented on issue #1635: ARROW-2142: [Python] Allow conversion from Numpy 
struct array
URL: https://github.com/apache/arrow/pull/1635#issuecomment-367511441
 
 
   rebased


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2184) [C++] Add static ctor for FileOutputStream returning shared_ptr to base OutputStream

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372183#comment-16372183
 ] 

ASF GitHub Bot commented on ARROW-2184:
---

wesm commented on issue #1642: ARROW-2184: [C++]  Add static ctor for 
FileOutputStream returning shared_ptr to base OutputStream
URL: https://github.com/apache/arrow/pull/1642#issuecomment-367510988
 
 
   Just rebased this, please 
   
   ```
   git fetch origin
   git reset --hard origin/ARROW-2184
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Add static ctor for FileOutputStream returning shared_ptr to base 
> OutputStream
> 
>
> Key: ARROW-2184
> URL: https://issues.apache.org/jira/browse/ARROW-2184
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> It would be useful for most IO ctors to return pointers to the base interface 
> that they implement rather than the subclass. Whether we deprecate the 
> current ones will vary on a case by case basis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2184) [C++] Add static ctor for FileOutputStream returning shared_ptr to base OutputStream

2018-02-21 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2184:
--
Labels: pull-request-available  (was: )

> [C++] Add static ctor for FileOutputStream returning shared_ptr to base 
> OutputStream
> 
>
> Key: ARROW-2184
> URL: https://issues.apache.org/jira/browse/ARROW-2184
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> It would be useful for most IO ctors to return pointers to the base interface 
> that they implement rather than the subclass. Whether we deprecate the 
> current ones will vary on a case by case basis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2184) [C++] Add static ctor for FileOutputStream returning shared_ptr to base OutputStream

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372169#comment-16372169
 ] 

ASF GitHub Bot commented on ARROW-2184:
---

xuepanchen opened a new pull request #1642: ARROW-2184: [C++]  Add static ctor 
for FileOutputStream returning shared_ptr to base OutputStream
URL: https://github.com/apache/arrow/pull/1642
 
 
   Add constructors to return pointers to the base interface


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Add static ctor for FileOutputStream returning shared_ptr to base 
> OutputStream
> 
>
> Key: ARROW-2184
> URL: https://issues.apache.org/jira/browse/ARROW-2184
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> It would be useful for most IO ctors to return pointers to the base interface 
> that they implement rather than the subclass. Whether we deprecate the 
> current ones will vary on a case by case basis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371840#comment-16371840
 ] 

ASF GitHub Bot commented on ARROW-2162:
---

cpcloud commented on issue #1619: ARROW-2162: [Python/C++] Decimal Values with 
too-high precision are multiplied by 100
URL: https://github.com/apache/arrow/pull/1619#issuecomment-367427993
 
 
   @pitrou thanks for the thorough review, much appreciated!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2008) [Python] Type inference for int32 NumPy arrays (expecting list) returns int64 and then conversion fails

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2008:
---

Assignee: Wes McKinney

> [Python] Type inference for int32 NumPy arrays (expecting list) 
> returns int64 and then conversion fails
> --
>
> Key: ARROW-2008
> URL: https://issues.apache.org/jira/browse/ARROW-2008
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> See report in [https://github.com/apache/arrow/issues/1430]
> {{arrow::py::InferArrowType}} is called, when traverses the array as though 
> it were any other Python sequence, and NumPy int32 scalars are not recognized 
> as such



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1345) [Python] Conversion from nested NumPy arrays fails on integers other than int64, float32

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1345:
---

Assignee: Wes McKinney

> [Python] Conversion from nested NumPy arrays fails on integers other than 
> int64, float32
> 
>
> Key: ARROW-1345
> URL: https://issues.apache.org/jira/browse/ARROW-1345
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> The inferred types are the largest ones, and then later conversion fails on 
> any arrays with smaller types because only exact conversions are implemented 
> thus far



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store has runtime dependency on Boost shared libraries when ARROW_BOOST_USE_SHARED=on

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372149#comment-16372149
 ] 

Wes McKinney commented on ARROW-2193:
-

I opened https://issues.apache.org/jira/browse/ARROW-2196 about trying to do 
something about this

> [Plasma] plasma_store has runtime dependency on Boost shared libraries when 
> ARROW_BOOST_USE_SHARED=on
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2196) [C++] Consider quarantining platform code with dependency on non-header Boost code

2018-02-21 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2196:
---

 Summary: [C++] Consider quarantining platform code with dependency 
on non-header Boost code
 Key: ARROW-2196
 URL: https://issues.apache.org/jira/browse/ARROW-2196
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


see discussion in ARROW-2193 for the motivation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2193) [Plasma] plasma_store has runtime dependency on Boost shared libraries when ARROW_BOOST_USE_SHARED=on

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2193:

Summary: [Plasma] plasma_store has runtime dependency on Boost shared 
libraries when ARROW_BOOST_USE_SHARED=on  (was: [Plasma] plasma_store forks 
endlessly)

> [Plasma] plasma_store has runtime dependency on Boost shared libraries when 
> ARROW_BOOST_USE_SHARED=on
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372148#comment-16372148
 ] 

Wes McKinney commented on ARROW-2193:
-

OK, I found the answer. There is no such thing as {{PRIVATE}} for static 
libraries:

https://cmake.org/pipermail/cmake/2016-May/063399.html

I guess this kind of makes sense. I don't think this is fixable then, but we 
should document the requirements if the user is using 
{{ARROW_BOOST_USE_SHARED=on}}, which is the default.

[~pitrou] you can make this problem go away by passing 
{{-DARROW_BOOST_USE_SHARED=off}}. I do all my development with Boost static 
linking 

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372141#comment-16372141
 ] 

Wes McKinney commented on ARROW-2193:
-

For the record, here is the command line that's being generated for me

{code}
/usr/bin/clang++-5.0  -ggdb -O0  -Weverything -Wno-c++98-compat 
-Wno-c++98-compat-pedantic -Wno-deprecated -Wno-weak-vtables -Wno-padded 
-Wno-comma -Wno-unused-parameter -Wno-unused-template -Wno-undef -Wno-shadow 
-Wno-switch-enum -Wno-exit-time-destructors -Wno-global-constructors 
-Wno-weak-template-vtables -Wno-undefined-reinterpret-cast 
-Wno-implicit-fallthrough -Wno-unreachable-code-return -Wno-float-equal 
-Wno-missing-prototypes -Wno-old-style-cast -Wno-covered-switch-default 
-Wno-cast-align -Wno-vla-extension -Wno-shift-sign-overflow 
-Wno-used-but-marked-unused -Wno-missing-variable-declarations 
-Wno-gnu-zero-variadic-macro-arguments -Wconversion -Wno-sign-conversion 
-Wno-disabled-macro-expansion -Wno-gnu-folding-constant -Wno-reserved-id-macro 
-Wno-range-loop-analysis -Wno-double-promotion -Wno-undefined-func-template 
-Wno-zero-as-null-pointer-constant -Wno-unknown-warning-option -Werror 
-std=c++11 -msse3 -maltivec -Werror -Qunused-arguments  -D_XOPEN_SOURCE=500 
-D_POSIX_C_SOURCE=200809L -fPIC -g  -rdynamic 
src/plasma/CMakeFiles/plasma_store.dir/store.cc.o  -o debug/plasma_store  
-Wl,-rpath,/home/wesm/cpp-toolchain/lib: -lrt debug/libplasma.a 
debug/libarrow.a -lrt orc_ep-install/lib/liborc.a 
../thirdparty/protobuf_ep-install/lib/libprotobuf.a 
/home/wesm/cpp-toolchain/lib/libzstd.a /home/wesm/cpp-toolchain/lib/libz.a 
/home/wesm/cpp-toolchain/lib/libsnappy.a /home/wesm/cpp-toolchain/lib/liblz4.a 
/home/wesm/cpp-toolchain/lib/libbrotlidec.a 
/home/wesm/cpp-toolchain/lib/libbrotlienc.a 
/home/wesm/cpp-toolchain/lib/libbrotlicommon.a -lpthread 
/home/wesm/cpp-toolchain/lib/libboost_system.so 
/home/wesm/cpp-toolchain/lib/libboost_filesystem.so 
/home/wesm/cpp-toolchain/lib/libflatbuffers.a -lpthread && :
{code}

The linker problem isn't isolated to clang, here it is with gcc-4.9:

{code}
/usr/bin/g++-4.9  -ggdb -O0  -Wall -Wconversion -Wno-sign-conversion 
-Wno-unknown-warning-option -Werror -std=c++11 -msse3 -Werror 
-D_XOPEN_SOURCE=500 -D_POSIX_C_SOURCE=200809L -fPIC -g  -rdynamic 
src/plasma/CMakeFiles/plasma_store.dir/store.cc.o  -o debug/plasma_store  
-Wl,-rpath,/home/wesm/cpp-toolchain/lib: -lrt debug/libplasma.a 
debug/libarrow.a -lrt orc_ep-install/lib/liborc.a 
../thirdparty/protobuf_ep-install/lib/libprotobuf.a 
/home/wesm/cpp-toolchain/lib/libzstd.a /home/wesm/cpp-toolchain/lib/libz.a 
/home/wesm/cpp-toolchain/lib/libsnappy.a /home/wesm/cpp-toolchain/lib/liblz4.a 
/home/wesm/cpp-toolchain/lib/libbrotlidec.a 
/home/wesm/cpp-toolchain/lib/libbrotlienc.a 
/home/wesm/cpp-toolchain/lib/libbrotlicommon.a -lpthread 
/home/wesm/cpp-toolchain/lib/libboost_system.so 
/home/wesm/cpp-toolchain/lib/libboost_filesystem.so 
/home/wesm/cpp-toolchain/lib/libflatbuffers.a -lpthread && :
{code}

I guess because no Boost symbols are used that the binary produced by gcc does 
not have the runtime dependency on the .so files:

{code}
$ ldd debug/plasma_store 
linux-vdso.so.1 =>  (0x7ffd8390b000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
(0x7f9ab8354000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 
(0x7f9ab8041000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f9ab7d3b000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 
(0x7f9ab7b24000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f9ab775b000)
/lib64/ld-linux-x86-64.so.2 (0x7f9ab8572000)
{code}

But the Boost libraries shouldn't be passed to the linker at all

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> 

[jira] [Commented] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372137#comment-16372137
 ] 

Wes McKinney commented on ARROW-2193:
-

I've spent almost an hour on this and I'm stumped

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1628) [Python] Incorrect serialization of numpy datetimes.

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372120#comment-16372120
 ] 

Wes McKinney commented on ARROW-1628:
-

Moved to 0.10.0

> [Python] Incorrect serialization of numpy datetimes.
> 
>
> Key: ARROW-1628
> URL: https://issues.apache.org/jira/browse/ARROW-1628
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
> Fix For: 0.10.0
>
>
> See https://github.com/ray-project/ray/issues/1041.
> The issue can be reproduced as follows.
> {code}
> import pyarrow as pa
> import numpy as np
> t = np.datetime64(datetime.datetime.now())
> print(type(t), t)  #  2017-09-30T09:50:46.089952
> t_new = pa.deserialize(pa.serialize(t).to_buffer())
> print(type(t_new), t_new)  #  0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1628) [Python] Incorrect serialization of numpy datetimes.

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1628:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Incorrect serialization of numpy datetimes.
> 
>
> Key: ARROW-1628
> URL: https://issues.apache.org/jira/browse/ARROW-1628
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
> Fix For: 0.10.0
>
>
> See https://github.com/ray-project/ray/issues/1041.
> The issue can be reproduced as follows.
> {code}
> import pyarrow as pa
> import numpy as np
> t = np.datetime64(datetime.datetime.now())
> print(type(t), t)  #  2017-09-30T09:50:46.089952
> t_new = pa.deserialize(pa.serialize(t).to_buffer())
> print(type(t_new), t_new)  #  0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372116#comment-16372116
 ] 

Wes McKinney commented on ARROW-2193:
-

I'm able to reproduce the Boost problem. For some reason the {{LINK_PRIVATE}} 
argument to {{arrow_static}} is not being respected -- the Boost libraries are 
being passed there. I'm not sure what is going on

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2069) [Python] Document that Plasma is not (yet) supported on Windows

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372093#comment-16372093
 ] 

ASF GitHub Bot commented on ARROW-2069:
---

wesm opened a new pull request #1641: ARROW-2069: [Python] Add note that Plasma 
is not supported on Windows
URL: https://github.com/apache/arrow/pull/1641
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document that Plasma is not (yet) supported on Windows
> ---
>
> Key: ARROW-2069
> URL: https://issues.apache.org/jira/browse/ARROW-2069
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> See discussion in https://github.com/apache/arrow/issues/1531



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2069) [Python] Document that Plasma is not (yet) supported on Windows

2018-02-21 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2069:
--
Labels: pull-request-available  (was: )

> [Python] Document that Plasma is not (yet) supported on Windows
> ---
>
> Key: ARROW-2069
> URL: https://issues.apache.org/jira/browse/ARROW-2069
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> See discussion in https://github.com/apache/arrow/issues/1531



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2069) [Python] Document that Plasma is not (yet) supported on Windows

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2069:
---

Assignee: Wes McKinney

> [Python] Document that Plasma is not (yet) supported on Windows
> ---
>
> Key: ARROW-2069
> URL: https://issues.apache.org/jira/browse/ARROW-2069
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> See discussion in https://github.com/apache/arrow/issues/1531



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2195) [Plasma] Segfault when retrieving RecordBatch from plasma store

2018-02-21 Thread Philipp Moritz (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp Moritz updated ARROW-2195:
--
Description: 
It can be reproduced with the following script:
{code:java}
 {code}

 import pyarrow as pa
 import pyarrow.plasma as plasma

def retrieve1():
              client = plasma.connect('test', "", 0)

             key = "keynumber1keynumber1"
              pid = plasma.ObjectID(bytearray(key,'UTF-8'))

             [buff] = client .get_buffers([pid])
              batch = pa.RecordBatchStreamReader(buff).read_next_batch()

             print(batch)
              print(batch.schema)
              print(batch[0])

             return batch

client = plasma.connect('test', "", 0)

test1 = [1, 12, 23, 3, 21, 34]
 test1 = pa.array(test1, pa.int32())

batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])

key = "keynumber1keynumber1"
 pid = plasma.ObjectID(bytearray(key,'UTF-8'))
 sink = pa.MockOutputStream()
 stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
 stream_writer.write_batch(batch)
 stream_writer.close()

bff = client.create(pid, sink.size())

stream = pa.FixedSizeBufferWriter(bff)
 writer = pa.RecordBatchStreamWriter(stream, batch.schema)
 writer.write_batch(batch)
 client.seal(pid)

batch = retrieve1()
 print(batch)
 print(batch.schema)
 print(batch[0])
{code:java}
 {code}
 

Preliminary backtrace:

 

```

CESS (code=1, address=0x38158)

    frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:

->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi

    0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for: 
PyInt_FromLong

    0x10e645805 <+37>: testq  %rax, %rax

    0x10e645808 <+40>: je     0x10e64580c               ; <+44>

(lldb) bt
 * thread #1: tid = 0xf1378e, 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, 
queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, 
address=0x38158)

  * frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

    frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) + 133

    frame #2: 0x00010e613b25 
lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933

    frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60

    frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + 22305

```

  was:
It can be reproduced with the following script:

```
import pyarrow as pa
import pyarrow.plasma as plasma

def retrieve1():
             client = plasma.connect('test', "", 0)

             key = "keynumber1keynumber1"
             pid = plasma.ObjectID(bytearray(key,'UTF-8'))

             [buff] = client .get_buffers([pid])
             batch = pa.RecordBatchStreamReader(buff).read_next_batch()

             print(batch)
             print(batch.schema)
             print(batch[0])

             return batch

client = plasma.connect('test', "", 0)

test1 = [1, 12, 23, 3, 21, 34]
test1 = pa.array(test1, pa.int32())

batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])

key = "keynumber1keynumber1"
pid = plasma.ObjectID(bytearray(key,'UTF-8'))
sink = pa.MockOutputStream()
stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
stream_writer.write_batch(batch)
stream_writer.close()

bff = client.create(pid, sink.size())

stream = pa.FixedSizeBufferWriter(bff)
writer = pa.RecordBatchStreamWriter(stream, batch.schema)
writer.write_batch(batch)
client.seal(pid)

batch = retrieve1()
print(batch)
print(batch.schema)
print(batch[0])

```

 

Preliminary backtrace:

 

```

CESS (code=1, address=0x38158)

    frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:

->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi

    0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for: 
PyInt_FromLong

    0x10e645805 <+37>: testq  %rax, %rax

    0x10e645808 <+40>: je     0x10e64580c               ; <+44>

(lldb) bt

* thread #1: tid = 0xf1378e, 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, 
queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, 
address=0x38158)

  * frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

    frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) + 133

    frame #2: 0x00010e613b25 
lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933

    frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60

    frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + 22305

```


> [Plasma] Segfault when retrieving RecordBatch from plasma store
> 

[jira] [Commented] (ARROW-2131) [Python] Serialization test fails on Windows when library has been built in place / not installed

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372059#comment-16372059
 ] 

ASF GitHub Bot commented on ARROW-2131:
---

pitrou commented on a change in pull request #1640: ARROW-2131: [Python] 
Prepend module path to PYTHONPATH when spawning subprocess
URL: https://github.com/apache/arrow/pull/1640#discussion_r169786220
 
 

 ##
 File path: python/pyarrow/tests/test_serialization.py
 ##
 @@ -580,6 +580,21 @@ def deserialize_regex(serialized, q):
 p.join()
 
 
+def _get_modified_env_with_pythonpath():
+# Prepend pyarrow root directory to PYTHONPATH
+env = os.environ.copy()
+existing_pythonpath = env.get('PYTHONPATH', '')
+if sys.platform == 'win32':
+sep = ';'
+else:
+sep = ':'
+
+module_path, _ = os.path.split(pa.__path__[0])
 
 Review comment:
   `os.path.abspath(os.path.dirname(pa.__file__))`, no?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Serialization test fails on Windows when library has been built in 
> place / not installed
> -
>
> Key: ARROW-2131
> URL: https://issues.apache.org/jira/browse/ARROW-2131
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I am not sure why this doesn't come up in Appveyor:
> {code}
> == FAILURES 
> ===
>  test_deserialize_buffer_in_different_process 
> _
> def test_deserialize_buffer_in_different_process():
> import tempfile
> import subprocess
> f = tempfile.NamedTemporaryFile(delete=False)
> b = pa.serialize(pa.frombuffer(b'hello')).to_buffer()
> f.write(b.to_pybytes())
> f.close()
> dir_path = os.path.dirname(os.path.realpath(__file__))
> python_file = os.path.join(dir_path, 'deserialize_buffer.py')
> >   subprocess.check_call([sys.executable, python_file, f.name])
> pyarrow\tests\test_serialization.py:596:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> popenargs = (['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att'],)
> kwargs = {}, retcode = 1
> cmd = ['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> The arguments are the same as for the call function.  Example:
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']' returned non-zero 
> exit status 1.
> C:\Miniconda3\envs\pyarrow-dev\lib\subprocess.py:291: CalledProcessError
>  Captured stderr call 
> -
> Traceback (most recent call last):
>   File "C:\Users\wesm\code\arrow\python\pyarrow\tests\deserialize_buffer.py", 
> line 22, in 
> import pyarrow as pa
> ModuleNotFoundError: No module named 'pyarrow'
> === 1 failed, 15 passed, 4 skipped in 0.40 seconds 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2131) [Python] Serialization test fails on Windows when library has been built in place / not installed

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372054#comment-16372054
 ] 

ASF GitHub Bot commented on ARROW-2131:
---

wesm opened a new pull request #1640: ARROW-2131: [Python] Prepend module path 
to PYTHONPATH when spawning subprocess
URL: https://github.com/apache/arrow/pull/1640
 
 
   This enables this test to pass in an in-place build without running 
`setup.py develop`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Serialization test fails on Windows when library has been built in 
> place / not installed
> -
>
> Key: ARROW-2131
> URL: https://issues.apache.org/jira/browse/ARROW-2131
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I am not sure why this doesn't come up in Appveyor:
> {code}
> == FAILURES 
> ===
>  test_deserialize_buffer_in_different_process 
> _
> def test_deserialize_buffer_in_different_process():
> import tempfile
> import subprocess
> f = tempfile.NamedTemporaryFile(delete=False)
> b = pa.serialize(pa.frombuffer(b'hello')).to_buffer()
> f.write(b.to_pybytes())
> f.close()
> dir_path = os.path.dirname(os.path.realpath(__file__))
> python_file = os.path.join(dir_path, 'deserialize_buffer.py')
> >   subprocess.check_call([sys.executable, python_file, f.name])
> pyarrow\tests\test_serialization.py:596:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> popenargs = (['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att'],)
> kwargs = {}, retcode = 1
> cmd = ['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> The arguments are the same as for the call function.  Example:
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']' returned non-zero 
> exit status 1.
> C:\Miniconda3\envs\pyarrow-dev\lib\subprocess.py:291: CalledProcessError
>  Captured stderr call 
> -
> Traceback (most recent call last):
>   File "C:\Users\wesm\code\arrow\python\pyarrow\tests\deserialize_buffer.py", 
> line 22, in 
> import pyarrow as pa
> ModuleNotFoundError: No module named 'pyarrow'
> === 1 failed, 15 passed, 4 skipped in 0.40 seconds 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2131) [Python] Serialization test fails on Windows when library has been built in place / not installed

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372056#comment-16372056
 ] 

ASF GitHub Bot commented on ARROW-2131:
---

wesm commented on a change in pull request #1640: ARROW-2131: [Python] Prepend 
module path to PYTHONPATH when spawning subprocess
URL: https://github.com/apache/arrow/pull/1640#discussion_r169785874
 
 

 ##
 File path: python/pyarrow/tests/test_serialization.py
 ##
 @@ -580,6 +580,21 @@ def deserialize_regex(serialized, q):
 p.join()
 
 
+def _get_modified_env_with_pythonpath():
+# Prepend pyarrow root directory to PYTHONPATH
+env = os.environ.copy()
+existing_pythonpath = env.get('PYTHONPATH', '')
+if sys.platform == 'win32':
+sep = ';'
+else:
+sep = ':'
+
+module_path, _ = os.path.split(pa.__path__[0])
 
 Review comment:
   @pitrou is there a more approved way to get the module absolute directory 
path?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Serialization test fails on Windows when library has been built in 
> place / not installed
> -
>
> Key: ARROW-2131
> URL: https://issues.apache.org/jira/browse/ARROW-2131
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I am not sure why this doesn't come up in Appveyor:
> {code}
> == FAILURES 
> ===
>  test_deserialize_buffer_in_different_process 
> _
> def test_deserialize_buffer_in_different_process():
> import tempfile
> import subprocess
> f = tempfile.NamedTemporaryFile(delete=False)
> b = pa.serialize(pa.frombuffer(b'hello')).to_buffer()
> f.write(b.to_pybytes())
> f.close()
> dir_path = os.path.dirname(os.path.realpath(__file__))
> python_file = os.path.join(dir_path, 'deserialize_buffer.py')
> >   subprocess.check_call([sys.executable, python_file, f.name])
> pyarrow\tests\test_serialization.py:596:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> popenargs = (['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att'],)
> kwargs = {}, retcode = 1
> cmd = ['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> The arguments are the same as for the call function.  Example:
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']' returned non-zero 
> exit status 1.
> C:\Miniconda3\envs\pyarrow-dev\lib\subprocess.py:291: CalledProcessError
>  Captured stderr call 
> -
> Traceback (most recent call last):
>   File "C:\Users\wesm\code\arrow\python\pyarrow\tests\deserialize_buffer.py", 
> line 22, in 
> import pyarrow as pa
> ModuleNotFoundError: No module named 'pyarrow'
> === 1 failed, 15 passed, 4 skipped in 0.40 seconds 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2131) [Python] Serialization test fails on Windows when library has been built in place / not installed

2018-02-21 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2131:
--
Labels: pull-request-available  (was: )

> [Python] Serialization test fails on Windows when library has been built in 
> place / not installed
> -
>
> Key: ARROW-2131
> URL: https://issues.apache.org/jira/browse/ARROW-2131
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I am not sure why this doesn't come up in Appveyor:
> {code}
> == FAILURES 
> ===
>  test_deserialize_buffer_in_different_process 
> _
> def test_deserialize_buffer_in_different_process():
> import tempfile
> import subprocess
> f = tempfile.NamedTemporaryFile(delete=False)
> b = pa.serialize(pa.frombuffer(b'hello')).to_buffer()
> f.write(b.to_pybytes())
> f.close()
> dir_path = os.path.dirname(os.path.realpath(__file__))
> python_file = os.path.join(dir_path, 'deserialize_buffer.py')
> >   subprocess.check_call([sys.executable, python_file, f.name])
> pyarrow\tests\test_serialization.py:596:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> popenargs = (['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att'],)
> kwargs = {}, retcode = 1
> cmd = ['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> The arguments are the same as for the call function.  Example:
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']' returned non-zero 
> exit status 1.
> C:\Miniconda3\envs\pyarrow-dev\lib\subprocess.py:291: CalledProcessError
>  Captured stderr call 
> -
> Traceback (most recent call last):
>   File "C:\Users\wesm\code\arrow\python\pyarrow\tests\deserialize_buffer.py", 
> line 22, in 
> import pyarrow as pa
> ModuleNotFoundError: No module named 'pyarrow'
> === 1 failed, 15 passed, 4 skipped in 0.40 seconds 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1621) [JAVA] Reduce Heap Usage per Vector

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371948#comment-16371948
 ] 

Wes McKinney commented on ARROW-1621:
-

Moving off 0.9.0. What more needs to be done here?

> [JAVA] Reduce Heap Usage per Vector
> ---
>
> Key: ARROW-1621
> URL: https://issues.apache.org/jira/browse/ARROW-1621
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Memory, Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
> Fix For: 0.10.0
>
>
> https://docs.google.com/document/d/1MU-ah_bBHIxXNrd7SkwewGCOOexkXJ7cgKaCis5f-PI/edit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1463) [JAVA] Restructure ValueVector hierarchy to minimize compile-time generated code

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1463:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [JAVA] Restructure ValueVector hierarchy to minimize compile-time generated 
> code
> 
>
> Key: ARROW-1463
> URL: https://issues.apache.org/jira/browse/ARROW-1463
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jacques Nadeau
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The templates used in the java package are very high mainteance and the if 
> conditions are hard to track. As started in the discussion here: 
> https://github.com/apache/arrow/pull/1012, I'd like to propose that we modify 
> the structure of the internal value vectors and code generation dynamics.
> Create new abstract base vectors:
> BaseFixedVector
> BaseVariableVector
> BaseNullableVector
> For each of these, implement all the basic functionality of a vector without 
> using templating.
> Evaluate whether to use code generation to generate specific specializations 
> of this functionality for each type where needed for performance purposes 
> (probably constrained to mutator and accessor set/get methods). Giant and 
> complex if conditions in the templates are actually worse from my perspective 
> than a small amount of hand written duplicated code since templates are much 
> harder to work with. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371982#comment-16371982
 ] 

Antoine Pitrou commented on ARROW-2193:
---

{quote}Do you know that fork is being called?{quote}

I don't know, but the process tree suggests so (see report). That said, after 
recompiling I don't witness this anymore, rather plasma_store fails launching.

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2194) [Python] Pandas columns metadata incorrect for empty string columns

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2194:

Summary: [Python] Pandas columns metadata incorrect for empty string 
columns  (was: Pandas columns metadata incorrect for empty string columns)

> [Python] Pandas columns metadata incorrect for empty string columns
> ---
>
> Key: ARROW-2194
> URL: https://issues.apache.org/jira/browse/ARROW-2194
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 0.9.0
>
>
> The {{pandas_type}} for {{bytes}} or {{unicode}} columns of an empty pandas 
> DataFrame is unexpectedly {{float64}}
>  
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import json
> empty_df = pd.DataFrame({'unicode': np.array([], dtype=np.unicode_), 'bytes': 
> np.array([], dtype=np.bytes_)})
> empty_table = pa.Table.from_pandas(empty_df)
> json.loads(empty_table.schema.metadata[b'pandas'])['columns']
> # Same behavior for input dtype np.unicode_
> [{u'field_name': u'bytes',
> u'metadata': None,
> u'name': u'bytes',
> u'numpy_type': u'object',
> u'pandas_type': u'float64'},
> {u'field_name': u'unicode',
> u'metadata': None,
> u'name': u'unicode',
> u'numpy_type': u'object',
> u'pandas_type': u'float64'},
> {u'field_name': u'__index_level_0__',
> u'metadata': None,
> u'name': None,
> u'numpy_type': u'int64',
> u'pandas_type': u'int64'}]{code}
>  
> Tested on Debian 8 with python2.7 and python 3.6.4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2131) [Python] Serialization test fails on Windows when library has been built in place / not installed

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2131:
---

Assignee: Wes McKinney

> [Python] Serialization test fails on Windows when library has been built in 
> place / not installed
> -
>
> Key: ARROW-2131
> URL: https://issues.apache.org/jira/browse/ARROW-2131
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> I am not sure why this doesn't come up in Appveyor:
> {code}
> == FAILURES 
> ===
>  test_deserialize_buffer_in_different_process 
> _
> def test_deserialize_buffer_in_different_process():
> import tempfile
> import subprocess
> f = tempfile.NamedTemporaryFile(delete=False)
> b = pa.serialize(pa.frombuffer(b'hello')).to_buffer()
> f.write(b.to_pybytes())
> f.close()
> dir_path = os.path.dirname(os.path.realpath(__file__))
> python_file = os.path.join(dir_path, 'deserialize_buffer.py')
> >   subprocess.check_call([sys.executable, python_file, f.name])
> pyarrow\tests\test_serialization.py:596:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> popenargs = (['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att'],)
> kwargs = {}, retcode = 1
> cmd = ['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> The arguments are the same as for the call function.  Example:
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']' returned non-zero 
> exit status 1.
> C:\Miniconda3\envs\pyarrow-dev\lib\subprocess.py:291: CalledProcessError
>  Captured stderr call 
> -
> Traceback (most recent call last):
>   File "C:\Users\wesm\code\arrow\python\pyarrow\tests\deserialize_buffer.py", 
> line 22, in 
> import pyarrow as pa
> ModuleNotFoundError: No module named 'pyarrow'
> === 1 failed, 15 passed, 4 skipped in 0.40 seconds 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2180) [C++] Remove APIs deprecated in 0.8.0 release

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371993#comment-16371993
 ] 

ASF GitHub Bot commented on ARROW-2180:
---

wesm commented on issue #1638: ARROW-2180: [C++] Remove deprecated APIs from 
0.8.0 cycle
URL: https://github.com/apache/arrow/pull/1638#issuecomment-367467535
 
 
   Appveyor build: https://ci.appveyor.com/project/wesm/arrow/build/1.0.1706


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Remove APIs deprecated in 0.8.0 release
> -
>
> Key: ARROW-2180
> URL: https://issues.apache.org/jira/browse/ARROW-2180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1833) [Java] Add accessor methods for data buffers that skip null checking

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1833:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Java] Add accessor methods for data buffers that skip null checking
> 
>
> Key: ARROW-1833
> URL: https://issues.apache.org/jira/browse/ARROW-1833
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Jingyuan Wang
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371828#comment-16371828
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

pitrou commented on issue #1635: ARROW-2142: [Python] Allow conversion from 
Numpy struct array
URL: https://github.com/apache/arrow/pull/1635#issuecomment-367414689
 
 
   AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.101


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2195) [Plasma] Segfault when retrieving RecordBatch from plasma store

2018-02-21 Thread Philipp Moritz (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp Moritz updated ARROW-2195:
--
Description: 
It can be reproduced with the following script:

{code:python}
import pyarrow as pa
import pyarrow.plasma as plasma

def retrieve1():
client = plasma.connect('test', "", 0)

key = "keynumber1keynumber1"
pid = plasma.ObjectID(bytearray(key,'UTF-8'))

[buff] = client .get_buffers([pid])
batch = pa.RecordBatchStreamReader(buff).read_next_batch()

print(batch)
print(batch.schema)
print(batch[0])

return batch

client = plasma.connect('test', "", 0)

test1 = [1, 12, 23, 3, 21, 34]
test1 = pa.array(test1, pa.int32())

batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])

key = "keynumber1keynumber1"
pid = plasma.ObjectID(bytearray(key,'UTF-8'))
sink = pa.MockOutputStream()
stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
stream_writer.write_batch(batch)
stream_writer.close()

bff = client.create(pid, sink.size())

stream = pa.FixedSizeBufferWriter(bff)
writer = pa.RecordBatchStreamWriter(stream, batch.schema)
writer.write_batch(batch)
client.seal(pid)

batch = retrieve1()
print(batch)
print(batch.schema)
print(batch[0])
{code}
 

Preliminary backtrace:

 

{code}

CESS (code=1, address=0x38158)

    frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:

->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi

    0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for: 
PyInt_FromLong

    0x10e645805 <+37>: testq  %rax, %rax

    0x10e645808 <+40>: je     0x10e64580c               ; <+44>

(lldb) bt
 * thread #1: tid = 0xf1378e, 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, 
queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, 
address=0x38158)

  * frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

    frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) + 133

    frame #2: 0x00010e613b25 
lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933

    frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60

    frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + 22305

{code}

  was:
It can be reproduced with the following script:
{code:java}
 {code}

 import pyarrow as pa
 import pyarrow.plasma as plasma

def retrieve1():
              client = plasma.connect('test', "", 0)

             key = "keynumber1keynumber1"
              pid = plasma.ObjectID(bytearray(key,'UTF-8'))

             [buff] = client .get_buffers([pid])
              batch = pa.RecordBatchStreamReader(buff).read_next_batch()

             print(batch)
              print(batch.schema)
              print(batch[0])

             return batch

client = plasma.connect('test', "", 0)

test1 = [1, 12, 23, 3, 21, 34]
 test1 = pa.array(test1, pa.int32())

batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])

key = "keynumber1keynumber1"
 pid = plasma.ObjectID(bytearray(key,'UTF-8'))
 sink = pa.MockOutputStream()
 stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
 stream_writer.write_batch(batch)
 stream_writer.close()

bff = client.create(pid, sink.size())

stream = pa.FixedSizeBufferWriter(bff)
 writer = pa.RecordBatchStreamWriter(stream, batch.schema)
 writer.write_batch(batch)
 client.seal(pid)

batch = retrieve1()
 print(batch)
 print(batch.schema)
 print(batch[0])
{code:java}
 {code}
 

Preliminary backtrace:

 

```

CESS (code=1, address=0x38158)

    frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:

->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi

    0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for: 
PyInt_FromLong

    0x10e645805 <+37>: testq  %rax, %rax

    0x10e645808 <+40>: je     0x10e64580c               ; <+44>

(lldb) bt
 * thread #1: tid = 0xf1378e, 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, 
queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, 
address=0x38158)

  * frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

    frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) + 133

    frame #2: 0x00010e613b25 
lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933

    frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60

    frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + 22305

```


> [Plasma] Segfault when retrieving RecordBatch from plasma store
> ---
>
> 

[jira] [Assigned] (ARROW-2184) [C++] Add static ctor for FileOutputStream returning shared_ptr to base OutputStream

2018-02-21 Thread Panchen Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Panchen Xue reassigned ARROW-2184:
--

Assignee: Panchen Xue

> [C++] Add static ctor for FileOutputStream returning shared_ptr to base 
> OutputStream
> 
>
> Key: ARROW-2184
> URL: https://issues.apache.org/jira/browse/ARROW-2184
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
> Fix For: 0.9.0
>
>
> It would be useful for most IO ctors to return pointers to the base interface 
> that they implement rather than the subclass. Whether we deprecate the 
> current ones will vary on a case by case basis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2184) [C++] Add static ctor for FileOutputStream returning shared_ptr to base OutputStream

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371963#comment-16371963
 ] 

Wes McKinney commented on ARROW-2184:
-

I think we should make a decision on whether to deprecate the existing ctors in 
Arrow 0.9.0 

> [C++] Add static ctor for FileOutputStream returning shared_ptr to base 
> OutputStream
> 
>
> Key: ARROW-2184
> URL: https://issues.apache.org/jira/browse/ARROW-2184
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> It would be useful for most IO ctors to return pointers to the base interface 
> that they implement rather than the subclass. Whether we deprecate the 
> current ones will vary on a case by case basis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2166) [GLib] Implement Slice for Column

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2166:
---

Assignee: yosuke shiro

> [GLib] Implement Slice for Column
> -
>
> Key: ARROW-2166
> URL: https://issues.apache.org/jira/browse/ARROW-2166
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Reporter: yosuke shiro
>Assignee: yosuke shiro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Add {{Slice}} api to Column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2036) NativeFile should support standard IOBase methods

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2036:
---

Assignee: Jim Crist

> NativeFile should support standard IOBase methods
> -
>
> Key: ARROW-2036
> URL: https://issues.apache.org/jira/browse/ARROW-2036
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Assignee: Jim Crist
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If `NativeFile` supported most/all of the standard IOBase methods 
> ([https://docs.python.org/3/library/io.html#io.IOBase),] then it'd be easier 
> to use arrow files with other python libraries. Would at least be nice to 
> support enough operations to use `io.TextIOWrapper`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2111) [C++] Linting could be faster

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2111:
---

Assignee: Antoine Pitrou

> [C++] Linting could be faster
> -
>
> Key: ARROW-2111
> URL: https://issues.apache.org/jira/browse/ARROW-2111
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently {{make lint}} style-checks C++ files sequentially (by calling 
> {{cpplint}}). We could instead style-check those files in parallel.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2128) [Python] Cannot serialize array of empty lists

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2128:
---

Assignee: Uwe L. Korn

> [Python] Cannot serialize array of empty lists
> --
>
> Key: ARROW-2128
> URL: https://issues.apache.org/jira/browse/ARROW-2128
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> This currently failing:
> {code:java}
> data = pd.Series([[], [], []])
> arr = pa.array(data)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> array.pxi:181: in pyarrow.lib.array
> ???
> array.pxi:26: in pyarrow.lib._sequence_to_array
> ???
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> > ???
> E pyarrow.lib.ArrowTypeError: Unable to determine data type
> {code}
> The code in {{SeqVisitor::GetType}} suggests that we don't want to support 
> thus but I would have expected that the above should result in {{List}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-549) [C++] Add function to concatenate like-typed arrays

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-549:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Add function to concatenate like-typed arrays
> ---
>
> Key: ARROW-549
> URL: https://issues.apache.org/jira/browse/ARROW-549
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: Analytics
> Fix For: 0.10.0
>
>
> A la 
> {{Status arrow::Concatenate(const std::vector>& 
> arrays, MemoryPool* pool, std::shared_ptr* out)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1463) [JAVA] Restructure ValueVector hierarchy to minimize compile-time generated code

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371949#comment-16371949
 ] 

Wes McKinney commented on ARROW-1463:
-

Where does this work stand?

> [JAVA] Restructure ValueVector hierarchy to minimize compile-time generated 
> code
> 
>
> Key: ARROW-1463
> URL: https://issues.apache.org/jira/browse/ARROW-1463
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jacques Nadeau
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The templates used in the java package are very high mainteance and the if 
> conditions are hard to track. As started in the discussion here: 
> https://github.com/apache/arrow/pull/1012, I'd like to propose that we modify 
> the structure of the internal value vectors and code generation dynamics.
> Create new abstract base vectors:
> BaseFixedVector
> BaseVariableVector
> BaseNullableVector
> For each of these, implement all the basic functionality of a vector without 
> using templating.
> Evaluate whether to use code generation to generate specific specializations 
> of this functionality for each type where needed for performance purposes 
> (probably constrained to mutator and accessor set/get methods). Giant and 
> complex if conditions in the templates are actually worse from my perspective 
> than a small amount of hand written duplicated code since templates are much 
> harder to work with. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1394) [Plasma] Add optional extension for allocating memory on GPUs

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1394:
---

Assignee: William Paul

> [Plasma] Add optional extension for allocating memory on GPUs
> -
>
> Key: ARROW-1394
> URL: https://issues.apache.org/jira/browse/ARROW-1394
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Plasma (C++)
>Reporter: Wes McKinney
>Assignee: William Paul
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> It would be useful to be able to allocate memory to be shared between 
> processes via Plasma using the CUDA IPC API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1621) [JAVA] Reduce Heap Usage per Vector

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1621:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [JAVA] Reduce Heap Usage per Vector
> ---
>
> Key: ARROW-1621
> URL: https://issues.apache.org/jira/browse/ARROW-1621
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Memory, Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
> Fix For: 0.10.0
>
>
> https://docs.google.com/document/d/1MU-ah_bBHIxXNrd7SkwewGCOOexkXJ7cgKaCis5f-PI/edit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2121) [Python] Consider special casing object arrays in pandas serializers.

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2121:
---

Assignee: Robert Nishihara

> [Python] Consider special casing object arrays in pandas serializers.
> -
>
> Key: ARROW-2121
> URL: https://issues.apache.org/jira/browse/ARROW-2121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2195) [Plasma] Segfault when retrieving RecordBatch from plasma store

2018-02-21 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2195:
-

 Summary: [Plasma] Segfault when retrieving RecordBatch from plasma 
store
 Key: ARROW-2195
 URL: https://issues.apache.org/jira/browse/ARROW-2195
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


It can be reproduced with the following script:

```
import pyarrow as pa
import pyarrow.plasma as plasma

def retrieve1():
             client = plasma.connect('test', "", 0)

             key = "keynumber1keynumber1"
             pid = plasma.ObjectID(bytearray(key,'UTF-8'))

             [buff] = client .get_buffers([pid])
             batch = pa.RecordBatchStreamReader(buff).read_next_batch()

             print(batch)
             print(batch.schema)
             print(batch[0])

             return batch

client = plasma.connect('test', "", 0)

test1 = [1, 12, 23, 3, 21, 34]
test1 = pa.array(test1, pa.int32())

batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])

key = "keynumber1keynumber1"
pid = plasma.ObjectID(bytearray(key,'UTF-8'))
sink = pa.MockOutputStream()
stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
stream_writer.write_batch(batch)
stream_writer.close()

bff = client.create(pid, sink.size())

stream = pa.FixedSizeBufferWriter(bff)
writer = pa.RecordBatchStreamWriter(stream, batch.schema)
writer.write_batch(batch)
client.seal(pid)

batch = retrieve1()
print(batch)
print(batch.schema)
print(batch[0])

```

 

Preliminary backtrace:

 

```

CESS (code=1, address=0x38158)

    frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:

->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi

    0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for: 
PyInt_FromLong

    0x10e645805 <+37>: testq  %rax, %rax

    0x10e645808 <+40>: je     0x10e64580c               ; <+44>

(lldb) bt

* thread #1: tid = 0xf1378e, 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, 
queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, 
address=0x38158)

  * frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

    frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) + 133

    frame #2: 0x00010e613b25 
lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933

    frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60

    frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + 22305

```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2168) [C++] Build toolchain builds with jemalloc

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2168:
---

Assignee: Uwe L. Korn

> [C++] Build toolchain builds with jemalloc
> --
>
> Key: ARROW-2168
> URL: https://issues.apache.org/jira/browse/ARROW-2168
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> We have fixed all known problems in the jemalloc 4.x branch and should be 
> able to gradually reactivate it in our builds to get its performance boost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Robert Nishihara (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371973#comment-16371973
 ] 

Robert Nishihara commented on ARROW-2193:
-

Do you know that {{fork}} is being called? Another way this could happen is if 
the tests fail to kill the plasma store and leave a bunch of them running.

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2029) [Python] Program crash on `HdfsFile.tell` if file is closed

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2029:
---

Assignee: Jim Crist

> [Python] Program crash on `HdfsFile.tell` if file is closed
> ---
>
> Key: ARROW-2029
> URL: https://issues.apache.org/jira/browse/ARROW-2029
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jim Crist
>Assignee: Jim Crist
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Of all the `NativeFile` methods, `tell` is the only one that doesn't check if 
> the file is still open before running. This can lead to crashes when using 
> hdfs:
>  
> {code:java}
> >>> import pyarrow as pa
> >>> h = pa.hdfs.connect()
> 18/01/24 22:31:35 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 18/01/24 22:31:36 WARN shortcircuit.DomainSocketFactory: The short-circuit 
> local reads feature cannot be used because libhadoop cannot be loaded.
> >>> with h.open("/tmp/test.txt", mode='wb') as f:
> ... pass
> ...
> >>> f.tell()
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f52ccb6733d, pid=14868, tid=0x7f52de2b9700
> #
> # JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build 
> 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
> # Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x67c33d]
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /working/python/hs_err_pid14868.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> Aborted
> {code}
> In python, most file-like objects raise a `ValueError` if the file is closed:
> {code:java}
> >>> f = open("test.py", mode='wb')
> >>> f.close()
> >>> f.tell()
> Traceback (most recent call last):
>   File "", line 1, in 
> ValueError: I/O operation on closed file
> >>> import io
> >>> buf = io.BytesIO()
> >>> buf.close()
> >>> buf.tell()
> Traceback (most recent call last):
>   File "", line 1, in 
> ValueError: I/O operation on closed file.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1171) C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and self-compiled turbodbc

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1171:

Fix Version/s: (was: 0.9.0)
   0.10.0

> C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and 
> self-compiled turbodbc
> 
>
> Key: ARROW-1171
> URL: https://issues.apache.org/jira/browse/ARROW-1171
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.4.1
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Original issue: https://github.com/blue-yonder/turbodbc/issues/102
> When using the {{pyarrow}} {{manylinux1}} Wheels to build Turbodbc on Fedora 
> 24, the {{turbodbc_arrow}} unittests segfault. The main environment attribute 
> here is that the compiler version used for building Turbodbc is newer than 
> the one used for Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2024) [Python] Remove global SerializationContext variables

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2024:

Summary: [Python] Remove global SerializationContext variables  (was: 
Remove global SerializationContext variables.)

> [Python] Remove global SerializationContext variables
> -
>
> Key: ARROW-2024
> URL: https://issues.apache.org/jira/browse/ARROW-2024
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> We should get rid of the global variables 
> _default_serialization_context and 
> pandas_serialization_context and replace them with functions 
> default_serialization_context() and 
> pandas_serialization_context().
> This will also make it faster to do import pyarrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1744) [Plasma] Provide TensorFlow operator to read tensors from plasma

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1744:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Plasma] Provide TensorFlow operator to read tensors from plasma
> 
>
> Key: ARROW-1744
> URL: https://issues.apache.org/jira/browse/ARROW-1744
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> see https://www.tensorflow.org/extend/adding_an_op



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2132) [Doc] Add links / mentions of Plasma store to main README

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371953#comment-16371953
 ] 

ASF GitHub Bot commented on ARROW-2132:
---

wesm commented on issue #1636: ARROW-2132: Add link to Plasma in main README
URL: https://github.com/apache/arrow/pull/1636#issuecomment-367459086
 
 
   @robertnishihara @pcmoritz could you review language and tweak as desired? 
(feel free to push to this branch)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Doc] Add links / mentions of Plasma store to main README
> -
>
> Key: ARROW-2132
> URL: https://issues.apache.org/jira/browse/ARROW-2132
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> This should be listed as separate from, but noted as a part of, the C++ 
> implementation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Deleted] (ARROW-1645) Access HDFS with read_table() automatically

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney deleted ARROW-1645:



> Access HDFS with read_table() automatically
> ---
>
> Key: ARROW-1645
> URL: https://issues.apache.org/jira/browse/ARROW-1645
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Ehsan Totoni
>Priority: Major
>
> t'd be great to support accessing HDFS automatically like: 
> `pq.read_table('hdfs://example.parquet'`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371875#comment-16371875
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

pitrou commented on issue #1635: ARROW-2142: [Python] Allow conversion from 
Numpy struct array
URL: https://github.com/apache/arrow/pull/1635#issuecomment-367435997
 
 
   AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.102


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2195) [Plasma] Segfault when retrieving RecordBatch from plasma store

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2195:

Fix Version/s: 0.9.0

> [Plasma] Segfault when retrieving RecordBatch from plasma store
> ---
>
> Key: ARROW-2195
> URL: https://issues.apache.org/jira/browse/ARROW-2195
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Priority: Major
> Fix For: 0.9.0
>
>
> It can be reproduced with the following script:
> {code:python}
> import pyarrow as pa
> import pyarrow.plasma as plasma
> def retrieve1():
> client = plasma.connect('test', "", 0)
> key = "keynumber1keynumber1"
> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> [buff] = client .get_buffers([pid])
> batch = pa.RecordBatchStreamReader(buff).read_next_batch()
> print(batch)
> print(batch.schema)
> print(batch[0])
> return batch
> client = plasma.connect('test', "", 0)
> test1 = [1, 12, 23, 3, 21, 34]
> test1 = pa.array(test1, pa.int32())
> batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
> key = "keynumber1keynumber1"
> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> sink = pa.MockOutputStream()
> stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> stream_writer.write_batch(batch)
> stream_writer.close()
> bff = client.create(pid, sink.size())
> stream = pa.FixedSizeBufferWriter(bff)
> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> writer.write_batch(batch)
> client.seal(pid)
> batch = retrieve1()
> print(batch)
> print(batch.schema)
> print(batch[0])
> {code}
>  
> Preliminary backtrace:
>  
> {code}
> CESS (code=1, address=0x38158)
>     frame #0: 0x00010e6457fc 
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:
> ->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi
>     0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for: 
> PyInt_FromLong
>     0x10e645805 <+37>: testq  %rax, %rax
>     0x10e645808 <+40>: je     0x10e64580c               ; <+44>
> (lldb) bt
>  * thread #1: tid = 0xf1378e, 0x00010e6457fc 
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, 
> queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, 
> address=0x38158)
>   * frame #0: 0x00010e6457fc 
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28
>     frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) + 
> 133
>     frame #2: 0x00010e613b25 
> lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933
>     frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60
>     frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + 22305
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2024) Remove global SerializationContext variables.

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2024:
---

Assignee: Robert Nishihara

> Remove global SerializationContext variables.
> -
>
> Key: ARROW-2024
> URL: https://issues.apache.org/jira/browse/ARROW-2024
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> We should get rid of the global variables 
> _default_serialization_context and 
> pandas_serialization_context and replace them with functions 
> default_serialization_context() and 
> pandas_serialization_context().
> This will also make it faster to do import pyarrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-971) [C++/Python] Implement Array.isvalid/notnull/isnull

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-971:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++/Python] Implement Array.isvalid/notnull/isnull
> ---
>
> Key: ARROW-971
> URL: https://issues.apache.org/jira/browse/ARROW-971
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Licht Takeuchi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> For arrays with nulls, this amounts to returning the validity bitmap. Without 
> nulls, an array of all 1 bits must be constructed. For isnull, the bits must 
> be flipped (in this case, the un-set part of the new bitmap must stay 0, 
> though).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2121) [Python] Consider special casing object arrays in pandas serializers.

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2121:

Summary: [Python] Consider special casing object arrays in pandas 
serializers.  (was: Consider special casing object arrays in pandas 
serializers.)

> [Python] Consider special casing object arrays in pandas serializers.
> -
>
> Key: ARROW-2121
> URL: https://issues.apache.org/jira/browse/ARROW-2121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1937) [Python] Add documentation for different forms of constructing nested arrays from Python data structures

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371960#comment-16371960
 ] 

Wes McKinney commented on ARROW-1937:
-

Since we have done a bunch of work on this for 0.9.0, it would be a real shame 
to not have documentation showcasing the results. I'm leaving this on 0.9.0

> [Python] Add documentation for different forms of constructing nested arrays 
> from Python data structures 
> -
>
> Key: ARROW-1937
> URL: https://issues.apache.org/jira/browse/ARROW-1937
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371938#comment-16371938
 ] 

Wes McKinney commented on ARROW-2193:
-

OK, this seems buggy. I marked for 0.9.0

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2193:

Fix Version/s: 0.9.0

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371820#comment-16371820
 ] 

Antoine Pitrou commented on ARROW-2193:
---

Ok, this is because I recently switched from gcc-4.9 to clang-5.0. With gcc, 
plasma_store doesn't have a runtime dependency on boost:
{code:bash}
$ ldd `which plasma_store`
linux-vdso.so.1 =>  (0x7ffc8b318000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
(0x7fdc79bbe000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 
(0x7fdc7983c000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7fdc79533000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 
(0x7fdc7931d000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7fdc78f53000)
/lib64/ld-linux-x86-64.so.2 (0x7fdc79ddb000)
{code}

But with clang I get:
{code:bash}
$ ldd `which plasma_store`
linux-vdso.so.1 =>  (0x7fff21ba4000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7f0d04d5d000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
(0x7f0d04b4)
libboost_system.so.1.66.0 => not found
libboost_filesystem.so.1.66.0 => not found
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 
(0x7f0d047be000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f0d044b5000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 
(0x7f0d0429f000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f0d03ed5000)
/lib64/ld-linux-x86-64.so.2 (0x7f0d04f65000)
{code}

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-21 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud resolved ARROW-2162.
--
Resolution: Fixed

Issue resolved by pull request 1619
[https://github.com/apache/arrow/pull/1619]

> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371838#comment-16371838
 ] 

ASF GitHub Bot commented on ARROW-2162:
---

cpcloud closed pull request #1619: ARROW-2162: [Python/C++] Decimal Values with 
too-high precision are multiplied by 100
URL: https://github.com/apache/arrow/pull/1619
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/python-test.cc 
b/cpp/src/arrow/python/python-test.cc
index a2b832bdb..b76caaece 100644
--- a/cpp/src/arrow/python/python-test.cc
+++ b/cpp/src/arrow/python/python-test.cc
@@ -201,5 +201,45 @@ TEST(BuiltinConversionTest, TestMixedTypeFails) {
   ASSERT_RAISES(UnknownError, ConvertPySequence(list, pool, &arr));
 }
 
+TEST_F(DecimalTest, FromPythonDecimalRescaleNotTruncateable) {
+  // We fail when truncating values that would lose data if cast to a decimal 
type with
+  // lower scale
+  Decimal128 value;
+  OwnedRef python_decimal(this->CreatePythonDecimal("1.001"));
+  auto type = ::arrow::decimal(10, 2);
+  const auto& decimal_type = static_cast(*type);
+  ASSERT_RAISES(Invalid, 
internal::DecimalFromPythonDecimal(python_decimal.obj(),
+decimal_type, 
&value));
+}
+
+TEST_F(DecimalTest, FromPythonDecimalRescaleTruncateable) {
+  // We allow truncation of values that do not lose precision when dividing by 
10 * the
+  // difference between the scales, e.g., 1.000 -> 1.00
+  Decimal128 value;
+  OwnedRef python_decimal(this->CreatePythonDecimal("1.000"));
+  auto type = ::arrow::decimal(10, 2);
+  const auto& decimal_type = static_cast(*type);
+  ASSERT_OK(
+  internal::DecimalFromPythonDecimal(python_decimal.obj(), decimal_type, 
&value));
+  ASSERT_EQ(100, value.low_bits());
+}
+
+TEST_F(DecimalTest, TestOverflowFails) {
+  Decimal128 value;
+  int32_t precision;
+  int32_t scale;
+  OwnedRef python_decimal(
+  this->CreatePythonDecimal("9.9"));
+  ASSERT_OK(
+  internal::InferDecimalPrecisionAndScale(python_decimal.obj(), 
&precision, &scale));
+  ASSERT_EQ(38, precision);
+  ASSERT_EQ(1, scale);
+
+  auto type = ::arrow::decimal(38, 38);
+  const auto& decimal_type = static_cast(*type);
+  ASSERT_RAISES(Invalid, 
internal::DecimalFromPythonDecimal(python_decimal.obj(),
+decimal_type, 
&value));
+}
+
 }  // namespace py
 }  // namespace arrow
diff --git a/cpp/src/arrow/util/decimal.cc b/cpp/src/arrow/util/decimal.cc
index e999854b1..a3c8cda76 100644
--- a/cpp/src/arrow/util/decimal.cc
+++ b/cpp/src/arrow/util/decimal.cc
@@ -854,26 +854,46 @@ static const Decimal128 ScaleMultipliers[] = {
 Decimal128("10"),
 Decimal128("100")};
 
+static bool RescaleWouldCauseDataLoss(const Decimal128& value, int32_t 
delta_scale,
+  int32_t abs_delta_scale, Decimal128* 
result) {
+  Decimal128 multiplier(ScaleMultipliers[abs_delta_scale]);
+
+  if (delta_scale < 0) {
+DCHECK_NE(multiplier, 0);
+Decimal128 remainder;
+Status status = value.Divide(multiplier, result, &remainder);
+DCHECK(status.ok()) << status.message();
+return remainder != 0;
+  }
+
+  *result = value * multiplier;
+  return *result < value;
+}
+
 Status Decimal128::Rescale(int32_t original_scale, int32_t new_scale,
Decimal128* out) const {
-  DCHECK_NE(out, NULLPTR);
-  DCHECK_NE(original_scale, new_scale);
-  const int32_t delta_scale = original_scale - new_scale;
+  DCHECK_NE(out, NULLPTR) << "out is nullptr";
+  DCHECK_NE(original_scale, new_scale) << "original_scale != new_scale";
+
+  const int32_t delta_scale = new_scale - original_scale;
   const int32_t abs_delta_scale = std::abs(delta_scale);
+
   DCHECK_GE(abs_delta_scale, 1);
   DCHECK_LE(abs_delta_scale, 38);
 
-  const Decimal128 scale_multiplier = ScaleMultipliers[abs_delta_scale];
-  const Decimal128 result = *this * scale_multiplier;
+  Decimal128 result(*this);
+  const bool rescale_would_cause_data_loss =
+  RescaleWouldCauseDataLoss(result, delta_scale, abs_delta_scale, out);
 
-  if (ARROW_PREDICT_FALSE(result < *this)) {
+  // Fail if we overflow or truncate
+  if (ARROW_PREDICT_FALSE(rescale_would_cause_data_loss)) {
 std::stringstream buf;
-buf << "Rescaling decimal value from original scale " << original_scale
-<< " to new scale " << new_scale << " would cause overflow";
+buf << "Rescaling decimal value " << ToString(original_scale)
+<< " from original scale of " << original_scale << " to new scale o

[jira] [Updated] (ARROW-2194) Pandas columns metadata incorrect for empty string columns

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2194:

Fix Version/s: 0.9.0

> Pandas columns metadata incorrect for empty string columns
> --
>
> Key: ARROW-2194
> URL: https://issues.apache.org/jira/browse/ARROW-2194
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 0.9.0
>
>
> The {{pandas_type}} for {{bytes}} or {{unicode}} columns of an empty pandas 
> DataFrame is unexpectedly {{float64}}
>  
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import json
> empty_df = pd.DataFrame({'unicode': np.array([], dtype=np.unicode_), 'bytes': 
> np.array([], dtype=np.bytes_)})
> empty_table = pa.Table.from_pandas(empty_df)
> json.loads(empty_table.schema.metadata[b'pandas'])['columns']
> # Same behavior for input dtype np.unicode_
> [{u'field_name': u'bytes',
> u'metadata': None,
> u'name': u'bytes',
> u'numpy_type': u'object',
> u'pandas_type': u'float64'},
> {u'field_name': u'unicode',
> u'metadata': None,
> u'name': u'unicode',
> u'numpy_type': u'object',
> u'pandas_type': u'float64'},
> {u'field_name': u'__index_level_0__',
> u'metadata': None,
> u'name': None,
> u'numpy_type': u'int64',
> u'pandas_type': u'int64'}]{code}
>  
> Tested on Debian 8 with python2.7 and python 3.6.4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2192) Commits to master should run all builds in CI matrix

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371771#comment-16371771
 ] 

ASF GitHub Bot commented on ARROW-2192:
---

wesm closed pull request #1634: ARROW-2192: [CI] Always build on master branch 
and repository
URL: https://github.com/apache/arrow/pull/1634
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/ci/travis_detect_changes.py b/ci/travis_detect_changes.py
index 2aeb34fa0..d60b13227 100644
--- a/ci/travis_detect_changes.py
+++ b/ci/travis_detect_changes.py
@@ -147,19 +147,25 @@ def get_unix_shell_eval(env):
 
 
 def run_from_travis():
-desc = get_travis_commit_description()
-if '[skip travis]' in desc:
-# Skip everything
-affected = dict.fromkeys(ALL_TOPICS, False)
-elif '[force ci]' in desc or '[force travis]' in desc:
-# Test everything
+if (os.environ['TRAVIS_REPO_SLUG'] == 'apache/arrow' and
+os.environ['TRAVIS_BRANCH'] == 'master' and
+os.environ['TRAVIS_EVENT_TYPE'] != 'pull_request'):
+# Never skip anything on master builds in the official repository
 affected = dict.fromkeys(ALL_TOPICS, True)
 else:
-# Test affected topics
-affected_files = list_travis_affected_files()
-perr("Affected files:", affected_files)
-affected = get_affected_topics(affected_files)
-assert set(affected) <= set(ALL_TOPICS), affected
+desc = get_travis_commit_description()
+if '[skip travis]' in desc:
+# Skip everything
+affected = dict.fromkeys(ALL_TOPICS, False)
+elif '[force ci]' in desc or '[force travis]' in desc:
+# Test everything
+affected = dict.fromkeys(ALL_TOPICS, True)
+else:
+# Test affected topics
+affected_files = list_travis_affected_files()
+perr("Affected files:", affected_files)
+affected = get_affected_topics(affected_files)
+assert set(affected) <= set(ALL_TOPICS), affected
 
 perr("Affected topics:")
 perr(pprint.pformat(affected))


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Commits to master should run all builds in CI matrix
> 
>
> Key: ARROW-2192
> URL: https://issues.apache.org/jira/browse/ARROW-2192
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> After ARROW-2083, we are only running builds related to changed components 
> with each patch in Travis CI and Appveyor. 
> The problem with this is that when we merge patches to master, our Travis CI 
> configuration (implemented by ASF infra to help alleviate clogged up build 
> queues) is set up to cancel in-progress builds whenever a new commit is 
> merged.
> So basically we could have in our timeline:
> * Patch merged affecting C++, Python
> * Patch merged affecting Java
> * Patch merged affecting JS
> So when the Java patch is merged, any in-progress C++/Python builds will be 
> cancelled. And if the JS patch comes in, the Java builds would be immediately 
> cancelled.
> In light of this I believe on master branch we should always run all of the 
> builds unconditionally



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2192) Commits to master should run all builds in CI matrix

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2192.
-
Resolution: Fixed

Issue resolved by pull request 1634
[https://github.com/apache/arrow/pull/1634]

> Commits to master should run all builds in CI matrix
> 
>
> Key: ARROW-2192
> URL: https://issues.apache.org/jira/browse/ARROW-2192
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> After ARROW-2083, we are only running builds related to changed components 
> with each patch in Travis CI and Appveyor. 
> The problem with this is that when we merge patches to master, our Travis CI 
> configuration (implemented by ASF infra to help alleviate clogged up build 
> queues) is set up to cancel in-progress builds whenever a new commit is 
> merged.
> So basically we could have in our timeline:
> * Patch merged affecting C++, Python
> * Patch merged affecting Java
> * Patch merged affecting JS
> So when the Java patch is merged, any in-progress C++/Python builds will be 
> cancelled. And if the JS patch comes in, the Java builds would be immediately 
> cancelled.
> In light of this I believe on master branch we should always run all of the 
> builds unconditionally



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1887) [Python] More efficient serialization of pandas Index types in custom serialization from ARROW-1784

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1887:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] More efficient serialization of pandas Index types in custom 
> serialization from ARROW-1784
> ---
>
> Key: ARROW-1887
> URL: https://issues.apache.org/jira/browse/ARROW-1887
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1994) [Python] Test against Pandas master

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371769#comment-16371769
 ] 

Wes McKinney commented on ARROW-1994:
-

This would be nice to have. Are there nightly pandas conda builds we could use? 
Otherwise this will increase our build times too much

> [Python] Test against Pandas master
> ---
>
> Key: ARROW-1994
> URL: https://issues.apache.org/jira/browse/ARROW-1994
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> We have seen recently a lot of breakage with Pandas master. This is an 
> annoyance to our users and should already break in our builds instead of 
> their chains. There is no need to add another entry to matrix, just in one of 
> them to re-run the tests with the Pandas master after they ran successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1994) [Python] Test against Pandas master

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1994:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Test against Pandas master
> ---
>
> Key: ARROW-1994
> URL: https://issues.apache.org/jira/browse/ARROW-1994
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> We have seen recently a lot of breakage with Pandas master. This is an 
> annoyance to our users and should already break in our builds instead of 
> their chains. There is no need to add another entry to matrix, just in one of 
> them to re-run the tests with the Pandas master after they ran successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2185) Remove CI directives from squashed commit messages

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2185:
---

Assignee: Wes McKinney

> Remove CI directives from squashed commit messages
> --
>
> Key: ARROW-2185
> URL: https://issues.apache.org/jira/browse/ARROW-2185
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> In our PR squash tool, we are potentially picking up CI directives like 
> {{[skip appveyor]}} from intermediate commits. We should regex these away and 
> instead use directives in the PR title if we wish the commit to master to 
> behave in a certain way



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2185) Remove CI directives from squashed commit messages

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371768#comment-16371768
 ] 

ASF GitHub Bot commented on ARROW-2185:
---

wesm opened a new pull request #1639: ARROW-2185: Strip CI directives from 
commit messages
URL: https://github.com/apache/arrow/pull/1639
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove CI directives from squashed commit messages
> --
>
> Key: ARROW-2185
> URL: https://issues.apache.org/jira/browse/ARROW-2185
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> In our PR squash tool, we are potentially picking up CI directives like 
> {{[skip appveyor]}} from intermediate commits. We should regex these away and 
> instead use directives in the PR title if we wish the commit to master to 
> behave in a certain way



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2185) Remove CI directives from squashed commit messages

2018-02-21 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2185:
--
Labels: pull-request-available  (was: )

> Remove CI directives from squashed commit messages
> --
>
> Key: ARROW-2185
> URL: https://issues.apache.org/jira/browse/ARROW-2185
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> In our PR squash tool, we are potentially picking up CI directives like 
> {{[skip appveyor]}} from intermediate commits. We should regex these away and 
> instead use directives in the PR title if we wish the commit to master to 
> behave in a certain way



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371763#comment-16371763
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

pitrou commented on issue #1635: ARROW-2142: [Python] Allow conversion from 
Numpy struct array
URL: https://github.com/apache/arrow/pull/1635#issuecomment-367414689
 
 
   AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.101


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2180) [C++] Remove APIs deprecated in 0.8.0 release

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371758#comment-16371758
 ] 

ASF GitHub Bot commented on ARROW-2180:
---

wesm opened a new pull request #1638: ARROW-2180: [C++] Remove deprecated APIs 
from 0.8.0 cycle
URL: https://github.com/apache/arrow/pull/1638
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Remove APIs deprecated in 0.8.0 release
> -
>
> Key: ARROW-2180
> URL: https://issues.apache.org/jira/browse/ARROW-2180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2180) [C++] Remove APIs deprecated in 0.8.0 release

2018-02-21 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2180:
--
Labels: pull-request-available  (was: )

> [C++] Remove APIs deprecated in 0.8.0 release
> -
>
> Key: ARROW-2180
> URL: https://issues.apache.org/jira/browse/ARROW-2180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2093) [Python] Possibly do not test pytorch serialization in Travis CI

2018-02-21 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2093:
--
Labels: pull-request-available  (was: )

> [Python] Possibly do not test pytorch serialization in Travis CI
> 
>
> Key: ARROW-2093
> URL: https://issues.apache.org/jira/browse/ARROW-2093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I am not sure it is worth downloading ~400MB in binaries
> {code}
> The following packages will be downloaded:
> package|build
> ---|-
> libgcc-5.2.0   |0 1.1 MB  defaults
> pillow-5.0.0   |   py27_0 958 KB  conda-forge
> libtiff-4.0.9  |0 511 KB  conda-forge
> libtorch-0.1.12|  nomkl_0 1.7 MB  defaults
> olefile-0.44   |   py27_0  50 KB  conda-forge
> torchvision-0.1.9  |   py27hdb88a65_1  86 KB  soumith
> openblas-0.2.19|214.1 MB  conda-forge
> numpy-1.13.1   |py27_blas_openblas_200 8.4 MB  
> conda-forge
> pytorch-0.2.0  |py27ha262b23_4cu75   312.2 MB  soumith
> mkl-2017.0.3   |0   129.5 MB  defaults
> 
>Total:   468.6 MB
> {code}
> Follow up from ARROW-2071 https://github.com/apache/arrow/pull/1561



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2180) [C++] Remove APIs deprecated in 0.8.0 release

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2180:
---

Assignee: Wes McKinney

> [C++] Remove APIs deprecated in 0.8.0 release
> -
>
> Key: ARROW-2180
> URL: https://issues.apache.org/jira/browse/ARROW-2180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2093) [Python] Possibly do not test pytorch serialization in Travis CI

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371748#comment-16371748
 ] 

ASF GitHub Bot commented on ARROW-2093:
---

wesm opened a new pull request #1637: ARROW-2093: [Python] Do not install 
PyTorch in Travis CI
URL: https://github.com/apache/arrow/pull/1637
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Possibly do not test pytorch serialization in Travis CI
> 
>
> Key: ARROW-2093
> URL: https://issues.apache.org/jira/browse/ARROW-2093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I am not sure it is worth downloading ~400MB in binaries
> {code}
> The following packages will be downloaded:
> package|build
> ---|-
> libgcc-5.2.0   |0 1.1 MB  defaults
> pillow-5.0.0   |   py27_0 958 KB  conda-forge
> libtiff-4.0.9  |0 511 KB  conda-forge
> libtorch-0.1.12|  nomkl_0 1.7 MB  defaults
> olefile-0.44   |   py27_0  50 KB  conda-forge
> torchvision-0.1.9  |   py27hdb88a65_1  86 KB  soumith
> openblas-0.2.19|214.1 MB  conda-forge
> numpy-1.13.1   |py27_blas_openblas_200 8.4 MB  
> conda-forge
> pytorch-0.2.0  |py27ha262b23_4cu75   312.2 MB  soumith
> mkl-2017.0.3   |0   129.5 MB  defaults
> 
>Total:   468.6 MB
> {code}
> Follow up from ARROW-2071 https://github.com/apache/arrow/pull/1561



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2093) [Python] Possibly do not test pytorch serialization in Travis CI

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2093:
---

Assignee: Wes McKinney

> [Python] Possibly do not test pytorch serialization in Travis CI
> 
>
> Key: ARROW-2093
> URL: https://issues.apache.org/jira/browse/ARROW-2093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> I am not sure it is worth downloading ~400MB in binaries
> {code}
> The following packages will be downloaded:
> package|build
> ---|-
> libgcc-5.2.0   |0 1.1 MB  defaults
> pillow-5.0.0   |   py27_0 958 KB  conda-forge
> libtiff-4.0.9  |0 511 KB  conda-forge
> libtorch-0.1.12|  nomkl_0 1.7 MB  defaults
> olefile-0.44   |   py27_0  50 KB  conda-forge
> torchvision-0.1.9  |   py27hdb88a65_1  86 KB  soumith
> openblas-0.2.19|214.1 MB  conda-forge
> numpy-1.13.1   |py27_blas_openblas_200 8.4 MB  
> conda-forge
> pytorch-0.2.0  |py27ha262b23_4cu75   312.2 MB  soumith
> mkl-2017.0.3   |0   129.5 MB  defaults
> 
>Total:   468.6 MB
> {code}
> Follow up from ARROW-2071 https://github.com/apache/arrow/pull/1561



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2132) [Doc] Add links / mentions of Plasma store to main README

2018-02-21 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2132:
--
Labels: pull-request-available  (was: )

> [Doc] Add links / mentions of Plasma store to main README
> -
>
> Key: ARROW-2132
> URL: https://issues.apache.org/jira/browse/ARROW-2132
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> This should be listed as separate from, but noted as a part of, the C++ 
> implementation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2132) [Doc] Add links / mentions of Plasma store to main README

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371740#comment-16371740
 ] 

ASF GitHub Bot commented on ARROW-2132:
---

wesm opened a new pull request #1636: ARROW-2132: Add link to Plasma in main 
README
URL: https://github.com/apache/arrow/pull/1636
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Doc] Add links / mentions of Plasma store to main README
> -
>
> Key: ARROW-2132
> URL: https://issues.apache.org/jira/browse/ARROW-2132
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> This should be listed as separate from, but noted as a part of, the C++ 
> implementation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2194) Pandas columns metadata incorrect for empty string columns

2018-02-21 Thread Florian Jetter (JIRA)
Florian Jetter created ARROW-2194:
-

 Summary: Pandas columns metadata incorrect for empty string columns
 Key: ARROW-2194
 URL: https://issues.apache.org/jira/browse/ARROW-2194
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Florian Jetter


The {{pandas_type}} for {{bytes}} or {{unicode}} columns of an empty pandas 
DataFrame is unexpectedly {{float64}}

 
{code}
import numpy as np
import pandas as pd
import pyarrow as pa
import json

empty_df = pd.DataFrame({'unicode': np.array([], dtype=np.unicode_), 'bytes': 
np.array([], dtype=np.bytes_)})
empty_table = pa.Table.from_pandas(empty_df)
json.loads(empty_table.schema.metadata[b'pandas'])['columns']

# Same behavior for input dtype np.unicode_
[{u'field_name': u'bytes',
u'metadata': None,
u'name': u'bytes',
u'numpy_type': u'object',
u'pandas_type': u'float64'},
{u'field_name': u'unicode',
u'metadata': None,
u'name': u'unicode',
u'numpy_type': u'object',
u'pandas_type': u'float64'},
{u'field_name': u'__index_level_0__',
u'metadata': None,
u'name': None,
u'numpy_type': u'int64',
u'pandas_type': u'int64'}]{code}
 

Tested on Debian 8 with python2.7 and python 3.6.4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2132) [Doc] Add links / mentions of Plasma store to main README

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2132:
---

Assignee: Wes McKinney

> [Doc] Add links / mentions of Plasma store to main README
> -
>
> Key: ARROW-2132
> URL: https://issues.apache.org/jira/browse/ARROW-2132
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> This should be listed as separate from, but noted as a part of, the C++ 
> implementation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1380) [C++] Fix "still reachable" valgrind warnings in Plasma Python unit tests

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1380:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Fix "still reachable" valgrind warnings in Plasma Python unit tests
> -
>
> Key: ARROW-1380
> URL: https://issues.apache.org/jira/browse/ARROW-1380
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> I thought I fixed this, but they seem to have recurred:
> https://travis-ci.org/apache/arrow/jobs/266421430#L5220



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1848) [Python] Add documentation examples for reading single Parquet files and datasets from HDFS

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1848:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Add documentation examples for reading single Parquet files and 
> datasets from HDFS
> ---
>
> Key: ARROW-1848
> URL: https://issues.apache.org/jira/browse/ARROW-1848
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> see 
> https://stackoverflow.com/questions/47443151/read-a-parquet-files-from-hdfs-using-pyarrow



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1963) [Python] Create Array from sequence of numpy.datetime64

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1963:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Create Array from sequence of numpy.datetime64
> ---
>
> Key: ARROW-1963
> URL: https://issues.apache.org/jira/browse/ARROW-1963
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> Currently we only support {{datetime.datetime}} and {{datetime.date}} but 
> {{numpy.datetime64}} also occurs quite often in the numpy/pandas-related 
> world.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-02-21 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371726#comment-16371726
 ] 

Antoine Pitrou commented on ARROW-2142:
---

I ended up applied your suggestion on array vectors rather than chunked array 
(see attached PR).

> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >