[jira] [Comment Edited] (ARROW-2193) [Plasma] plasma_store has runtime dependency on Boost shared libraries when ARROW_BOOST_USE_SHARED=on

2018-02-22 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372545#comment-16372545
 ] 

Uwe L. Korn edited comment on ARROW-2193 at 2/22/18 8:54 AM:
-

The problem is possibly the different default flags that are passed from the 
compiler to the linker. GCC passes {{--as-needed}} by default whereas clang 
does not. Setting {{CXXFLAGS}} and so to {{-Wl,--as-needed}} should be the 
proper fix. We actually don't want to link statically to Boost (I will open a 
separate issue about the problems I had with this).


was (Author: xhochy):
The problem is possibly the different default flags that are passed from the 
compiler to the linker. GCC passes `--as-needed` by default whereas clang does 
not. Setting {{CXXFLAGS}} and so to {{-Wl,--as-needed}} should be the proper 
fix. We actually don't want to link statically to Boost (I will open a separate 
issue about the problems I had with this).

> [Plasma] plasma_store has runtime dependency on Boost shared libraries when 
> ARROW_BOOST_USE_SHARED=on
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store has runtime dependency on Boost shared libraries when ARROW_BOOST_USE_SHARED=on

2018-02-22 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372538#comment-16372538
 ] 

Antoine Pitrou commented on ARROW-2193:
---

{quote}you can make this problem go away by passing 
-DARROW_BOOST_USE_SHARED=off{quote}

Great, thank you!

> [Plasma] plasma_store has runtime dependency on Boost shared libraries when 
> ARROW_BOOST_USE_SHARED=on
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store has runtime dependency on Boost shared libraries when ARROW_BOOST_USE_SHARED=on

2018-02-22 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372545#comment-16372545
 ] 

Uwe L. Korn commented on ARROW-2193:


The problem is possibly the different default flags that are passed from the 
compiler to the linker. GCC passes `--as-needed` by default whereas clang does 
not. Setting {{CXXFLAGS}} and so to {{-Wl,--as-needed}} should be the proper 
fix. We actually don't want to link statically to Boost (I will open a separate 
issue about the problems I had with this).

> [Plasma] plasma_store has runtime dependency on Boost shared libraries when 
> ARROW_BOOST_USE_SHARED=on
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2066) [Python] Document reading Parquet files from Azure Blob Store

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373068#comment-16373068
 ] 

ASF GitHub Bot commented on ARROW-2066:
---

rjrussell77 commented on a change in pull request #1544: ARROW-2066: [Python] 
Document using pyarrow with Azure Blob Store
URL: https://github.com/apache/arrow/pull/1544#discussion_r170025416
 
 

 ##
 File path: python/doc/source/parquet.rst
 ##
 @@ -237,3 +237,44 @@ throughput:
 
pq.read_table(where, nthreads=4)
pq.ParquetDataset(where).read(nthreads=4)
+
+Reading a Parquet File from Azure Blob storage
+--
+
+The code below shows how to use Azure's storage sdk along with pyarrow to read
+a parquet file into a Pandas dataframe.
+This is suitable for executing inside a Jupyter notebook running on a Python 3
+kernel.
+
+Dependencies: 
+
+* python 3.6.2 
+* azure-storage 0.36.0 
+* pyarrow 0.8.0 
+
+.. code-block:: python
+
+   import pyarrow.parquet as pq
+   import io
+   from azure.storage.blob import BlockBlobService
+
+   account_name = '...'
+   account_key = '...'
+   container_name = '...'
+   parquet_file = 'mysample.parquet'
+
+   block_blob_service = BlockBlobService(account_name=account_name, 
account_key=account_key)
+   try:
+  block_blob_service.get_blob_to_stream(container_name=container_name, 
blob_name=parquet_file, stream=byte_stream)
+  pd = pq.read_table(source=byte_stream).to_pandas()
+  pd.head(10)
 
 Review comment:
   How about, `# Do work on df` (lower case)?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document reading Parquet files from Azure Blob Store
> -
>
> Key: ARROW-2066
> URL: https://issues.apache.org/jira/browse/ARROW-2066
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> See https://github.com/apache/arrow/issues/1510



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2180) [C++] Remove APIs deprecated in 0.8.0 release

2018-02-22 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2180.
-
Resolution: Fixed

Issue resolved by pull request 1638
[https://github.com/apache/arrow/pull/1638]

> [C++] Remove APIs deprecated in 0.8.0 release
> -
>
> Key: ARROW-2180
> URL: https://issues.apache.org/jira/browse/ARROW-2180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2132) [Doc] Add links / mentions of Plasma store to main README

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373090#comment-16373090
 ] 

ASF GitHub Bot commented on ARROW-2132:
---

wesm closed pull request #1636: ARROW-2132: Add link to Plasma in main README
URL: https://github.com/apache/arrow/pull/1636
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/README.md b/README.md
index a7fed9865..846a4030e 100644
--- a/README.md
+++ b/README.md
@@ -38,10 +38,12 @@ set of technologies that enable big data systems to process 
and move data fast.
 Major components of the project include:
 
  - [The Arrow Columnar In-Memory 
Format](https://github.com/apache/arrow/tree/master/format)
- - [C++ implementation](https://github.com/apache/arrow/tree/master/cpp)
+ - [C++ libraries](https://github.com/apache/arrow/tree/master/cpp)
+ - [Plasma Object 
Store](https://github.com/apache/arrow/tree/master/cpp/src/plasma): a
+   shared-memory blob store, part of the C++ codebase
  - [C bindings using GLib](https://github.com/apache/arrow/tree/master/c_glib)
- - [Java implementation](https://github.com/apache/arrow/tree/master/java)
- - [JavaScript implementation](https://github.com/apache/arrow/tree/master/js)
+ - [Java libraries](https://github.com/apache/arrow/tree/master/java)
+ - [JavaScript libraries](https://github.com/apache/arrow/tree/master/js)
  - [Python bindings to C++](https://github.com/apache/arrow/tree/master/python)
 
 Arrow is an [Apache Software Foundation](https://www.apache.org) project. 
Learn more at
@@ -49,8 +51,7 @@ Arrow is an [Apache Software 
Foundation](https://www.apache.org) project. Learn
 
 ### What's in the Arrow libraries?
 
-The reference Arrow implementations contain a number of distinct software
-components:
+The reference Arrow libraries contain a number of distinct software components:
 
 - Columnar vector and table-like containers (similar to data frames) supporting
   flat or nested types


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Doc] Add links / mentions of Plasma store to main README
> -
>
> Key: ARROW-2132
> URL: https://issues.apache.org/jira/browse/ARROW-2132
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> This should be listed as separate from, but noted as a part of, the C++ 
> implementation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2132) [Doc] Add links / mentions of Plasma store to main README

2018-02-22 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2132.
-
Resolution: Fixed

Issue resolved by pull request 1636
[https://github.com/apache/arrow/pull/1636]

> [Doc] Add links / mentions of Plasma store to main README
> -
>
> Key: ARROW-2132
> URL: https://issues.apache.org/jira/browse/ARROW-2132
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> This should be listed as separate from, but noted as a part of, the C++ 
> implementation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2008) [Python] Type inference for int32 NumPy arrays (expecting list) returns int64 and then conversion fails

2018-02-22 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2008.
-
Resolution: Fixed

Resolved in 
https://github.com/apache/arrow/commit/897cc4d917e375b180147856baa9c5da2e6173e5

> [Python] Type inference for int32 NumPy arrays (expecting list) 
> returns int64 and then conversion fails
> --
>
> Key: ARROW-2008
> URL: https://issues.apache.org/jira/browse/ARROW-2008
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> See report in [https://github.com/apache/arrow/issues/1430]
> {{arrow::py::InferArrowType}} is called, when traverses the array as though 
> it were any other Python sequence, and NumPy int32 scalars are not recognized 
> as such



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2066) [Python] Document reading Parquet files from Azure Blob Store

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373078#comment-16373078
 ] 

ASF GitHub Bot commented on ARROW-2066:
---

rjrussell77 commented on a change in pull request #1544: ARROW-2066: [Python] 
Document using pyarrow with Azure Blob Store
URL: https://github.com/apache/arrow/pull/1544#discussion_r170027717
 
 

 ##
 File path: python/doc/source/parquet.rst
 ##
 @@ -237,3 +237,43 @@ throughput:
 
pq.read_table(where, nthreads=4)
pq.ParquetDataset(where).read(nthreads=4)
+
+Reading a Parquet File from Azure Blob storage
+--
+
+The code below shows how to use Azure's storage sdk along with pyarrow to read
+a parquet file into a Pandas dataframe.
+This is suitable for executing inside a Jupyter notebook running on a Python 3
+kernel.
+
+Dependencies: 
+
+* python 3.6.2 
+* azure-storage 0.36.0 
+* pyarrow 0.8.0 
+
+.. code-block:: python
+
+   import pyarrow.parquet as pq
+   import io
+   from azure.storage.blob import BlockBlobService
+
+   account_name = '...'
+   account_key = '...'
+   container_name = '...'
+   parquet_file = 'mysample.parquet'
+
+   block_blob_service = BlockBlobService(account_name=account_name, 
account_key=account_key)
+   try:
+  block_blob_service.get_blob_to_stream(container_name=container_name, 
blob_name=parquet_file, stream=byte_stream)
+  df = pq.read_table(source=byte_stream).to_pandas()
+  # Do work on df ...
+   finally:
+  # Add finally block to ensure closure of the stream
+  byte_stream.close()
+
 
 Review comment:
   @xhochy Ok, I've responded to your last set of feedback.  How are we looking 
now?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document reading Parquet files from Azure Blob Store
> -
>
> Key: ARROW-2066
> URL: https://issues.apache.org/jira/browse/ARROW-2066
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> See https://github.com/apache/arrow/issues/1510



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2066) [Python] Document reading Parquet files from Azure Blob Store

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372965#comment-16372965
 ] 

ASF GitHub Bot commented on ARROW-2066:
---

xhochy commented on a change in pull request #1544: ARROW-2066: [Python] 
Document using pyarrow with Azure Blob Store
URL: https://github.com/apache/arrow/pull/1544#discussion_r169998846
 
 

 ##
 File path: python/doc/source/parquet.rst
 ##
 @@ -237,3 +237,44 @@ throughput:
 
pq.read_table(where, nthreads=4)
pq.ParquetDataset(where).read(nthreads=4)
+
+Reading a Parquet File from Azure Blob storage
+--
+
+The code below shows how to use Azure's storage sdk along with pyarrow to read
+a parquet file into a Pandas dataframe.
+This is suitable for executing inside a Jupyter notebook running on a Python 3
+kernel.
+
+Dependencies: 
+
+* python 3.6.2 
+* azure-storage 0.36.0 
+* pyarrow 0.8.0 
+
+.. code-block:: python
+
+   import pyarrow.parquet as pq
+   import io
+   from azure.storage.blob import BlockBlobService
+
+   account_name = '...'
+   account_key = '...'
+   container_name = '...'
+   parquet_file = 'mysample.parquet'
+
+   block_blob_service = BlockBlobService(account_name=account_name, 
account_key=account_key)
+   try:
+  block_blob_service.get_blob_to_stream(container_name=container_name, 
blob_name=parquet_file, stream=byte_stream)
+  pd = pq.read_table(source=byte_stream).to_pandas()
+  pd.head(10)
+   except Exception as err:
+  print("Error: {0}".format(err))
+   finally:
+  byte_stream.close()
+
 
 Review comment:
   Can you add this comment to the code? That will also be helpful for the 
reader later.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document reading Parquet files from Azure Blob Store
> -
>
> Key: ARROW-2066
> URL: https://issues.apache.org/jira/browse/ARROW-2066
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> See https://github.com/apache/arrow/issues/1510



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2066) [Python] Document reading Parquet files from Azure Blob Store

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372964#comment-16372964
 ] 

ASF GitHub Bot commented on ARROW-2066:
---

xhochy commented on a change in pull request #1544: ARROW-2066: [Python] 
Document using pyarrow with Azure Blob Store
URL: https://github.com/apache/arrow/pull/1544#discussion_r169998270
 
 

 ##
 File path: python/doc/source/parquet.rst
 ##
 @@ -237,3 +237,44 @@ throughput:
 
pq.read_table(where, nthreads=4)
pq.ParquetDataset(where).read(nthreads=4)
+
+Reading a Parquet File from Azure Blob storage
+--
+
+The code below shows how to use Azure's storage sdk along with pyarrow to read
+a parquet file into a Pandas dataframe.
+This is suitable for executing inside a Jupyter notebook running on a Python 3
+kernel.
+
+Dependencies: 
+
+* python 3.6.2 
+* azure-storage 0.36.0 
+* pyarrow 0.8.0 
+
+.. code-block:: python
+
+   import pyarrow.parquet as pq
+   import io
+   from azure.storage.blob import BlockBlobService
+
+   account_name = '...'
+   account_key = '...'
+   container_name = '...'
+   parquet_file = 'mysample.parquet'
+
+   block_blob_service = BlockBlobService(account_name=account_name, 
account_key=account_key)
+   try:
+  block_blob_service.get_blob_to_stream(container_name=container_name, 
blob_name=parquet_file, stream=byte_stream)
+  pd = pq.read_table(source=byte_stream).to_pandas()
 
 Review comment:
   The result is typically written into a variable called `df` whereas `pd` is 
the abbreviation you use when you import pandas (`import pandas as pd`)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document reading Parquet files from Azure Blob Store
> -
>
> Key: ARROW-2066
> URL: https://issues.apache.org/jira/browse/ARROW-2066
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> See https://github.com/apache/arrow/issues/1510



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2066) [Python] Document reading Parquet files from Azure Blob Store

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372966#comment-16372966
 ] 

ASF GitHub Bot commented on ARROW-2066:
---

xhochy commented on a change in pull request #1544: ARROW-2066: [Python] 
Document using pyarrow with Azure Blob Store
URL: https://github.com/apache/arrow/pull/1544#discussion_r169998695
 
 

 ##
 File path: python/doc/source/parquet.rst
 ##
 @@ -237,3 +237,44 @@ throughput:
 
pq.read_table(where, nthreads=4)
pq.ParquetDataset(where).read(nthreads=4)
+
+Reading a Parquet File from Azure Blob storage
+--
+
+The code below shows how to use Azure's storage sdk along with pyarrow to read
+a parquet file into a Pandas dataframe.
+This is suitable for executing inside a Jupyter notebook running on a Python 3
+kernel.
+
+Dependencies: 
+
+* python 3.6.2 
+* azure-storage 0.36.0 
+* pyarrow 0.8.0 
+
+.. code-block:: python
+
+   import pyarrow.parquet as pq
+   import io
+   from azure.storage.blob import BlockBlobService
+
+   account_name = '...'
+   account_key = '...'
+   container_name = '...'
+   parquet_file = 'mysample.parquet'
+
+   block_blob_service = BlockBlobService(account_name=account_name, 
account_key=account_key)
+   try:
+  block_blob_service.get_blob_to_stream(container_name=container_name, 
blob_name=parquet_file, stream=byte_stream)
+  pd = pq.read_table(source=byte_stream).to_pandas()
+  pd.head(10)
+   except Exception as err:
 
 Review comment:
   Please don't catch exceptions like this, just let it throw.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document reading Parquet files from Azure Blob Store
> -
>
> Key: ARROW-2066
> URL: https://issues.apache.org/jira/browse/ARROW-2066
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> See https://github.com/apache/arrow/issues/1510



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2066) [Python] Document reading Parquet files from Azure Blob Store

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372967#comment-16372967
 ] 

ASF GitHub Bot commented on ARROW-2066:
---

xhochy commented on a change in pull request #1544: ARROW-2066: [Python] 
Document using pyarrow with Azure Blob Store
URL: https://github.com/apache/arrow/pull/1544#discussion_r169998387
 
 

 ##
 File path: python/doc/source/parquet.rst
 ##
 @@ -237,3 +237,44 @@ throughput:
 
pq.read_table(where, nthreads=4)
pq.ParquetDataset(where).read(nthreads=4)
+
+Reading a Parquet File from Azure Blob storage
+--
+
+The code below shows how to use Azure's storage sdk along with pyarrow to read
+a parquet file into a Pandas dataframe.
+This is suitable for executing inside a Jupyter notebook running on a Python 3
+kernel.
+
+Dependencies: 
+
+* python 3.6.2 
+* azure-storage 0.36.0 
+* pyarrow 0.8.0 
+
+.. code-block:: python
+
+   import pyarrow.parquet as pq
+   import io
+   from azure.storage.blob import BlockBlobService
+
+   account_name = '...'
+   account_key = '...'
+   container_name = '...'
+   parquet_file = 'mysample.parquet'
+
+   block_blob_service = BlockBlobService(account_name=account_name, 
account_key=account_key)
+   try:
+  block_blob_service.get_blob_to_stream(container_name=container_name, 
blob_name=parquet_file, stream=byte_stream)
+  pd = pq.read_table(source=byte_stream).to_pandas()
+  pd.head(10)
 
 Review comment:
   Better replace this with `# Do work on DF …`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document reading Parquet files from Azure Blob Store
> -
>
> Key: ARROW-2066
> URL: https://issues.apache.org/jira/browse/ARROW-2066
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> See https://github.com/apache/arrow/issues/1510



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2197) Document "undefined symbol" issue and workaround

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372952#comment-16372952
 ] 

ASF GitHub Bot commented on ARROW-2197:
---

wesm commented on a change in pull request #1644: ARROW-2197: Document C++ ABI 
issue and workaround
URL: https://github.com/apache/arrow/pull/1644#discussion_r169995511
 
 

 ##
 File path: python/doc/source/development.rst
 ##
 @@ -246,6 +246,20 @@ To build a self-contained wheel (include Arrow C++ and 
Parquet C++), one can set
 Again, if you did not build parquet-cpp, you should omit ``--with-parquet`` and
 if you did not build with plasma, you should omit ``--with-plasma``.
 
+Known issues
+
+
+If you're getting some "undefined symbol" errors when importing pyarrow,
+you may have to fix the ABI version used for C++ libraries:
 
 Review comment:
   I think this should be more specific about issues using the conda-forge 
packages with a newer gcc toolchain. If the user is using a version of gcc, or 
clang with a base gcc at 5.0 or higher (e.g. with Ubuntu 16.04 or higher), then 
this flag must be added. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Document "undefined symbol" issue and workaround
> 
>
> Key: ARROW-2197
> URL: https://issues.apache.org/jira/browse/ARROW-2197
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
>
> See [https://github.com/apache/arrow/issues/1612]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2198) [Python] Docstring for parquet.read_table is misleading or incorrect

2018-02-22 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2198:
---

 Summary: [Python] Docstring for parquet.read_table is misleading 
or incorrect
 Key: ARROW-2198
 URL: https://issues.apache.org/jira/browse/ARROW-2198
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


See https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L872

One should be able to pass a Python file object directly. The docstring 
suggests otherwise



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2131) [Python] Serialization test fails on Windows when library has been built in place / not installed

2018-02-22 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2131.
-
Resolution: Fixed

Issue resolved by pull request 1640
[https://github.com/apache/arrow/pull/1640]

> [Python] Serialization test fails on Windows when library has been built in 
> place / not installed
> -
>
> Key: ARROW-2131
> URL: https://issues.apache.org/jira/browse/ARROW-2131
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I am not sure why this doesn't come up in Appveyor:
> {code}
> == FAILURES 
> ===
>  test_deserialize_buffer_in_different_process 
> _
> def test_deserialize_buffer_in_different_process():
> import tempfile
> import subprocess
> f = tempfile.NamedTemporaryFile(delete=False)
> b = pa.serialize(pa.frombuffer(b'hello')).to_buffer()
> f.write(b.to_pybytes())
> f.close()
> dir_path = os.path.dirname(os.path.realpath(__file__))
> python_file = os.path.join(dir_path, 'deserialize_buffer.py')
> >   subprocess.check_call([sys.executable, python_file, f.name])
> pyarrow\tests\test_serialization.py:596:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> popenargs = (['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att'],)
> kwargs = {}, retcode = 1
> cmd = ['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> The arguments are the same as for the call function.  Example:
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']' returned non-zero 
> exit status 1.
> C:\Miniconda3\envs\pyarrow-dev\lib\subprocess.py:291: CalledProcessError
>  Captured stderr call 
> -
> Traceback (most recent call last):
>   File "C:\Users\wesm\code\arrow\python\pyarrow\tests\deserialize_buffer.py", 
> line 22, in 
> import pyarrow as pa
> ModuleNotFoundError: No module named 'pyarrow'
> === 1 failed, 15 passed, 4 skipped in 0.40 seconds 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-1345) [Python] Conversion from nested NumPy arrays fails on integers other than int64, float32

2018-02-22 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1345.
-
Resolution: Fixed

Issue resolved by pull request 1643
[https://github.com/apache/arrow/pull/1643]

> [Python] Conversion from nested NumPy arrays fails on integers other than 
> int64, float32
> 
>
> Key: ARROW-1345
> URL: https://issues.apache.org/jira/browse/ARROW-1345
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The inferred types are the largest ones, and then later conversion fails on 
> any arrays with smaller types because only exact conversions are implemented 
> thus far



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1345) [Python] Conversion from nested NumPy arrays fails on integers other than int64, float32

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372971#comment-16372971
 ] 

ASF GitHub Bot commented on ARROW-1345:
---

wesm closed pull request #1643: ARROW-1345: [Python] Test conversion from 
nested NumPy arrays with smaller int, float types
URL: https://github.com/apache/arrow/pull/1643
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/tests/test_convert_pandas.py 
b/python/pyarrow/tests/test_convert_pandas.py
index 6b62622f5..6e68dd961 100644
--- a/python/pyarrow/tests/test_convert_pandas.py
+++ b/python/pyarrow/tests/test_convert_pandas.py
@@ -1269,6 +1269,21 @@ def test_nested_lists_all_empty(self):
 assert arr.equals(expected)
 assert arr.type == pa.list_(pa.null())
 
+def test_nested_smaller_ints(self):
+# ARROW-1345, ARROW-2008, there were some type inference bugs happening
+# before
+data = pd.Series([np.array([1, 2, 3], dtype='i1'), None])
+result = pa.array(data)
+result2 = pa.array(data.values)
+expected = pa.array([[1, 2, 3], None], type=pa.list_(pa.int8()))
+assert result.equals(expected)
+assert result2.equals(expected)
+
+data3 = pd.Series([np.array([1, 2, 3], dtype='f4'), None])
+result3 = pa.array(data3)
+expected3 = pa.array([[1, 2, 3], None], type=pa.list_(pa.float32()))
+assert result3.equals(expected3)
+
 def test_infer_lists(self):
 data = OrderedDict([
 ('nan_ints', [[None, 1], [2, 3]]),


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from nested NumPy arrays fails on integers other than 
> int64, float32
> 
>
> Key: ARROW-1345
> URL: https://issues.apache.org/jira/browse/ARROW-1345
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The inferred types are the largest ones, and then later conversion fails on 
> any arrays with smaller types because only exact conversions are implemented 
> thus far



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2180) [C++] Remove APIs deprecated in 0.8.0 release

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373088#comment-16373088
 ] 

ASF GitHub Bot commented on ARROW-2180:
---

wesm closed pull request #1638: ARROW-2180: [C++] Remove deprecated APIs from 
0.8.0 cycle
URL: https://github.com/apache/arrow/pull/1638
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc
index a8043d69b..83142dfef 100644
--- a/cpp/src/arrow/array.cc
+++ b/cpp/src/arrow/array.cc
@@ -140,15 +140,6 @@ PrimitiveArray::PrimitiveArray(const 
std::shared_ptr& type, int64_t le
   SetData(ArrayData::Make(type, length, {null_bitmap, data}, null_count, 
offset));
 }
 
-#ifndef ARROW_NO_DEPRECATED_API
-
-const uint8_t* PrimitiveArray::raw_values() const {
-  return raw_values_ +
- offset() * static_cast(*type()).bit_width() / 
CHAR_BIT;
-}
-
-#endif
-
 template 
 NumericArray::NumericArray(const std::shared_ptr& data)
 : PrimitiveArray(data) {
@@ -752,17 +743,6 @@ class ArrayDataWrapper {
 
 }  // namespace internal
 
-#ifndef ARROW_NO_DEPRECATED_API
-
-Status MakeArray(const std::shared_ptr& data, 
std::shared_ptr* out) {
-  internal::ArrayDataWrapper wrapper_visitor(data, out);
-  RETURN_NOT_OK(VisitTypeInline(*data->type, _visitor));
-  DCHECK(out);
-  return Status::OK();
-}
-
-#endif
-
 std::shared_ptr MakeArray(const std::shared_ptr& data) {
   std::shared_ptr out;
   internal::ArrayDataWrapper wrapper_visitor(data, );
diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h
index 5b9ce9a01..faa9211c6 100644
--- a/cpp/src/arrow/array.h
+++ b/cpp/src/arrow/array.h
@@ -146,13 +146,6 @@ struct ARROW_EXPORT ArrayData {
 
   std::shared_ptr Copy() const { return 
std::make_shared(*this); }
 
-#ifndef ARROW_NO_DEPRECATED_API
-
-  // Deprecated since 0.8.0
-  std::shared_ptr ShallowCopy() const { return Copy(); }
-
-#endif
-
   std::shared_ptr type;
   int64_t length;
   int64_t null_count;
@@ -161,19 +154,6 @@ struct ARROW_EXPORT ArrayData {
   std::vector child_data;
 };
 
-#ifndef ARROW_NO_DEPRECATED_API
-
-/// \brief Create a strongly-typed Array instance from generic ArrayData
-/// \param[in] data the array contents
-/// \param[out] out the resulting Array instance
-/// \return Status
-///
-/// \note Deprecated since 0.8.0
-ARROW_EXPORT
-Status MakeArray(const std::shared_ptr& data, 
std::shared_ptr* out);
-
-#endif
-
 /// \brief Create a strongly-typed Array instance from generic ArrayData
 /// \param[in] data the array contents
 /// \return the resulting Array instance
@@ -335,15 +315,6 @@ class ARROW_EXPORT PrimitiveArray : public FlatArray {
   /// Does not account for any slice offset
   std::shared_ptr values() const { return data_->buffers[1]; }
 
-#ifndef ARROW_NO_DEPRECATED_API
-
-  /// \brief Return pointer to start of raw data
-  ///
-  /// \note Deprecated since 0.8.0
-  const uint8_t* raw_values() const;
-
-#endif
-
  protected:
   PrimitiveArray() {}
 
diff --git a/cpp/src/arrow/buffer.h b/cpp/src/arrow/buffer.h
index 74a3c680d..cf25ccd03 100644
--- a/cpp/src/arrow/buffer.h
+++ b/cpp/src/arrow/buffer.h
@@ -371,22 +371,6 @@ ARROW_EXPORT
 Status AllocateResizableBuffer(MemoryPool* pool, const int64_t size,
std::shared_ptr* out);
 
-#ifndef ARROW_NO_DEPRECATED_API
-
-/// \brief Create Buffer referencing std::string memory
-///
-/// Warning: string instance must stay alive
-///
-/// \param str std::string instance
-/// \return std::shared_ptr
-///
-/// \note Deprecated Since 0.8.0
-static inline std::shared_ptr GetBufferFromString(const std::string& 
str) {
-  return std::make_shared(str);
-}
-
-#endif  // ARROW_NO_DEPRECATED_API
-
 }  // namespace arrow
 
 #endif  // ARROW_BUFFER_H
diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc
index 9ed54ca3a..69cacbfac 100644
--- a/cpp/src/arrow/compare.cc
+++ b/cpp/src/arrow/compare.cc
@@ -783,30 +783,4 @@ bool TypeEquals(const DataType& left, const DataType& 
right) {
   return are_equal;
 }
 
-Status ArrayEquals(const Array& left, const Array& right, bool* are_equal) {
-  *are_equal = ArrayEquals(left, right);
-  return Status::OK();
-}
-
-Status TensorEquals(const Tensor& left, const Tensor& right, bool* are_equal) {
-  *are_equal = TensorEquals(left, right);
-  return Status::OK();
-}
-
-Status ArrayApproxEquals(const Array& left, const Array& right, bool* 
are_equal) {
-  *are_equal = ArrayApproxEquals(left, right);
-  return Status::OK();
-}
-
-Status ArrayRangeEquals(const Array& left, const Array& right, int64_t 
start_idx,
-int64_t end_idx, int64_t other_start_idx, bool* 
are_equal) {
-  *are_equal = ArrayRangeEquals(left, right, 

[jira] [Commented] (ARROW-2180) [C++] Remove APIs deprecated in 0.8.0 release

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373086#comment-16373086
 ] 

ASF GitHub Bot commented on ARROW-2180:
---

wesm commented on issue #1638: ARROW-2180: [C++] Remove deprecated APIs from 
0.8.0 cycle
URL: https://github.com/apache/arrow/pull/1638#issuecomment-367751466
 
 
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Remove APIs deprecated in 0.8.0 release
> -
>
> Key: ARROW-2180
> URL: https://issues.apache.org/jira/browse/ARROW-2180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2131) [Python] Serialization test fails on Windows when library has been built in place / not installed

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373083#comment-16373083
 ] 

ASF GitHub Bot commented on ARROW-2131:
---

wesm closed pull request #1640: ARROW-2131: [Python] Prepend module path to 
PYTHONPATH when spawning subprocess
URL: https://github.com/apache/arrow/pull/1640
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/tests/test_serialization.py 
b/python/pyarrow/tests/test_serialization.py
index 0917172d2..feccebbde 100644
--- a/python/pyarrow/tests/test_serialization.py
+++ b/python/pyarrow/tests/test_serialization.py
@@ -580,6 +580,22 @@ def deserialize_regex(serialized, q):
 p.join()
 
 
+def _get_modified_env_with_pythonpath():
+# Prepend pyarrow root directory to PYTHONPATH
+env = os.environ.copy()
+existing_pythonpath = env.get('PYTHONPATH', '')
+if sys.platform == 'win32':
+sep = ';'
+else:
+sep = ':'
+
+module_path = os.path.abspath(
+os.path.dirname(os.path.dirname(pa.__file__)))
+
+env['PYTHONPATH'] = sep.join((module_path, existing_pythonpath))
+return env
+
+
 def test_deserialize_buffer_in_different_process():
 import tempfile
 import subprocess
@@ -589,9 +605,12 @@ def test_deserialize_buffer_in_different_process():
 f.write(b.to_pybytes())
 f.close()
 
+subprocess_env = _get_modified_env_with_pythonpath()
+
 dir_path = os.path.dirname(os.path.realpath(__file__))
 python_file = os.path.join(dir_path, 'deserialize_buffer.py')
-subprocess.check_call([sys.executable, python_file, f.name])
+subprocess.check_call([sys.executable, python_file, f.name],
+  env=subprocess_env)
 
 
 def test_set_pickle():


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Serialization test fails on Windows when library has been built in 
> place / not installed
> -
>
> Key: ARROW-2131
> URL: https://issues.apache.org/jira/browse/ARROW-2131
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I am not sure why this doesn't come up in Appveyor:
> {code}
> == FAILURES 
> ===
>  test_deserialize_buffer_in_different_process 
> _
> def test_deserialize_buffer_in_different_process():
> import tempfile
> import subprocess
> f = tempfile.NamedTemporaryFile(delete=False)
> b = pa.serialize(pa.frombuffer(b'hello')).to_buffer()
> f.write(b.to_pybytes())
> f.close()
> dir_path = os.path.dirname(os.path.realpath(__file__))
> python_file = os.path.join(dir_path, 'deserialize_buffer.py')
> >   subprocess.check_call([sys.executable, python_file, f.name])
> pyarrow\tests\test_serialization.py:596:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> popenargs = (['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att'],)
> kwargs = {}, retcode = 1
> cmd = ['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> The arguments are the same as for the call function.  Example:
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 

[jira] [Commented] (ARROW-2131) [Python] Serialization test fails on Windows when library has been built in place / not installed

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373080#comment-16373080
 ] 

ASF GitHub Bot commented on ARROW-2131:
---

wesm commented on issue #1640: ARROW-2131: [Python] Prepend module path to 
PYTHONPATH when spawning subprocess
URL: https://github.com/apache/arrow/pull/1640#issuecomment-367750917
 
 
   Appveyor build https://ci.appveyor.com/project/wesm/arrow/build/1.0.1712


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Serialization test fails on Windows when library has been built in 
> place / not installed
> -
>
> Key: ARROW-2131
> URL: https://issues.apache.org/jira/browse/ARROW-2131
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I am not sure why this doesn't come up in Appveyor:
> {code}
> == FAILURES 
> ===
>  test_deserialize_buffer_in_different_process 
> _
> def test_deserialize_buffer_in_different_process():
> import tempfile
> import subprocess
> f = tempfile.NamedTemporaryFile(delete=False)
> b = pa.serialize(pa.frombuffer(b'hello')).to_buffer()
> f.write(b.to_pybytes())
> f.close()
> dir_path = os.path.dirname(os.path.realpath(__file__))
> python_file = os.path.join(dir_path, 'deserialize_buffer.py')
> >   subprocess.check_call([sys.executable, python_file, f.name])
> pyarrow\tests\test_serialization.py:596:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> popenargs = (['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att'],)
> kwargs = {}, retcode = 1
> cmd = ['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> The arguments are the same as for the call function.  Example:
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']' returned non-zero 
> exit status 1.
> C:\Miniconda3\envs\pyarrow-dev\lib\subprocess.py:291: CalledProcessError
>  Captured stderr call 
> -
> Traceback (most recent call last):
>   File "C:\Users\wesm\code\arrow\python\pyarrow\tests\deserialize_buffer.py", 
> line 22, in 
> import pyarrow as pa
> ModuleNotFoundError: No module named 'pyarrow'
> === 1 failed, 15 passed, 4 skipped in 0.40 seconds 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2199) Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-02-22 Thread Siddharth Teotia (JIRA)
Siddharth Teotia created ARROW-2199:
---

 Summary: Follow up fixes for ARROW-2019. Ensure density driven 
capacity is never less than 1 and propagate density throughout the vector tree
 Key: ARROW-2199
 URL: https://issues.apache.org/jira/browse/ARROW-2199
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java - Vectors
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia
 Fix For: 0.9.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1463) [JAVA] Restructure ValueVector hierarchy to minimize compile-time generated code

2018-02-22 Thread Siddharth Teotia (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Teotia updated ARROW-1463:

Component/s: Java - Vectors
 Java - Memory

> [JAVA] Restructure ValueVector hierarchy to minimize compile-time generated 
> code
> 
>
> Key: ARROW-1463
> URL: https://issues.apache.org/jira/browse/ARROW-1463
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Memory, Java - Vectors
>Reporter: Jacques Nadeau
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The templates used in the java package are very high mainteance and the if 
> conditions are hard to track. As started in the discussion here: 
> https://github.com/apache/arrow/pull/1012, I'd like to propose that we modify 
> the structure of the internal value vectors and code generation dynamics.
> Create new abstract base vectors:
> BaseFixedVector
> BaseVariableVector
> BaseNullableVector
> For each of these, implement all the basic functionality of a vector without 
> using templating.
> Evaluate whether to use code generation to generate specific specializations 
> of this functionality for each type where needed for performance purposes 
> (probably constrained to mutator and accessor set/get methods). Giant and 
> complex if conditions in the templates are actually worse from my perspective 
> than a small amount of hand written duplicated code since templates are much 
> harder to work with. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1807) [JAVA] Reduce Heap Usage (Phase 3): consolidate buffers

2018-02-22 Thread Siddharth Teotia (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Teotia updated ARROW-1807:

Fix Version/s: 0.10.0

> [JAVA] Reduce Heap Usage (Phase 3): consolidate buffers
> ---
>
> Key: ARROW-1807
> URL: https://issues.apache.org/jira/browse/ARROW-1807
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
> Fix For: 0.10.0
>
>
> Consolidate buffers for reducing the volume of objects and heap usage
>  => single buffer for fixed width
> < validity + offsets> = single buffer for var width, list vector



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2019) Control the memory allocated for inner vector in LIST

2018-02-22 Thread Siddharth Teotia (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Teotia updated ARROW-2019:

Component/s: Java - Vectors

> Control the memory allocated for inner vector in LIST
> -
>
> Key: ARROW-2019
> URL: https://issues.apache.org/jira/browse/ARROW-2019
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> We have observed cases in our external sort code where the amount of memory 
> actually allocated for a record batch sometimes turns out to be more than 
> necessary and also more than what was reserved by the operator for special 
> purposes. Thus queries fail with OOM.
> Usually to control the memory allocated by vector.allocateNew() is to do a 
> setInitialCapacity() and the latter modifies the vector state variables which 
> are then used to allocate memory. However, due to the multiplier of 5 used in 
> List Vector, we end up asking for more memory than necessary. For example, 
> for a value count of 4095, we asked for 128KB of memory for an offset buffer 
> of VarCharVector for a field which was list of varchars. 
> We did ((4095 * 5) + 1) * 4 => 80KB . => 128KB (rounded off to power of 2 
> allocation). 
> We had earlier made changes to setInitialCapacity() of ListVector when we 
> were facing problems with deeply nested lists and decided to use the 
> multiplier only for the leaf scalar vector. 
> It looks like there is a need for a specialized setInitialCapacity() for 
> ListVector where the caller dictates the repeatedness.
> Also, there is another bug in setInitialCapacity() where the allocation of 
> validity buffer doesn't obey the capacity specified in setInitialCapacity(). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2199) Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-02-22 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2199:
--
Labels: pull-request-available  (was: )

> Follow up fixes for ARROW-2019. Ensure density driven capacity is never less 
> than 1 and propagate density throughout the vector tree
> 
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373391#comment-16373391
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia opened a new pull request #1646: ARROW-2199: [JAVA] Control the 
memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646
 
 
   Use density based setInitialCapacity and propagate density down the vector
   tree from complex vectors. Also ensure that density driven initial capacity
   is never less than 1.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Follow up fixes for ARROW-2019. Ensure density driven capacity is never less 
> than 1 and propagate density throughout the vector tree
> 
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2184) [C++] Add static ctor for FileOutputStream returning shared_ptr to base OutputStream

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373390#comment-16373390
 ] 

ASF GitHub Bot commented on ARROW-2184:
---

wesm commented on a change in pull request #1642: ARROW-2184: [C++]  Add static 
constructor for FileOutputStream returning shared_ptr to OutputStream
URL: https://github.com/apache/arrow/pull/1642#discussion_r170084165
 
 

 ##
 File path: cpp/src/arrow/io/file.h
 ##
 @@ -39,6 +39,21 @@ class ARROW_EXPORT FileOutputStream : public OutputStream {
  public:
   ~FileOutputStream() override;
 
+  /// \brief Open a local file for writing, truncating any existing file
+  /// \param[in] path with UTF8 encoding
+  /// \param[out] out a base interface OutputStream instance
+  ///
+  /// When opening a new file, any existing file with the indicated path is
+  /// truncated to 0 bytes, deleting any existing memory
+  static Status Open(const std::string& path, std::shared_ptr* 
out);
+
+  /// \brief Open a local file for writing
+  /// \param[in] path with UTF8 encoding
+  /// \param[in] append append to existing file, otherwise truncate to 0 bytes
+  /// \param[out] out a base interface OutputStream instance
+  static Status Open(const std::string& path, bool append,
+ std::shared_ptr* out);
 
 Review comment:
   We need some unit tests for these. We should also change the Python bindings 
to use these APIs


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Add static ctor for FileOutputStream returning shared_ptr to base 
> OutputStream
> 
>
> Key: ARROW-2184
> URL: https://issues.apache.org/jira/browse/ARROW-2184
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> It would be useful for most IO ctors to return pointers to the base interface 
> that they implement rather than the subclass. Whether we deprecate the 
> current ones will vary on a case by case basis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373409#comment-16373409
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on issue #1646: ARROW-2199: [JAVA] Control the memory 
allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#issuecomment-367814904
 
 
   This has fixes and improvements we did in Dremio as follow-up changes to 
ARROW-2019 https://github.com/apache/arrow/pull/1497.
   
   cc @vkorukanti , @BryanCutler , @icexelloss , @jacques-n 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Follow up fixes for ARROW-2019. Ensure density driven capacity is never less 
> than 1 and propagate density throughout the vector tree
> 
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-02-22 Thread Siddharth Teotia (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Teotia updated ARROW-2199:

Summary: [JAVA] Follow up fixes for ARROW-2019. Ensure density driven 
capacity is never less than 1 and propagate density throughout the vector tree  
(was: Follow up fixes for ARROW-2019. Ensure density driven capacity is never 
less than 1 and propagate density throughout the vector tree)

> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2019) Control the memory allocated for inner vector in LIST

2018-02-22 Thread Siddharth Teotia (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Teotia updated ARROW-2019:

Fix Version/s: 0.9.0

> Control the memory allocated for inner vector in LIST
> -
>
> Key: ARROW-2019
> URL: https://issues.apache.org/jira/browse/ARROW-2019
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> We have observed cases in our external sort code where the amount of memory 
> actually allocated for a record batch sometimes turns out to be more than 
> necessary and also more than what was reserved by the operator for special 
> purposes. Thus queries fail with OOM.
> Usually to control the memory allocated by vector.allocateNew() is to do a 
> setInitialCapacity() and the latter modifies the vector state variables which 
> are then used to allocate memory. However, due to the multiplier of 5 used in 
> List Vector, we end up asking for more memory than necessary. For example, 
> for a value count of 4095, we asked for 128KB of memory for an offset buffer 
> of VarCharVector for a field which was list of varchars. 
> We did ((4095 * 5) + 1) * 4 => 80KB . => 128KB (rounded off to power of 2 
> allocation). 
> We had earlier made changes to setInitialCapacity() of ListVector when we 
> were facing problems with deeply nested lists and decided to use the 
> multiplier only for the leaf scalar vector. 
> It looks like there is a need for a specialized setInitialCapacity() for 
> ListVector where the caller dictates the repeatedness.
> Also, there is another bug in setInitialCapacity() where the allocation of 
> validity buffer doesn't obey the capacity specified in setInitialCapacity(). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2019) Control the memory allocated for inner vector in LIST

2018-02-22 Thread Siddharth Teotia (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Teotia resolved ARROW-2019.
-
Resolution: Fixed

> Control the memory allocated for inner vector in LIST
> -
>
> Key: ARROW-2019
> URL: https://issues.apache.org/jira/browse/ARROW-2019
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> We have observed cases in our external sort code where the amount of memory 
> actually allocated for a record batch sometimes turns out to be more than 
> necessary and also more than what was reserved by the operator for special 
> purposes. Thus queries fail with OOM.
> Usually to control the memory allocated by vector.allocateNew() is to do a 
> setInitialCapacity() and the latter modifies the vector state variables which 
> are then used to allocate memory. However, due to the multiplier of 5 used in 
> List Vector, we end up asking for more memory than necessary. For example, 
> for a value count of 4095, we asked for 128KB of memory for an offset buffer 
> of VarCharVector for a field which was list of varchars. 
> We did ((4095 * 5) + 1) * 4 => 80KB . => 128KB (rounded off to power of 2 
> allocation). 
> We had earlier made changes to setInitialCapacity() of ListVector when we 
> were facing problems with deeply nested lists and decided to use the 
> multiplier only for the leaf scalar vector. 
> It looks like there is a need for a specialized setInitialCapacity() for 
> ListVector where the caller dictates the repeatedness.
> Also, there is another bug in setInitialCapacity() where the allocation of 
> validity buffer doesn't obey the capacity specified in setInitialCapacity(). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1463) [JAVA] Restructure ValueVector hierarchy to minimize compile-time generated code

2018-02-22 Thread Siddharth Teotia (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373438#comment-16373438
 ] 

Siddharth Teotia commented on ARROW-1463:
-

This work and all the follow-up refactoring is complete. I am marking it as 
resolved. Setting the fix-version as 0.9.0 although some of the tasks were 
completed in 0.8.0.

> [JAVA] Restructure ValueVector hierarchy to minimize compile-time generated 
> code
> 
>
> Key: ARROW-1463
> URL: https://issues.apache.org/jira/browse/ARROW-1463
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jacques Nadeau
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The templates used in the java package are very high mainteance and the if 
> conditions are hard to track. As started in the discussion here: 
> https://github.com/apache/arrow/pull/1012, I'd like to propose that we modify 
> the structure of the internal value vectors and code generation dynamics.
> Create new abstract base vectors:
> BaseFixedVector
> BaseVariableVector
> BaseNullableVector
> For each of these, implement all the basic functionality of a vector without 
> using templating.
> Evaluate whether to use code generation to generate specific specializations 
> of this functionality for each type where needed for performance purposes 
> (probably constrained to mutator and accessor set/get methods). Giant and 
> complex if conditions in the templates are actually worse from my perspective 
> than a small amount of hand written duplicated code since templates are much 
> harder to work with. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-1463) [JAVA] Restructure ValueVector hierarchy to minimize compile-time generated code

2018-02-22 Thread Siddharth Teotia (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Teotia resolved ARROW-1463.
-
   Resolution: Fixed
Fix Version/s: (was: 0.10.0)
   0.9.0

> [JAVA] Restructure ValueVector hierarchy to minimize compile-time generated 
> code
> 
>
> Key: ARROW-1463
> URL: https://issues.apache.org/jira/browse/ARROW-1463
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jacques Nadeau
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The templates used in the java package are very high mainteance and the if 
> conditions are hard to track. As started in the discussion here: 
> https://github.com/apache/arrow/pull/1012, I'd like to propose that we modify 
> the structure of the internal value vectors and code generation dynamics.
> Create new abstract base vectors:
> BaseFixedVector
> BaseVariableVector
> BaseNullableVector
> For each of these, implement all the basic functionality of a vector without 
> using templating.
> Evaluate whether to use code generation to generate specific specializations 
> of this functionality for each type where needed for performance purposes 
> (probably constrained to mutator and accessor set/get methods). Giant and 
> complex if conditions in the templates are actually worse from my perspective 
> than a small amount of hand written duplicated code since templates are much 
> harder to work with. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1621) [JAVA] Reduce Heap Usage per Vector

2018-02-22 Thread Siddharth Teotia (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373442#comment-16373442
 ] 

Siddharth Teotia commented on ARROW-1621:
-

ARROW-1807 needs to be done.

> [JAVA] Reduce Heap Usage per Vector
> ---
>
> Key: ARROW-1621
> URL: https://issues.apache.org/jira/browse/ARROW-1621
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Memory, Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
> Fix For: 0.10.0
>
>
> https://docs.google.com/document/d/1MU-ah_bBHIxXNrd7SkwewGCOOexkXJ7cgKaCis5f-PI/edit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373598#comment-16373598
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

vkorukanti commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r170115978
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java
 ##
 @@ -166,13 +168,23 @@ public void setInitialCapacity(int numRecords) {
*This helps in tightly controlling the memory we provision
*for inner data vector.
*/
+  @Override
   public void setInitialCapacity(int numRecords, double density) {
+if ((numRecords * density) >= 2_000_000_000) {
 
 Review comment:
   why are we using a different constant here than MAX_ALLOCATION_SIZE?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373597#comment-16373597
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

vkorukanti commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r170116190
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java
 ##
 @@ -166,13 +168,23 @@ public void setInitialCapacity(int numRecords) {
*This helps in tightly controlling the memory we provision
*for inner data vector.
*/
+  @Override
   public void setInitialCapacity(int numRecords, double density) {
+if ((numRecords * density) >= 2_000_000_000) {
+  throw new OversizedAllocationException("Requested amount of memory is 
more than max allowed");
 
 Review comment:
   If possible can we add some context here like the current capacity control 
variables? Useful in debugging.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2201) [Website] Publish JS API Docs

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373678#comment-16373678
 ] 

ASF GitHub Bot commented on ARROW-2201:
---

TheNeuralBit opened a new pull request #1647: ARROW-2201: [Website] Publish JS 
API Docs
URL: https://github.com/apache/arrow/pull/1647
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Website] Publish JS API Docs
> -
>
> Key: ARROW-2201
> URL: https://issues.apache.org/jira/browse/ARROW-2201
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Website
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Minor
>  Labels: pull-request-available
>
> ARROW-951 isn't yet resolved because the generated API docs don't reflect the 
> full project hierarchy, but we can still publish the current docs as they 
> stand.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2201) [Website] Publish JS API Docs

2018-02-22 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2201:
--
Labels: pull-request-available  (was: )

> [Website] Publish JS API Docs
> -
>
> Key: ARROW-2201
> URL: https://issues.apache.org/jira/browse/ARROW-2201
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Website
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Minor
>  Labels: pull-request-available
>
> ARROW-951 isn't yet resolved because the generated API docs don't reflect the 
> full project hierarchy, but we can still publish the current docs as they 
> stand.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2201) [Website] Publish JS API Docs

2018-02-22 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2201:


 Summary: [Website] Publish JS API Docs
 Key: ARROW-2201
 URL: https://issues.apache.org/jira/browse/ARROW-2201
 Project: Apache Arrow
  Issue Type: Task
  Components: Website
Reporter: Brian Hulette
Assignee: Brian Hulette


ARROW-951 isn't yet resolved because the generated API docs don't reflect the 
full project hierarchy, but we can still publish the current docs as they stand.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2069) [Python] Document that Plasma is not (yet) supported on Windows

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373624#comment-16373624
 ] 

ASF GitHub Bot commented on ARROW-2069:
---

wesm closed pull request #1641: ARROW-2069: [Python] Add note that Plasma is 
not supported on Windows
URL: https://github.com/apache/arrow/pull/1641
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/doc/source/plasma.rst b/python/doc/source/plasma.rst
index 74837b96c..b64b4c260 100644
--- a/python/doc/source/plasma.rst
+++ b/python/doc/source/plasma.rst
@@ -24,6 +24,9 @@ The Plasma In-Memory Object Store
 .. contents:: Contents
   :depth: 3
 
+.. note::
+
+   As present, Plasma is only supported for use on Linux and macOS.
 
 The Plasma API
 --


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document that Plasma is not (yet) supported on Windows
> ---
>
> Key: ARROW-2069
> URL: https://issues.apache.org/jira/browse/ARROW-2069
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> See discussion in https://github.com/apache/arrow/issues/1531



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2185) Remove CI directives from squashed commit messages

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373743#comment-16373743
 ] 

ASF GitHub Bot commented on ARROW-2185:
---

wesm closed pull request #1639: ARROW-2185: Strip CI directives from commit 
messages
URL: https://github.com/apache/arrow/pull/1639
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/dev/merge_arrow_pr.py b/dev/merge_arrow_pr.py
index 6c0e66376..74f0762c3 100755
--- a/dev/merge_arrow_pr.py
+++ b/dev/merge_arrow_pr.py
@@ -175,7 +175,8 @@ def merge_pr(pr_num, target_ref):
 "Closes #%s from %s and squashes the following commits:"
 % (pr_num, pr_repo_desc)]
 for c in commits:
-merge_message_flags += ["-m", c]
+stripped_message = strip_ci_directives(c).strip()
+merge_message_flags += ["-m", stripped_message]
 
 run_cmd(['git', 'commit',
  '--no-verify',  # do not run commit hooks
@@ -199,6 +200,15 @@ def merge_pr(pr_num, target_ref):
 return merge_hash
 
 
+_REGEX_CI_DIRECTIVE = re.compile('\[[^\]]*\]')
+
+
+def strip_ci_directives(commit_message):
+# Remove things like '[force ci]', '[skip appveyor]' from the assembled
+# commit message
+return _REGEX_CI_DIRECTIVE.sub('', commit_message)
+
+
 def fix_version_from_branch(branch, versions):
 # Note: Assumes this is a sorted (newest->oldest) list of un-released
 # versions
@@ -209,7 +219,7 @@ def fix_version_from_branch(branch, versions):
 return [x for x in versions if x.name.startswith(branch_ver)][-1]
 
 
-def exctract_jira_id(title):
+def extract_jira_id(title):
 m = re.search(r'^(ARROW-[0-9]+)\b.*$', title)
 if m:
 return m.group(1)
@@ -219,7 +229,7 @@ def exctract_jira_id(title):
 
 
 def check_jira(title):
-jira_id = exctract_jira_id(title)
+jira_id = extract_jira_id(title)
 asf_jira = jira.client.JIRA({'server': JIRA_API_BASE},
 basic_auth=(JIRA_USERNAME, JIRA_PASSWORD))
 try:
@@ -232,7 +242,7 @@ def resolve_jira(title, merge_branches, comment):
 asf_jira = jira.client.JIRA({'server': JIRA_API_BASE},
 basic_auth=(JIRA_USERNAME, JIRA_PASSWORD))
 
-default_jira_id = exctract_jira_id(title)
+default_jira_id = extract_jira_id(title)
 
 jira_id = input("Enter a JIRA id [%s]: " % default_jira_id)
 if jira_id == "":


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove CI directives from squashed commit messages
> --
>
> Key: ARROW-2185
> URL: https://issues.apache.org/jira/browse/ARROW-2185
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> In our PR squash tool, we are potentially picking up CI directives like 
> {{[skip appveyor]}} from intermediate commits. We should regex these away and 
> instead use directives in the PR title if we wish the commit to master to 
> behave in a certain way



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2093) [Python] Possibly do not test pytorch serialization in Travis CI

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373753#comment-16373753
 ] 

ASF GitHub Bot commented on ARROW-2093:
---

wesm closed pull request #1637: ARROW-2093: [Python] Do not install PyTorch in 
Travis CI
URL: https://github.com/apache/arrow/pull/1637
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh
index a487da596..9ed5825bb 100755
--- a/ci/travis_script_python.sh
+++ b/ci/travis_script_python.sh
@@ -43,11 +43,14 @@ conda install -y -q pip \
   pandas \
   cython
 
-if [ "$PYTHON_VERSION" != "2.7" ] || [ $TRAVIS_OS_NAME != "osx" ]; then
-  # Install pytorch for torch tensor conversion tests
-  # PyTorch seems to be broken on Python 2.7 on macOS so we skip it
-  conda install -y -q pytorch torchvision -c soumith
-fi
+# ARROW-2093: PyTorch increases the size of our conda dependency stack
+# significantly, and so we have disabled these tests in Travis CI for now
+
+# if [ "$PYTHON_VERSION" != "2.7" ] || [ $TRAVIS_OS_NAME != "osx" ]; then
+#   # Install pytorch for torch tensor conversion tests
+#   # PyTorch seems to be broken on Python 2.7 on macOS so we skip it
+#   conda install -y -q pytorch torchvision -c soumith
+# fi
 
 # Build C++ libraries
 mkdir -p $ARROW_CPP_BUILD_DIR


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Possibly do not test pytorch serialization in Travis CI
> 
>
> Key: ARROW-2093
> URL: https://issues.apache.org/jira/browse/ARROW-2093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I am not sure it is worth downloading ~400MB in binaries
> {code}
> The following packages will be downloaded:
> package|build
> ---|-
> libgcc-5.2.0   |0 1.1 MB  defaults
> pillow-5.0.0   |   py27_0 958 KB  conda-forge
> libtiff-4.0.9  |0 511 KB  conda-forge
> libtorch-0.1.12|  nomkl_0 1.7 MB  defaults
> olefile-0.44   |   py27_0  50 KB  conda-forge
> torchvision-0.1.9  |   py27hdb88a65_1  86 KB  soumith
> openblas-0.2.19|214.1 MB  conda-forge
> numpy-1.13.1   |py27_blas_openblas_200 8.4 MB  
> conda-forge
> pytorch-0.2.0  |py27ha262b23_4cu75   312.2 MB  soumith
> mkl-2017.0.3   |0   129.5 MB  defaults
> 
>Total:   468.6 MB
> {code}
> Follow up from ARROW-2071 https://github.com/apache/arrow/pull/1561



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373664#comment-16373664
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r170124436
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java
 ##
 @@ -166,13 +168,23 @@ public void setInitialCapacity(int numRecords) {
*This helps in tightly controlling the memory we provision
*for inner data vector.
*/
+  @Override
   public void setInitialCapacity(int numRecords, double density) {
+if ((numRecords * density) >= 2_000_000_000) {
 
 Review comment:
   Done to safeguard against truncation. We can use Integer.MAX_VALUE instead.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373662#comment-16373662
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r170124342
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java
 ##
 @@ -166,13 +168,23 @@ public void setInitialCapacity(int numRecords) {
*This helps in tightly controlling the memory we provision
*for inner data vector.
*/
+  @Override
   public void setInitialCapacity(int numRecords, double density) {
+if ((numRecords * density) >= 2_000_000_000) {
+  throw new OversizedAllocationException("Requested amount of memory is 
more than max allowed");
 
 Review comment:
   Sure, I will file a JIRA for this since this sort of problem needs to be 
addressed throughout the code.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2093) [Python] Possibly do not test pytorch serialization in Travis CI

2018-02-22 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2093.
-
Resolution: Fixed

Issue resolved by pull request 1637
[https://github.com/apache/arrow/pull/1637]

> [Python] Possibly do not test pytorch serialization in Travis CI
> 
>
> Key: ARROW-2093
> URL: https://issues.apache.org/jira/browse/ARROW-2093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I am not sure it is worth downloading ~400MB in binaries
> {code}
> The following packages will be downloaded:
> package|build
> ---|-
> libgcc-5.2.0   |0 1.1 MB  defaults
> pillow-5.0.0   |   py27_0 958 KB  conda-forge
> libtiff-4.0.9  |0 511 KB  conda-forge
> libtorch-0.1.12|  nomkl_0 1.7 MB  defaults
> olefile-0.44   |   py27_0  50 KB  conda-forge
> torchvision-0.1.9  |   py27hdb88a65_1  86 KB  soumith
> openblas-0.2.19|214.1 MB  conda-forge
> numpy-1.13.1   |py27_blas_openblas_200 8.4 MB  
> conda-forge
> pytorch-0.2.0  |py27ha262b23_4cu75   312.2 MB  soumith
> mkl-2017.0.3   |0   129.5 MB  defaults
> 
>Total:   468.6 MB
> {code}
> Follow up from ARROW-2071 https://github.com/apache/arrow/pull/1561



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2200) Arrow + plasma build issues

2018-02-22 Thread Zongheng Yang (JIRA)
Zongheng Yang created ARROW-2200:


 Summary: Arrow + plasma build issues
 Key: ARROW-2200
 URL: https://issues.apache.org/jira/browse/ARROW-2200
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Zongheng Yang


I'm looking into Plasma's use of XXH64 hash library, and whether we can replace 
it with google/crc32c.
 
Here's my build 
[change|https://github.com/concretevitamin/arrow/commit/e4abaddf55255bf2e773b1094287bfd99a6dfb69].
 
 
With this change, for some reason, the plasma_static library did NOT get linked 
into libcrc32c.a (which is successfully built), whereas plasma_shared and 
plasma_store did link with it:
 
---
 » tail ./src/plasma/CMakeFiles/plasma_\{static,shared,store}.dir/link.txt
==> ./src/plasma/CMakeFiles/plasma_static.dir/link.txt <==
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ar
 qc ../../release/libplasma.a  CMakeFiles/plasma_objlib.dir/client.cc.o 
CMakeFiles/plasma_objlib.dir/common.cc.o 
CMakeFiles/plasma_objlib.dir/eviction_policy.cc.o 
CMakeFiles/plasma_objlib.dir/events.cc.o 
CMakeFiles/plasma_objlib.dir/fling.cc.o CMakeFiles/plasma_objlib.dir/io.cc.o 
CMakeFiles/plasma_objlib.dir/malloc.cc.o 
CMakeFiles/plasma_objlib.dir/plasma.cc.o 
CMakeFiles/plasma_objlib.dir/protocol.cc.o 
CMakeFiles/plasma_objlib.dir/thirdparty/ae/ae.c.o 
CMakeFiles/plasma_objlib.dir/thirdparty/xxhash.cc.o
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib
 ../../release/libplasma.a
 
==> ./src/plasma/CMakeFiles/plasma_shared.dir/link.txt <==
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
 -g -O3 -O3 -DNDEBUG  -Wall -std=c++11 -msse3 -stdlib=libc++  
-Qunused-arguments  -D_XOPEN_SOURCE=500 -D_POSIX_C_SOURCE=200809L -fPIC -O3 
-DNDEBUG -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup 
 -o ../../release/libplasma.0.0.0.dylib -install_name @rpath/libplasma.0.dylib 
...
../../crc32c_ep/src/crc32c_ep-install/lib/libcrc32c.a ../../release/libarrow.a 
/usr/lib/libpthread.dylib /usr/local/lib/libboost_system-mt.a 
/usr/local/lib/libboost_filesystem-mt.a
 
==> ./src/plasma/CMakeFiles/plasma_store.dir/link.txt <==
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
  -g -O3 -O3 -DNDEBUG  -Wall -std=c++11 -msse3 -stdlib=libc++  
-Qunused-arguments  -D_XOPEN_SOURCE=500 -D_POSIX_C_SOURCE=200809L -fPIC -O3 
-DNDEBUG -Wl,-search_paths_first -Wl,-headerpad_max_install_names  
CMakeFiles/plasma_store.dir/store.cc.o  -o ../../release/plasma_store 
../../release/libplasma.a ../../crc32c_ep/src/crc32c_ep-install/lib/libcrc32c.a 
../../release/libarrow.a /usr/lib/libpthread.dylib 
/usr/local/lib/libboost_system-mt.a /usr/local/lib/libboost_filesystem-mt.a

---
 
Do you see what's going on?  What am I doing wrong to not have "plasma_static" 
depend on "crc32c_ep"?
 
Any advice will be greatly appreciated,
Zongheng



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2202) [JS] Add DataFrame.toJSON

2018-02-22 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2202:


 Summary: [JS] Add DataFrame.toJSON
 Key: ARROW-2202
 URL: https://issues.apache.org/jira/browse/ARROW-2202
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Brian Hulette


Currently, {{CountByResult}} has its own [{{toJSON}} 
method|https://github.com/apache/arrow/blob/master/js/src/table.ts#L282], but 
there should be a more general one for every {{DataFrame}}.

{{CountByResult.toJSON}} returns:
{code:json}
{
  "keyA": 10,
  "keyB": 10,
  ...
}{code}

A more general {{toJSON}} could just return a list of objects with an entry for 
each column. For the above {{CountByResult}}, the output would look like:
{code:json}
[
  {value: "keyA", count: 10},
  {value: "keyB", count: 10},
  ...
]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2203) [C++] StderrStream class

2018-02-22 Thread Rares Vernica (JIRA)
Rares Vernica created ARROW-2203:


 Summary: [C++] StderrStream class
 Key: ARROW-2203
 URL: https://issues.apache.org/jira/browse/ARROW-2203
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Rares Vernica


The C++ API has support for reading and writing data from and to STDIN and 
STDOUT. The classes are arrow::io::StdinStream and arrow::io::StdoutStream. It 
some scenarios it might be useful to write data to STDERR. Adding a 
StderrStream class should be a trivial addition given the StdoutStream class.

If you think a StderrStream class is a good idea, I am more than happy to 
submit a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373583#comment-16373583
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on issue #1646: ARROW-2199: [JAVA] Control the memory 
allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#issuecomment-367814904
 
 
   This has fixes and improvements we did in Dremio as follow-up changes to 
ARROW-2019 https://github.com/apache/arrow/pull/1497.
   
   cc @vkorukanti , @BryanCutler , @icexelloss , @jacques-n, @StevenMPhillips 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2185) Remove CI directives from squashed commit messages

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373626#comment-16373626
 ] 

ASF GitHub Bot commented on ARROW-2185:
---

wesm commented on issue #1639: ARROW-2185: Strip CI directives from commit 
messages
URL: https://github.com/apache/arrow/pull/1639#issuecomment-367851003
 
 
   +1. I'll merge this using the modified script as validation that "it works"


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove CI directives from squashed commit messages
> --
>
> Key: ARROW-2185
> URL: https://issues.apache.org/jira/browse/ARROW-2185
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> In our PR squash tool, we are potentially picking up CI directives like 
> {{[skip appveyor]}} from intermediate commits. We should regex these away and 
> instead use directives in the PR title if we wish the commit to master to 
> behave in a certain way



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2069) [Python] Document that Plasma is not (yet) supported on Windows

2018-02-22 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2069.
-
Resolution: Fixed

Issue resolved by pull request 1641
[https://github.com/apache/arrow/pull/1641]

> [Python] Document that Plasma is not (yet) supported on Windows
> ---
>
> Key: ARROW-2069
> URL: https://issues.apache.org/jira/browse/ARROW-2069
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> See discussion in https://github.com/apache/arrow/issues/1531



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2093) [Python] Possibly do not test pytorch serialization in Travis CI

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373750#comment-16373750
 ] 

ASF GitHub Bot commented on ARROW-2093:
---

wesm commented on issue #1637: ARROW-2093: [Python] Do not install PyTorch in 
Travis CI
URL: https://github.com/apache/arrow/pull/1637#issuecomment-367873718
 
 
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Possibly do not test pytorch serialization in Travis CI
> 
>
> Key: ARROW-2093
> URL: https://issues.apache.org/jira/browse/ARROW-2093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I am not sure it is worth downloading ~400MB in binaries
> {code}
> The following packages will be downloaded:
> package|build
> ---|-
> libgcc-5.2.0   |0 1.1 MB  defaults
> pillow-5.0.0   |   py27_0 958 KB  conda-forge
> libtiff-4.0.9  |0 511 KB  conda-forge
> libtorch-0.1.12|  nomkl_0 1.7 MB  defaults
> olefile-0.44   |   py27_0  50 KB  conda-forge
> torchvision-0.1.9  |   py27hdb88a65_1  86 KB  soumith
> openblas-0.2.19|214.1 MB  conda-forge
> numpy-1.13.1   |py27_blas_openblas_200 8.4 MB  
> conda-forge
> pytorch-0.2.0  |py27ha262b23_4cu75   312.2 MB  soumith
> mkl-2017.0.3   |0   129.5 MB  defaults
> 
>Total:   468.6 MB
> {code}
> Follow up from ARROW-2071 https://github.com/apache/arrow/pull/1561



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2197) Document "undefined symbol" issue and workaround

2018-02-22 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2197:
-

 Summary: Document "undefined symbol" issue and workaround
 Key: ARROW-2197
 URL: https://issues.apache.org/jira/browse/ARROW-2197
 Project: Apache Arrow
  Issue Type: Task
  Components: Documentation
Affects Versions: 0.8.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


See [https://github.com/apache/arrow/issues/1612]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2197) Document "undefined symbol" issue and workaround

2018-02-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372883#comment-16372883
 ] 

ASF GitHub Bot commented on ARROW-2197:
---

pitrou opened a new pull request #1644: ARROW-2197: Document C++ ABI issue and 
workaround
URL: https://github.com/apache/arrow/pull/1644
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Document "undefined symbol" issue and workaround
> 
>
> Key: ARROW-2197
> URL: https://issues.apache.org/jira/browse/ARROW-2197
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
>
> See [https://github.com/apache/arrow/issues/1612]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2197) Document "undefined symbol" issue and workaround

2018-02-22 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2197:
--
Labels: pull-request-available  (was: )

> Document "undefined symbol" issue and workaround
> 
>
> Key: ARROW-2197
> URL: https://issues.apache.org/jira/browse/ARROW-2197
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
>
> See [https://github.com/apache/arrow/issues/1612]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)