[jira] [Assigned] (ARROW-714) [C++] Add import_pyarrow C API in the style of NumPy for thirdparty C++ users

2017-05-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-714:
--

Assignee: Wes McKinney

> [C++] Add import_pyarrow C API in the style of NumPy for thirdparty C++ users
> -
>
> Key: ARROW-714
> URL: https://issues.apache.org/jira/browse/ARROW-714
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.4.0
>
>
> See the implementation of import_array in NumPy for this purpose:
> https://github.com/numpy/numpy/blob/c90d7c94fd2077d0beca48fa89a423da2b0bb663/numpy/core/code_generators/generate_numpy_api.py#L46



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ARROW-1016) Python: Include C++ headers (optionally) in wheels

2017-05-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1016:

Fix Version/s: 0.4.0

> Python: Include C++ headers (optionally) in wheels
> --
>
> Key: ARROW-1016
> URL: https://issues.apache.org/jira/browse/ARROW-1016
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
> Fix For: 0.4.0
>
>
> This is not the most beautiful solution (that would be using conda :D) but a 
> first step to have wheels which you can use to build other python packages 
> with native code against.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ARROW-1016) Python: Include C++ headers (optionally) in wheels

2017-05-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1016.
-
Resolution: Fixed

Issue resolved by pull request 678
[https://github.com/apache/arrow/pull/678]

> Python: Include C++ headers (optionally) in wheels
> --
>
> Key: ARROW-1016
> URL: https://issues.apache.org/jira/browse/ARROW-1016
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>
> This is not the most beautiful solution (that would be using conda :D) but a 
> first step to have wheels which you can use to build other python packages 
> with native code against.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ARROW-1008) [C++] Define abstract interface for stream iteration

2017-05-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1008:

Fix Version/s: 0.4.0

> [C++] Define abstract interface for stream iteration
> 
>
> Key: ARROW-1008
> URL: https://issues.apache.org/jira/browse/ARROW-1008
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.4.0
>
>
> The purpose of this JIRA is to decouple the physical structure of the stream 
> from the StreamReader API. So if we wanted to put the stream components into 
> different physical files on disk, this would permit the construction of a 
> different kind of StreamIterator that knows how to read the respective stream 
> components from files



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (ARROW-1008) [C++] Define abstract interface for stream iteration

2017-05-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1008:
---

Assignee: Wes McKinney

> [C++] Define abstract interface for stream iteration
> 
>
> Key: ARROW-1008
> URL: https://issues.apache.org/jira/browse/ARROW-1008
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.4.0
>
>
> The purpose of this JIRA is to decouple the physical structure of the stream 
> from the StreamReader API. So if we wanted to put the stream components into 
> different physical files on disk, this would permit the construction of a 
> different kind of StreamIterator that knows how to read the respective stream 
> components from files



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ARROW-1010) [Website] Only show English posts in /blog/

2017-05-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1010.
-
Resolution: Fixed

Issue resolved by pull request 675
[https://github.com/apache/arrow/pull/675]

> [Website] Only show English posts in /blog/
> ---
>
> Key: ARROW-1010
> URL: https://issues.apache.org/jira/browse/ARROW-1010
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>
> Translated blog posts can link to each other, but I think each blog post 
> should only appear once in the blogroll



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-1019) [C++] Implement input stream and output stream with Gzip codec

2017-05-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1019:
---

 Summary: [C++] Implement input stream and output stream with Gzip 
codec
 Key: ARROW-1019
 URL: https://issues.apache.org/jira/browse/ARROW-1019
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


After incorporating the compression code and toolchain from parquet-cpp, we 
should be able to add a codec layer for on-the-fly compression and decompression



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-1018) [C++] Add option to create FileOutputStream, ReadableFile from OS file descriptor

2017-05-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1018:
---

 Summary: [C++] Add option to create FileOutputStream, ReadableFile 
from OS file descriptor
 Key: ARROW-1018
 URL: https://issues.apache.org/jira/browse/ARROW-1018
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


Currently we require a file path. It should also be possible to initialize from 
a file descriptor



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-1017) Python: Calling to_pandas on a Parquet file in HDFS leaks memory

2017-05-12 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008513#comment-16008513
 ] 

Wes McKinney commented on ARROW-1017:
-

[~xhochy] we should try to investigate this before approving a parquet-cpp 
1.1.0 release in case there is a memory leak in libparquet

> Python: Calling to_pandas on a Parquet file in HDFS leaks memory
> 
>
> Key: ARROW-1017
> URL: https://issues.apache.org/jira/browse/ARROW-1017
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.3.0
>Reporter: James Porritt
> Fix For: 0.4.0
>
>
> Running the following code results in ever increasing memory usage, even 
> though I would expect the dataframe to be garbage collected when it goes out 
> of scope. For the size of my parquet file, I see the usage increasing about 
> 3GB per loop:
> {code}
> from pyarrow import HdfsClient
> def read_parquet_file(client, parquet_file):
> parquet = client.read_parquet(parquet_file)
> df = parquet.to_pandas()
> client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3')
> parquet_file = '/my/parquet/file
> while True:
> read_parquet_file(client, parquet_file)
> {code}
> Is there a reference count issue similar to ARROW-362?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-1017) Python: Calling to_pandas on a Parquet file in HDFS leaks memory

2017-05-12 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008481#comment-16008481
 ] 

Wes McKinney commented on ARROW-1017:
-

Thanks [~jporritt], I will take a look and see if I can reproduce the issue. 

> Python: Calling to_pandas on a Parquet file in HDFS leaks memory
> 
>
> Key: ARROW-1017
> URL: https://issues.apache.org/jira/browse/ARROW-1017
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.3.0
>Reporter: James Porritt
> Fix For: 0.4.0
>
>
> Running the following code results in ever increasing memory usage, even 
> though I would expect the dataframe to be garbage collected when it goes out 
> of scope. For the size of my parquet file, I see the usage increasing about 
> 3GB per loop:
> {code}
> from pyarrow import HdfsClient
> def read_parquet_file(client, parquet_file):
> parquet = client.read_parquet(parquet_file)
> df = parquet.to_pandas()
> client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3')
> parquet_file = '/my/parquet/file
> while True:
> read_parquet_file(client, parquet_file)
> {code}
> Is there a reference count issue similar to ARROW-362?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ARROW-1017) Python: Calling to_pandas on a Parquet file in HDFS leaks memory

2017-05-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1017:

Fix Version/s: 0.4.0

> Python: Calling to_pandas on a Parquet file in HDFS leaks memory
> 
>
> Key: ARROW-1017
> URL: https://issues.apache.org/jira/browse/ARROW-1017
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.3.0
>Reporter: James Porritt
> Fix For: 0.4.0
>
>
> Running the following code results in ever increasing memory usage, even 
> though I would expect the dataframe to be garbage collected when it goes out 
> of scope. For the size of my parquet file, I see the usage increasing about 
> 3GB per loop:
> {code}
> from pyarrow import HdfsClient
> def read_parquet_file(client, parquet_file):
> parquet = client.read_parquet(parquet_file)
> df = parquet.to_pandas()
> client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3')
> parquet_file = '/my/parquet/file
> while True:
> read_parquet_file(client, parquet_file)
> {code}
> Is there a reference count issue similar to ARROW-362?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-1017) Python: Calling to_pandas on a Parquet file in HDFS leaks memory

2017-05-12 Thread James Porritt (JIRA)
James Porritt created ARROW-1017:


 Summary: Python: Calling to_pandas on a Parquet file in HDFS leaks 
memory
 Key: ARROW-1017
 URL: https://issues.apache.org/jira/browse/ARROW-1017
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.3.0
Reporter: James Porritt


Running the following code results in ever increasing memory usage, even though 
I would expect the dataframe to be garbage collected when it goes out of scope. 
For the size of my parquet file, I see the usage increasing about 3GB per loop:

{code}
from pyarrow import HdfsClient

def read_parquet_file(client, parquet_file):
parquet = client.read_parquet(parquet_file)
df = parquet.to_pandas()

client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3')
parquet_file = '/my/parquet/file
while True:
read_parquet_file(client, parquet_file)
{code}

Is there a reference count issue similar to ARROW-362?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2017-05-12 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008315#comment-16008315
 ] 

Kazuaki Ishizaki commented on ARROW-300:


Thank you for your response. I was also busy for preparing materials for GTC. 
It is good time to make a document, now.

It sounds good to prepare a Google document for collecting public comments. I 
will start creating a document for purpose, scope, and design.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (ARROW-988) [JS] Add entry to Travis CI matrix

2017-05-12 Thread Brian Hulette (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette reassigned ARROW-988:
---

Assignee: Brian Hulette

> [JS] Add entry to Travis CI matrix
> --
>
> Key: ARROW-988
> URL: https://issues.apache.org/jira/browse/ARROW-988
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-1016) Python: Include C++ headers (optionally) in wheels

2017-05-12 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-1016:
--

 Summary: Python: Include C++ headers (optionally) in wheels
 Key: ARROW-1016
 URL: https://issues.apache.org/jira/browse/ARROW-1016
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn


This is not the most beautiful solution (that would be using conda :D) but a 
first step to have wheels which you can use to build other python packages with 
native code against.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-1015) [Java] Implement schema-level metadata

2017-05-12 Thread Emilio Lahr-Vivaz (JIRA)
Emilio Lahr-Vivaz created ARROW-1015:


 Summary: [Java] Implement schema-level metadata
 Key: ARROW-1015
 URL: https://issues.apache.org/jira/browse/ARROW-1015
 Project: Apache Arrow
  Issue Type: Task
Reporter: Emilio Lahr-Vivaz
Assignee: Emilio Lahr-Vivaz
 Fix For: 0.4.0


Schema already defines metadata in the arrow format - implement in Java.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)