[jira] [Assigned] (ARROW-714) [C++] Add import_pyarrow C API in the style of NumPy for thirdparty C++ users
[ https://issues.apache.org/jira/browse/ARROW-714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-714: -- Assignee: Wes McKinney > [C++] Add import_pyarrow C API in the style of NumPy for thirdparty C++ users > - > > Key: ARROW-714 > URL: https://issues.apache.org/jira/browse/ARROW-714 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.4.0 > > > See the implementation of import_array in NumPy for this purpose: > https://github.com/numpy/numpy/blob/c90d7c94fd2077d0beca48fa89a423da2b0bb663/numpy/core/code_generators/generate_numpy_api.py#L46 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (ARROW-1016) Python: Include C++ headers (optionally) in wheels
[ https://issues.apache.org/jira/browse/ARROW-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1016: Fix Version/s: 0.4.0 > Python: Include C++ headers (optionally) in wheels > -- > > Key: ARROW-1016 > URL: https://issues.apache.org/jira/browse/ARROW-1016 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn > Fix For: 0.4.0 > > > This is not the most beautiful solution (that would be using conda :D) but a > first step to have wheels which you can use to build other python packages > with native code against. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (ARROW-1016) Python: Include C++ headers (optionally) in wheels
[ https://issues.apache.org/jira/browse/ARROW-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1016. - Resolution: Fixed Issue resolved by pull request 678 [https://github.com/apache/arrow/pull/678] > Python: Include C++ headers (optionally) in wheels > -- > > Key: ARROW-1016 > URL: https://issues.apache.org/jira/browse/ARROW-1016 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn > > This is not the most beautiful solution (that would be using conda :D) but a > first step to have wheels which you can use to build other python packages > with native code against. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (ARROW-1008) [C++] Define abstract interface for stream iteration
[ https://issues.apache.org/jira/browse/ARROW-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1008: Fix Version/s: 0.4.0 > [C++] Define abstract interface for stream iteration > > > Key: ARROW-1008 > URL: https://issues.apache.org/jira/browse/ARROW-1008 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.4.0 > > > The purpose of this JIRA is to decouple the physical structure of the stream > from the StreamReader API. So if we wanted to put the stream components into > different physical files on disk, this would permit the construction of a > different kind of StreamIterator that knows how to read the respective stream > components from files -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (ARROW-1008) [C++] Define abstract interface for stream iteration
[ https://issues.apache.org/jira/browse/ARROW-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1008: --- Assignee: Wes McKinney > [C++] Define abstract interface for stream iteration > > > Key: ARROW-1008 > URL: https://issues.apache.org/jira/browse/ARROW-1008 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.4.0 > > > The purpose of this JIRA is to decouple the physical structure of the stream > from the StreamReader API. So if we wanted to put the stream components into > different physical files on disk, this would permit the construction of a > different kind of StreamIterator that knows how to read the respective stream > components from files -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (ARROW-1010) [Website] Only show English posts in /blog/
[ https://issues.apache.org/jira/browse/ARROW-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1010. - Resolution: Fixed Issue resolved by pull request 675 [https://github.com/apache/arrow/pull/675] > [Website] Only show English posts in /blog/ > --- > > Key: ARROW-1010 > URL: https://issues.apache.org/jira/browse/ARROW-1010 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Wes McKinney >Assignee: Wes McKinney > > Translated blog posts can link to each other, but I think each blog post > should only appear once in the blogroll -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (ARROW-1019) [C++] Implement input stream and output stream with Gzip codec
Wes McKinney created ARROW-1019: --- Summary: [C++] Implement input stream and output stream with Gzip codec Key: ARROW-1019 URL: https://issues.apache.org/jira/browse/ARROW-1019 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney After incorporating the compression code and toolchain from parquet-cpp, we should be able to add a codec layer for on-the-fly compression and decompression -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (ARROW-1018) [C++] Add option to create FileOutputStream, ReadableFile from OS file descriptor
Wes McKinney created ARROW-1018: --- Summary: [C++] Add option to create FileOutputStream, ReadableFile from OS file descriptor Key: ARROW-1018 URL: https://issues.apache.org/jira/browse/ARROW-1018 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Currently we require a file path. It should also be possible to initialize from a file descriptor -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-1017) Python: Calling to_pandas on a Parquet file in HDFS leaks memory
[ https://issues.apache.org/jira/browse/ARROW-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008513#comment-16008513 ] Wes McKinney commented on ARROW-1017: - [~xhochy] we should try to investigate this before approving a parquet-cpp 1.1.0 release in case there is a memory leak in libparquet > Python: Calling to_pandas on a Parquet file in HDFS leaks memory > > > Key: ARROW-1017 > URL: https://issues.apache.org/jira/browse/ARROW-1017 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.3.0 >Reporter: James Porritt > Fix For: 0.4.0 > > > Running the following code results in ever increasing memory usage, even > though I would expect the dataframe to be garbage collected when it goes out > of scope. For the size of my parquet file, I see the usage increasing about > 3GB per loop: > {code} > from pyarrow import HdfsClient > def read_parquet_file(client, parquet_file): > parquet = client.read_parquet(parquet_file) > df = parquet.to_pandas() > client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3') > parquet_file = '/my/parquet/file > while True: > read_parquet_file(client, parquet_file) > {code} > Is there a reference count issue similar to ARROW-362? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-1017) Python: Calling to_pandas on a Parquet file in HDFS leaks memory
[ https://issues.apache.org/jira/browse/ARROW-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008481#comment-16008481 ] Wes McKinney commented on ARROW-1017: - Thanks [~jporritt], I will take a look and see if I can reproduce the issue. > Python: Calling to_pandas on a Parquet file in HDFS leaks memory > > > Key: ARROW-1017 > URL: https://issues.apache.org/jira/browse/ARROW-1017 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.3.0 >Reporter: James Porritt > Fix For: 0.4.0 > > > Running the following code results in ever increasing memory usage, even > though I would expect the dataframe to be garbage collected when it goes out > of scope. For the size of my parquet file, I see the usage increasing about > 3GB per loop: > {code} > from pyarrow import HdfsClient > def read_parquet_file(client, parquet_file): > parquet = client.read_parquet(parquet_file) > df = parquet.to_pandas() > client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3') > parquet_file = '/my/parquet/file > while True: > read_parquet_file(client, parquet_file) > {code} > Is there a reference count issue similar to ARROW-362? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (ARROW-1017) Python: Calling to_pandas on a Parquet file in HDFS leaks memory
[ https://issues.apache.org/jira/browse/ARROW-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1017: Fix Version/s: 0.4.0 > Python: Calling to_pandas on a Parquet file in HDFS leaks memory > > > Key: ARROW-1017 > URL: https://issues.apache.org/jira/browse/ARROW-1017 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.3.0 >Reporter: James Porritt > Fix For: 0.4.0 > > > Running the following code results in ever increasing memory usage, even > though I would expect the dataframe to be garbage collected when it goes out > of scope. For the size of my parquet file, I see the usage increasing about > 3GB per loop: > {code} > from pyarrow import HdfsClient > def read_parquet_file(client, parquet_file): > parquet = client.read_parquet(parquet_file) > df = parquet.to_pandas() > client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3') > parquet_file = '/my/parquet/file > while True: > read_parquet_file(client, parquet_file) > {code} > Is there a reference count issue similar to ARROW-362? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (ARROW-1017) Python: Calling to_pandas on a Parquet file in HDFS leaks memory
James Porritt created ARROW-1017: Summary: Python: Calling to_pandas on a Parquet file in HDFS leaks memory Key: ARROW-1017 URL: https://issues.apache.org/jira/browse/ARROW-1017 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.3.0 Reporter: James Porritt Running the following code results in ever increasing memory usage, even though I would expect the dataframe to be garbage collected when it goes out of scope. For the size of my parquet file, I see the usage increasing about 3GB per loop: {code} from pyarrow import HdfsClient def read_parquet_file(client, parquet_file): parquet = client.read_parquet(parquet_file) df = parquet.to_pandas() client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3') parquet_file = '/my/parquet/file while True: read_parquet_file(client, parquet_file) {code} Is there a reference count issue similar to ARROW-362? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008315#comment-16008315 ] Kazuaki Ishizaki commented on ARROW-300: Thank you for your response. I was also busy for preparing materials for GTC. It is good time to make a document, now. It sounds good to prepare a Google document for collecting public comments. I will start creating a document for purpose, scope, and design. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (ARROW-988) [JS] Add entry to Travis CI matrix
[ https://issues.apache.org/jira/browse/ARROW-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette reassigned ARROW-988: --- Assignee: Brian Hulette > [JS] Add entry to Travis CI matrix > -- > > Key: ARROW-988 > URL: https://issues.apache.org/jira/browse/ARROW-988 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Brian Hulette >Assignee: Brian Hulette > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (ARROW-1016) Python: Include C++ headers (optionally) in wheels
Uwe L. Korn created ARROW-1016: -- Summary: Python: Include C++ headers (optionally) in wheels Key: ARROW-1016 URL: https://issues.apache.org/jira/browse/ARROW-1016 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Uwe L. Korn Assignee: Uwe L. Korn This is not the most beautiful solution (that would be using conda :D) but a first step to have wheels which you can use to build other python packages with native code against. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (ARROW-1015) [Java] Implement schema-level metadata
Emilio Lahr-Vivaz created ARROW-1015: Summary: [Java] Implement schema-level metadata Key: ARROW-1015 URL: https://issues.apache.org/jira/browse/ARROW-1015 Project: Apache Arrow Issue Type: Task Reporter: Emilio Lahr-Vivaz Assignee: Emilio Lahr-Vivaz Fix For: 0.4.0 Schema already defines metadata in the arrow format - implement in Java. -- This message was sent by Atlassian JIRA (v6.3.15#6346)