Re: Efficient Pandas serialization for mixed object and numeric DataFrames
How are you serializing the dataframe? If you use *pyarrow.serialize(df)*, then each column should be serialized separately and numeric columns will be handled efficiently. On Thu, Oct 18, 2018 at 9:10 PM Mitar wrote: > Hi! > > It seems that if a DataFrame contains both numeric and object columns, > the whole DataFrame is pickled and not that only object columns are > pickled? Is this right? Are there any plans to improve this? > > > Mitar > > -- > http://mitar.tnode.com/ > https://twitter.com/mitar_m >
Efficient Pandas serialization for mixed object and numeric DataFrames
Hi! It seems that if a DataFrame contains both numeric and object columns, the whole DataFrame is pickled and not that only object columns are pickled? Is this right? Are there any plans to improve this? Mitar -- http://mitar.tnode.com/ https://twitter.com/mitar_m
[jira] [Created] (ARROW-3559) Statically link libraries for plasma_store_server executable.
Robert Nishihara created ARROW-3559: --- Summary: Statically link libraries for plasma_store_server executable. Key: ARROW-3559 URL: https://issues.apache.org/jira/browse/ARROW-3559 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Reporter: Robert Nishihara Assignee: Robert Nishihara {code:java} cd ~ git clone https://github.com/apache/arrow cd arrow/cpp mkdir build cd build cmake -DCMAKE_BUILD_TYPE=Release -DARROW_PYTHON=on -DARROW_PLASMA=on .. make -j16 sudo make install cd ~ cp arrow/cpp/build/release/plasma_store_server . mv arrow arrow-temp # Try to start the store ./plasma_store_server -s /tmp/store -m 10{code} The last line crashes with {code:java} ./plasma_store_server: error while loading shared libraries: libplasma.so.12: cannot open shared object file: No such file or directory{code} For usability, it's important that people can copy around the plasma store executable and run it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [VOTE] Accept donation of Ruby bindings to Parquet GLib
+1 In "[VOTE] Accept donation of Ruby bindings to Parquet GLib" on Thu, 18 Oct 2018 16:59:41 -0400, Wes McKinney wrote: > hello, > > Kouhei Sutou is proposing to donate Ruby bindings to the Parquet GLib > library, which was received as a donation in September. This Ruby > library was originally developed at > > https://github.com/red-data-tools/red-parquet/ > > Kou has submitted the work as a pull request > https://github.com/apache/arrow/pull/2772 > > This vote is to determine if the Arrow PMC is in favor of accepting > this donation, subject to the fulfillment of the ASF IP Clearance process. > > [ ] +1 : Accept contribution of Ruby Parquet bindings > [ ] 0 : No opinion > [ ] -1 : Reject contribution because... > > Here is my vote: +1 > > The vote will be open for at least 72 hours. > > Thanks, > Wes
[jira] [Created] (ARROW-3558) Remove fatal error when plasma client calls get on an unsealed object that it created.
Robert Nishihara created ARROW-3558: --- Summary: Remove fatal error when plasma client calls get on an unsealed object that it created. Key: ARROW-3558 URL: https://issues.apache.org/jira/browse/ARROW-3558 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Reporter: Robert Nishihara Assignee: Robert Nishihara In the case when Get is called with a timeout, this should simply behave as if the object hasn't been created yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [VOTE] Accept donation of Ruby bindings to Parquet GLib
+1 > Am 18.10.2018 um 22:59 schrieb Wes McKinney : > > hello, > > Kouhei Sutou is proposing to donate Ruby bindings to the Parquet GLib > library, which was received as a donation in September. This Ruby > library was originally developed at > > https://github.com/red-data-tools/red-parquet/ > > Kou has submitted the work as a pull request > https://github.com/apache/arrow/pull/2772 > > This vote is to determine if the Arrow PMC is in favor of accepting > this donation, subject to the fulfillment of the ASF IP Clearance process. > >[ ] +1 : Accept contribution of Ruby Parquet bindings >[ ] 0 : No opinion >[ ] -1 : Reject contribution because... > > Here is my vote: +1 > > The vote will be open for at least 72 hours. > > Thanks, > Wes
[VOTE] Accept donation of Ruby bindings to Parquet GLib
hello, Kouhei Sutou is proposing to donate Ruby bindings to the Parquet GLib library, which was received as a donation in September. This Ruby library was originally developed at https://github.com/red-data-tools/red-parquet/ Kou has submitted the work as a pull request https://github.com/apache/arrow/pull/2772 This vote is to determine if the Arrow PMC is in favor of accepting this donation, subject to the fulfillment of the ASF IP Clearance process. [ ] +1 : Accept contribution of Ruby Parquet bindings [ ] 0 : No opinion [ ] -1 : Reject contribution because... Here is my vote: +1 The vote will be open for at least 72 hours. Thanks, Wes
[jira] [Created] (ARROW-3556) [CI] Disable optimizations on Windows
Antoine Pitrou created ARROW-3556: - Summary: [CI] Disable optimizations on Windows Key: ARROW-3556 URL: https://issues.apache.org/jira/browse/ARROW-3556 Project: Apache Arrow Issue Type: Improvement Components: C++, Continuous Integration Reporter: Antoine Pitrou Disabling compiler optimizations, even in release mode, should allow builds to become a bit faster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3555) [Plasma] Unify plasma client get function using metadata.
Yuhong Guo created ARROW-3555: - Summary: [Plasma] Unify plasma client get function using metadata. Key: ARROW-3555 URL: https://issues.apache.org/jira/browse/ARROW-3555 Project: Apache Arrow Issue Type: New Feature Reporter: Yuhong Guo Sometimes, it is very hard for the data consumer to know whether an object is a buffer or other objects. If we use try-catch to catch the pyarrow deserialization exception and then using `plasma_client.get_buffer`, the code is not clean. We may leverage the metadata which is not used at all to mark the buffer data. In the client of other language, this would be simple to implement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Support for TIMESTAMP_NANOS in parquet-cpp
Hi everyone, in parquet-format, there is now support for TIMESTAMP_NANOS: https://github.com/apache/parquet-format/pull/102 For parquet-cpp, this is not yet supported. I have a few questions now: • is there an overview of what release of parquet-format is currently fully support in parquet-cpp (something like a feature support matrix)? • how fast are new features in parquet-format adopted? I think having a document describing the current completeness of implementation of the spec would be very helpful for users of the parquet-cpp library. Thanks, Roman
[jira] [Created] (ARROW-3554) [C++] Reverse traits for C++
Wolf Vollprecht created ARROW-3554: -- Summary: [C++] Reverse traits for C++ Key: ARROW-3554 URL: https://issues.apache.org/jira/browse/ARROW-3554 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wolf Vollprecht This might be more of a question that I would have asked on a chat, so sorry if inappropriate here as an issue. I am trying to get the Arrow type from a native C++ type. I would like to use something like `arrow_type::type -> UInt8Type` or `arrow_type() -> shared_ptr` Is that implemented somewhere? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Making a bugfix 0.11.1 release
Hi Antoine, Thanks for the quick response! This helps to clear up my confusion. Best Regards, Kevin Gurney From: Antoine Pitrou Sent: Thursday, October 18, 2018 9:54:47 AM To: dev@arrow.apache.org Subject: Re: Making a bugfix 0.11.1 release Le 18/10/2018 à 15:44, Kevin Gurney a écrit : > Hi All, > > We are working with the arrow version 0.9.0 C++ libraries in conjunction with > separate parquet-cpp version 1.4.0. > > Questions: > > 1. Does this zlib issue affect all clients of the arrow C++ libraries or > just the Python PyArrow code? To be clear: this is a packaging issue and only affects the PyArrow binary wheels (i.e. if you type "pip install pyarrow"). It should not affect people who self-compile Arrow or PyArrow; and it probably doesn't affect people who download other binaries, either (such as Conda packages). Regards Antoine.
[jira] [Created] (ARROW-3553) [R] Error when losing data on int64, uint64 conversions to double
Wes McKinney created ARROW-3553: --- Summary: [R] Error when losing data on int64, uint64 conversions to double Key: ARROW-3553 URL: https://issues.apache.org/jira/browse/ARROW-3553 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Wes McKinney Fix For: 0.12.0 Conversions outside the representable range should probably warn or error. See such check we do in Python https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/helpers.cc#L350 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Making a bugfix 0.11.1 release
Le 18/10/2018 à 15:44, Kevin Gurney a écrit : > Hi All, > > We are working with the arrow version 0.9.0 C++ libraries in conjunction with > separate parquet-cpp version 1.4.0. > > Questions: > > 1. Does this zlib issue affect all clients of the arrow C++ libraries or > just the Python PyArrow code? To be clear: this is a packaging issue and only affects the PyArrow binary wheels (i.e. if you type "pip install pyarrow"). It should not affect people who self-compile Arrow or PyArrow; and it probably doesn't affect people who download other binaries, either (such as Conda packages). Regards Antoine.
[jira] [Created] (ARROW-3552) [Python] Implement pa.RecordBatch.serialize_to to write single message to an OutputStream
Wes McKinney created ARROW-3552: --- Summary: [Python] Implement pa.RecordBatch.serialize_to to write single message to an OutputStream Key: ARROW-3552 URL: https://issues.apache.org/jira/browse/ARROW-3552 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Wes McKinney {{RecordBatch.serialize}} writes in memory. This would help with shared memory worksflows. See also pyarrow.ipc.write_tensor -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Making a bugfix 0.11.1 release
Hi All, We are working with the arrow version 0.9.0 C++ libraries in conjunction with separate parquet-cpp version 1.4.0. Questions: 1. Does this zlib issue affect all clients of the arrow C++ libraries or just the Python PyArrow code? 2. Does this zlib compression issue also affect the arrow version 0.9.0 C++ libraries (before parquet-cpp was merged in), or only the latest arrow version 0.11.0 C++ libraries (with parquet-cpp merged in)? Best Regards, Kevin Gurney From: Krisztián Szűcs Sent: Thursday, October 18, 2018 5:31:01 AM To: dev@arrow.apache.org Subject: Re: Making a bugfix 0.11.1 release I've added the two zlib issues to 0.11.1 version: https://issues.apache.org/jira/projects/ARROW/versions/12344316 On Wed, Oct 17, 2018 at 10:51 PM Wes McKinney wrote: > Got it, thank you for clarifying. It wasn't clear whether the bug > would occur in the build environment (CentOS 5 + devtoolset-2) as well > as other Linux environments. > On Wed, Oct 17, 2018 at 4:16 PM Antoine Pitrou wrote: > > > > > > Le 17/10/2018 à 20:38, Wes McKinney a écrit : > > > hi folks, > > > > > > Since the Python wheels are being installed 10,000 times per day or > > > more, I don't think we should allow them to be broken for much longer. > > > > > > What additional patches need to be done before an RC can be cut? Since > > > I'm concerned about the broken patches undermining the project's > > > reputation, I can adjust my priorities to start a release vote later > > > today or first thing tomorrow morning. Seems like > > > https://issues.apache.org/jira/browse/ARROW-3535 might be the last > > > item, and I can prepare a maintenance branch with the cherry-picked > > > fixes > > > > > > Was there a determination as to why our CI systems did not catch the > > > blocker ARROW-3514? > > > > Because it was not exercised by the test suite. My take is that the bug > > would only happen with specific data, e.g. tiny and/or entirely > > incompressible. I don't think general gzip compression of Parquet files > > was broken. > > > > Regards > > > > Antoine. >
[jira] [Created] (ARROW-3551) Change MapD to OmniSci on Powered By page
Todd Mostak created ARROW-3551: -- Summary: Change MapD to OmniSci on Powered By page Key: ARROW-3551 URL: https://issues.apache.org/jira/browse/ARROW-3551 Project: Apache Arrow Issue Type: Improvement Reporter: Todd Mostak MapD recently changed its name to OmniSci. We should update the Powered By page to reflect this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3550) [C++] Use kUnknownNullCount in NumericArray constructor
Wolf Vollprecht created ARROW-3550: -- Summary: [C++] Use kUnknownNullCount in NumericArray constructor Key: ARROW-3550 URL: https://issues.apache.org/jira/browse/ARROW-3550 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wolf Vollprecht Currently, the default value in the NumericArray constructor for the null_count is 0. I wonder wether it would be better to use kUnknownNullCount instead? A user could still choose to supply a null_count of 0, or a nullptr as bitmask which would imply a null_count of 0 as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Making a bugfix 0.11.1 release
I've added the two zlib issues to 0.11.1 version: https://issues.apache.org/jira/projects/ARROW/versions/12344316 On Wed, Oct 17, 2018 at 10:51 PM Wes McKinney wrote: > Got it, thank you for clarifying. It wasn't clear whether the bug > would occur in the build environment (CentOS 5 + devtoolset-2) as well > as other Linux environments. > On Wed, Oct 17, 2018 at 4:16 PM Antoine Pitrou wrote: > > > > > > Le 17/10/2018 à 20:38, Wes McKinney a écrit : > > > hi folks, > > > > > > Since the Python wheels are being installed 10,000 times per day or > > > more, I don't think we should allow them to be broken for much longer. > > > > > > What additional patches need to be done before an RC can be cut? Since > > > I'm concerned about the broken patches undermining the project's > > > reputation, I can adjust my priorities to start a release vote later > > > today or first thing tomorrow morning. Seems like > > > https://issues.apache.org/jira/browse/ARROW-3535 might be the last > > > item, and I can prepare a maintenance branch with the cherry-picked > > > fixes > > > > > > Was there a determination as to why our CI systems did not catch the > > > blocker ARROW-3514? > > > > Because it was not exercised by the test suite. My take is that the bug > > would only happen with specific data, e.g. tiny and/or entirely > > incompressible. I don't think general gzip compression of Parquet files > > was broken. > > > > Regards > > > > Antoine. >