Re: Efficient Pandas serialization for mixed object and numeric DataFrames

2018-10-18 Thread Robert Nishihara
How are you serializing the dataframe? If you use *pyarrow.serialize(df)*,
then each column should be serialized separately and numeric columns will
be handled efficiently.

On Thu, Oct 18, 2018 at 9:10 PM Mitar  wrote:

> Hi!
>
> It seems that if a DataFrame contains both numeric and object columns,
> the whole DataFrame is pickled and not that only object columns are
> pickled? Is this right? Are there any plans to improve this?
>
>
> Mitar
>
> --
> http://mitar.tnode.com/
> https://twitter.com/mitar_m
>


Efficient Pandas serialization for mixed object and numeric DataFrames

2018-10-18 Thread Mitar
Hi!

It seems that if a DataFrame contains both numeric and object columns,
the whole DataFrame is pickled and not that only object columns are
pickled? Is this right? Are there any plans to improve this?


Mitar

-- 
http://mitar.tnode.com/
https://twitter.com/mitar_m


[jira] [Created] (ARROW-3559) Statically link libraries for plasma_store_server executable.

2018-10-18 Thread Robert Nishihara (JIRA)
Robert Nishihara created ARROW-3559:
---

 Summary: Statically link libraries for plasma_store_server 
executable.
 Key: ARROW-3559
 URL: https://issues.apache.org/jira/browse/ARROW-3559
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


{code:java}
cd ~
git clone https://github.com/apache/arrow
cd arrow/cpp
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DARROW_PYTHON=on -DARROW_PLASMA=on ..
make -j16
sudo make install

cd ~
cp arrow/cpp/build/release/plasma_store_server .
mv arrow arrow-temp

# Try to start the store
./plasma_store_server -s /tmp/store -m 10{code}
The last line crashes with
{code:java}
./plasma_store_server: error while loading shared libraries: libplasma.so.12: 
cannot open shared object file: No such file or directory{code}
For usability, it's important that people can copy around the plasma store 
executable and run it.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Accept donation of Ruby bindings to Parquet GLib

2018-10-18 Thread Kouhei Sutou
+1

In 
  "[VOTE] Accept donation of Ruby bindings to Parquet GLib" on Thu, 18 Oct 2018 
16:59:41 -0400,
  Wes McKinney  wrote:

> hello,
> 
> Kouhei Sutou is proposing to donate Ruby bindings to the Parquet GLib
> library, which was received as a donation in September. This Ruby
> library was originally developed at
> 
> https://github.com/red-data-tools/red-parquet/
> 
> Kou has submitted the work as a pull request
> https://github.com/apache/arrow/pull/2772
> 
> This vote is to determine if the Arrow PMC is in favor of accepting
> this donation, subject to the fulfillment of the ASF IP Clearance process.
> 
> [ ] +1 : Accept contribution of Ruby Parquet bindings
> [ ]  0 : No opinion
> [ ] -1 : Reject contribution because...
> 
> Here is my vote: +1
> 
> The vote will be open for at least 72 hours.
> 
> Thanks,
> Wes


[jira] [Created] (ARROW-3558) Remove fatal error when plasma client calls get on an unsealed object that it created.

2018-10-18 Thread Robert Nishihara (JIRA)
Robert Nishihara created ARROW-3558:
---

 Summary: Remove fatal error when plasma client calls get on an 
unsealed object that it created.
 Key: ARROW-3558
 URL: https://issues.apache.org/jira/browse/ARROW-3558
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


In the case when Get is called with a timeout, this should simply behave as if 
the object hasn't been created yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Accept donation of Ruby bindings to Parquet GLib

2018-10-18 Thread Uwe L. Korn
+1 

> Am 18.10.2018 um 22:59 schrieb Wes McKinney :
> 
> hello,
> 
> Kouhei Sutou is proposing to donate Ruby bindings to the Parquet GLib
> library, which was received as a donation in September. This Ruby
> library was originally developed at
> 
> https://github.com/red-data-tools/red-parquet/
> 
> Kou has submitted the work as a pull request
> https://github.com/apache/arrow/pull/2772
> 
> This vote is to determine if the Arrow PMC is in favor of accepting
> this donation, subject to the fulfillment of the ASF IP Clearance process.
> 
>[ ] +1 : Accept contribution of Ruby Parquet bindings
>[ ]  0 : No opinion
>[ ] -1 : Reject contribution because...
> 
> Here is my vote: +1
> 
> The vote will be open for at least 72 hours.
> 
> Thanks,
> Wes


[VOTE] Accept donation of Ruby bindings to Parquet GLib

2018-10-18 Thread Wes McKinney
hello,

Kouhei Sutou is proposing to donate Ruby bindings to the Parquet GLib
library, which was received as a donation in September. This Ruby
library was originally developed at

https://github.com/red-data-tools/red-parquet/

Kou has submitted the work as a pull request
https://github.com/apache/arrow/pull/2772

This vote is to determine if the Arrow PMC is in favor of accepting
this donation, subject to the fulfillment of the ASF IP Clearance process.

[ ] +1 : Accept contribution of Ruby Parquet bindings
[ ]  0 : No opinion
[ ] -1 : Reject contribution because...

Here is my vote: +1

The vote will be open for at least 72 hours.

Thanks,
Wes


[jira] [Created] (ARROW-3556) [CI] Disable optimizations on Windows

2018-10-18 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-3556:
-

 Summary: [CI] Disable optimizations on Windows
 Key: ARROW-3556
 URL: https://issues.apache.org/jira/browse/ARROW-3556
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou


Disabling compiler optimizations, even in release mode, should allow builds to 
become a bit faster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3555) [Plasma] Unify plasma client get function using metadata.

2018-10-18 Thread Yuhong Guo (JIRA)
Yuhong Guo created ARROW-3555:
-

 Summary: [Plasma] Unify plasma client get function using metadata.
 Key: ARROW-3555
 URL: https://issues.apache.org/jira/browse/ARROW-3555
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Yuhong Guo


Sometimes, it is very hard for the data consumer to know whether an object is a 
buffer or other objects. If we use try-catch to catch the pyarrow 
deserialization exception and then using `plasma_client.get_buffer`, the code 
is not clean.
We may leverage the metadata which is not used at all to mark the buffer data. 
In the client of other language, this would be simple to implement. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Support for TIMESTAMP_NANOS in parquet-cpp

2018-10-18 Thread Roman Karlstetter
Hi everyone,
in parquet-format, there is now support for TIMESTAMP_NANOS: 
https://github.com/apache/parquet-format/pull/102 
For parquet-cpp, this is not yet supported. I have a few questions now:
• is there an overview of what release of parquet-format is currently fully 
support in parquet-cpp (something like a feature support matrix)?
• how fast are new features in parquet-format adopted?
I think having a document describing the current completeness of implementation 
of the spec would be very helpful for users of the parquet-cpp library.
Thanks,
Roman




[jira] [Created] (ARROW-3554) [C++] Reverse traits for C++

2018-10-18 Thread Wolf Vollprecht (JIRA)
Wolf Vollprecht created ARROW-3554:
--

 Summary: [C++] Reverse traits for C++
 Key: ARROW-3554
 URL: https://issues.apache.org/jira/browse/ARROW-3554
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wolf Vollprecht


This might be more of a question that I would have asked on a chat, so sorry if 
inappropriate here as an issue.

 

I am trying to get the Arrow type from a native C++ type. 

I would like to use something like

 

`arrow_type::type -> UInt8Type` or `arrow_type() -> 
shared_ptr`

 

Is that implemented somewhere?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Making a bugfix 0.11.1 release

2018-10-18 Thread Kevin Gurney
Hi Antoine,


Thanks for the quick response!


This helps to clear up my confusion.


Best Regards,


Kevin Gurney


From: Antoine Pitrou 
Sent: Thursday, October 18, 2018 9:54:47 AM
To: dev@arrow.apache.org
Subject: Re: Making a bugfix 0.11.1 release


Le 18/10/2018 à 15:44, Kevin Gurney a écrit :
> Hi All,
>
> We are working with the arrow version 0.9.0 C++ libraries in conjunction with 
> separate parquet-cpp version 1.4.0.
>
> Questions:
>
>   1.  Does this zlib issue affect all clients of the arrow C++ libraries or 
> just the Python PyArrow code?

To be clear: this is a packaging issue and only affects the PyArrow
binary wheels (i.e. if you type "pip install pyarrow").  It should not
affect people who self-compile Arrow or PyArrow; and it probably doesn't
affect people who download other binaries, either (such as Conda packages).

Regards

Antoine.



[jira] [Created] (ARROW-3553) [R] Error when losing data on int64, uint64 conversions to double

2018-10-18 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3553:
---

 Summary: [R] Error when losing data on int64, uint64 conversions 
to double
 Key: ARROW-3553
 URL: https://issues.apache.org/jira/browse/ARROW-3553
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Wes McKinney
 Fix For: 0.12.0


Conversions outside the representable range should probably warn or error. See 
such check we do in Python 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/helpers.cc#L350



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Making a bugfix 0.11.1 release

2018-10-18 Thread Antoine Pitrou


Le 18/10/2018 à 15:44, Kevin Gurney a écrit :
> Hi All,
> 
> We are working with the arrow version 0.9.0 C++ libraries in conjunction with 
> separate parquet-cpp version 1.4.0.
> 
> Questions:
> 
>   1.  Does this zlib issue affect all clients of the arrow C++ libraries or 
> just the Python PyArrow code?

To be clear: this is a packaging issue and only affects the PyArrow
binary wheels (i.e. if you type "pip install pyarrow").  It should not
affect people who self-compile Arrow or PyArrow; and it probably doesn't
affect people who download other binaries, either (such as Conda packages).

Regards

Antoine.


[jira] [Created] (ARROW-3552) [Python] Implement pa.RecordBatch.serialize_to to write single message to an OutputStream

2018-10-18 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3552:
---

 Summary: [Python] Implement pa.RecordBatch.serialize_to to write 
single message to an OutputStream
 Key: ARROW-3552
 URL: https://issues.apache.org/jira/browse/ARROW-3552
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney


{{RecordBatch.serialize}} writes in memory. This would help with shared memory 
worksflows. See also pyarrow.ipc.write_tensor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Making a bugfix 0.11.1 release

2018-10-18 Thread Kevin Gurney
Hi All,

We are working with the arrow version 0.9.0 C++ libraries in conjunction with 
separate parquet-cpp version 1.4.0.

Questions:

  1.  Does this zlib issue affect all clients of the arrow C++ libraries or 
just the Python PyArrow code?
  2.  Does this zlib compression issue also affect the arrow version 0.9.0 C++ 
libraries (before parquet-cpp was merged in), or only the latest arrow version 
0.11.0 C++ libraries (with parquet-cpp merged in)?

Best Regards,

Kevin Gurney


From: Krisztián Szűcs 
Sent: Thursday, October 18, 2018 5:31:01 AM
To: dev@arrow.apache.org
Subject: Re: Making a bugfix 0.11.1 release

I've added the two zlib issues to 0.11.1 version:
https://issues.apache.org/jira/projects/ARROW/versions/12344316

On Wed, Oct 17, 2018 at 10:51 PM Wes McKinney  wrote:

> Got it, thank you for clarifying. It wasn't clear whether the bug
> would occur in the build environment (CentOS 5 + devtoolset-2) as well
> as other Linux environments.
> On Wed, Oct 17, 2018 at 4:16 PM Antoine Pitrou  wrote:
> >
> >
> > Le 17/10/2018 à 20:38, Wes McKinney a écrit :
> > > hi folks,
> > >
> > > Since the Python wheels are being installed 10,000 times per day or
> > > more, I don't think we should allow them to be broken for much longer.
> > >
> > > What additional patches need to be done before an RC can be cut? Since
> > > I'm concerned about the broken patches undermining the project's
> > > reputation, I can adjust my priorities to start a release vote later
> > > today or first thing tomorrow morning. Seems like
> > > https://issues.apache.org/jira/browse/ARROW-3535 might be the last
> > > item, and I can prepare a maintenance branch with the cherry-picked
> > > fixes
> > >
> > > Was there a determination as to why our CI systems did not catch the
> > > blocker ARROW-3514?
> >
> > Because it was not exercised by the test suite.  My take is that the bug
> > would only happen with specific data, e.g. tiny and/or entirely
> > incompressible.  I don't think general gzip compression of Parquet files
> > was broken.
> >
> > Regards
> >
> > Antoine.
>


[jira] [Created] (ARROW-3551) Change MapD to OmniSci on Powered By page

2018-10-18 Thread Todd Mostak (JIRA)
Todd Mostak created ARROW-3551:
--

 Summary: Change MapD to OmniSci on Powered By page
 Key: ARROW-3551
 URL: https://issues.apache.org/jira/browse/ARROW-3551
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Todd Mostak


MapD recently changed its name to OmniSci. We should update the Powered By page 
to reflect this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3550) [C++] Use kUnknownNullCount in NumericArray constructor

2018-10-18 Thread Wolf Vollprecht (JIRA)
Wolf Vollprecht created ARROW-3550:
--

 Summary: [C++] Use kUnknownNullCount in NumericArray constructor
 Key: ARROW-3550
 URL: https://issues.apache.org/jira/browse/ARROW-3550
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wolf Vollprecht


Currently, the default value in the NumericArray constructor for the null_count 
is 0.

I wonder wether it would be better to use kUnknownNullCount instead? A user 
could still choose to supply a null_count of 0, or a nullptr as bitmask which 
would imply a null_count of 0 as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Making a bugfix 0.11.1 release

2018-10-18 Thread Krisztián Szűcs
I've added the two zlib issues to 0.11.1 version:
https://issues.apache.org/jira/projects/ARROW/versions/12344316

On Wed, Oct 17, 2018 at 10:51 PM Wes McKinney  wrote:

> Got it, thank you for clarifying. It wasn't clear whether the bug
> would occur in the build environment (CentOS 5 + devtoolset-2) as well
> as other Linux environments.
> On Wed, Oct 17, 2018 at 4:16 PM Antoine Pitrou  wrote:
> >
> >
> > Le 17/10/2018 à 20:38, Wes McKinney a écrit :
> > > hi folks,
> > >
> > > Since the Python wheels are being installed 10,000 times per day or
> > > more, I don't think we should allow them to be broken for much longer.
> > >
> > > What additional patches need to be done before an RC can be cut? Since
> > > I'm concerned about the broken patches undermining the project's
> > > reputation, I can adjust my priorities to start a release vote later
> > > today or first thing tomorrow morning. Seems like
> > > https://issues.apache.org/jira/browse/ARROW-3535 might be the last
> > > item, and I can prepare a maintenance branch with the cherry-picked
> > > fixes
> > >
> > > Was there a determination as to why our CI systems did not catch the
> > > blocker ARROW-3514?
> >
> > Because it was not exercised by the test suite.  My take is that the bug
> > would only happen with specific data, e.g. tiny and/or entirely
> > incompressible.  I don't think general gzip compression of Parquet files
> > was broken.
> >
> > Regards
> >
> > Antoine.
>