from:"Robert Nishihara"

Re: Closing Plasma issues?

2020-09-07 Thread Robert Nishihara

I think that makes sense. They can be reopened if necessary.

On Mon, Sep 7, 2020 at 9:49 AM Antoine Pitrou  wrote:

>
> Hello,
>
> The Plasma component in our C++ codebase is now unmaintained, with the
> original authors and maintainers having forked the codebase on their
> side.  I propose to close the open Plasma issues in JIRA as "Won't fix".
>  Is there any concern about this?
>
> Regards
>
> Antoine.
>

Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

2020-08-17 Thread Robert Nishihara

To answer Wes's question, the Plasma inside of Ray is not currently usable
in a C++ library context, though it wouldn't be impossible to make that
happen.

I (or someone) could conduct a simple poll via Google Forms on the user
mailing list to gauge demand if we are concerned about breaking a lot of
people's workflow.

On Mon, Aug 17, 2020 at 3:21 AM Antoine Pitrou  wrote:

>
> Le 15/08/2020 à 17:56, Wes McKinney a écrit :
> >
> > What isn't clear is whether the Plasma that's in Ray is usable in a
> > C++ library context (e.g. what we currently ship as libplasma-dev e.g.
> > on Ubuntu/Debian). That seems still useful, but if the project isn't
> > being actively maintained / developed (which, given the series of
> > stale PRs over the last year or two, it doesn't seem to be) it's
> > unclear whether we want to keep shipping it.
>
> At least on GitHub, the C++ API seems to be getting little use.  Most
> search results below are forks/copies of the Arrow or Ray codebases.
> There are also a couple stale experiments:
> https://github.com/search?l=C%2B%2B=1=PlasmaClient=Code
>
> Regards
>
> Antoine.
>

Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

2020-07-21 Thread Robert Nishihara

Hi all,

Regarding Plasma, you're right we should have started this conversation
earlier! The way it's being developed in Ray currently isn't useful as a
standalone project. We realized that tighter integration with Ray's object
lifetime tracking could be important, and removing IPCs and making it a
separate thread in the same process as our scheduler could make a big
difference for performance. Some of these optimizations wouldn't be easy
without a tight integration, so there are some trade-offs here.

Regarding the Python serialization format, I agree with Antoine that it
should be deprecated. We began developing it before pickle 5, but now that
pickle 5 has taken off, it makes less sense (it's useful in its own right,
but at the end of the day, we were interested in it as a way to serialize
arbitrary Python objects).

-Robert

On Sun, Jul 12, 2020 at 5:26 PM Wes McKinney  wrote:

> I'll add deprecation warnings to the pyarrow.serialize functions in
> question, it will be pretty simple.
>
> On Sun, Jul 12, 2020, 6:34 PM Neal Richardson  >
> wrote:
>
> > This seems like something to investigate after the 1.0 release.
> >
> > Neal
> >
> > On Sun, Jul 12, 2020 at 11:53 AM Antoine Pitrou 
> > wrote:
> >
> > >
> > > I'd certainly like to deprecate our custom Python serialization format,
> > > and using pickle protocol 5 instead is a very good idea.
> > >
> > > We can probably keep it in 1.0 while raising a FutureWarning.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 12/07/2020 à 19:22, Wes McKinney a écrit :
> > > > It appears that the Ray developers have decided to fork Plasma and
> > > > decouple from the Arrow codebase:
> > > >
> > > > https://github.com/ray-project/ray/pull/9154
> > > >
> > > > This is a disappointing development to occur without any discussion
> on
> > > > this mailing list but given the lack of development activity on
> Plasma
> > > > I would like to see how others in the community would like to
> proceed.
> > > >
> > > > It appears additionally that the Union-based serialization format
> > > > implemented by arrow/python/serialize.h and the pyarrow/serialize.py
> > > > has been dropped in favor of pickle5. If there is not value in
> > > > maintaining this code then it would probably be preferable for us to
> > > > remove this from the codebase.
> > > >
> > > > Thanks,
> > > > Wes
> > > >
> > >
> >
>

Re: [ANNOUNCE] New Arrow committer: Francois Saint-Jacques

2019-06-12 Thread Robert Nishihara

Congratulations!

On Wed, Jun 12, 2019 at 4:16 PM Philipp Moritz  wrote:

> Congrats François :)
>
> On Wed, Jun 12, 2019 at 3:37 PM Antoine Pitrou  wrote:
>
> >
> > Welcome on the team François :-)
> >
> >
> > Le 12/06/2019 à 17:45, Wes McKinney a écrit :
> > > On behalf of the Arrow PMC I'm happy to announce that Francois has
> > > accepted an invitation to become an Arrow committer!
> > >
> > > Welcome, and thank you for your contributions!
> > >
> >
>

[jira] [Created] (ARROW-5099) Compiling Plasma TensorFlow op has Python 2 bug.

2019-04-03 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-5099:
---

 Summary: Compiling Plasma TensorFlow op has Python 2 bug.
 Key: ARROW-5099
 URL: https://issues.apache.org/jira/browse/ARROW-5099
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Plasma, Python
Reporter: Robert Nishihara


I've seen the following error when compiling the Plasma TensorFlow op.
TensorFlow version: 1.13.1
Compiling Plasma TensorFlow Op...
Traceback (most recent call last):
  File "/ray/python/ray/experimental/sgd/test_sgd.py", line 48, in 
all_reduce_alg=args.all_reduce_alg)
  File "/ray/python/ray/experimental/sgd/sgd.py", line 110, in __init__
shard_shapes = ray.get(self.workers[0].shard_shapes.remote())
  File "/ray/python/ray/worker.py", line 2307, in get
raise value
ray.exceptions.RayTaskError: {color:#00cdcd}ray_worker{color} (pid=81, 
host=629a7997c823)
NameError: global name 'FileNotFoundError' is not defined
{{FileNotFoundError}} doesn't exist in Python 2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [Python] The next manylinux specification

2019-03-25 Thread Robert Nishihara

Thanks for posting the thread. This is great!

On Mon, Mar 25, 2019 at 9:04 AM Wes McKinney  wrote:

> Thanks Antoine for alerting us to this thread. It's important that our
> interests are represented in this discussion given the problems we've
> had with interactions with the TensorFlow and PyTorch wheels. Please
> let me know if I can help.
>
> Robert and Philipp, can you keep an eye out on this also?
>
> On Fri, Mar 22, 2019 at 4:12 PM Antoine Pitrou  wrote:
> >
> >
> > For those who are interested in discussing it:
> >
> > https://discuss.python.org/t/the-next-manylinux-specification/1043
> >
> > Regards
> >
> > Antoine.
>

Re: TensorFlow, PyTorch, and manylinux1

2019-02-04 Thread Robert Nishihara

Replying to the thread because the last two messages got dropped.

On Mon, Feb 4, 2019 at 10:00 AM soumith  wrote:

> > I think trying to package CUDA is the wrong way to think about it.
> Instead, perhaps you should try to make the package compatible with
> system CUDA installs.
>
> I agree in principle.
> The problem fundamentally stems from user expectation.
>
> In my ~6+ years of supporting Torch and PyTorch, installing CUDA on a
> system can take days, with a user mean approximately half a day. It might
> be userland incompetence, or that CUDA is a magical snowflake, but the
> reality is that installing CUDA is never great.
> So, a huge amount of issues reported by userland are side-effects from
> broken CUDA installs.
> It doesn't help that the PyPI user expectations of "my package should just
> work after a pip install".
>
> If we can reliably install an up-to-date CUDA in a standardized way, and
> NVIDIA simply doesn't sidestep the userland issues by saying "user our
> docker", or "our PPA is 100% reliable", we would've been in a better state.
>
> Until then, I think it's best that we find a solution for PyPI users that
> can work out of box with PyPI.
>
> On Mon, Feb 4, 2019 at 12:52 PM Antoine Pitrou 
> wrote:
>
> > On Tue, 5 Feb 2019 01:45:34 +0800
> > Jason Zaman  wrote:
> > > On Tue, 5 Feb 2019 at 01:30, soumith  wrote:
> > > >
> > > > Unfortunately I'll be on a long flight, and cannot make it to the
> > SIGBuild meeting.
> > > > I'm definitely interested in the meeting notes and any follow-up
> > meeting.
> > > >
> > > > > I think we should leave CUDA out of the
> > > > discussion initially and see if we can get the cpu-only wheel working
> > > > correctly. Hopefully cpu-only is viable on manylinux2014 then we can
> > > > tackle CUDA afterwards.
> > > >
> > > > 50% of the complexity is in the CUDA packaging.
> > > > The other 50% is in shipping a more modern libstdc++.so
> > > > I believe we'll make progress if we ignore CUDA, but we'll not
> address
> > half of the issue.
> > >
> > > Yeah, we'll definitely need both to solve it fully. My thinking is
> > > that all packages need at least C++11 but only some need CUDA. Or
> > > might we end up where the libstcc++.so is incompatible with CUDA if we
> > > don't work on everything together?
> >
> > I think trying to package CUDA is the wrong way to think about it.
> > Instead, perhaps you should try to make the package compatible with
> > system CUDA installs.
> >
> > For example, the Numba pip wheel almost works out-of-the-box with a
> > system CUDA install on Ubuntu 18.04.  I say "almost" because I had to
> > set two environment variables:
> > https://github.com/numba/numba/issues/3738
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>

Re: [ANNOUNCE] New Arrow committer: Ravindra Pindikura

2019-02-04 Thread Robert Nishihara

Congratulations!

On Mon, Feb 4, 2019 at 10:06 AM Antoine Pitrou  wrote:

>
> Congratulations and thanks for all the work on Gandiva :-)
>
> Regards
>
> Antoine.
>
>
> Le 04/02/2019 à 16:40, Wes McKinney a écrit :
> > On behalf of the Arrow PMC, I'm happy to announce that Ravindra has an
> > accepted an invitation to become a committer on Apache Arrow.
> >
> > Welcome, and thank you for your contributions!
> >
>

Re: [ANNOUNCE] New Arrow PMC member: Andy Grove

2019-02-04 Thread Robert Nishihara

Congratulations!

On Mon, Feb 4, 2019 at 10:02 AM paddy horan  wrote:

> Congrats Andy
>
> Get Outlook for iOS
>
> 
> From: Wes McKinney 
> Sent: Monday, February 4, 2019 10:39 AM
> To: dev@arrow.apache.org
> Subject: [ANNOUNCE] New Arrow PMC member: Andy Grove
>
> The Project Management Committee (PMC) for Apache Arrow has invited
> Andy Grove to become a PMC member and we are pleased to announce that
> Andy has accepted.
>
> Congratulations and welcome!
>

[jira] [Created] (ARROW-4379) Register pyarrow serializers for collections.Counter and collections.deque.

2019-01-25 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-4379:
---

 Summary: Register pyarrow serializers for collections.Counter and 
collections.deque.
 Key: ARROW-4379
 URL: https://issues.apache.org/jira/browse/ARROW-4379
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: TensorFlow, PyTorch, and manylinux1

2018-12-18 Thread Robert Nishihara

polyfill
>>>> what we needed to support older distros better.
>>>>
>>>> A fully fleshed out C++11 implementation landed in gcc in various
>>>> stages, with gradual ABI changes [2]. Unfortunately, the libstdc++ that
>>>> ships with centos6 (and hence manylinx2010) isn't sufficient to cover all
>>>> of C++11. For example, the binaries we built with devtoolset3 (gcc 4.9.2)
>>>> on CentOS6 didn't run with the default libstdc++ on CentOS6 either due to
>>>> ABI changes or minimum GLIBCXX version for some of the symbols being
>>>> unavailable.
>>>>
>>>> We tried our best to support our binaries running on CentOS6 and above
>>>> with various ranges of static linking hacks until 0.3.1 (January 2018), but
>>>> at some point hacks over hacks was only getting more fragile. Hence we
>>>> moved to a CentOS7-based image in April 2018 [3], and relied only on
>>>> dynamic linking to the system-shipped libstdc++.
>>>>
>>>> As Wes mentions [4], an option is to host a modern C++ standard library
>>>> via PyPI would put manylinux2010 on the table. There are however subtle
>>>> consequences with this -- if this package gets installed into a conda
>>>> environment, it'll clobber anaconda-shipped libstdc++, possibly corrupting
>>>> environments for thousands of anaconda users (this is actually similar to
>>>> the issues with `mkl` shipped via PyPI and Conda clobbering each other).
>>>>
>>>>
>>>> References:
>>>>
>>>> [1] https://github.com/NVIDIA/nvidia-docker/issues/348
>>>> [2] https://gcc.gnu.org/wiki/Cxx11AbiCompatibility
>>>> [3]
>>>> https://github.com/pytorch/builder/commit/44d9bfa607a7616c66fe6492fadd8f05f3578b93
>>>> [4] https://github.com/apache/arrow/pull/3177#issuecomment-447515982
>>>>
>>>> ..
>>>>
>>>> On Sun, Dec 16, 2018 at 2:57 PM Wes McKinney 
>>>> wrote:
>>>>
>>>>> Reposting since I wasn't subscribed to develop...@tensorflow.org. I
>>>>> also didn't see Soumith's response since it didn't come through to
>>>>> dev@arrow.apache.org
>>>>>
>>>>> In response to the non-conforming ABI in the TF and PyTorch wheels, we
>>>>> have attempted to hack around the issue with some elaborate
>>>>> workarounds [1] [2] that have ultimately proved to not work
>>>>> universally. The bottom line is that this is burdening other projects
>>>>> in the Python ecosystem and causing confusing application crashes.
>>>>>
>>>>> First, to state what should hopefully obvious to many of you, Python
>>>>> wheels are not a robust way to deploy complex C++ projects, even
>>>>> setting aside the compiler toolchain issue. If a project has
>>>>> non-trivial third party dependencies, you either have to statically
>>>>> link them or bundle shared libraries with the wheel (we do a bit of
>>>>> both in Apache Arrow). Neither solution is foolproof in all cases.
>>>>> There are other downsides to wheels when it comes to numerical
>>>>> computing -- it is difficult to utilize things like the Intel MKL
>>>>> which may be used by multiple projects. If two projects have the same
>>>>> third party C++ dependency (e.g. let's use gRPC or libprotobuf as a
>>>>> straw man example), it's hard to guarantee that versions or ABI will
>>>>> not conflict with each other.
>>>>>
>>>>> In packaging with conda, we pin all dependencies when building
>>>>> projects that depend on them, then package and deploy the dependencies
>>>>> as separate shared libraries instead of bundling. To resolve the need
>>>>> for newer compilers or newer C++ standard library, libstdc++.so and
>>>>> other system shared libraries are packaged and installed as
>>>>> dependencies. In manylinux1, the RedHat devtoolset compiler toolchain
>>>>> is used as it performs selective static linking of symbols to enable
>>>>> C++11 libraries to be deployed on older Linuxes like RHEL5/6. A conda
>>>>> environment functions as sort of portable miniature Linux
>>>>> distribution.
>>>>>
>>>>> Given

[jira] [Created] (ARROW-3920) Plasma reference counting not properly done in TensorFlow custom operator.

2018-11-30 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3920:
---

 Summary: Plasma reference counting not properly done in TensorFlow 
custom operator.
 Key: ARROW-3920
 URL: https://issues.apache.org/jira/browse/ARROW-3920
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


We never call {{Release}} in the custom op code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3611) Give error more quickly when pyarrow serialization context is used incorrectly.

2018-10-24 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3611:
---

 Summary: Give error more quickly when pyarrow serialization 
context is used incorrectly.
 Key: ARROW-3611
 URL: https://issues.apache.org/jira/browse/ARROW-3611
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara
Assignee: Robert Nishihara


When {{type_id}} is not a string or can't be cast to a string, 
{{register_type}} will succeed, but {{_deserialize_callback}} can fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3574) Fix remaining bug with plasma static versus shared libraries.

2018-10-20 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3574:
---

 Summary: Fix remaining bug with plasma static versus shared 
libraries.
 Key: ARROW-3574
 URL: https://issues.apache.org/jira/browse/ARROW-3574
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


Address a few missing pieces in [https://github.com/apache/arrow/pull/2792.] On 
Mac, moving the {{plasma_store_server}} executable around and then executing it 
leads to

 
{code:java}
dyld: Library not loaded: @rpath/libarrow.12.dylib

  Referenced from: 
/Users/rkn/Workspace/ray/./python/ray/core/src/plasma/plasma_store_server

  Reason: image not found

Abort trap: 6{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Efficient Pandas serialization for mixed object and numeric DataFrames

2018-10-18 Thread Robert Nishihara

How are you serializing the dataframe? If you use *pyarrow.serialize(df)*,
then each column should be serialized separately and numeric columns will
be handled efficiently.

On Thu, Oct 18, 2018 at 9:10 PM Mitar  wrote:

> Hi!
>
> It seems that if a DataFrame contains both numeric and object columns,
> the whole DataFrame is pickled and not that only object columns are
> pickled? Is this right? Are there any plans to improve this?
>
>
> Mitar
>
> --
> http://mitar.tnode.com/
> https://twitter.com/mitar_m
>

[jira] [Created] (ARROW-3559) Statically link libraries for plasma_store_server executable.

2018-10-18 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3559:
---

 Summary: Statically link libraries for plasma_store_server 
executable.
 Key: ARROW-3559
 URL: https://issues.apache.org/jira/browse/ARROW-3559
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


{code:java}
cd ~
git clone https://github.com/apache/arrow
cd arrow/cpp
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DARROW_PYTHON=on -DARROW_PLASMA=on ..
make -j16
sudo make install

cd ~
cp arrow/cpp/build/release/plasma_store_server .
mv arrow arrow-temp

# Try to start the store
./plasma_store_server -s /tmp/store -m 10{code}
The last line crashes with
{code:java}
./plasma_store_server: error while loading shared libraries: libplasma.so.12: 
cannot open shared object file: No such file or directory{code}
For usability, it's important that people can copy around the plasma store 
executable and run it.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3558) Remove fatal error when plasma client calls get on an unsealed object that it created.

2018-10-18 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3558:
---

 Summary: Remove fatal error when plasma client calls get on an 
unsealed object that it created.
 Key: ARROW-3558
 URL: https://issues.apache.org/jira/browse/ARROW-3558
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


In the case when Get is called with a timeout, this should simply behave as if 
the object hasn't been created yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3548) Speed up storing small objects in the object store.

2018-10-17 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3548:
---

 Summary: Speed up storing small objects in the object store.
 Key: ARROW-3548
 URL: https://issues.apache.org/jira/browse/ARROW-3548
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


Currently, to store an object in the plasma object store, there are a lot of 
IPCs. We first call "Create", which does an IPC round trip. Then we call 
"Seal", which is one IPC. Then we call "Release", which is another IPC.

For small objects, we can just inline the object and metadata directly into the 
message to the store, and wait for the response (the response tells us if the 
object was successfully created). This is just a single IPC round trip, which 
can be much faster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3454) Tab complete doesn't work for plasma client.

2018-10-06 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3454:
---

 Summary: Tab complete doesn't work for plasma client.
 Key: ARROW-3454
 URL: https://issues.apache.org/jira/browse/ARROW-3454
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Robert Nishihara


In IPython, tab complete on a plasma client object should reveal the client's 
methods. I think this is the same thing as making sure {{dir(client)}} returns 
all of the relevant methods/fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3373) Fix bug in which plasma store can die when client gets multiple objects and object becomes available.

2018-09-30 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3373:
---

 Summary: Fix bug in which plasma store can die when client gets 
multiple objects and object becomes available.
 Key: ARROW-3373
 URL: https://issues.apache.org/jira/browse/ARROW-3373
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara
 Fix For: 0.11.0


This bug was recently introduced in 
[https://github.com/apache/arrow/pull/2650.] The store can die when a client 
calls "get" on multiple object IDs and then the first object ID becomes 
available.

Will have a patch momentarily.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3348) Plasma store dies when an object that a dead client is waiting for gets created.

2018-09-27 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3348:
---

 Summary: Plasma store dies when an object that a dead client is 
waiting for gets created.
 Key: ARROW-3348
 URL: https://issues.apache.org/jira/browse/ARROW-3348
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


I will have a patch for this soon.

To reproduce the bug do the following:
 # Start plasma store
 # Create client 1 and have it call {{get(object_id)}}
 # Kill client 1
 # Create client 2 and have it kill create an object with ID {{object_id}}

This will cause the plasma store to crash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [DISCUSS] Re-think CI strategy?

2018-08-06 Thread Robert Nishihara

Thanks Wes.

As for Python 3.5, 3.6, and 3.7, I think testing any one of them should be
sufficient (I can't recall any errors that happened with one version and
not the other).

On Mon, Aug 6, 2018 at 12:01 PM Wes McKinney  wrote:

> @Robert, it looks like NumPy is making LTS releases until Jan 1, 2020
>
>
> https://docs.scipy.org/doc/numpy-1.14.0/neps/dropping-python2.7-proposal.html
>
> Based on this, I think it's fine for us to continue to support Python
> 2.7 until then. It's only 16 months away; are you all ready for the
> next decade?
>
> We should also discuss if we want to continue to build and test Python
> 3.5. From download statistics it appears that there are 5-10x as many
> Python 3.6 users as 3.5. I would prefer to drop 3.5 and begin
> supporting 3.7 soon.
>
> @Antoine, I think we can avoid building the C++ codebase 3 times, but
> it will require a bit of retooling of the scripts. The reason that
> ccache isn't working properly is probably because the Python include
> directory is being included even for compilation units that do not use
> the Python C API.
> https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L721.
> I'm opening a JIRA about fixing this
> https://issues.apache.org/jira/browse/ARROW-2994
>
> Created https://issues.apache.org/jira/browse/ARROW-2995 about
> removing the redundant build cycle
>
> On Mon, Aug 6, 2018 at 2:19 PM, Robert Nishihara
>  wrote:
> >>
> >> Also, at this point we're sometimes hitting the 50 minutes time limit on
> >> our slowest Travis-CI matrix job, which means we have to restart it...
> >> making the build even slower.
> >>
> > Only a short-term fix, but Travis can lengthen the max build time if you
> > email them and ask them to.
>

Re: [DISCUSS] Re-think CI strategy?

2018-08-06 Thread Robert Nishihara

>
> Also, at this point we're sometimes hitting the 50 minutes time limit on
> our slowest Travis-CI matrix job, which means we have to restart it...
> making the build even slower.
>
Only a short-term fix, but Travis can lengthen the max build time if you
email them and ask them to.

Re: [DISCUSS] Re-think CI strategy?

2018-08-06 Thread Robert Nishihara

Wes, do you primarily want to drop Python 2 to speed up Travis or to reduce
the development overhead? In my experience the development overhead is
minimal and well worth it. For Travis, we could consider looking into other
options like paying for more concurrency.

January 2019 is very soon and Python 2 is still massively popular.

On Mon, Aug 6, 2018 at 5:11 AM Wes McKinney  wrote:

> > The 40+ minutes Travis-CI job already uses the toolchain packages AFAIK.
> >  Don't they include thrift?
>
> I was referring to your comment about "parquet-cpp AppVeyor builds are
> abysmally slow". I think the slowness is in significant part due to
> the ExternalProject builds, where Thrift is the worst offender.
>

Re: Pyarrow Plasma client.release() fault

2018-07-20 Thread Robert Nishihara

Hi Corey,

It is possible that the current eviction policy will evict a ton of objects
at once. Since the plasma store is single threaded, this could cause the
plasma store to be unresponsive while the eviction is happening (though it
should not hang permanently, just temporarily).

You could always try starting the plasma store with a smaller amount of
memory (using the "-m" flag) and see if that changes things.

Glad to hear that ray is simplifying things.

-Robert

On Fri, Jul 20, 2018 at 1:30 PM Corey Nolet  wrote:

> Robert,
>
> Yes I am using separate Plasma clients in each different thread. I also
> verified that I am not using up all the file descriptors or reaching the
> overcommit limit.
>
> I do see that the Plasma server is evicting objects every so often. I'm
> assuming this eviction may be going on in the background? Is it possible
> that the locking up may be the result of a massive eviction? I am
> allocating over 8TB for the Plasma server.
>
> Wes,
>
> Best practices would be great. I did find that the @ray.remote scheduler
> from the Ray project has drastically simplified my code.
>
> I also attempted using single-node PySpark but the type conversion I need
> for going from CSV->Dataframes was orders of magnitude slower than Pandas
> and Python.
>
>
>
> On Mon, Jul 16, 2018 at 8:17 PM Wes McKinney  wrote:
>
> > Seems like we might want to write down some best practices for this
> > level of large scale usage, essentially a supercomputer-like rig. I
> > wouldn't even know where to come by a machine with a machine with >
> > 2TB memory for scalability / concurrency load testing
> >
> > On Mon, Jul 16, 2018 at 2:59 PM, Robert Nishihara
> >  wrote:
> > > Are you using the same plasma client from all of the different threads?
> > If
> > > so, that could cause race conditions as the client is not thread safe.
> > >
> > > Alternatively, if you have a separate plasma client for each thread,
> then
> > > you may be running out of file descriptors somewhere (either the client
> > > process or the store).
> > >
> > > Can you check if the object store evicting objects (it prints something
> > to
> > > stdout/stderr when this happens)? Could you be running out of memory
> but
> > > failing to release the objects?
> > >
> > > On Tue, Jul 10, 2018 at 9:48 AM Corey Nolet  wrote:
> > >
> > >> Update:
> > >>
> > >> I'm investigating the possibility that I've reached the overcommit
> > limit in
> > >> the kernel as a result of all the parallel processes.
> > >>
> > >> This still doesn't fix the client.release() problem but it might
> explain
> > >> why the processing appears to halt, after some time, until I restart
> the
> > >> Jupyter kernel.
> > >>
> > >> On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet 
> wrote:
> > >>
> > >> > Wes,
> > >> >
> > >> > Unfortunately, my code is on a separate network. I'll try to explain
> > what
> > >> > I'm doing and if you need further detail, I can certainly pseudocode
> > >> > specifics.
> > >> >
> > >> > I am using multiprocessing.Pool() to fire up a bunch of threads for
> > >> > different filenames. In each thread, I'm performing a pd.read_csv(),
> > >> > sorting by the timestamp field (rounded to the day) and chunking the
> > >> > Dataframe into separate Dataframes. I create a new Plasma ObjectID
> for
> > >> each
> > >> > of the chunked Dataframes, convert them to RecordBuffer objects,
> > stream
> > >> the
> > >> > bytes to Plasma and seal the objects. Only the objectIDs are
> returned
> > to
> > >> > the orchestration thread.
> > >> >
> > >> > In follow-on processing, I'm combining the ObjectIDs for each of the
> > >> > unique day timestamps into lists and I'm passing those into a
> > function in
> > >> > parallel using multiprocessing.Pool(). In this function, I'm
> iterating
> > >> > through the lists of objectIds, loading them back into Dataframes,
> > >> > appending them together until their size
> > >> > is > some predefined threshold, and performing a df.to_parquet().
> > >> >
> > >> > The steps in the 2 paragraphs above are performing in a loop,
> > batching up
> > >> > 500-1k files at a time for each iteration.
> > >> >
> > >&

Re: Pyarrow Plasma client.release() fault

2018-07-16 Thread Robert Nishihara

Are you using the same plasma client from all of the different threads? If
so, that could cause race conditions as the client is not thread safe.

Alternatively, if you have a separate plasma client for each thread, then
you may be running out of file descriptors somewhere (either the client
process or the store).

Can you check if the object store evicting objects (it prints something to
stdout/stderr when this happens)? Could you be running out of memory but
failing to release the objects?

On Tue, Jul 10, 2018 at 9:48 AM Corey Nolet  wrote:

> Update:
>
> I'm investigating the possibility that I've reached the overcommit limit in
> the kernel as a result of all the parallel processes.
>
> This still doesn't fix the client.release() problem but it might explain
> why the processing appears to halt, after some time, until I restart the
> Jupyter kernel.
>
> On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet  wrote:
>
> > Wes,
> >
> > Unfortunately, my code is on a separate network. I'll try to explain what
> > I'm doing and if you need further detail, I can certainly pseudocode
> > specifics.
> >
> > I am using multiprocessing.Pool() to fire up a bunch of threads for
> > different filenames. In each thread, I'm performing a pd.read_csv(),
> > sorting by the timestamp field (rounded to the day) and chunking the
> > Dataframe into separate Dataframes. I create a new Plasma ObjectID for
> each
> > of the chunked Dataframes, convert them to RecordBuffer objects, stream
> the
> > bytes to Plasma and seal the objects. Only the objectIDs are returned to
> > the orchestration thread.
> >
> > In follow-on processing, I'm combining the ObjectIDs for each of the
> > unique day timestamps into lists and I'm passing those into a function in
> > parallel using multiprocessing.Pool(). In this function, I'm iterating
> > through the lists of objectIds, loading them back into Dataframes,
> > appending them together until their size
> > is > some predefined threshold, and performing a df.to_parquet().
> >
> > The steps in the 2 paragraphs above are performing in a loop, batching up
> > 500-1k files at a time for each iteration.
> >
> > When I run this iteration a few times, it eventually locks up the Plasma
> > client. With regards to the release() fault, it doesn't seem to matter
> when
> > or where I run it (in the orchestration thread or in other threads), it
> > always seems to crash the Jupyter kernel. I'm thinking I might be using
> it
> > wrong, I'm just trying to figure out where and what I'm doing.
> >
> > Thanks again!
> >
> > On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney 
> wrote:
> >
> >> hi Corey,
> >>
> >> Can you provide the code (or a simplified version thereof) that shows
> >> how you're using Plasma?
> >>
> >> - Wes
> >>
> >> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet 
> wrote:
> >> > I'm on a system with 12TB of memory and attempting to use Pyarrow's
> >> Plasma
> >> > client to convert a series of CSV files (via Pandas) into a Parquet
> >> store.
> >> >
> >> > I've got a little over 20k CSV files to process which are about 1-2gb
> >> each.
> >> > I'm loading 500 to 1000 files at a time.
> >> >
> >> > In each iteration, I'm loading a series of files, partitioning them
> by a
> >> > time field into separate dataframes, then writing parquet files in
> >> > directories for each day.
> >> >
> >> > The problem I'm having is that the Plasma client & server appear to
> >> lock up
> >> > after about 2-3 iterations. It locks up to the point where I can't
> even
> >> > CTRL+C the server. I am able to stop the notebook and re-trying the
> code
> >> > just continues to lock up when interacting with Jupyter. There are no
> >> > errors in my logs to tell me something's wrong.
> >> >
> >> > Just to make sure I'm not just being impatient and possibly need to
> wait
> >> > for some background services to finish, I allowed the code to run
> >> overnight
> >> > and it was still in the same state when I came in to work this
> morning.
> >> I'm
> >> > running the Plasma server with 4TB max.
> >> >
> >> > In an attempt to pro-actively free up some of the object ids that I no
> >> > longer need, I also attempted to use the client.release() function
> but I
> >> > cannot seem to figure out how to make this work properly. It crashes
> my
> >> > Jupyter kernel each time I try.
> >> >
> >> > I'm using Pyarrow 0.9.0
> >> >
> >> > Thanks in advance.
> >>
> >
>

Re: bug? pyarrow deserialize_components doesn't work in multiple processes

2018-07-06 Thread Robert Nishihara

Can you reproduce it without all of the multiprocessing code? E.g., just
call *pyarrow.serialize* in one interpreter. Then copy and paste the bytes
into another interpreter and call *pyarrow.deserialize *or
*pyarrow.deserialize_components*?
On Thu, Jul 5, 2018 at 9:48 PM Josh Quigley 
wrote:

> Attachment inline:
>
> import pyarrow as pa
> import multiprocessing as mp
> import numpy as np
>
> def make_payload():
> """Common function - make data to send"""
> return ['message', 123, np.random.uniform(-100, 100, (4, 4))]
>
> def send_payload(payload, connection):
> """Common function - serialize & send data through a socket"""
> s = pa.serialize(payload)
> c = s.to_components()
>
> # Send
> data = c.pop('data')
> connection.send(c)
> for d in data:
> connection.send_bytes(d)
> connection.send_bytes(b'')
>
>
> def recv_payload(connection):
> """Common function - recv data through a socket & deserialize"""
> c = connection.recv()
> c['data'] = []
> while True:
> r = connection.recv_bytes()
> if len(r) == 0:
> break
> c['data'].append(pa.py_buffer(r))
>
> print('...deserialize')
> return pa.deserialize_components(c)
>
>
> def run_same_process():
> """Same process: Send data down a socket, then read data from the
> matching socket"""
> print('run_same_process')
> recv_conn,send_conn = mp.Pipe(duplex=False)
> payload = make_payload()
> print(payload)
> send_payload(payload, send_conn)
> payload2 = recv_payload(recv_conn)
> print(payload2)
>
>
> def receiver(recv_conn):
> """Separate process: runs in a different process, recv data &
> deserialize"""
> print('Receiver started')
> payload = recv_payload(recv_conn)
> print(payload)
>
>
> def run_separate_process():
> """Separate process: launch the child process, then send data"""
>
>
> print('run_separate_process')
> recv_conn,send_conn = mp.Pipe(duplex=False)
> process = mp.Process(target=receiver, args=(recv_conn,))
> process.start()
>
> payload = make_payload()
> print(payload)
> send_payload(payload, send_conn)
>
> process.join()
>
> if __name__ == '__main__':
> run_same_process()
> run_separate_process()
>
>
> On Fri, Jul 6, 2018 at 2:42 PM Josh Quigley <
> josh.quig...@lifetrading.com.au>
> wrote:
>
> > A reproducible program attached - it first runs serialize/deserialize
> from
> > the same process, then it does the same work using a separate process for
> > the deserialize.
> >
> > The behaviour see is (after the same process code executes happily) is
> > hanging / child-process crashing during the call to deserialize.
> >
> > Is this expected, and if not, is there a known workaround?
> >
> > Running Windows 10, conda distribution,  with package versions listed
> > below. I'll also see what happens if I run on *nix.
> >
> >   - arrow-cpp=0.9.0=py36_vc14_7
> >   - boost-cpp=1.66.0=vc14_1
> >   - bzip2=1.0.6=vc14_1
> >   - hdf5=1.10.2=vc14_0
> >   - lzo=2.10=vc14_0
> >   - parquet-cpp=1.4.0=vc14_0
> >   - snappy=1.1.7=vc14_1
> >   - zlib=1.2.11=vc14_0
> >   - blas=1.0=mkl
> >   - blosc=1.14.3=he51fdeb_0
> >   - cython=0.28.3=py36hfa6e2cd_0
> >   - icc_rt=2017.0.4=h97af966_0
> >   - intel-openmp=2018.0.3=0
> >   - numexpr=2.6.5=py36hcd2f87e_0
> >   - numpy=1.14.5=py36h9fa60d3_2
> >   - numpy-base=1.14.5=py36h5c71026_2
> >   - pandas=0.23.1=py36h830ac7b_0
> >   - pyarrow=0.9.0=py36hfe5e424_2
> >   - pytables=3.4.4=py36he6f6034_0
> >   - python=3.6.6=hea74fb7_0
> >   - vc=14=h0510ff6_3
> >   - vs2015_runtime=14.0.25123=3
> >
> >
>

[jira] [Created] (ARROW-2657) Segfault when importing TensorFlow after Pyarrow

2018-05-31 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2657:
---

 Summary: Segfault when importing TensorFlow after Pyarrow
 Key: ARROW-2657
 URL: https://issues.apache.org/jira/browse/ARROW-2657
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Robert Nishihara






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: PyArrow & Python Multiprocessing

2018-05-16 Thread Robert Nishihara

You're welcome!

On Wed, May 16, 2018 at 6:13 PM Corey Nolet <cjno...@gmail.com> wrote:

> I must say, I’m super excited about using Arrow and Plasma.
>
> The code you just posted worked for me at home and I’m sure I’ll figure
> out what I was doing wrong tomorrow at work.
>
> Anyways, thanks so much for your help and fast replies!
>
> Sent from my iPhone
>
> > On May 16, 2018, at 7:42 PM, Robert Nishihara <robertnishih...@gmail.com>
> wrote:
> >
> > You should be able to do something like the following.
> >
> > # Start the store.
> > plasma_store -s /tmp/store -m 10
> >
> > Then in Python, do the following:
> >
> > import pandas as pd
> > import pyarrow.plasma as plasma
> > import numpy as np
> >
> > client = plasma.connect('/tmp/store', '', 0)
> > series = pd.Series(np.zeros(100))
> > object_id = client.put(series)
> >
> > And yes, I would create a separate Plasma client for each process. I
> don't
> > think you'll be able to pickle a Plasma client object successfully (it
> has
> > a socket connection to the store).
> >
> > On Wed, May 16, 2018 at 3:43 PM Corey Nolet <cjno...@gmail.com> wrote:
> >
> >> Robert,
> >>
> >> Thank you for the quick response. I've been playing around for a few
> hours
> >> to get a feel for how this works.
> >>
> >> If I understand correctly, it's better to have the Plasma client objects
> >> instantiated within each separate process? Weird things seemed to happen
> >> when I attempted to share a single one. I was assuming that the pickle
> >> serialization by python multiprocessing would have been serializing the
> >> connection info and re-instantiating on the other side but that didn't
> seem
> >> to be the case.
> >>
> >> I managed to load up a gigantic set of CSV files into Dataframes. Now
> I'm
> >> attempting to read the chunks, perform a groupby-aggregate, and write
> the
> >> results back to the Plasma store. Unless I'm mistaken, there doesn't
> seem
> >> to be a very direct way of accomplishing this. When I tried converting
> the
> >> Series object into a Plasma Array and just doing a client.put(array) I
> get
> >> a pickling error. Unless maybe I'm misunderstanding the architecture
> here,
> >> I believe that error would have been referring to attempts to serialize
> the
> >> object into a file? I would hope that the data isn't all being sent to
> the
> >> single Plasma server (or sent over sockets for that matter).
> >>
> >> What would be the recommended strategy for serializing Pandas Series
> >> objects? I really like the StreamWriter concept here but there does not
> >> seem to be a direct way (or documentation) to accomplish this.
> >>
> >> Thanks again.
> >>
> >> On Wed, May 16, 2018 at 1:28 PM, Robert Nishihara <
> >> robertnishih...@gmail.com
> >>> wrote:
> >>
> >>> Take a look at the Plasma object store
> >>> https://arrow.apache.org/docs/python/plasma.html.
> >>>
> >>> Here's an example using it (along with multiprocessing to sort a pandas
> >>> dataframe)
> >>> https://github.com/apache/arrow/blob/master/python/
> >>> examples/plasma/sorting/sort_df.py.
> >>> It's possible the example is a bit out of date.
> >>>
> >>> You may be interested in taking a look at Ray
> >>> https://github.com/ray-project/ray. We use Plasma/Arrow under the hood
> >> to
> >>> do all of these things but hide a lot of the bookkeeping (like object
> ID
> >>> generation). For your setting, you can think of it as a replacement for
> >>> Python multiprocessing that automatically uses shared memory and Arrow
> >> for
> >>> serialization.
> >>>
> >>>> On Wed, May 16, 2018 at 10:02 AM Corey Nolet <cjno...@gmail.com>
> wrote:
> >>>>
> >>>> I've been reading through the PyArrow documentation and trying to
> >>>> understand how to use the tool effectively for IPC (using zero-copy).
> >>>>
> >>>> I'm on a system with 586 cores & 1TB of ram. I'm using Panda's
> >> Dataframes
> >>>> to process several 10's of gigs of data in memory and the pickling
> that
> >>> is
> >>>> done by Python's multiprocessing API is very wasteful.
> >>>>
> >>>> I'm running a little hand-built map-reduce where I chunk the dataframe
> >>> into
> >>>> N_mappers number of chunks, run some processing on them, then run some
> >>>> number N_reducers to finalize the operation. What I'd like to be able
> >> to
> >>> do
> >>>> is chunk up the dataframe into Arrow Buffer objects and just have each
> >>>> mapped task read their respective Buffer object with the guarantee of
> >>>> zero-copy.
> >>>>
> >>>> I see there's a couple Filesystem abstractions for doing memory-mapped
> >>>> files. Durability isn't something I need and I'm willing to forego the
> >>>> expense of putting the files on disk.
> >>>>
> >>>> Is it possible to write the data directly to memory and pass just the
> >>>> reference around to the different processes? What's the recommended
> way
> >>> to
> >>>> accomplish my goal here?
> >>>>
> >>>>
> >>>> Thanks in advance!
> >>>>
> >>>
> >>
>

Re: PyArrow & Python Multiprocessing

2018-05-16 Thread Robert Nishihara

You should be able to do something like the following.

# Start the store.
plasma_store -s /tmp/store -m 10

Then in Python, do the following:

import pandas as pd
import pyarrow.plasma as plasma
import numpy as np

client = plasma.connect('/tmp/store', '', 0)
series = pd.Series(np.zeros(100))
object_id = client.put(series)

And yes, I would create a separate Plasma client for each process. I don't
think you'll be able to pickle a Plasma client object successfully (it has
a socket connection to the store).

On Wed, May 16, 2018 at 3:43 PM Corey Nolet <cjno...@gmail.com> wrote:

> Robert,
>
> Thank you for the quick response. I've been playing around for a few hours
> to get a feel for how this works.
>
> If I understand correctly, it's better to have the Plasma client objects
> instantiated within each separate process? Weird things seemed to happen
> when I attempted to share a single one. I was assuming that the pickle
> serialization by python multiprocessing would have been serializing the
> connection info and re-instantiating on the other side but that didn't seem
> to be the case.
>
> I managed to load up a gigantic set of CSV files into Dataframes. Now I'm
> attempting to read the chunks, perform a groupby-aggregate, and write the
> results back to the Plasma store. Unless I'm mistaken, there doesn't seem
> to be a very direct way of accomplishing this. When I tried converting the
> Series object into a Plasma Array and just doing a client.put(array) I get
> a pickling error. Unless maybe I'm misunderstanding the architecture here,
> I believe that error would have been referring to attempts to serialize the
> object into a file? I would hope that the data isn't all being sent to the
> single Plasma server (or sent over sockets for that matter).
>
> What would be the recommended strategy for serializing Pandas Series
> objects? I really like the StreamWriter concept here but there does not
> seem to be a direct way (or documentation) to accomplish this.
>
> Thanks again.
>
> On Wed, May 16, 2018 at 1:28 PM, Robert Nishihara <
> robertnishih...@gmail.com
> > wrote:
>
> > Take a look at the Plasma object store
> > https://arrow.apache.org/docs/python/plasma.html.
> >
> > Here's an example using it (along with multiprocessing to sort a pandas
> > dataframe)
> > https://github.com/apache/arrow/blob/master/python/
> > examples/plasma/sorting/sort_df.py.
> > It's possible the example is a bit out of date.
> >
> > You may be interested in taking a look at Ray
> > https://github.com/ray-project/ray. We use Plasma/Arrow under the hood
> to
> > do all of these things but hide a lot of the bookkeeping (like object ID
> > generation). For your setting, you can think of it as a replacement for
> > Python multiprocessing that automatically uses shared memory and Arrow
> for
> > serialization.
> >
> > On Wed, May 16, 2018 at 10:02 AM Corey Nolet <cjno...@gmail.com> wrote:
> >
> > > I've been reading through the PyArrow documentation and trying to
> > > understand how to use the tool effectively for IPC (using zero-copy).
> > >
> > > I'm on a system with 586 cores & 1TB of ram. I'm using Panda's
> Dataframes
> > > to process several 10's of gigs of data in memory and the pickling that
> > is
> > > done by Python's multiprocessing API is very wasteful.
> > >
> > > I'm running a little hand-built map-reduce where I chunk the dataframe
> > into
> > > N_mappers number of chunks, run some processing on them, then run some
> > > number N_reducers to finalize the operation. What I'd like to be able
> to
> > do
> > > is chunk up the dataframe into Arrow Buffer objects and just have each
> > > mapped task read their respective Buffer object with the guarantee of
> > > zero-copy.
> > >
> > > I see there's a couple Filesystem abstractions for doing memory-mapped
> > > files. Durability isn't something I need and I'm willing to forego the
> > > expense of putting the files on disk.
> > >
> > > Is it possible to write the data directly to memory and pass just the
> > > reference around to the different processes? What's the recommended way
> > to
> > > accomplish my goal here?
> > >
> > >
> > > Thanks in advance!
> > >
> >
>

Re: PyArrow & Python Multiprocessing

2018-05-16 Thread Robert Nishihara

Take a look at the Plasma object store
https://arrow.apache.org/docs/python/plasma.html.

Here's an example using it (along with multiprocessing to sort a pandas
dataframe)
https://github.com/apache/arrow/blob/master/python/examples/plasma/sorting/sort_df.py.
It's possible the example is a bit out of date.

You may be interested in taking a look at Ray
https://github.com/ray-project/ray. We use Plasma/Arrow under the hood to
do all of these things but hide a lot of the bookkeeping (like object ID
generation). For your setting, you can think of it as a replacement for
Python multiprocessing that automatically uses shared memory and Arrow for
serialization.

On Wed, May 16, 2018 at 10:02 AM Corey Nolet  wrote:

> I've been reading through the PyArrow documentation and trying to
> understand how to use the tool effectively for IPC (using zero-copy).
>
> I'm on a system with 586 cores & 1TB of ram. I'm using Panda's Dataframes
> to process several 10's of gigs of data in memory and the pickling that is
> done by Python's multiprocessing API is very wasteful.
>
> I'm running a little hand-built map-reduce where I chunk the dataframe into
> N_mappers number of chunks, run some processing on them, then run some
> number N_reducers to finalize the operation. What I'd like to be able to do
> is chunk up the dataframe into Arrow Buffer objects and just have each
> mapped task read their respective Buffer object with the guarantee of
> zero-copy.
>
> I see there's a couple Filesystem abstractions for doing memory-mapped
> files. Durability isn't something I need and I'm willing to forego the
> expense of putting the files on disk.
>
> Is it possible to write the data directly to memory and pass just the
> reference around to the different processes? What's the recommended way to
> accomplish my goal here?
>
>
> Thanks in advance!
>

[jira] [Created] (ARROW-2469) Make out arguments last in ReadMessage API.

2018-04-18 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2469:
---

 Summary: Make out arguments last in ReadMessage API.
 Key: ARROW-2469
 URL: https://issues.apache.org/jira/browse/ARROW-2469
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Robert Nishihara
Assignee: Robert Nishihara






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2451) Handle more dtypes efficiently in custom numpy array serializer.

2018-04-12 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2451:
---

 Summary: Handle more dtypes efficiently in custom numpy array 
serializer.
 Key: ARROW-2451
 URL: https://issues.apache.org/jira/browse/ARROW-2451
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara


Right now certain dtypes like bool or fixed length strings are serialized as 
lists, which is inefficient. We can handle these more efficiently by casting 
them to uint8 and saving the original dtype as additional data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2448) Segfault when plasma client goes out of scope before buffer.

2018-04-11 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2448:
---

 Summary: Segfault when plasma client goes out of scope before 
buffer.
 Key: ARROW-2448
 URL: https://issues.apache.org/jira/browse/ARROW-2448
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++), Python
Reporter: Robert Nishihara


The following causes a segfault.

 

First start a plasma store with
{code:java}
plasma_store -s /tmp/store -m 100{code}
Then run the following in Python.
{code}
import pyarrow.plasma as plasma
import numpy as np

client = plasma.connect('/tmp/store', '', 0)

object_id = client.put(np.zeros(3))

buf = client.get(object_id)

del client

del buf  # This segfaults.{code}
The backtrace is 
{code:java}
(lldb) bt

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
(code=1, address=0xfffc)

  * frame #0: 0x0001056deaee 
libplasma.0.dylib`plasma::PlasmaClient::Release(plasma::UniqueID const&) + 142

    frame #1: 0x0001056de9e9 
libplasma.0.dylib`plasma::PlasmaBuffer::~PlasmaBuffer() + 41

    frame #2: 0x0001056dec9f libplasma.0.dylib`arrow::Buffer::~Buffer() + 63

    frame #3: 0x000106206661 
lib.cpython-36m-darwin.so`std::__1::shared_ptr::~shared_ptr() 
[inlined] std::__1::__shared_count::__release_shared(this=0x0001019b7d20) 
at memory:3444

    frame #4: 0x000106206617 
lib.cpython-36m-darwin.so`std::__1::shared_ptr::~shared_ptr() 
[inlined] 
std::__1::__shared_weak_count::__release_shared(this=0x0001019b7d20) at 
memory:3486

    frame #5: 0x000106206617 
lib.cpython-36m-darwin.so`std::__1::shared_ptr::~shared_ptr(this=0x000100791780)
 at memory:4412

    frame #6: 0x000106002b35 
lib.cpython-36m-darwin.so`std::__1::shared_ptr::~shared_ptr(this=0x000100791780)
 at memory:4410

    frame #7: 0x0001061052c5 lib.cpython-36m-darwin.so`void 
__Pyx_call_destructor<std::__1::shared_ptr 
>(x=std::__1::shared_ptr::element_type @ 0x0001019b7d38 
strong=0 weak=1) at lib.cxx:486

    frame #8: 0x000106104f93 
lib.cpython-36m-darwin.so`__pyx_tp_dealloc_7pyarrow_3lib_Buffer(o=0x000100791768)
 at lib.cxx:107704

    frame #9: 0x0001069fcd54 multiarray.cpython-36m-darwin.so`array_dealloc 
+ 292

    frame #10: 0x0001000e8daf libpython3.6m.dylib`_PyDict_DelItem_KnownHash 
+ 463

    frame #11: 0x000100171899 libpython3.6m.dylib`_PyEval_EvalFrameDefault 
+ 13321

    frame #12: 0x0001001791ef libpython3.6m.dylib`_PyEval_EvalCodeWithName 
+ 2447

    frame #13: 0x00010016e3d4 libpython3.6m.dylib`PyEval_EvalCode + 100

    frame #14: 0x0001001a3bd6 
libpython3.6m.dylib`PyRun_InteractiveOneObject + 582

    frame #15: 0x0001001a350e 
libpython3.6m.dylib`PyRun_InteractiveLoopFlags + 222

    frame #16: 0x0001001a33fc libpython3.6m.dylib`PyRun_AnyFileExFlags + 60

    frame #17: 0x0001001bc835 libpython3.6m.dylib`Py_Main + 3829

    frame #18: 0x00010df8 python`main + 232

    frame #19: 0x7fff6cd80015 libdyld.dylib`start + 1

    frame #20: 0x7fff6cd80015 libdyld.dylib`start + 1{code}
Basically, the issue is that when the buffer goes out of scope, it calls 
{{Release}} on the plasma client, but the client has already been deallocated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2397) Document changes in Tensor encoding in IPC.md.

2018-04-04 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2397:
---

 Summary: Document changes in Tensor encoding in IPC.md.
 Key: ARROW-2397
 URL: https://issues.apache.org/jira/browse/ARROW-2397
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Robert Nishihara


Update IPC.md to reflect the changes in 
https://github.com/apache/arrow/pull/1802.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2308) Serialized tensor data should be 64-byte aligned.

2018-03-13 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2308:
---

 Summary: Serialized tensor data should be 64-byte aligned.
 Key: ARROW-2308
 URL: https://issues.apache.org/jira/browse/ARROW-2308
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara


See [https://github.com/ray-project/ray/issues/1658] for an example of this 
issue. Non-aligned data can trigger a copy when fed into TensorFlow and things 
like that.
{code}
import pyarrow as pa
import numpy as np

x = np.zeros(10)
y = pa.deserialize(pa.serialize(x).to_buffer())

x.ctypes.data % 64  # 0 (it starts out aligned)
y.ctypes.data % 64  # 48 (it is no longer aligned)
{code}
It should be possible to fix this by calling something like 
{{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. 
Note that we already do this before writing the tensor header, but the tensor 
header is not necessarily a multiple of 64 bytes, so the subsequent data can be 
unaligned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: How to properly serialize subclasses of supported classes

2018-03-05 Thread Robert Nishihara

We just chatted offline. Should be fixed by
https://github.com/apache/arrow/pull/1704.

On Mon, Mar 5, 2018 at 3:42 AM Mitar <mmi...@gmail.com> wrote:

> Hi!
>
> You mean, this explains why a subclass of list is not being matched? Maybe.
>
> But I do not get why my custom serialization for ndarray subclass is
> never called.
>
> Or how hard would it be to automatically serialize/deserialize into
> subclasses so that I would not have to have a custom serialization for
> ndarray but the existing ndarray serialization would work, casting it
> into a proper subclass.
>
>
> Mitar
>
> On Sun, Mar 4, 2018 at 2:39 PM, Robert Nishihara
> <robertnishih...@gmail.com> wrote:
> > The issue is probably this line
> >
> >
> https://github.com/apache/arrow/blob/8b1c8118b017a941f0102709d72df7e5a9783aa4/cpp/src/arrow/python/python_to_arrow.cc#L504
> >
> > which uses PyList_Check instead of PyList_CheckExact. Changing it to the
> > exact form will cause it to use the custom serializer for subclasses of
> > list.
> >
> > On Sun, Mar 4, 2018 at 1:08 AM Mitar <mmi...@gmail.com> wrote:
> >>
> >> Hi!
> >>
> >> I have a subclass of numpy and another of pandas which add a metadata
> >> attribute to them. Moreover, I have a subclass of typing.List as a
> >> Python generic with this metadata attribute as well.
> >>
> >> Now, it seems if I serialize this to plasma store and back I get
> >> standard numpy, pandas, or list back, respectively.
> >>
> >> My question is: how can I make it so that proper subclasses are
> >> returned, including the custom metadata attribute?
> >>
> >> I tried to use pyarrow_lib._default_serialization_context.register_type
> >> but it does not seem to work. Moreover, I still worry that even if I
> >> create a serialization for a custom class, if anyone makes a subclass
> >> and tries to store it plasma store they will get back the custom class
> >> and not a subclass.
> >>
> >> This is how I am testing:
> >>
> >>
> >>
> https://gitlab.com/datadrivendiscovery/metadata/blob/plasma/tests/test_plasma.py#L50
> >>
> >> And here is the code for custom numpy class and attempt at registering
> >> custom serialization:
> >>
> >>
> >>
> https://gitlab.com/datadrivendiscovery/metadata/blob/plasma/d3m_metadata/container/numpy.py#L135
> >>
> >> It looks like custom serialization is not called.
> >>
> >>
> >> Mitar
> >>
> >> --
> >> http://mitar.tnode.com/
> >> https://twitter.com/mitar_m
>
>
>
> --
> http://mitar.tnode.com/
> https://twitter.com/mitar_m
>

[jira] [Created] (ARROW-2265) Serializing subclasses of np.ndarray returns a np.ndarray.

2018-03-05 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2265:
---

 Summary: Serializing subclasses of np.ndarray returns a np.ndarray.
 Key: ARROW-2265
 URL: https://issues.apache.org/jira/browse/ARROW-2265
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Robert Nishihara
Assignee: Robert Nishihara






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [ANNOUNCE] New Arrow committers

2018-02-14 Thread Robert Nishihara

Thanks a lot Wes!
On Wed, Feb 14, 2018 at 7:28 AM Wes McKinney <wesmck...@gmail.com> wrote:

> On behalf of the Arrow PMC, I'm pleased to announce that Brian Hulette
> (@TheNeuralBit) and Robert Nishihara (@robertnishihara) are now Arrow
> committers. Thank you for all your contributions!
>
> Welcome, and congrats!
>
> - Wes
>

[jira] [Created] (ARROW-2122) Pyarrow fails to serialize dataframe with timestamp.

2018-02-08 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2122:
---

 Summary: Pyarrow fails to serialize dataframe with timestamp.
 Key: ARROW-2122
 URL: https://issues.apache.org/jira/browse/ARROW-2122
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Robert Nishihara


The bug can be reproduced as follows.
{code:java}
import pyarrow as pa
import pandas as pd


s = pa.serialize({code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2121) Consider special casing object arrays in pandas serializers.

2018-02-08 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2121:
---

 Summary: Consider special casing object arrays in pandas 
serializers.
 Key: ARROW-2121
 URL: https://issues.apache.org/jira/browse/ARROW-2121
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2065) Fix bug in SerializationContext.clone().

2018-01-30 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2065:
---

 Summary: Fix bug in SerializationContext.clone().
 Key: ARROW-2065
 URL: https://issues.apache.org/jira/browse/ARROW-2065
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara


We currently fail to copy over one of the fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2024) Remove global SerializationContext variables.

2018-01-23 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2024:
---

 Summary: Remove global SerializationContext variables.
 Key: ARROW-2024
 URL: https://issues.apache.org/jira/browse/ARROW-2024
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara


We should get rid of the global variables 
_default_serialization_context and pandas_serialization_context 
and replace them with functions default_serialization_context() and 
pandas_serialization_context().

This will also make it faster to do import pyarrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Exploring the possibility of creating a persistent cache by arrow/plasma

2018-01-21 Thread Robert Nishihara

Evicted objects are gone for good, although it would certainly be possible
to add the ability to persist them to disk.

The Plasma store does reference counting to figure out which clients are
using which objects. Clients can "release" objects through the client API
to decrement the reference count. The Plasma store also keeps track of when
a client exits/dies and automatically gets rid of the reference counts for
that client.

On Sun, Jan 21, 2018 at 4:09 PM Mike Sam <mikesam...@gmail.com> wrote:

> Great, thank you very much.
>
> What happens to the evicted objects? are they
> gone for good or are they persisted locally?
>
> Also, what defines "objects that are not currently in use by any client"?
> reference counting?
>
>
>
> On Sat, Jan 20, 2018 at 1:53 PM, Robert Nishihara <
> robertnishih...@gmail.com
> > wrote:
>
> > When Plasma is started up, you specify the total amount of memory it is
> > allowed to use (in bytes) with the -m flag.
> >
> > When a Plasma client attempts to create a new object and there is not
> > enough memory in the store, the store will evict a bunch of unused
> objects
> > to free up memory (objects that are not currently in use by any client).
> > This is done in a least-recently-used fashion as defined in the eviction
> > policy
> > https://github.com/apache/arrow/blob/master/cpp/src/
> > plasma/eviction_policy.h.
> > In principle, this eviction policy could be made more configurable or a
> > different eviction policy could be plugged in, though we haven't
> > experimented with that much.
> >
> > If you want to manually delete an object from Plasma, that can be done
> with
> > the "Delete" command
> > https://github.com/apache/arrow/blob/d135974a0d3dd9a9fbbb10da4c5dbc
> > 65f9324234/cpp/src/plasma/client.h#L186,
> > which is part of the C++ Plasma client API but has not been exposed
> through
> > Python yet.
> >
> > For now, if you want to make sure that an object will not be evicted
> (e.g.,
> > from the C++ Client API), you can call Get on the object ID and then it
> > will not be evicted before you call Release from the same client.
> >
> > On Fri, Jan 19, 2018 at 5:17 PM Mike Sam <mikesam...@gmail.com> wrote:
> >
> > > Thank you, Robert, for your answer.
> > >
> > > Could you kindly further elaborate on number 1 as I am not
> > > familiar with Plasma codebase yet?
> > > Are you saying persistence is available out of the box? else what
> > > specific things need to be added
> > > to Plasma codebase to make this happen?
> > >
> > > Thank you,
> > > Mike
> > >
> > >
> > >
> > > On Thu, Jan 18, 2018 at 11:43 PM, Robert Nishihara <
> > > robertnishih...@gmail.com> wrote:
> > >
> > > > Hi Mike,
> > > >
> > > > 1. I think yes, though we'd need to turn off the automatic LRU
> eviction
> > > > that happens when the store fills up.
> > > >
> > > > 3. I think there are some edge cases and it depends what is in your
> > > > DataFrame, but at least if it consists of numerical data then the two
> > > > representations should use the same underlying data in shared memory.
> > > >
> > > > On Thu, Jan 18, 2018 at 11:37 PM Mike Sam <mikesam...@gmail.com>
> > wrote:
> > > >
> > > > > I am interested to implement an arrow based persisted cache store
> > and I
> > > > > have a few related questions:
> > > > >
> > > > >1.
> > > > >
> > > > >Is it possible just to use Plasma for this goal?
> > > > >(My understanding is that it is not persistable)
> > > > >Else, what is the recommended way to do so?
> > > > >2.
> > > > >
> > > > >Is feather the better file format for persistence to avoid
> > > > >re-transcoding hot chunks?
> > > > >3.
> > > > >
> > > > >When Pandas load data from plasma/arrow, is it doubling the
> memory
> > > > >usage? (One for the arrow representation, one for pandas
> > > > representation)
> > > > >
> > > > > --
> > > > > Thanks,
> > > > > Mike
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > Mike
> > >
> >
>
>
>
> --
> Thanks,
> Mike
>

Re: Exploring the possibility of creating a persistent cache by arrow/plasma

2018-01-20 Thread Robert Nishihara

When Plasma is started up, you specify the total amount of memory it is
allowed to use (in bytes) with the -m flag.

When a Plasma client attempts to create a new object and there is not
enough memory in the store, the store will evict a bunch of unused objects
to free up memory (objects that are not currently in use by any client).
This is done in a least-recently-used fashion as defined in the eviction
policy
https://github.com/apache/arrow/blob/master/cpp/src/plasma/eviction_policy.h.
In principle, this eviction policy could be made more configurable or a
different eviction policy could be plugged in, though we haven't
experimented with that much.

If you want to manually delete an object from Plasma, that can be done with
the "Delete" command
https://github.com/apache/arrow/blob/d135974a0d3dd9a9fbbb10da4c5dbc65f9324234/cpp/src/plasma/client.h#L186,
which is part of the C++ Plasma client API but has not been exposed through
Python yet.

For now, if you want to make sure that an object will not be evicted (e.g.,
from the C++ Client API), you can call Get on the object ID and then it
will not be evicted before you call Release from the same client.

On Fri, Jan 19, 2018 at 5:17 PM Mike Sam <mikesam...@gmail.com> wrote:

> Thank you, Robert, for your answer.
>
> Could you kindly further elaborate on number 1 as I am not
> familiar with Plasma codebase yet?
> Are you saying persistence is available out of the box? else what
> specific things need to be added
> to Plasma codebase to make this happen?
>
> Thank you,
> Mike
>
>
>
> On Thu, Jan 18, 2018 at 11:43 PM, Robert Nishihara <
> robertnishih...@gmail.com> wrote:
>
> > Hi Mike,
> >
> > 1. I think yes, though we'd need to turn off the automatic LRU eviction
> > that happens when the store fills up.
> >
> > 3. I think there are some edge cases and it depends what is in your
> > DataFrame, but at least if it consists of numerical data then the two
> > representations should use the same underlying data in shared memory.
> >
> > On Thu, Jan 18, 2018 at 11:37 PM Mike Sam <mikesam...@gmail.com> wrote:
> >
> > > I am interested to implement an arrow based persisted cache store and I
> > > have a few related questions:
> > >
> > >1.
> > >
> > >Is it possible just to use Plasma for this goal?
> > >(My understanding is that it is not persistable)
> > >Else, what is the recommended way to do so?
> > >2.
> > >
> > >Is feather the better file format for persistence to avoid
> > >re-transcoding hot chunks?
> > >3.
> > >
> > >When Pandas load data from plasma/arrow, is it doubling the memory
> > >usage? (One for the arrow representation, one for pandas
> > representation)
> > >
> > > --
> > > Thanks,
> > > Mike
> > >
> >
>
>
>
> --
> Thanks,
> Mike
>

Re: Exploring the possibility of creating a persistent cache by arrow/plasma

2018-01-18 Thread Robert Nishihara

Hi Mike,

1. I think yes, though we'd need to turn off the automatic LRU eviction
that happens when the store fills up.

3. I think there are some edge cases and it depends what is in your
DataFrame, but at least if it consists of numerical data then the two
representations should use the same underlying data in shared memory.

On Thu, Jan 18, 2018 at 11:37 PM Mike Sam  wrote:

> I am interested to implement an arrow based persisted cache store and I
> have a few related questions:
>
>1.
>
>Is it possible just to use Plasma for this goal?
>(My understanding is that it is not persistable)
>Else, what is the recommended way to do so?
>2.
>
>Is feather the better file format for persistence to avoid
>re-transcoding hot chunks?
>3.
>
>When Pandas load data from plasma/arrow, is it doubling the memory
>usage? (One for the arrow representation, one for pandas representation)
>
> --
> Thanks,
> Mike
>

[jira] [Created] (ARROW-2011) Allow setting the pickler to use in pyarrow serialization.

2018-01-18 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2011:
---

 Summary: Allow setting the pickler to use in pyarrow serialization.
 Key: ARROW-2011
 URL: https://issues.apache.org/jira/browse/ARROW-2011
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara
Assignee: Robert Nishihara


We currently try to import cloudpickle and failing that fall back to pickle. 
However, given that there are many versions of cloudpickle and they are 
typically incompatible with one another, the caller may want to specify a 
specific version, so we should allow them to set the specific pickler to use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Arrow policy on rewriting git history?

2018-01-17 Thread Robert Nishihara

Got it (I remember that discussion actually). The status quo is OK for us..
longer term we'll switch to using releases.

On Wed, Jan 17, 2018 at 7:50 AM Wes McKinney <wesmck...@gmail.com> wrote:

> We have been rebasing master after releases so that the release tag
> (and commits for the changelog, Java package metadata, etc.) appears
> in master. This only affects PRs merged while the release vote is
> open, but it's understandably not ideal.
>
> There was a prior mailing list thread where we discussed this. The
> alternative is to not merge PRs while a release vote is open, but this
> has the effect of artificially slowing down the development cadence.
>
> I would suggest we do a 0.8.1 bug fix release sometime in the next 2
> weeks with the goal of helping Ray get onto a tagged release, and
> establish some process to help us validate master before cutting a
> release candidates to avoid having to cancel a release vote. We also
> need to be able validate the Spark integration more easily (this is
> ongoing in https://github.com/apache/arrow/pull/1319 -- Bryan do you
> have time to work on this?)
>
> thanks
> Wes
>
> On Wed, Jan 17, 2018 at 12:39 AM, Robert Nishihara
> <robertnishih...@gmail.com> wrote:
> > I've noticed that specific commits sometimes disappear from the master
> > branch. Is this an inevitable consequence of the way Arrow does releases?
> > Or would it be possible to avoid removing commits from the master branch?
> >
> > Of course once we start using Arrow releases this won't be an issue. At
> the
> > moment we check out specific Arrow commits, and so there are a number of
> > commits in our history that no longer build because the corresponding
> > commits in Arrow have disappeared.
>

Arrow policy on rewriting git history?

2018-01-16 Thread Robert Nishihara

I've noticed that specific commits sometimes disappear from the master
branch. Is this an inevitable consequence of the way Arrow does releases?
Or would it be possible to avoid removing commits from the master branch?

Of course once we start using Arrow releases this won't be an issue. At the
moment we check out specific Arrow commits, and so there are a number of
commits in our history that no longer build because the corresponding
commits in Arrow have disappeared.

[jira] [Created] (ARROW-2000) Deduplicate file descriptors when plasma store replies to get request.

2018-01-15 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2000:
---

 Summary: Deduplicate file descriptors when plasma store replies to 
get request.
 Key: ARROW-2000
 URL: https://issues.apache.org/jira/browse/ARROW-2000
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


Right now when the plasma store replies to a GetRequest from a client, it sends 
many file descriptors over the relevant socket (by calling {{send_fd}}). 
However, many of these file descriptors are redundant and so we should 
deduplicate them before sending.

 

Note that I often see the error "Failed to send file descriptor, retrying." 
printed when getting around 100 objects from the store. This may alleviate that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-1972) Deserialization of buffer objects (and pandas dataframes) segfaults on different processes.

2018-01-06 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-1972:
---

 Summary: Deserialization of buffer objects (and pandas dataframes) 
segfaults on different processes.
 Key: ARROW-1972
 URL: https://issues.apache.org/jira/browse/ARROW-1972
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Robert Nishihara


To see the issue, first serialize a pyarrow buffer.

{code}
import pyarrow as pa

serialized = pa.serialize(pa.frombuffer(b'hello')).to_buffer().to_pybytes()

print(serialized)  # 
b'\x00\x00\x00\x00\x01\x00\x00\x00\xcc\x00\x00\x00\x10\x00\x00\x00\x0c\x00\x0e\x00\x06\x00\x05\x00\x08\x00\x00\x00\x0c\x00\x00\x00\x00\x01\x03\x00\x10\x00\x00\x00\x00\x00\n\x00\x08\x00\x00\x00\x04\x00\x00\x00\n\x00\x00\x00\x04\x00\x00\x00\x01\x00\x00\x00\x04\x00\x00\x00\xc6\xff\xff\xff\x00\x00\x01\x0e|\x00\x00\x00\x18\x00\x00\x00\x04\x00\x00\x00\x01\x00\x00\x004\x00\x00\x00\x08\x00\x0c\x00\x06\x00\x08\x00\x08\x00\x00\x00\x00\x00\x01\x00\x04\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x12\x00\x14\x00\x08\x00\x06\x00\x07\x00\x0c\x00\x00\x00\x10\x00\x00\x00\x12\x00\x00\x00\x00\x00\x01\x02$\x00\x00\x00\x14\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x08\x00\x0c\x00\x08\x00\x07\x00\x08\x00\x00\x00\x00\x00\x00\x01
 
\x00\x00\x00\x06\x00\x00\x00buffer\x00\x00\x04\x00\x00\x00list\x00\x00\x00\x00\x00\x00\x00\x00\xcc\x00\x00\x00\x14\x00\x00\x00\x00\x00\x00\x00\x0c\x00\x16\x00\x06\x00\x05\x00\x08\x00\x0c\x00\x0c\x00\x00\x00\x00\x03\x03\x00\x18\x00\x00\x00h\x00\x00\x00\x00\x00\x00\x00\x00\x00\n\x00\x18\x00\x0c\x00\x04\x00\x08\x00\n\x00\x00\x00l\x00\x00\x00\x10\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
 \x00\x00\x00\x00\x00\x00\x00 
\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x00`\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00`\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00hello'
{code}

Deserializing it within the same process succeeds, however deserializing it in 
a **separate process** causes a segfault. E.g.,

{code}
import pyarrow as pa

pa.deserialize(b'\x00\x00\x00\x00\x01...')  # This segfaults
{code}

The backtrace is

{code}
(lldb) bt
* thread #1, queue = ‘com.apple.main-thread’, stop reason = EXC_BAD_ACCESS 
(code=1, address=0x0)
  * frame #0: 0x
frame #1: 0x000105605534 
libarrow_python.0.dylib`arrow::py::wrap_buffer(buffer=std::__1::shared_ptr::element_type
 @ 0x00010060c348 strong=1 weak=1) at pyarrow.cc:48
frame #2: 0x00010554fdee 
libarrow_python.0.dylib`arrow::py::GetValue(context=0x000108f17818, 
parent=0x000100645438, arr=0x000100622938, index=0, type=0, 
base=0x000108f0e528, blobs=0x000108f09588, result=0x7fff5fbfd218) 
at arrow_to_python.cc:173
frame #3: 0x00010554d93a 
libarrow_python.0.dylib`arrow::py::DeserializeList(context=0x000108f17818, 
array=0x000100645438, start_idx=0, stop_idx=2, base=0x000108f0e528, 
blobs=0x000108f09588, out=0x7fff5fbfd470) at arrow_to_python.cc:208
frame #4: 0x00010554d302 
libarrow_python.0.dylib`arrow::py::DeserializeDict(context=0x000108f17818, 
array=0x000100645338, start_idx=0, stop_idx=2, base=0x000108f0e528, 
blobs=0x000108f09588, out=0x7fff5fbfddd8) at arrow_to_python.cc:74
frame #5: 0x00010554f249 
libarrow_python.0.dylib`arrow::py::GetValue(context=0x000108f17818, 
parent=0x0001006377a8, arr=0x000100645298, index=0, type=0, 
base=0x000108f0e528, blobs=0x000108f09588, result=0x7fff5fbfddd8) 
at arrow_to_python.cc:158
frame #6: 0x00010554d93a 
libarrow_python.0.dylib`arrow::py::DeserializeList(context=0x000108f17818, 
array=0x0001006377a8, start_idx=0, stop_idx=1, base=0x000108f0e528, 
blobs=0x000108f09588, out=0x7fff5fbfdfe8) at arrow_to_python.cc:208
frame #7: 0x000105551fbf 
libarrow_python.0.dylib`arrow::py::DeserializeObject(context=0x000108f17818,
 obj=0x000108f09588, base=0x000108f0e528, out=0x7fff5fbfdfe8) at 
arrow_to_python.cc:287
frame #8: 0x000104abecae 
lib.cpython-36m-darwin.so`__pyx_pf_7pyarrow_3lib_18SerializedPyObject_2deserialize(__pyx_v_self

Re: What factors influence pyarrow.version?

2017-12-27 Thread Robert Nishihara

Makes sense, thanks!

On Wed, Dec 27, 2017 at 10:39 PM Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Robert,
>
> the version number is determined with
> https://github.com/pypa/setuptools_scm. Somewhere between 0.7.1 and 0.8,
> we have accidentally used also the JS tags for determing the next version,
> thus the 0.3.1 version. This should have been fixed by
> https://github.com/apache/arrow/commit/d64947e8c650687856bd221ea4ff15c86db0ebc1.
> In a post-0.8 master you should actually get 0.8.1.devX+gXXX as the
> version.
>
> Uwe
>
> On Wed, Dec 27, 2017, at 4:50 AM, Robert Nishihara wrote:
> > If you have more insight into how pyarrow.__version__ gets computed,
> please
> > let me know!
> >
> > When recompiling pyarrow multiple times at the same commit, sometimes I
> see
> > different values for pyarrow.__version__ and sometimes it is None. And
> the
> > versions often seem way off. For example, we are currently around 0.8 or
> > 0.9, but the version I just got from compiling was 0.3.1.dev51+gb599b9e.
> >
> > I'd expect it to only depend on the current git hash, but it seems to
> > depend on other factors that I can't quite pin down.
> >
> > I think the relevant lines are
> >
> https://github.com/apache/arrow/blob/8986521255f48a2aa775921eac0175b4e7afaa16/python/setup.py#L402-L410
> > .
> >
> > Those lines seem to run some variant of the command
> >
> > git describe --dirty --tags --long --match *.* --match
> > 'apache-arrow-[0-9]*'
> >
> > Though not quite in that form because when I run that locally I see
> >
> > fatal: --dirty is incompatible with commit-ishes
> >
> > Thanks for your help!
>

What factors influence pyarrow.version?

2017-12-26 Thread Robert Nishihara

If you have more insight into how pyarrow.__version__ gets computed, please
let me know!

When recompiling pyarrow multiple times at the same commit, sometimes I see
different values for pyarrow.__version__ and sometimes it is None. And the
versions often seem way off. For example, we are currently around 0.8 or
0.9, but the version I just got from compiling was 0.3.1.dev51+gb599b9e.

I'd expect it to only depend on the current git hash, but it seems to
depend on other factors that I can't quite pin down.

I think the relevant lines are
https://github.com/apache/arrow/blob/8986521255f48a2aa775921eac0175b4e7afaa16/python/setup.py#L402-L410
.

Those lines seem to run some variant of the command

git describe --dirty --tags --long --match *.* --match
'apache-arrow-[0-9]*'

Though not quite in that form because when I run that locally I see

fatal: --dirty is incompatible with commit-ishes

Thanks for your help!

[jira] [Created] (ARROW-1951) Add memcopy_threads to serialization context

2017-12-26 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-1951:
---

 Summary: Add memcopy_threads to serialization context
 Key: ARROW-1951
 URL: https://issues.apache.org/jira/browse/ARROW-1951
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Robert Nishihara
Assignee: Robert Nishihara
Priority: Minor


Right now, when calling {{put}} with a plasma client, we set 
{{memcopy_threads}} to 4. We should expose this so it can be changed through 
Python.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1829) [Plasma] Clean up eviction policy bookkeeping

2017-11-16 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-1829:
---

 Summary: [Plasma] Clean up eviction policy bookkeeping
 Key: ARROW-1829
 URL: https://issues.apache.org/jira/browse/ARROW-1829
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


Currently, the eviction policy has a field {{memory_used_}} which keeps track 
of how much memory the store is currently using. However, this field is only 
updated when {{require_space}} is called, and it should be updated every time 
an object is created.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Re: Modeling N-Dim Arrays in Arrow

2017-11-16 Thread Robert Nishihara

Great!

On Thu, Nov 16, 2017 at 3:04 PM Lewis John McGibbney <lewi...@apache.org>
wrote:

> Fantastic Robert, thank you for the pointers.
> The documentation and graphics on ray github pages is very helpful.
> Lewis
>
> On 2017-11-16 11:20, Robert Nishihara <robertnishih...@gmail.com> wrote:
> > Yes definitely! You can do this through high level Python APIs, e.g.,
> > something like
> >
> https://github.com/apache/arrow/blob/ca3acdc138b1ac27c9111b236d33593988689a20/python/pyarrow/tests/test_serialization.py#L214-L216
> > .
> >
> > You can also share the numpy arrays using shared memory, e.g.,
> >
> https://issues.apache.org/jira/browse/ARROW-1792?focusedCommentId=16252940=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16252940
> >
> > You can also do this through C++.
> >
> > Some benchmarks at
> >
> https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html
> > .
> > On Thu, Nov 16, 2017 at 10:49 AM Lewis John McGibbney <
> lewi...@apache.org>
> > wrote:
> >
> > > Hi Folks,
> > >
> > > Array-oriented scientific data (such as satellite remote sensing data)
> is
> > > commonly encoded using NetCDF [0] and HDF [1] data formats as these
> formats
> > > have been designed and developed to offer amongst other things,
> some/all of
> > > the following features
> > >  * Self-Describing. A netCDF file includes information about the data
> it
> > > contains.
> > > * Portable. A netCDF file can be accessed by computers with different
> ways
> > > of storing integers, characters, and floating-point numbers.
> > >  * Scalable. A small subset of a large dataset may be accessed
> efficiently.
> > >  * Appendable. Data may be appended to a properly structured netCDF
> file
> > > without copying the dataset or redefining its structure.
> > >  * Sharable. One writer and multiple readers may simultaneously access
> the
> > > same netCDF file.
> > >  * Archivable. Access to all earlier forms of netCDF data will be
> > > supported by current and future versions of the software.
> > >
> > > I am currently toying with the idea of exploring and hopefully
> > > benchmarking use of storage-class memory hardware combined with Arrow
> as a
> > > mechanism for improving both fast and flexible data access and possibly
> > > analysis.
> > >
> > > Very first question, has anyone attempted to/are currently using Arrow
> to
> > > store N-Dim array-based data?
> > >
> > > Thanks in advance,
> > > Lewis
> > >
> > > [0] http://www.unidata.ucar.edu/software/netcdf/
> > > [1] https://www.hdfgroup.org/solutions/hdf5/
> > >
> >
>

Re: Modeling N-Dim Arrays in Arrow

2017-11-16 Thread Robert Nishihara

Yes definitely! You can do this through high level Python APIs, e.g.,
something like
https://github.com/apache/arrow/blob/ca3acdc138b1ac27c9111b236d33593988689a20/python/pyarrow/tests/test_serialization.py#L214-L216
.

You can also share the numpy arrays using shared memory, e.g.,
https://issues.apache.org/jira/browse/ARROW-1792?focusedCommentId=16252940=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16252940

You can also do this through C++.

Some benchmarks at
https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html
.
On Thu, Nov 16, 2017 at 10:49 AM Lewis John McGibbney 
wrote:

> Hi Folks,
>
> Array-oriented scientific data (such as satellite remote sensing data) is
> commonly encoded using NetCDF [0] and HDF [1] data formats as these formats
> have been designed and developed to offer amongst other things, some/all of
> the following features
>  * Self-Describing. A netCDF file includes information about the data it
> contains.
> * Portable. A netCDF file can be accessed by computers with different ways
> of storing integers, characters, and floating-point numbers.
>  * Scalable. A small subset of a large dataset may be accessed efficiently.
>  * Appendable. Data may be appended to a properly structured netCDF file
> without copying the dataset or redefining its structure.
>  * Sharable. One writer and multiple readers may simultaneously access the
> same netCDF file.
>  * Archivable. Access to all earlier forms of netCDF data will be
> supported by current and future versions of the software.
>
> I am currently toying with the idea of exploring and hopefully
> benchmarking use of storage-class memory hardware combined with Arrow as a
> mechanism for improving both fast and flexible data access and possibly
> analysis.
>
> Very first question, has anyone attempted to/are currently using Arrow to
> store N-Dim array-based data?
>
> Thanks in advance,
> Lewis
>
> [0] http://www.unidata.ucar.edu/software/netcdf/
> [1] https://www.hdfgroup.org/solutions/hdf5/
>

[jira] [Created] (ARROW-1745) Compilation failure on Mac OS in plasma tests

2017-10-28 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-1745:
---

 Summary: Compilation failure on Mac OS in plasma tests
 Key: ARROW-1745
 URL: https://issues.apache.org/jira/browse/ARROW-1745
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1700) Implement Node.js client for Plasma store

2017-10-20 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-1700:
---

 Summary: Implement Node.js client for Plasma store
 Key: ARROW-1700
 URL: https://issues.apache.org/jira/browse/ARROW-1700
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript, Plasma (C++)
Reporter: Robert Nishihara






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1653) [Plasma] Use static cast to avoid compiler warning.

2017-10-05 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-1653:
---

 Summary: [Plasma] Use static cast to avoid compiler warning.
 Key: ARROW-1653
 URL: https://issues.apache.org/jira/browse/ARROW-1653
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


See https://github.com/apache/arrow/pull/1172#discussion_r142931449.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1647) [Plasma] Potential bug when reading/writing messages.

2017-10-04 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-1647:
---

 Summary: [Plasma] Potential bug when reading/writing messages.
 Key: ARROW-1647
 URL: https://issues.apache.org/jira/browse/ARROW-1647
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Robert Nishihara


When we write the "length" field, it is an {code}int64_t{code}. See 
https://github.com/apache/arrow/blob/dc129d60fbffbf3a5b71b1f7987f7dab948b3d61/cpp/src/plasma/io.cc#L63.

However, when we read the "length" field, it is a {code}size_t{code}. See 
https://github.com/apache/arrow/blob/dc129d60fbffbf3a5b71b1f7987f7dab948b3d61/cpp/src/plasma/io.cc#L98.

I'm not sure if this is a bug, but it looks like it might be. And I suspect 
there is an issue somewhere in this area because a couple Ray users on 
non-standard platforms have reported issues with a check failure at 
https://github.com/apache/arrow/blob/dc129d60fbffbf3a5b71b1f7987f7dab948b3d61/cpp/src/plasma/io.cc#L94.

See https://github.com/ray-project/ray/issues/1008.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1628) Incorrect pyarrow serialization of numpy datetimes.

2017-09-30 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-1628:
---

 Summary: Incorrect pyarrow serialization of numpy datetimes.
 Key: ARROW-1628
 URL: https://issues.apache.org/jira/browse/ARROW-1628
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Robert Nishihara


See https://github.com/ray-project/ray/issues/1041.

The issue can be reproduced as follows.

{code}
import pyarrow as pa
import numpy as np

t = np.datetime64(datetime.datetime.now())

print(type(t), t)  #  2017-09-30T09:50:46.089952

t_new = pa.deserialize(pa.serialize(t).to_buffer())

print(type(t_new), t_new)  #  0
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1410) Plasma object store occasionally pauses for a long time

2017-08-24 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-1410:
---

 Summary: Plasma object store occasionally pauses for a long time
 Key: ARROW-1410
 URL: https://issues.apache.org/jira/browse/ARROW-1410
 Project: Apache Arrow
  Issue Type: Improvement
 Environment: Ubuntu 16.04
Reporter: Robert Nishihara


The problem can be reproduced as follows. First start a plasma store with

{code}
plasma_store -s /tmp/s1 -m 5000
{code}

Then continuously put in objects using a script like the following.

{code}
import pyarrow.plasma as plasma
import numpy as np

client = plasma.connect('/tmp/s1', '', 0)

for i in range(2):
print(i)
object_id = plasma.ObjectID(np.random.bytes(20))
client.create(object_id, np.random.randint(0, 1))
client.seal(object_id)
{code}

As the loop counters are being printed, you will see long pauses. The problem 
is the fact that we are mmapping pages with the MAP_POPULATE flag. Though this 
can be used to improve performance of subsequent object creations, it isn't 
worth the long pauses. We may want to find a way to populate the pages in the 
background.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1382) Python objects containing multiple copies of the same object are serialized incorrectly

2017-08-19 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-1382:
---

 Summary: Python objects containing multiple copies of the same 
object are serialized incorrectly
 Key: ARROW-1382
 URL: https://issues.apache.org/jira/browse/ARROW-1382
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Robert Nishihara


If a Python object appears multiple times within a list/tuple/dictionary, then 
when pyarrow serializes the object, it will duplicate the object many times. 
This leads to a potentially huge expansion in the size of the object (e.g., the 
serialized version of {{100 * [np.zeros(10 ** 6)]}} will be 100 times bigger 
than it needs to be).

{code}
import pyarrow as pa

l = [0]
original_object = [l, l]

# Serialize and deserialize the object.
buf = pa.serialize(original_object).to_buffer()
new_object = pa.deserialize(buf)

# This works.
assert original_object[0] is original_object[1]

# This fails.
assert new_object[0] is new_object[1]
{code}

One potential way to address this is to use the Arrow dictionary encoding.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1368) libarrow.a is not linked against boost libraries when compiled with -DARROW_BOOST_USE_SHARED=off

2017-08-16 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-1368:
---

 Summary: libarrow.a is not linked against boost libraries when 
compiled with -DARROW_BOOST_USE_SHARED=off
 Key: ARROW-1368
 URL: https://issues.apache.org/jira/browse/ARROW-1368
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
 Environment: Ubuntu 16.04
Reporter: Robert Nishihara


When I build arrow with {{-DARROW_BOOST_USE_SHARED=off}} and then inspect 
{{libarrow.a}} with {{nm -g libarrow.a}}, some boost symbols are undefined.

The problem can be reproduced on Ubuntu 16.04 as follows.

First compile boost with -fPIC.

{{code}}
cd ~
wget --no-check-certificate 
http://downloads.sourceforge.net/project/boost/boost/1.60.0/boost_1_60_0.tar.gz 
-O ~/boost_1_60_0.tar.gz
tar xf boost_1_60_0.tar.gz
cd boost_1_60_0/
./bootstrap.sh
./bjam cxxflags=-fPIC cflags=-fPIC --prefix=../boost --with-filesystem 
--with-date_time --with-system --with-regex install
cd ..
{{code}}

Then compile Arrow.

{{code}}
cd ~
git clone https://github.com/apache/arrow
mkdir -p ~/arrow/cpp/build
cd ~/arrow/cpp/build
BOOST_ROOT=~/boost \
cmake -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_C_FLAGS="-g -O3" \
  -DCMAKE_CXX_FLAGS="-g -O3" \
  -DARROW_BUILD_TESTS=off \
  -DARROW_HDFS=on \
  -DARROW_BOOST_USE_SHARED=off \
  -DARROW_PYTHON=on \
  -DARROW_PLASMA=on \
  -DPLASMA_PYTHON=on \
  -DARROW_JEMALLOC=off \
  -DARROW_WITH_BROTLI=off \
  -DARROW_WITH_LZ4=off \
  -DARROW_WITH_SNAPPY=off \
  -DARROW_WITH_ZLIB=off \
  -DARROW_WITH_ZSTD=off \
  ..

make VERBOSE=1 -j8
{{code}}

The cmake command finds the recently compiled boost and prints the following.

{{code}}
-- [ /usr/share/cmake-3.5/Modules/FindBoost.cmake:1516 ] Boost_FOUND = 1
-- Boost version: 1.60.0
-- Found the following Boost libraries:
--   system
--   filesystem
-- Boost include dir: /home/ubuntu/boost/include
-- Boost libraries: 
/home/ubuntu/boost/lib/libboost_system.a/home/ubuntu/boost/lib/libboost_filesystem.a
Added static library dependency boost_system: 
/home/ubuntu/boost/lib/libboost_system.a
Added static library dependency boost_filesystem: 
/home/ubuntu/boost/lib/libboost_filesystem.a
{{code}}

Compilation does not appear to link {{libarrow.a}} against the boost libraries 
(though {{libarrow.so}} is handled properly).

For {{libarrow.a}} (not linked against boost)

{{code}}
/usr/bin/ar qc release/libarrow.a  
CMakeFiles/arrow_objlib.dir/src/arrow/array.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/buffer.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/builder.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/compare.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/memory_pool.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/pretty_print.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/status.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/table.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/tensor.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/type.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/visitor.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/io/file.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/io/interfaces.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/io/memory.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/util/bit-util.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/util/compression.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/util/cpu-info.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/util/decimal.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/util/key_value_metadata.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/io/hdfs.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/io/hdfs-internal.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/ipc/feather.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/ipc/json.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/ipc/json-internal.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/ipc/metadata.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/ipc/reader.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/ipc/writer.cc.o
{{code}}

For {{libarrow.so}} (linked against boost)

{{code}}
/usr/bin/c++  -fPIC -g -O3 -O3 -DNDEBUG -Wall -std=c++11 -msse3  -O3 -DNDEBUG 
-Wl,--version-script=/home/ubuntu/arrow/cpp/src/arrow/symbols.map -shared 
-Wl,-soname,libarrow.so.0 -o release/libarrow.so.0.0.0 
CMakeFiles/arrow_objlib.dir/src/arrow/array.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/buffer.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/builder.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/compare.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/memory_pool.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/pretty_print.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/status.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/table.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/tensor.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/type.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/visitor.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/io/file.cc.o 
CMakeFiles/arrow_objlib.dir/src/arrow/io/interfaces.cc.o 
CMakeFiles/

Re: [DISCUSS] Apache Arrow and the GPU Open Analytics Initiative

2017-08-16 Thread Robert Nishihara

That makes a lot of sense. In some contexts it could make sense to run
multiple Plasma stores per machine (possibly for different devices or
different NUMA zones). Though that could make it slightly harder to take
advantage of faster GPU to GPU communication.

On Wed, Aug 16, 2017 at 2:01 PM Philipp Moritz  wrote:

> One observation here is that as far as I know shared memory is not
> typically used between multiple gpus and on a single gpu there is already a
> unified shared address space that each cuda thread can access.
>
> One reasonable extension of the APIs and facilities given these limitations
> would be the following:
>
> 1.) Extend plasma::Create to take an optional flag (CPU/HOST/SHARED, GPU0,
> GPU1, etc.) which allocates the object on the desired device (host shared
> memory,  gpu 0, gpu 1, etc.)
>
> 2.) Extend plasma::Get to take the same flag and will transparently copy
> the data to the desired device as neccessary and return a pointer that is
> valid on the specified device.
>
> 3.) Extend the status and notification APIs to account for these changes
> and also the object lifetime tracking.
>
> I wonder if people would find that useful, let me know about your thoughts!
> Ideally we would also have some integration into say TensorFlow or other
> deep learning frameworks that can make use of these capabilities (the way
> we typically use gpus in Ray at the moment is mostly through TensorFlow by
> feeding data through placeholders, which has some performance bottlenecks
> but so far we mostly managed to work around them).
>
>
>
> On Wed, Aug 16, 2017 at 1:01 PM, Wes McKinney  wrote:
>
> > One idea is whether the Plasma object store could be extended to
> > support devices other than POSIX shared memory, like GPU device memory
> > (or multiple GPUs on a single host).
> >
> > Philipp or Robert or any of the people who know the Plasma code best,
> > any idea how this might be approached? It would have to be developed
> > as an optional extension so that users without e.g. a CUDA
> > installation don't have to bother with nvcc (which is proprietary) or
> > the CUDA runtime libraries.
> >
> > - Wes
> >
> > On Mon, Aug 7, 2017 at 2:15 PM, Wes McKinney 
> wrote:
> > > hi all,
> > >
> > > A group of companies have created a project called the GPU Open
> > > Analytics Initiative (GOAI), with the purpose of creating open source
> > > software and specifications for analytics on GPU.
> > >
> > > So far, they have focused on building a "GPU Data Frame", which is
> > > effectively putting Arrow data on the GPU:
> > >
> > > https://github.com/gpuopenanalytics/libgdf/wiki/Technical-Overview
> > > http://gpuopenanalytics.com/
> > >
> > > Shared memory IPC and analytics on Arrow data beyond the CPU are
> > > definitely in scope for the Arrow project, so we should look for ways
> > > to collaborate and help each other. I am sure this will not be the
> > > last time that someone needs to use Arrow memory with GPUs, so it
> > > would be useful for the community to develop memory management and
> > > utility code to assist with using Arrow in a mixed-device setting.
> > >
> > > I am not sure how to best proceed but wanted to make everyone aware of
> > > GOAI and look for opportunities to grow the Arrow community.
> > >
> > > Thanks,
> > > Wes
> >
>

Re: Arrow Plasma Object Store - IP clearance

2017-08-07 Thread Robert Nishihara

Thanks! This is great!

On Mon, Aug 7, 2017 at 11:30 AM Wes McKinney  wrote:

> Thanks to the Plasma developers for their code contribution and
> efforts integrating it with the Arrow codebase! It's a powerful and
> useful tool that will help the project grow.
>
> - Wes
>
> On Mon, Aug 7, 2017 at 2:24 PM, Philipp Moritz  wrote:
> > Great to hear! Thanks a lot to everybody involved with this for their
> help.
> >
> > On Mon, Aug 7, 2017 at 11:19 AM, Julian Hyde  wrote:
> >
> >> The vote for IP clearance of the Plasma Object Store on the Incubator
> list
> >> has passed[1].
> >>
> >> We can now proceed with a release.
> >>
> >> Julian
> >>
> >> [1] https://s.apache.org/arrow-plasma-object-store-clearance-result
> >>
> >>
> >>
>

[jira] [Created] (ARROW-1194) Trouble deserializing a pandas DataFrame from a PyArrow buffer.

2017-07-07 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-1194:
---

 Summary: Trouble deserializing a pandas DataFrame from a PyArrow 
buffer.
 Key: ARROW-1194
 URL: https://issues.apache.org/jira/browse/ARROW-1194
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.5.0
 Environment: Ubuntu 16.04
Python 3.6
Reporter: Robert Nishihara


I'm running into the following problem.

Suppose I create a dataframe and serialize it.

{code:language=python}
import pyarrow as pa
import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3]})
record_batch = pa.RecordBatch.from_pandas(df)
{code}

It's size is 352 according to

{code:language=python}
pa.get_record_batch_size(record_batch)  # This is 352.
{code}

However, if I write it using a stream_writer and then attempt to read it, the 
resulting buffer has size 928.

{code:language=python}
sink = pa.BufferOutputStream()
stream_writer = pa.RecordBatchStreamWriter(sink, record_batch.schema)
stream_writer.write_batch(record_batch)
new_buf = sink.get_result()
new_buf.size  # This is 928.
{code}

I'm running into this problem because I'm attempting to write the pandas 
DataFrame to the Plasma object store as follows (after Plasma has been started 
and a client has been created), so I need to know the size ahead of time.

{code:language=python}
data_size = pa.get_record_batch_size(record_batch)
object_id = plasma.ObjectID(np.random.bytes(20))

buf = client.create(object_id, data_size)  # Note that if I replace "data_size" 
by "data_size + 1000" then it works.
stream = plasma.FixedSizeBufferOutputStream(buf)
stream_writer = pa.RecordBatchStreamWriter(stream, record_batch.schema)
stream_writer.write_batch(record_batch)
{code}

The above fails because the buffer allocated in Plasma only has size 352, but 
928 bytes are needed.

So my question is, am I determining the size of the record batch incorrectly? 
Or could there be a bug in {code:language=python}pa.get_record_batch_size{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-826) Compilation error on Mac with -DARROW_PYTHON=on

2017-04-14 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-826:
--

 Summary: Compilation error on Mac with -DARROW_PYTHON=on
 Key: ARROW-826
 URL: https://issues.apache.org/jira/browse/ARROW-826
 Project: Apache Arrow
  Issue Type: Bug
 Environment: > python --version
Python 2.7.12 :: Anaconda custom (x86_64)

> system_profiler SPSoftwareDataType
Software:

System Software Overview:

  System Version: macOS 10.12.4 (16E195)
  Kernel Version: Darwin 16.5.0
  Boot Volume: Macintosh HD

> conda install libgcc
Fetching package metadata .
Solving package specifications: .

# All requested packages already installed.
libgcc4.8.5
Reporter: Robert Nishihara
 Attachments: Without-Python-ON-success.txt, With-Python-ON-failure.txt

It looks like compiling Arrow with {-DARROW_PYTHON=on} failed for a Ray user on 
MacOS.

The logs are attached (one failed with {-DARROW_PYTHON=on}) and one succeeded 
without that flag.

The issue was originally reported at 
https://github.com/ray-project/ray/issues/461.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (ARROW-739) Parallel build fails non-deterministically.

2017-03-29 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-739:
--

 Summary: Parallel build fails non-deterministically.
 Key: ARROW-739
 URL: https://issues.apache.org/jira/browse/ARROW-739
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
 Environment: OS X 10.12.1
Reporter: Robert Nishihara
 Attachments: arrow_build_make_j1.txt, arrow_build_make_j8.txt

The following script fails non-deterministically (most of the time) on my 
machine.

With `make -j1` it seems to work (though I only tried a few times). With `make 
-j8` it fails most of the time (though I have seen it succeed).

```
git clone https://github.com/apache/arrow.git
cd arrow/cpp
git checkout 8f386374eca26d0eebe562beac52fc75459f352c
mkdir release
cd release
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_FLAGS="-g" -DCMAKE_CXX_FLAGS="-g" 
-DARROW_BUILD_TESTS=OFF ..
make VERBOSE=1 -j8
cd ../../..
```

The output from both a successful run (arrow_build_make_j1.txt) and an 
unsuccessful run (arrow_build_make_j8.txt) are attached, but the error may be 
the following.

```
install: mkdir 
/Users/rkn/Workspace/testingarrow/arrow/cpp/release/jemalloc_ep-prefix/src/jemalloc_ep/dist:
 File exists
make[3]: *** [install_include] Error 71
make[3]: *** Waiting for unfinished jobs
```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (ARROW-607) [C++] Speed up bitsetting in ArrayBuilder::UnsafeSetNotNull

2017-03-09 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903668#comment-15903668
 ] 

Robert Nishihara commented on ARROW-607:


I believe this is already done in ARROW-553 
https://github.com/apache/arrow/commit/ad0157547a4f5e6e51fa2f712c2ed9477489a20c.

We cherry-picked the some components of ARROW-553 in our fork because we're 
pretty far behind the Arrow master at the moment.

> [C++] Speed up bitsetting in ArrayBuilder::UnsafeSetNotNull
> ---
>
> Key: ARROW-607
> URL: https://issues.apache.org/jira/browse/ARROW-607
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>
> see, e.g. https://github.com/pcmoritz/arrow/pull/3



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

71 matches

Mail list logo