from:"Robert Nishihara \(JIRA\)"

[jira] [Resolved] (ARROW-3399) [Python] Cannot serialize numpy matrix object

2019-04-16 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-3399.
-
Resolution: Fixed

Issue resolved by pull request 4096
[https://github.com/apache/arrow/pull/4096]

> [Python] Cannot serialize numpy matrix object
> -
>
> Key: ARROW-3399
> URL: https://issues.apache.org/jira/browse/ARROW-3399
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Mitar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> This is a regression from 0.9.0 and happens with 0.10.0 with Python 3.6.5 on 
> Linux.
> {code:java}
> from pyarrow import plasma
> import numpy
> import time
> import subprocess
> import os
> import signal
> m = numpy.matrix(numpy.array([[1, 2], [3, 4]]))
> process = subprocess.Popen(['plasma_store', '-m', '100', '-s', 
> '/tmp/plasma', '-d', '/dev/shm'], stdout=subprocess.DEVNULL, 
> stderr=subprocess.DEVNULL, encoding='utf8', preexec_fn=os.setpgrp)
> time.sleep(5)
> client = plasma.connect('/tmp/plasma', '', 0)
> try:
> client.put(m)
> finally:
> client.disconnect()
> os.killpg(os.getpgid(process.pid), signal.SIGTERM)
> {code}
> Error:
> {noformat}
>   File "pyarrow/_plasma.pyx", line 397, in pyarrow._plasma.PlasmaClient.put
>   File "pyarrow/serialization.pxi", line 338, in pyarrow.lib.serialize
>   File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: This object exceeds the maximum 
> recursion depth. It may contain itself recursively.{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-5099) Compiling Plasma TensorFlow op has Python 2 bug.

2019-04-03 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-5099:
---

 Summary: Compiling Plasma TensorFlow op has Python 2 bug.
 Key: ARROW-5099
 URL: https://issues.apache.org/jira/browse/ARROW-5099
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Plasma, Python
Reporter: Robert Nishihara


I've seen the following error when compiling the Plasma TensorFlow op.
TensorFlow version: 1.13.1
Compiling Plasma TensorFlow Op...
Traceback (most recent call last):
  File "/ray/python/ray/experimental/sgd/test_sgd.py", line 48, in 
all_reduce_alg=args.all_reduce_alg)
  File "/ray/python/ray/experimental/sgd/sgd.py", line 110, in __init__
shard_shapes = ray.get(self.workers[0].shard_shapes.remote())
  File "/ray/python/ray/worker.py", line 2307, in get
raise value
ray.exceptions.RayTaskError: {color:#00cdcd}ray_worker{color} (pid=81, 
host=629a7997c823)
NameError: global name 'FileNotFoundError' is not defined
{{FileNotFoundError}} doesn't exist in Python 2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3399) [Python] Cannot serialize numpy matrix object

2019-04-01 Thread Robert Nishihara (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807314#comment-16807314
 ] 

Robert Nishihara commented on ARROW-3399:
-

Hi Mitar, can you submit your change as a PR? I suspect some minor 
modifications will make it work and it will be easier to comment there. Also 
please add a test.

> [Python] Cannot serialize numpy matrix object
> -
>
> Key: ARROW-3399
> URL: https://issues.apache.org/jira/browse/ARROW-3399
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Mitar
>Priority: Major
> Fix For: 0.14.0
>
>
> This is a regression from 0.9.0 and happens with 0.10.0 with Python 3.6.5 on 
> Linux.
> {code:java}
> from pyarrow import plasma
> import numpy
> import time
> import subprocess
> import os
> import signal
> m = numpy.matrix(numpy.array([[1, 2], [3, 4]]))
> process = subprocess.Popen(['plasma_store', '-m', '100', '-s', 
> '/tmp/plasma', '-d', '/dev/shm'], stdout=subprocess.DEVNULL, 
> stderr=subprocess.DEVNULL, encoding='utf8', preexec_fn=os.setpgrp)
> time.sleep(5)
> client = plasma.connect('/tmp/plasma', '', 0)
> try:
> client.put(m)
> finally:
> client.disconnect()
> os.killpg(os.getpgid(process.pid), signal.SIGTERM)
> {code}
> Error:
> {noformat}
>   File "pyarrow/_plasma.pyx", line 397, in pyarrow._plasma.PlasmaClient.put
>   File "pyarrow/serialization.pxi", line 338, in pyarrow.lib.serialize
>   File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: This object exceeds the maximum 
> recursion depth. It may contain itself recursively.{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-4983) [Plasma] Unmap memory when the client is destroyed

2019-03-21 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-4983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-4983.
-
   Resolution: Fixed
Fix Version/s: 0.13.0

Issue resolved by pull request 4001
[https://github.com/apache/arrow/pull/4001]

> [Plasma] Unmap memory when the client is destroyed
> --
>
> Key: ARROW-4983
> URL: https://issues.apache.org/jira/browse/ARROW-4983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Affects Versions: 0.12.1
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Currently the plasma memory mapped into the client is not unmapped upon 
> destruction of the client, which can cause memory mapped files to be kept 
> around longer than necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-4690) [Python] Building TensorFlow compatible wheels for Arrow

2019-02-28 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-4690.
-
Resolution: Fixed

Issue resolved by pull request 3766
[https://github.com/apache/arrow/pull/3766]

> [Python] Building TensorFlow compatible wheels for Arrow
> 
>
> Key: ARROW-4690
> URL: https://issues.apache.org/jira/browse/ARROW-4690
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Since the inclusion of LLVM, arrow wheels stopped working with TensorFlow 
> again (on some configurations at least).
> While we are continuing to discuss a more permanent solution in 
> [https://groups.google.com/a/tensorflow.org/d/topic/developers/TMqRaT-H2bI/discussion|https://groups.google.com/a/tensorflow.org/d/topic/developers/TMqRaT-H2bI/discussion,],
>  I made some progress in creating tensorflow compatible wheels for an 
> unmodified pyarrow.
> They won't adhere to the manylinux1 standard, but they should be as 
> compatible as the TensorFlow wheels because they use the same build 
> environment (ubuntu 14.04).
> I'll create a PR with the necessary changes. I don't propose to ship these 
> wheels but it might be a good idea to include the docker image and 
> instructions how to build them in the tree for organizations that want to use 
> tensorflow with pyarrow on top of pip. The official recommendation should 
> probably be to use conda if the average user wants to do this for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1628) [Python] Incorrect serialization of numpy datetimes.

2019-02-05 Thread Robert Nishihara (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761477#comment-16761477
 ] 

Robert Nishihara commented on ARROW-1628:
-

[~wesmckinn] It'd still be good to fix, so I think we should leave the issue 
open, but I don't think it needs to be prioritized at the moment.

> [Python] Incorrect serialization of numpy datetimes.
> 
>
> Key: ARROW-1628
> URL: https://issues.apache.org/jira/browse/ARROW-1628
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>
> See https://github.com/ray-project/ray/issues/1041.
> The issue can be reproduced as follows.
> {code}
> import pyarrow as pa
> import numpy as np
> t = np.datetime64(datetime.datetime.now())
> print(type(t), t)  #  2017-09-30T09:50:46.089952
> t_new = pa.deserialize(pa.serialize(t).to_buffer())
> print(type(t_new), t_new)  #  0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-4475) [Python] Serializing objects that contain themselves

2019-02-04 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-4475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-4475.
-
   Resolution: Fixed
Fix Version/s: 0.13.0

Issue resolved by pull request 3556
[https://github.com/apache/arrow/pull/3556]

> [Python] Serializing objects that contain themselves
> 
>
> Key: ARROW-4475
> URL: https://issues.apache.org/jira/browse/ARROW-4475
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a regression from [https://github.com/apache/arrow/pull/3423]
> The following segfaults:
> {code:java}
> import pyarrow as pa
> lst = []
> lst.append(lst)
> pa.serialize(lst){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-02-04 Thread Robert Nishihara (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760130#comment-16760130
 ] 

Robert Nishihara commented on ARROW-4418:
-

[~zhijunfu] [~suquark] note that we will need to provide an alternative to UNIX 
domain sockets to make it work on Windows.

> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Zhijun Fu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Original text:
> It would be nice to move plasma store from current event loop to boost::asio 
> to modernize the code, and more importantly to benefit from the 
> functionalities provided by asio, which I think also provides opportunities 
> for performance improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-01-30 Thread Robert Nishihara (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756791#comment-16756791
 ] 

Robert Nishihara commented on ARROW-4418:
-

[~zhijunfu], I'd suggest doing a straightforward translation from the AE event 
loop to boost::asio. The multithreaded architecture sounds pretty complex, but 
I think it makes sense to explore if it provides substantial performance 
benefits (though it may not make sense for modest performance benefits).

Maybe we should use {{asio}} instead of {{boost::asio}} because {{asio}} looks 
like it's header only. http://think-async.com/Asio/AsioAndBoostAsio

[~pitrou], the non-boost {{asio}} looks like it's header only, so that may be 
preferable. What do you think about the overall tradeoff?

> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Zhijun Fu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Original text:
> It would be nice to move plasma store from current event loop to boost::asio 
> to modernize the code, and more importantly to benefit from the 
> functionalities provided by asio, which I think also provides opportunities 
> for performance improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-01-29 Thread Robert Nishihara (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755381#comment-16755381
 ] 

Robert Nishihara commented on ARROW-4418:
-

If preferable, there is also a non-boost version of asio 
[https://think-async.com/Asio/AsioAndBoostAsio.html|https://think-async.com/Asio/AsioAndBoostAsio.html,]

 

I also remember thinking that asio is moving into the C++ standard library, 
though I can't seem to find a reference for that at the moment.

 

The benefits of using asio are pretty big (Windows support for the Plasma store 
as well as using a more standard C++ approach than what we are currently doing).

 

In terms of alternatives, I know that Philipp has looked into gRPC. Maybe he 
could elaborate on the pros/cons there?

> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Zhijun Fu
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Original text:
> It would be nice to move plasma store from current event loop to boost::asio 
> to modernize the code, and more importantly to benefit from the 
> functionalities provided by asio, which I think also provides opportunities 
> for performance improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-4379) Register pyarrow serializers for collections.Counter and collections.deque.

2019-01-25 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-4379:
---

 Summary: Register pyarrow serializers for collections.Counter and 
collections.deque.
 Key: ARROW-4379
 URL: https://issues.apache.org/jira/browse/ARROW-4379
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (ARROW-4269) [Python] AttributeError: module 'pandas.core' has no attribute 'arrays'

2019-01-15 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-4269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara reassigned ARROW-4269:
---

Assignee: Philipp Moritz

> [Python] AttributeError: module 'pandas.core' has no attribute 'arrays'
> ---
>
> Key: ARROW-4269
> URL: https://issues.apache.org/jira/browse/ARROW-4269
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> This happens with pandas 0.22:
> ```
> In [1]: import pyarrow
> ---
> AttributeError Traceback (most recent call last)
>  in ()
> > 1 import pyarrow
> ~/arrow/python/pyarrow/__init__.py in ()
>  174 localfs = LocalFileSystem.get_instance()
>  175 
> --> 176 from pyarrow.serialization import (default_serialization_context,
>  177 register_default_serialization_handlers,
>  178 register_torch_serialization_handlers)
> ~/arrow/python/pyarrow/serialization.py in ()
>  303 
>  304 
> --> 305 
> register_default_serialization_handlers(_default_serialization_context)
> ~/arrow/python/pyarrow/serialization.py in 
> register_default_serialization_handlers(serialization_context)
>  294 custom_deserializer=_deserialize_pyarrow_table)
>  295 
> --> 296 _register_custom_pandas_handlers(serialization_context)
>  297 
>  298
> ~/arrow/python/pyarrow/serialization.py in 
> _register_custom_pandas_handlers(context)
>  175 custom_deserializer=_load_pickle_from_buffer)
>  176 
> --> 177 if hasattr(pd.core.arrays, 'interval'):
>  178 context.register_type(
>  179 pd.core.arrays.interval.IntervalArray,
> AttributeError: module 'pandas.core' has no attribute 'arrays'
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-4269) [Python] AttributeError: module 'pandas.core' has no attribute 'arrays'

2019-01-15 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-4269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-4269.
-
   Resolution: Fixed
Fix Version/s: 0.12.0

Issue resolved by pull request 3410
[https://github.com/apache/arrow/pull/3410]

> [Python] AttributeError: module 'pandas.core' has no attribute 'arrays'
> ---
>
> Key: ARROW-4269
> URL: https://issues.apache.org/jira/browse/ARROW-4269
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This happens with pandas 0.22:
> ```
> In [1]: import pyarrow
> ---
> AttributeError Traceback (most recent call last)
>  in ()
> > 1 import pyarrow
> ~/arrow/python/pyarrow/__init__.py in ()
>  174 localfs = LocalFileSystem.get_instance()
>  175 
> --> 176 from pyarrow.serialization import (default_serialization_context,
>  177 register_default_serialization_handlers,
>  178 register_torch_serialization_handlers)
> ~/arrow/python/pyarrow/serialization.py in ()
>  303 
>  304 
> --> 305 
> register_default_serialization_handlers(_default_serialization_context)
> ~/arrow/python/pyarrow/serialization.py in 
> register_default_serialization_handlers(serialization_context)
>  294 custom_deserializer=_deserialize_pyarrow_table)
>  295 
> --> 296 _register_custom_pandas_handlers(serialization_context)
>  297 
>  298
> ~/arrow/python/pyarrow/serialization.py in 
> _register_custom_pandas_handlers(context)
>  175 custom_deserializer=_load_pickle_from_buffer)
>  176 
> --> 177 if hasattr(pd.core.arrays, 'interval'):
>  178 context.register_type(
>  179 pd.core.arrays.interval.IntervalArray,
> AttributeError: module 'pandas.core' has no attribute 'arrays'
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-4249) [Plasma] Remove reference to logging.h from plasma/common.h

2019-01-13 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-4249.
-
   Resolution: Fixed
Fix Version/s: (was: 0.13.0)
   0.12.0

Issue resolved by pull request 3392
[https://github.com/apache/arrow/pull/3392]

> [Plasma] Remove reference to logging.h from plasma/common.h
> ---
>
> Key: ARROW-4249
> URL: https://issues.apache.org/jira/browse/ARROW-4249
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Affects Versions: 0.11.1
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> It is not needed there and pollutes the namespace for applications that use 
> the plasma client it with arrow's DCHECK macros (DCHECK is a name widely used 
> in other projects).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-4217) [Plasma] Remove custom object metadata

2019-01-09 Thread Robert Nishihara (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738761#comment-16738761
 ] 

Robert Nishihara commented on ARROW-4217:
-

We've started using the metadata to specify the object serialization format 
(e.g., Arrow, raw bytes, pickle, ...). We can remove it, but we should have a 
good alternative for how to do this. E.g., some discussion in 
https://github.com/apache/arrow/pull/2788.

> [Plasma] Remove custom object metadata
> --
>
> Key: ARROW-4217
> URL: https://issues.apache.org/jira/browse/ARROW-4217
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Affects Versions: 0.11.1
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Minor
> Fix For: 0.13.0
>
>
> Currently, Plasma supports custom metadata for objects. This doesn't seem to 
> be used at the moment, and removing it will simplify the interface and 
> implementation of plasma. Removing the custom metadata will also make 
> eviction to other blob stores easier (most other stores don't support custom 
> metadata).
> My personal use case was to store arrow schemata in there, but they are now 
> stored as part of the object itself.
> If nobody else is using this, I'd suggest removing it. If people really want 
> metadata, they could always store it as a separate object if desired.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-3958) [Plasma] Reduce number of IPCs

2018-12-13 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-3958.
-
Resolution: Fixed

Issue resolved by pull request 3124
[https://github.com/apache/arrow/pull/3124]

> [Plasma] Reduce number of IPCs
> --
>
> Key: ARROW-3958
> URL: https://issues.apache.org/jira/browse/ARROW-3958
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Affects Versions: 0.11.1
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Currently we ship file descriptors of objects from the store to the client 
> every time an object is created or gotten. There is relatively few distinct 
> file descriptors, so caching them can get rid of one IPC in the majority of 
> cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3920) Plasma reference counting not properly done in TensorFlow custom operator.

2018-11-30 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3920:
---

 Summary: Plasma reference counting not properly done in TensorFlow 
custom operator.
 Key: ARROW-3920
 URL: https://issues.apache.org/jira/browse/ARROW-3920
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


We never call {{Release}} in the custom op code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-3607) [Java] delete() method via JNI for plasma

2018-11-23 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-3607.
-
   Resolution: Fixed
Fix Version/s: 0.12.0

Issue resolved by pull request 2829
[https://github.com/apache/arrow/pull/2829]

> [Java] delete() method via JNI for plasma
> -
>
> Key: ARROW-3607
> URL: https://issues.apache.org/jira/browse/ARROW-3607
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java, Plasma (C++)
>Affects Versions: 0.11.1
>Reporter: Shubham Chaurasia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Currently delete(objectId) method has not been exposed via JNI interface. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-3648) [Plasma] Add API to get metadata and data at the same time

2018-11-03 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-3648.
-
   Resolution: Fixed
Fix Version/s: 0.12.0

Issue resolved by pull request 2862
[https://github.com/apache/arrow/pull/2862]

> [Plasma] Add API to get metadata and data at the same time
> --
>
> Key: ARROW-3648
> URL: https://issues.apache.org/jira/browse/ARROW-3648
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Yuhong Guo
>Assignee: Yuhong Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Current Arrow Java Plasma client has no API to get the metadata and data 
> together in one API call. If we split this process into two API calls, the 
> object status could be different. Current observation shows that the first 
> call could be empty(object not stored yet) while the second call will success 
> but the metadata and data does not match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-3555) [Plasma] Unify plasma client get function using metadata.

2018-10-29 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-3555.
-
   Resolution: Fixed
Fix Version/s: 0.12.0

Issue resolved by pull request 2788
[https://github.com/apache/arrow/pull/2788]

> [Plasma] Unify plasma client get function using metadata.
> -
>
> Key: ARROW-3555
> URL: https://issues.apache.org/jira/browse/ARROW-3555
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Yuhong Guo
>Assignee: Yuhong Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Sometimes, it is very hard for the data consumer to know whether an object is 
> a buffer or other objects. If we use try-catch to catch the pyarrow 
> deserialization exception and then using `plasma_client.get_buffer`, the code 
> is not clean.
> We may leverage the metadata which is not used at all to mark the buffer 
> data. In the client of other language, this would be simple to implement. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3611) Give error more quickly when pyarrow serialization context is used incorrectly.

2018-10-24 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3611:
---

 Summary: Give error more quickly when pyarrow serialization 
context is used incorrectly.
 Key: ARROW-3611
 URL: https://issues.apache.org/jira/browse/ARROW-3611
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara
Assignee: Robert Nishihara


When {{type_id}} is not a string or can't be cast to a string, 
{{register_type}} will succeed, but {{_deserialize_callback}} can fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3605) Remove AE library from plasma header files.

2018-10-24 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3605:
---

 Summary: Remove AE library from plasma header files.
 Key: ARROW-3605
 URL: https://issues.apache.org/jira/browse/ARROW-3605
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


Users of {{plasma/events.h}} won't have access to {{ae.h}} so we need to remove 
it from the header file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3574) Fix remaining bug with plasma static versus shared libraries.

2018-10-20 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3574:
---

 Summary: Fix remaining bug with plasma static versus shared 
libraries.
 Key: ARROW-3574
 URL: https://issues.apache.org/jira/browse/ARROW-3574
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


Address a few missing pieces in [https://github.com/apache/arrow/pull/2792.] On 
Mac, moving the {{plasma_store_server}} executable around and then executing it 
leads to

 
{code:java}
dyld: Library not loaded: @rpath/libarrow.12.dylib

  Referenced from: 
/Users/rkn/Workspace/ray/./python/ray/core/src/plasma/plasma_store_server

  Reason: image not found

Abort trap: 6{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3559) Statically link libraries for plasma_store_server executable.

2018-10-18 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3559:
---

 Summary: Statically link libraries for plasma_store_server 
executable.
 Key: ARROW-3559
 URL: https://issues.apache.org/jira/browse/ARROW-3559
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


{code:java}
cd ~
git clone https://github.com/apache/arrow
cd arrow/cpp
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DARROW_PYTHON=on -DARROW_PLASMA=on ..
make -j16
sudo make install

cd ~
cp arrow/cpp/build/release/plasma_store_server .
mv arrow arrow-temp

# Try to start the store
./plasma_store_server -s /tmp/store -m 10{code}
The last line crashes with
{code:java}
./plasma_store_server: error while loading shared libraries: libplasma.so.12: 
cannot open shared object file: No such file or directory{code}
For usability, it's important that people can copy around the plasma store 
executable and run it.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3558) Remove fatal error when plasma client calls get on an unsealed object that it created.

2018-10-18 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3558:
---

 Summary: Remove fatal error when plasma client calls get on an 
unsealed object that it created.
 Key: ARROW-3558
 URL: https://issues.apache.org/jira/browse/ARROW-3558
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


In the case when Get is called with a timeout, this should simply behave as if 
the object hasn't been created yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3548) Speed up storing small objects in the object store.

2018-10-17 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara updated ARROW-3548:

Issue Type: Improvement  (was: Bug)

> Speed up storing small objects in the object store.
> ---
>
> Key: ARROW-3548
> URL: https://issues.apache.org/jira/browse/ARROW-3548
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Major
>
> Currently, to store an object in the plasma object store, there are a lot of 
> IPCs. We first call "Create", which does an IPC round trip. Then we call 
> "Seal", which is one IPC. Then we call "Release", which is another IPC.
> For small objects, we can just inline the object and metadata directly into 
> the message to the store, and wait for the response (the response tells us if 
> the object was successfully created). This is just a single IPC round trip, 
> which can be much faster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2363) [Plasma] Have an automatic object-releasing Create() variant

2018-10-17 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara updated ARROW-2363:

Affects Version/s: (was: 0.8.0)

> [Plasma] Have an automatic object-releasing Create() variant
> 
>
> Key: ARROW-2363
> URL: https://issues.apache.org/jira/browse/ARROW-2363
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Antoine Pitrou
>Priority: Major
>
> Like ARROW-2195, but for Create() instead of Get(). Need creating a new C++ 
> API and using it on the Python side.
>  * Create() currently increments the reference count twice
>  * Both Seal() and Release() decrement the reference count
>  * The returned buffer must also handle the case where Seal() wasn't called : 
> first Release() then Abort()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3548) Speed up storing small objects in the object store.

2018-10-17 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3548:
---

 Summary: Speed up storing small objects in the object store.
 Key: ARROW-3548
 URL: https://issues.apache.org/jira/browse/ARROW-3548
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


Currently, to store an object in the plasma object store, there are a lot of 
IPCs. We first call "Create", which does an IPC round trip. Then we call 
"Seal", which is one IPC. Then we call "Release", which is another IPC.

For small objects, we can just inline the object and metadata directly into the 
message to the store, and wait for the response (the response tells us if the 
object was successfully created). This is just a single IPC round trip, which 
can be much faster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3454) Tab complete doesn't work for plasma client.

2018-10-06 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3454:
---

 Summary: Tab complete doesn't work for plasma client.
 Key: ARROW-3454
 URL: https://issues.apache.org/jira/browse/ARROW-3454
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Robert Nishihara


In IPython, tab complete on a plasma client object should reveal the client's 
methods. I think this is the same thing as making sure {{dir(client)}} returns 
all of the relevant methods/fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3373) Fix bug in which plasma store can die when client gets multiple objects and object becomes available.

2018-09-30 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3373:
---

 Summary: Fix bug in which plasma store can die when client gets 
multiple objects and object becomes available.
 Key: ARROW-3373
 URL: https://issues.apache.org/jira/browse/ARROW-3373
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara
 Fix For: 0.11.0


This bug was recently introduced in 
[https://github.com/apache/arrow/pull/2650.] The store can die when a client 
calls "get" on multiple object IDs and then the first object ID becomes 
available.

Will have a patch momentarily.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-3348) Plasma store dies when an object that a dead client is waiting for gets created.

2018-09-29 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-3348.
-
   Resolution: Fixed
Fix Version/s: 0.11.0

Issue resolved by pull request 2650
[https://github.com/apache/arrow/pull/2650]

> Plasma store dies when an object that a dead client is waiting for gets 
> created.
> 
>
> Key: ARROW-3348
> URL: https://issues.apache.org/jira/browse/ARROW-3348
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> I will have a patch for this soon.
> To reproduce the bug do the following:
>  # Start plasma store
>  # Create client 1 and have it call {{get(object_id)}}
>  # Kill client 1
>  # Create client 2 and have it kill create an object with ID {{object_id}}
> This will cause the plasma store to crash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3348) Plasma store dies when an object that a dead client is waiting for gets created.

2018-09-27 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-3348:
---

 Summary: Plasma store dies when an object that a dead client is 
waiting for gets created.
 Key: ARROW-3348
 URL: https://issues.apache.org/jira/browse/ARROW-3348
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


I will have a patch for this soon.

To reproduce the bug do the following:
 # Start plasma store
 # Create client 1 and have it call {{get(object_id)}}
 # Kill client 1
 # Create client 2 and have it kill create an object with ID {{object_id}}

This will cause the plasma store to crash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-2975) [Plasma] TensorFlow op: Compilation only working if arrow found by pkg-config

2018-08-08 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-2975.
-
   Resolution: Fixed
Fix Version/s: JS-0.4.0

Issue resolved by pull request 2368
[https://github.com/apache/arrow/pull/2368]

> [Plasma] TensorFlow op: Compilation only working if arrow found by pkg-config
> -
>
> Key: ARROW-2975
> URL: https://issues.apache.org/jira/browse/ARROW-2975
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Currently the pyarrow/tensorflow/build.sh script uses pyarrow to discover the 
> arrow libraries to link against. However, this is not working on the pip 
> package of pyarrow (since the .pc files are not shipped with it).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3012) Installation crashes on Python 3.7

2018-08-07 Thread Robert Nishihara (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572121#comment-16572121
 ] 

Robert Nishihara commented on ARROW-3012:
-

I've seen this issue as well even without Python 3.7. In the past I've fixed it 
by changing the version of {{setuptools_scm}} or {{setuptools}}. E.g., 
https://github.com/ray-project/ray/pull/2477/files

> Installation crashes on Python 3.7
> --
>
> Key: ARROW-3012
> URL: https://issues.apache.org/jira/browse/ARROW-3012
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
> Environment: OS: MacOS High Sierra (10.13.5)
> Python: 3.7.0
> Cython: 0.28.5
>Reporter: Diego Argueta
>Priority: Major
>
> To reproduce, on Python 3.7.0: 
> {code:none}
> pip3.7 install pyarrow==0.10.0
> {code}
>  
> The result is a crash:
> {code:none}
> Collecting pyarrow
> Using cached 
> https://files.pythonhosted.org/packages/c0/a0/f7e9dfd8988d94f4952f9b50eb04e14a80fbe39218520725aab53daab57c/pyarrow-0.10.0.tar.gz
> Complete output from command python setup.py egg_info:
> Traceback (most recent call last):
> File "", line 1, in 
> File 
> "/private/var/folders/6r/dy0_bd2x2kn735d8kymc4qt0gn/T/pip-install-w77cbide/pyarrow/setup.py",
>  line 545, in 
> url="https://arrow.apache.org/;
> File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/setuptools/__init__.py",
>  line 131, in setup
> return distutils.core.setup(**attrs)
> File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/distutils/core.py", line 
> 108, in setup
> _setup_distribution = dist = klass(attrs)
> File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/setuptools/dist.py",
>  line 370, in __init__
> k: v for k, v in attrs.items()
> File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/distutils/dist.py", line 
> 292, in __init__
> self.finalize_options()
> File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/setuptools/dist.py",
>  line 529, in finalize_options
> ep.load()(self, ep.name, value)
> File 
> "/private/var/folders/6r/dy0_bd2x2kn735d8kymc4qt0gn/T/pip-install-w77cbide/pyarrow/.eggs/setuptools_scm-3.0.6-py3.7.egg/setuptools_scm/integration.py",
>  line 23, in version_keyword
> dist.metadata.version = get_version(**value)
> File 
> "/private/var/folders/6r/dy0_bd2x2kn735d8kymc4qt0gn/T/pip-install-w77cbide/pyarrow/.eggs/setuptools_scm-3.0.6-py3.7.egg/setuptools_scm/__init__.py",
>  line 135, in get_version
> parsed_version = _do_parse(config)
> File 
> "/private/var/folders/6r/dy0_bd2x2kn735d8kymc4qt0gn/T/pip-install-w77cbide/pyarrow/.eggs/setuptools_scm-3.0.6-py3.7.egg/setuptools_scm/__init__.py",
>  line 77, in _do_parse
> parse_result = _call_entrypoint_fn(config, config.parse)
> File 
> "/private/var/folders/6r/dy0_bd2x2kn735d8kymc4qt0gn/T/pip-install-w77cbide/pyarrow/.eggs/setuptools_scm-3.0.6-py3.7.egg/setuptools_scm/__init__.py",
>  line 40, in _call_entrypoint_fn
> return fn(config.absolute_root)
> File 
> "/private/var/folders/6r/dy0_bd2x2kn735d8kymc4qt0gn/T/pip-install-w77cbide/pyarrow/setup.py",
>  line 498, in parse_version
> return version_from_scm(root)
> File 
> "/private/var/folders/6r/dy0_bd2x2kn735d8kymc4qt0gn/T/pip-install-w77cbide/pyarrow/.eggs/setuptools_scm-3.0.6-py3.7.egg/setuptools_scm/__init__.py",
>  line 28, in version_from_scm
> return _version_from_entrypoint(root, "setuptools_scm.parse_scm")
> File 
> "/private/var/folders/6r/dy0_bd2x2kn735d8kymc4qt0gn/T/pip-install-w77cbide/pyarrow/.eggs/setuptools_scm-3.0.6-py3.7.egg/setuptools_scm/__init__.py",
>  line 44, in _version_from_entrypoint
> for ep in iter_matching_entrypoints(config.absolute_root, entrypoint):
> AttributeError: 'str' object has no attribute 'absolute_root'
> 
> Command "python setup.py egg_info" failed with error code 1 in 
> /private/var/folders/6r/dy0_bd2x2kn735d8kymc4qt0gn/T/pip-install-w77cbide/pyarrow/
> {code}
>  
> I suspect this is because {{setuptools_scm}} isn't being used correctly. The 
> function takes one argument, {{root}}, but judging from the code that uses 
> it, it appears to expect a {{setuptools_scm.config.Configuration}} object 
> rather than a file path.
> All the documentation says to use {{get_version()}} and the package author 
> doesn't seem to be sure that {{version_from_scm()}} should even be a public 
> function (see 
> [here|https://github.com/pypa/setuptools_scm/blob/master/src/setuptools_scm/__init__.py#L27]).
>  Perhaps going with that would be best.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2956) [Python]Arrow plasma throws ArrowIOError and process crashed

2018-08-02 Thread Robert Nishihara (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566448#comment-16566448
 ] 

Robert Nishihara commented on ARROW-2956:
-

I'm not sure we support memory sizes this small. However, if we don't support 
it, then we need to fail when starting the plasma store in the first place 
instead of letting it fail later.

 

This issue probably has to do with the default malloc sizes in this file 

https://github.com/apache/arrow/blob/d48dce2cfebdbd044a8260d0a77f5fe3d89a4a2d/cpp/src/plasma/malloc.cc#L47

> [Python]Arrow plasma throws ArrowIOError and process crashed
> 
>
> Key: ARROW-2956
> URL: https://issues.apache.org/jira/browse/ARROW-2956
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: He Kaisheng
>Priority: Major
>
> hello,
> We start a plasma store with 100k memory. when storage is full, it throws 
> ArrowIOError and the *process crashed,* not the expected PlasmaStoreFull 
> error.
> code:
> {code:java}
> import pyarrow.plasma as plasma
> import numpy as np
> plasma_client = plasma.connect(plasma_socket, '', 0)
> ref = []
> for _ in range(1000):
> obj_id = plasma_client.put(np.random.randint(100, size=(100, 100), 
> dtype=np.int16))
> data = plasma_client.get(obj_id)
> ref.append(data)
> {code}
> error:
> {noformat}
> ---
> ArrowIOError  Traceback (most recent call last)
>  in ()
>   2 ref = []
>   3 for _ in range(1000):
> > 4 obj_id = plasma_client.put(np.random.randint(100, size=(100, 
> 100), dtype=np.int16))
>   5 data = plasma_client.get(obj_id)
>   6 ref.append(data)
> plasma.pyx in pyarrow.plasma.PlasmaClient.put()
> plasma.pyx in pyarrow.plasma.PlasmaClient.create()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Encountered unexpected EOF{noformat}
> this problem doesn't exist when dtype is np.int64 or share memory is 
> larger(like more than 100M), it seems so strange, anybody knows the reason? 
> Thanks a lot.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (ARROW-2953) [Plasma] Store memory usage

2018-08-01 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara reassigned ARROW-2953:
---

Assignee: Philipp Moritz

> [Plasma] Store memory usage
> ---
>
> Key: ARROW-2953
> URL: https://issues.apache.org/jira/browse/ARROW-2953
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> While doing some memory profiling on the store, it became clear that at the 
> moment the metadata of the objects takes up much more space than it should. 
> In particular, for each object:
>  * The object id (20 bytes) is stored three times
>  * The object checksum (8 bytes) is stored twice
> We can therefore significantly reduce the metadata overhead with some 
> refactoring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-2953) [Plasma] Store memory usage

2018-08-01 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-2953.
-
   Resolution: Fixed
Fix Version/s: JS-0.4.0

Issue resolved by pull request 2359
[https://github.com/apache/arrow/pull/2359]

> [Plasma] Store memory usage
> ---
>
> Key: ARROW-2953
> URL: https://issues.apache.org/jira/browse/ARROW-2953
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> While doing some memory profiling on the store, it became clear that at the 
> moment the metadata of the objects takes up much more space than it should. 
> In particular, for each object:
>  * The object id (20 bytes) is stored three times
>  * The object checksum (8 bytes) is stored twice
> We can therefore significantly reduce the metadata overhead with some 
> refactoring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-2940) [Python] Import error with pytorch 0.3

2018-07-30 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-2940.
-
   Resolution: Fixed
Fix Version/s: JS-0.4.0

Issue resolved by pull request 2342
[https://github.com/apache/arrow/pull/2342]

> [Python] Import error with pytorch 0.3
> --
>
> Key: ARROW-2940
> URL: https://issues.apache.org/jira/browse/ARROW-2940
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The fix in ARROW-2920 doesn't work in versions strictly before pytorch 0.4:
> {code:java}
> >>> import pyarrow
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/ubuntu/arrow/python/pyarrow/__init__.py", line 57, in 
>     compat.import_pytorch_extension()
>   File "/home/ubuntu/arrow/python/pyarrow/compat.py", line 249, in 
> import_pytorch_extension
>     ctypes.CDLL(os.path.join(path, "lib/libcaffe2.so"))
>   File 
> "/home/ubuntu/anaconda3/envs/breaking-env2/lib/python3.5/ctypes/__init__.py", 
> line 351, in __init__
>     self._handle = _dlopen(self._name, mode)
> OSError: 
> /home/ubuntu/anaconda3/envs/breaking-env2/lib/python3.5/site-packages/torch/lib/libcaffe2.so:
>  cannot open shared object file: No such file or directory{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2920) [Python] Segfault with pytorch 0.4

2018-07-26 Thread Robert Nishihara (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559017#comment-16559017
 ] 

Robert Nishihara commented on ARROW-2920:
-

Does the problem only occur when TensorFlow is *not* installed?

> [Python] Segfault with pytorch 0.4
> --
>
> Key: ARROW-2920
> URL: https://issues.apache.org/jira/browse/ARROW-2920
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Philipp Moritz
>Priority: Major
>
> See also [https://github.com/ray-project/ray/issues/2447]
> How to reproduce:
>  * Start the Ubuntu Deep Learning AMI (version 12.0) on EC2
>  * Create a new env with {{conda create -y -n breaking-env python=3.5}}
>  * Install pytorch with {{source activate breaking-env && conda install 
> pytorch torchvision cuda91 -c pytorch}}
>  * Compile and install manylinux1 pyarrow wheels from latest arrow master as 
> described here: 
> https://github.com/apache/arrow/blob/2876a3fdd1fb9ef6918b7214d6e8d1e3017b42ad/python/manylinux1/README.md
>  * In the breaking-env just created, run the following:
>  
> {code:java}
> Python 3.5.5 |Anaconda, Inc.| (default, May 13 2018, 21:12:35)
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow
> >>> import torch
> >>> torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, 
> >>> bias=False).cuda()
> Segmentation fault (core dumped){code}
>  
> Backtrace:
> {code:java}
> >>> torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, 
> >>> bias=False).cuda()
> Program received signal SIGSEGV, Segmentation fault.
> 0x in ?? ()
> (gdb) bt
> #0  0x in ?? ()
> #1  0x77bc8a99 in __pthread_once_slow (once_control=0x7fffdb791e50 
> , init_routine=0x7fffe46aafe1 
> )
>     at pthread_once.c:116
> #2  0x7fffda95c302 in at::Type::toBackend(at::Backend) const () from 
> /home/ubuntu/anaconda3/envs/breaking-env2/lib/python3.5/site-packages/torch/lib/libcaffe2.so
> #3  0x7fffdc59b231 in torch::autograd::VariableType::toBackend 
> (this=, b=) at 
> torch/csrc/autograd/generated/VariableType.cpp:145
> #4  0x7fffdc8dbe8a in torch::autograd::THPVariable_cuda 
> (self=0x76dbff78, args=0x76daf828, kwargs=0x0) at 
> torch/csrc/autograd/generated/python_variable_methods.cpp:333
> #5  0x5569f4e8 in PyCFunction_Call ()
> #6  0x556f67cc in PyEval_EvalFrameEx ()
> #7  0x556fbe08 in PyEval_EvalFrameEx ()
> #8  0x556f6e90 in PyEval_EvalFrameEx ()
> #9  0x556fbe08 in PyEval_EvalFrameEx ()
> #10 0x5570103d in PyEval_EvalCodeEx ()
> #11 0x55701f5c in PyEval_EvalCode ()
> #12 0x5575e454 in run_mod ()
> #13 0x5562ab5e in PyRun_InteractiveOneObject ()
> #14 0x5562ad01 in PyRun_InteractiveLoopFlags ()
> #15 0x5562ad62 in PyRun_AnyFileExFlags.cold.2784 ()
> #16 0x5562b080 in Py_Main.cold.2785 ()
> #17 0x5562b871 in main (){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2916) [C++] Plasma Seal is slow due to hashing

2018-07-25 Thread Robert Nishihara (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-2916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556353#comment-16556353
 ] 

Robert Nishihara commented on ARROW-2916:
-

In my experience, profilers often show the hashing taking up a bunch of time, 
but when I actually comment out the hashing, things don't get faster.

> [C++] Plasma Seal is slow due to hashing
> 
>
> Key: ARROW-2916
> URL: https://issues.apache.org/jira/browse/ARROW-2916
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Plasma (C++)
>Reporter: Simon Mo
>Priority: Minor
>
> When I writes a 1MB tensor into plasma, it takes about 2ms. 50% of time is 
> spent sealing the object. Can we add a flag to the Seal API to disable to 
> hashing when needed?
>  
> For example:
> ```
>  Status PlasmaClient::Seal(const ObjectID& object_id, bool hash_flag)
>  ```
>   
> cc [~pcmoritz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-1744) [Plasma] Provide TensorFlow operator to read tensors from plasma

2018-07-17 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-1744.
-
   Resolution: Fixed
Fix Version/s: (was: 0.11.0)
   JS-0.4.0

Issue resolved by pull request 2104
[https://github.com/apache/arrow/pull/2104]

> [Plasma] Provide TensorFlow operator to read tensors from plasma
> 
>
> Key: ARROW-1744
> URL: https://issues.apache.org/jira/browse/ARROW-1744
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>  Time Spent: 17h
>  Remaining Estimate: 0h
>
> see https://www.tensorflow.org/extend/adding_an_op



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-2805) [Python] TensorFlow import workaround not working with tensorflow-gpu if CUDA is not installed

2018-07-06 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-2805.
-
   Resolution: Fixed
Fix Version/s: JS-0.4.0

Issue resolved by pull request 2224
[https://github.com/apache/arrow/pull/2224]

> [Python] TensorFlow import workaround not working with tensorflow-gpu if CUDA 
> is not installed
> --
>
> Key: ARROW-2805
> URL: https://issues.apache.org/jira/browse/ARROW-2805
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available, tensorflow
> Fix For: JS-0.4.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> TensorFlow version: 1.7 (GPU enabled but CUDA is not installed)
> tensorflow-gpu was installed via pip install
> ```
> import ray
>  File "/home/eric/Desktop/ray-private/python/ray/__init__.py", line 28, in 
> 
>  import pyarrow # noqa: F401
>  File 
> "/home/eric/Desktop/ray-private/python/ray/pyarrow_files/pyarrow/__init__.py",
>  line 55, in 
>  compat.import_tensorflow_extension()
>  File 
> "/home/eric/Desktop/ray-private/python/ray/pyarrow_files/pyarrow/compat.py", 
> line 193, in import_tensorflow_extension
>  ctypes.CDLL(ext)
>  File "/usr/lib/python3.5/ctypes/__init__.py", line 347, in __init__
>  self._handle = _dlopen(self._name, mode)
> OSError: libcublas.so.9.0: cannot open shared object file: No such file or 
> directory
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2657) Segfault when importing TensorFlow after Pyarrow

2018-05-31 Thread Robert Nishihara (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497286#comment-16497286
 ] 

Robert Nishihara commented on ARROW-2657:
-

Note that the issue does not appear when building arrow from source. It occurs 
when building a pyarrow wheel and then installing the wheel.

> Segfault when importing TensorFlow after Pyarrow
> 
>
> Key: ARROW-2657
> URL: https://issues.apache.org/jira/browse/ARROW-2657
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Robert Nishihara
>Priority: Major
>
> The following will segfault when pyarrow wheels are built using the 
> instructions in 
> [https://github.com/apache/arrow/tree/master/python/manylinux1#build-instructions].
> {code:java}
> import pyarrow
> import tensorflow
> {code}
> Searching over commits, this was introduced in 
> https://github.com/apache/arrow/commit/2093f6ec5c628ef983194a3fb3d0a621dd58c600.
> Running in gdb shows
> {code:java}
> $ gdb python
> GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
> Copyright (C) 2016 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
> .
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from python...done.
> (gdb) run
> Starting program: /home/ubuntu/anaconda3/bin/python 
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow
> >>> import tensorflow
> Program received signal SIGSEGV, Segmentation fault.
> 0x in ?? ()
> (gdb) bt
> #0 0x in ?? ()
> #1 0x77bc8a99 in __pthread_once_slow (
>  once_control=0x7fffd95561c8  namespace)::cpuid_once_flag>, init_routine=0x717e6fe1 
> )
>  at pthread_once.c:116
> #2 0x7fffd8df6faa in void std::call_once(std::once_flag&, 
> void (&)()) ()
>  from 
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
> #3 0x7fffd8df6fde in 
> tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) ()
>  from 
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
> #4 0x7fffd8df6f11 in tensorflow::port::(anonymous 
> namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string 
> const&) ()
>  from 
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
> #5 0x7fffd8c38394 in _GLOBAL__sub_I_cpu_feature_guard.cc ()
>  from 
> /home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
> ---Type  to continue, or q  to quit---
> #6 0x77de76ba in call_init (l=, argc=argc@entry=1, 
>  argv=argv@entry=0x7fffe598, env=env@entry=0x55c628f0)
>  at dl-init.c:72
> #7 0x77de77cb in call_init (env=0x55c628f0, argv=0x7fffe598, 
>  argc=1, l=) at dl-init.c:30
> #8 _dl_init (main_map=main_map@entry=0x565d9640, argc=1, 
>  argv=0x7fffe598, env=0x55c628f0) at dl-init.c:120
> #9 0x77dec8e2 in dl_open_worker (a=a@entry=0x7fff8810)
>  at dl-open.c:575
> #10 0x77de7564 in _dl_catch_error (
>  objname=objname@entry=0x7fff8800, 
>  errstring=errstring@entry=0x7fff8808, 
>  mallocedp=mallocedp@entry=0x7fff87ff, 
>  operate=operate@entry=0x77dec4d0 , 
>  args=args@entry=0x7fff8810) at dl-error.c:187
> #11 0x77debda9 in _dl_open (
>  file=0x7fffde1edc00 
> "/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so",
>  mode=-2147483646, 
>  caller_dlopen=0x55742bfa <_PyImport_FindSharedFuncptr+138>, nsid=-2, 
>  argc=, argv=, env=0x55c628f0)
> ---Type  to continue, or q  to quit---
>  at dl-open.c:660
> #12 0x775ecf09 in dlopen_doit (a=a@entry=0x7fff8a40) at 
> dlopen.c:66
> #13 0x77de7564 in _dl_catch_error (objname=0x55b35d00, 
>  errstring=0x55b35d08, mallocedp=0x55b35cf8, 
>  operate=0x775eceb0 , args=0x7fff8a40)
>  at dl-error.c:187
> #14 0x775ed571 in _dlerror_run (
>

[jira] [Updated] (ARROW-2657) Segfault when importing TensorFlow after Pyarrow

2018-05-31 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara updated ARROW-2657:

Description: 
The following will segfault when pyarrow wheels are built using the 
instructions in 
[https://github.com/apache/arrow/tree/master/python/manylinux1#build-instructions].
{code:java}
import pyarrow
import tensorflow
{code}
Searching over commits, this was introduced in 
https://github.com/apache/arrow/commit/2093f6ec5c628ef983194a3fb3d0a621dd58c600.

Running in gdb shows
{code:java}
$ gdb python
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) run
Starting program: /home/ubuntu/anaconda3/bin/python 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> import tensorflow

Program received signal SIGSEGV, Segmentation fault.
0x in ?? ()
(gdb) bt
#0 0x in ?? ()
#1 0x77bc8a99 in __pthread_once_slow (
 once_control=0x7fffd95561c8 , init_routine=0x717e6fe1 )
 at pthread_once.c:116
#2 0x7fffd8df6faa in void std::call_once(std::once_flag&, void 
(&)()) ()
 from 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#3 0x7fffd8df6fde in 
tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) ()
 from 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#4 0x7fffd8df6f11 in tensorflow::port::(anonymous 
namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string const&) 
()
 from 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#5 0x7fffd8c38394 in _GLOBAL__sub_I_cpu_feature_guard.cc ()
 from 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
---Type  to continue, or q  to quit---
#6 0x77de76ba in call_init (l=, argc=argc@entry=1, 
 argv=argv@entry=0x7fffe598, env=env@entry=0x55c628f0)
 at dl-init.c:72
#7 0x77de77cb in call_init (env=0x55c628f0, argv=0x7fffe598, 
 argc=1, l=) at dl-init.c:30
#8 _dl_init (main_map=main_map@entry=0x565d9640, argc=1, 
 argv=0x7fffe598, env=0x55c628f0) at dl-init.c:120
#9 0x77dec8e2 in dl_open_worker (a=a@entry=0x7fff8810)
 at dl-open.c:575
#10 0x77de7564 in _dl_catch_error (
 objname=objname@entry=0x7fff8800, 
 errstring=errstring@entry=0x7fff8808, 
 mallocedp=mallocedp@entry=0x7fff87ff, 
 operate=operate@entry=0x77dec4d0 , 
 args=args@entry=0x7fff8810) at dl-error.c:187
#11 0x77debda9 in _dl_open (
 file=0x7fffde1edc00 
"/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so",
 mode=-2147483646, 
 caller_dlopen=0x55742bfa <_PyImport_FindSharedFuncptr+138>, nsid=-2, 
 argc=, argv=, env=0x55c628f0)
---Type  to continue, or q  to quit---
 at dl-open.c:660
#12 0x775ecf09 in dlopen_doit (a=a@entry=0x7fff8a40) at dlopen.c:66
#13 0x77de7564 in _dl_catch_error (objname=0x55b35d00, 
 errstring=0x55b35d08, mallocedp=0x55b35cf8, 
 operate=0x775eceb0 , args=0x7fff8a40)
 at dl-error.c:187
#14 0x775ed571 in _dlerror_run (
 operate=operate@entry=0x775eceb0 , 
 args=args@entry=0x7fff8a40) at dlerror.c:163
#15 0x775ecfa1 in __dlopen (file=, mode=)
 at dlopen.c:87
#16 0x55742bfa in _PyImport_FindSharedFuncptr ()
#17 0x557698a0 in _PyImport_LoadDynamicModuleWithSpec ()
#18 0x55769ae5 in _imp_create_dynamic ()
#19 0x55665a61 in PyCFunction_Call ()
#20 0x55719fdb in _PyEval_EvalFrameDefault ()
#21 0x556eba94 in _PyEval_EvalCodeWithName ()
#22 0x556ec941 in fast_function ()
#23 0x556f2755 in call_function ()
#24 0x55714cba in _PyEval_EvalFrameDefault ()
---Type  to continue, or q  to quit---
#25 0x556ec70b in fast_function ()
#26 0x556f2755 in call_function ()
#27

[jira] [Updated] (ARROW-2657) Segfault when importing TensorFlow after Pyarrow

2018-05-31 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara updated ARROW-2657:

Description: 
The following will segfault when pyarrow wheels are built using the 
instructions in 
https://github.com/apache/arrow/tree/master/python/manylinux1#build-instructions.

{code}
import pyarrow
import tensorflow
{code}

Searching over commits, this was introduced in 
https://github.com/apache/arrow/commits/master?after=94409a6a3f5d9203acdccb7fce2c88400939e589+34.

Running in gdb shows

{code}
$ gdb python
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) run
Starting program: /home/ubuntu/anaconda3/bin/python 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> import tensorflow

Program received signal SIGSEGV, Segmentation fault.
0x in ?? ()
(gdb) bt
#0 0x in ?? ()
#1 0x77bc8a99 in __pthread_once_slow (
 once_control=0x7fffd95561c8 , init_routine=0x717e6fe1 )
 at pthread_once.c:116
#2 0x7fffd8df6faa in void std::call_once(std::once_flag&, void 
(&)()) ()
 from 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#3 0x7fffd8df6fde in 
tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) ()
 from 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#4 0x7fffd8df6f11 in tensorflow::port::(anonymous 
namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string const&) 
()
 from 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#5 0x7fffd8c38394 in _GLOBAL__sub_I_cpu_feature_guard.cc ()
 from 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
---Type  to continue, or q  to quit---
#6 0x77de76ba in call_init (l=, argc=argc@entry=1, 
 argv=argv@entry=0x7fffe598, env=env@entry=0x55c628f0)
 at dl-init.c:72
#7 0x77de77cb in call_init (env=0x55c628f0, argv=0x7fffe598, 
 argc=1, l=) at dl-init.c:30
#8 _dl_init (main_map=main_map@entry=0x565d9640, argc=1, 
 argv=0x7fffe598, env=0x55c628f0) at dl-init.c:120
#9 0x77dec8e2 in dl_open_worker (a=a@entry=0x7fff8810)
 at dl-open.c:575
#10 0x77de7564 in _dl_catch_error (
 objname=objname@entry=0x7fff8800, 
 errstring=errstring@entry=0x7fff8808, 
 mallocedp=mallocedp@entry=0x7fff87ff, 
 operate=operate@entry=0x77dec4d0 , 
 args=args@entry=0x7fff8810) at dl-error.c:187
#11 0x77debda9 in _dl_open (
 file=0x7fffde1edc00 
"/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so",
 mode=-2147483646, 
 caller_dlopen=0x55742bfa <_PyImport_FindSharedFuncptr+138>, nsid=-2, 
 argc=, argv=, env=0x55c628f0)
---Type  to continue, or q  to quit---
 at dl-open.c:660
#12 0x775ecf09 in dlopen_doit (a=a@entry=0x7fff8a40) at dlopen.c:66
#13 0x77de7564 in _dl_catch_error (objname=0x55b35d00, 
 errstring=0x55b35d08, mallocedp=0x55b35cf8, 
 operate=0x775eceb0 , args=0x7fff8a40)
 at dl-error.c:187
#14 0x775ed571 in _dlerror_run (
 operate=operate@entry=0x775eceb0 , 
 args=args@entry=0x7fff8a40) at dlerror.c:163
#15 0x775ecfa1 in __dlopen (file=, mode=)
 at dlopen.c:87
#16 0x55742bfa in _PyImport_FindSharedFuncptr ()
#17 0x557698a0 in _PyImport_LoadDynamicModuleWithSpec ()
#18 0x55769ae5 in _imp_create_dynamic ()
#19 0x55665a61 in PyCFunction_Call ()
#20 0x55719fdb in _PyEval_EvalFrameDefault ()
#21 0x556eba94 in _PyEval_EvalCodeWithName ()
#22 0x556ec941 in fast_function ()
#23 0x556f2755 in call_function ()
#24 0x55714cba in _PyEval_EvalFrameDefault ()
---Type  to continue, or q  to quit---
#25 0x556ec70b in fast_function ()
#26 0x556f2755 in call_function ()
#27

[jira] [Updated] (ARROW-2657) Segfault when importing TensorFlow after Pyarrow

2018-05-31 Thread Robert Nishihara (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara updated ARROW-2657:

Docs Text:   (was: The following will segfault when pyarrow wheels are 
built using the instructions in 
https://github.com/apache/arrow/tree/master/python/manylinux1#build-instructions.

{code}
import pyarrow
import tensorflow
{code}

Searching over commits, this was introduced in 
https://github.com/apache/arrow/commits/master?after=94409a6a3f5d9203acdccb7fce2c88400939e589+34.

Running in gdb shows

{code}
$ gdb python
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) run
Starting program: /home/ubuntu/anaconda3/bin/python 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> import tensorflow

Program received signal SIGSEGV, Segmentation fault.
0x in ?? ()
(gdb) bt
#0  0x in ?? ()
#1  0x77bc8a99 in __pthread_once_slow (
once_control=0x7fffd95561c8 , init_routine=0x717e6fe1 )
at pthread_once.c:116
#2  0x7fffd8df6faa in void std::call_once(std::once_flag&, void 
(&)()) ()
   from 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#3  0x7fffd8df6fde in 
tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) ()
   from 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#4  0x7fffd8df6f11 in tensorflow::port::(anonymous 
namespace)::CheckFeatureOrDie(tensorflow::port::CPUFeature, std::string const&) 
()
   from 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#5  0x7fffd8c38394 in _GLOBAL__sub_I_cpu_feature_guard.cc ()
   from 
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
---Type  to continue, or q  to quit---
#6  0x77de76ba in call_init (l=, argc=argc@entry=1, 
argv=argv@entry=0x7fffe598, env=env@entry=0x55c628f0)
at dl-init.c:72
#7  0x77de77cb in call_init (env=0x55c628f0, argv=0x7fffe598, 
argc=1, l=) at dl-init.c:30
#8  _dl_init (main_map=main_map@entry=0x565d9640, argc=1, 
argv=0x7fffe598, env=0x55c628f0) at dl-init.c:120
#9  0x77dec8e2 in dl_open_worker (a=a@entry=0x7fff8810)
at dl-open.c:575
#10 0x77de7564 in _dl_catch_error (
objname=objname@entry=0x7fff8800, 
errstring=errstring@entry=0x7fff8808, 
mallocedp=mallocedp@entry=0x7fff87ff, 
operate=operate@entry=0x77dec4d0 , 
args=args@entry=0x7fff8810) at dl-error.c:187
#11 0x77debda9 in _dl_open (
file=0x7fffde1edc00 
"/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so",
 mode=-2147483646, 
caller_dlopen=0x55742bfa <_PyImport_FindSharedFuncptr+138>, nsid=-2, 
argc=, argv=, env=0x55c628f0)
---Type  to continue, or q  to quit---
at dl-open.c:660
#12 0x775ecf09 in dlopen_doit (a=a@entry=0x7fff8a40) at dlopen.c:66
#13 0x77de7564 in _dl_catch_error (objname=0x55b35d00, 
errstring=0x55b35d08, mallocedp=0x55b35cf8, 
operate=0x775eceb0 , args=0x7fff8a40)
at dl-error.c:187
#14 0x775ed571 in _dlerror_run (
operate=operate@entry=0x775eceb0 , 
args=args@entry=0x7fff8a40) at dlerror.c:163
#15 0x775ecfa1 in __dlopen (file=, mode=)
at dlopen.c:87
#16 0x55742bfa in _PyImport_FindSharedFuncptr ()
#17 0x557698a0 in _PyImport_LoadDynamicModuleWithSpec ()
#18 0x55769ae5 in _imp_create_dynamic ()
#19 0x55665a61 in PyCFunction_Call ()
#20 0x55719fdb in _PyEval_EvalFrameDefault ()
#21 0x556eba94 in _PyEval_EvalCodeWithName ()
#22 0x556ec941 in fast_function ()
#23 0x556f2755 in call_function ()
#24 0x55714cba in _PyEval_EvalFrameDefault ()
---Type  to continue, or q  to quit---
#25

[jira] [Created] (ARROW-2657) Segfault when importing TensorFlow after Pyarrow

2018-05-31 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2657:
---

 Summary: Segfault when importing TensorFlow after Pyarrow
 Key: ARROW-2657
 URL: https://issues.apache.org/jira/browse/ARROW-2657
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Robert Nishihara






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2641) [C++] Investigate spurious memset() calls

2018-05-28 Thread Robert Nishihara (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492909#comment-16492909
 ] 

Robert Nishihara commented on ARROW-2641:
-

Most likely unrelated, but we memset some unused padding bytes to 0 in the c++ 
implementation to make the serialization deterministic. e.g., 
https://github.com/apache/arrow/pull/405

> [C++] Investigate spurious memset() calls
> -
>
> Key: ARROW-2641
> URL: https://issues.apache.org/jira/browse/ARROW-2641
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> {{builder.cc}} has TODO statements of the form:
> {code:c++}
>   // TODO(emkornfield) valgrind complains without this
>   memset(data_->mutable_data(), 0, static_cast(nbytes));
> {code}
> Ideally we shouldn't have to zero-initialize a data buffer before writing to 
> it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2578) [Plasma] Valgrind errors related to std::random_device

2018-05-13 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16473737#comment-16473737
 ] 

Robert Nishihara commented on ARROW-2578:
-

This is suspiciously similar to 
[https://github.com/ray-project/ray/issues/1423,] so I suspect it's a known 
valgrind bug and that upgrading to valgrind 3.13.0 would solve the issue.

Did the underlying valgrind version change or something?

> [Plasma] Valgrind errors related to std::random_device
> --
>
> Key: ARROW-2578
> URL: https://issues.apache.org/jira/browse/ARROW-2578
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> These have started popping up very recently: 
> [https://travis-ci.org/apache/arrow/jobs/378526493]
> e.g.
> {code:java}
> [ RUN ] PlasmaSerialization.SealRequest
> ==19147== Conditional jump or move depends on uninitialised value(s)
> ==19147== at 0x510FFD8: std::random_device::_M_init(std::string const&) 
> (cow-string-inst.cc:56)
> ==19147== by 0x4E3B7C: std::random_device::random_device(std::string const&) 
> (random.h:1588)
> ==19147== by 0x4E2E6F: plasma::UniqueID::from_random() (common.cc:31)
> ==19147== by 0x4871D6: 
> plasma::PlasmaSerialization_SealRequest_Test::TestBody() 
> (serialization_tests.cc:120)
> ==19147== by 0x4D6589: void 
> testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2402)
> ==19147== by 0x4D0317: void 
> testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2438)
> ==19147== by 0x4B57D8: testing::Test::Run() (gtest.cc:2475)
> ==19147== by 0x4B607F: testing::TestInfo::Run() (gtest.cc:2656)
> ==19147== by 0x4B6743: testing::TestCase::Run() (gtest.cc:2774)
> ==19147== by 0x4BD113: testing::internal::UnitTestImpl::RunAllTests() 
> (gtest.cc:4649)
> ==19147== by 0x4D7891: bool 
> testing::internal::HandleSehExceptionsInMethodIfSupported  bool>(testing::internal::UnitTestImpl*, bool 
> (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2402)
> ==19147== by 0x4D103F: bool 
> testing::internal::HandleExceptionsInMethodIfSupported  bool>(testing::internal::UnitTestImpl*, bool 
> (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2438)
> ==19147=={code}
>  Any ideas on how to fix this are appreciated!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2506) [Plasma] Build error on macOS

2018-04-24 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450461#comment-16450461
 ] 

Robert Nishihara commented on ARROW-2506:
-

I think it makes sense to just remove the {{= 1}}.

> [Plasma] Build error on macOS
> -
>
> Key: ARROW-2506
> URL: https://issues.apache.org/jira/browse/ARROW-2506
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Priority: Major
>
> Since the upgrade to flatbuffers 1.9.0, I'm seeing this error on the Ray CI:
> arrow/cpp/src/plasma/format/plasma.fbs:234:0: error: default value of 0 for 
> field status is not part of enum ObjectStatus
> I'm planning to just remove the '= 1' from 'Local = 1'. This will break the 
> protocol however, so if we prefer to just put in a 'Dummy = 0' object at the 
> beginning of the enum, that would also be fine with me. However, the 
> ObjectStatus API is not stable yet and not even exposed to Python, so I think 
> breaking it is fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-2264) [Python] Efficiently serialize numpy arrays with dtype of unicode fixed length string

2018-04-21 Thread Robert Nishihara (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-2264.
-
Resolution: Fixed

> [Python] Efficiently serialize numpy arrays with dtype of unicode fixed 
> length string
> -
>
> Key: ARROW-2264
> URL: https://issues.apache.org/jira/browse/ARROW-2264
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Mitar
>Assignee: Robert Nishihara
>Priority: Major
>
> Looking at the numpy array serialization code it seems that if I have a dtype 
> like " efficient one.
> {{Example:}}{{>>> np.array(['aaa', 'bbb'])}}
> {{array(['aaa', 'bbb'], dtype=' This should be able to work, no? It has fixed offsets and memory layout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (ARROW-2264) [Python] Efficiently serialize numpy arrays with dtype of unicode fixed length string

2018-04-21 Thread Robert Nishihara (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara reassigned ARROW-2264:
---

Assignee: Robert Nishihara

> [Python] Efficiently serialize numpy arrays with dtype of unicode fixed 
> length string
> -
>
> Key: ARROW-2264
> URL: https://issues.apache.org/jira/browse/ARROW-2264
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Mitar
>Assignee: Robert Nishihara
>Priority: Major
>
> Looking at the numpy array serialization code it seems that if I have a dtype 
> like " efficient one.
> {{Example:}}{{>>> np.array(['aaa', 'bbb'])}}
> {{array(['aaa', 'bbb'], dtype=' This should be able to work, no? It has fixed offsets and memory layout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2264) [Python] Efficiently serialize numpy arrays with dtype of unicode fixed length string

2018-04-21 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447038#comment-16447038
 ] 

Robert Nishihara commented on ARROW-2264:
-

Fixed by https://github.com/apache/arrow/pull/1887.

> [Python] Efficiently serialize numpy arrays with dtype of unicode fixed 
> length string
> -
>
> Key: ARROW-2264
> URL: https://issues.apache.org/jira/browse/ARROW-2264
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Mitar
>Assignee: Robert Nishihara
>Priority: Major
>
> Looking at the numpy array serialization code it seems that if I have a dtype 
> like " efficient one.
> {{Example:}}{{>>> np.array(['aaa', 'bbb'])}}
> {{array(['aaa', 'bbb'], dtype=' This should be able to work, no? It has fixed offsets and memory layout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2489) [Plasma] test_plasma.py crashes

2018-04-21 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446988#comment-16446988
 ] 

Robert Nishihara commented on ARROW-2489:
-

Ok, that makes sense.

> [Plasma] test_plasma.py crashes
> ---
>
> Key: ARROW-2489
> URL: https://issues.apache.org/jira/browse/ARROW-2489
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GPU, Plasma (C++), Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> This is new here:
> {code}$ py.test   --tb=native pyarrow/tests/test_plasma.py 
> = test 
> session starts 
> ==
> platform linux -- Python 3.6.5, pytest-3.3.2, py-1.5.2, pluggy-0.6.0
> rootdir: /home/antoine/arrow/python, inifile: setup.cfg
> plugins: xdist-1.22.0, timeout-1.2.1, repeat-0.4.1, forked-0.2, 
> faulthandler-1.3.1
> collected 23 items
>   
>
> pyarrow/tests/test_plasma.py *** Error in 
> `/home/antoine/miniconda3/envs/pyarrow/bin/python': double free or corruption 
> (!prev): 0x01699520 ***
> [...]
> Current thread 0x7fe7e8570700 (most recent call first):
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 211 in 
> test_connection_failure_raises_exception
> [...]
> {code}
> Here is the C backtrace under gdb:
> {code}
> #0  0x769d0428 in __GI_raise (sig=sig@entry=6) at 
> ../sysdeps/unix/sysv/linux/raise.c:54
> #1  0x769d202a in __GI_abort () at abort.c:89
> #2  0x76a127ea in __libc_message (do_abort=do_abort@entry=2, 
> fmt=fmt@entry=0x76b2bed8 "*** Error in `%s': %s: 0x%s ***\n")
> at ../sysdeps/posix/libc_fatal.c:175
> #3  0x76a1b37a in malloc_printerr (ar_ptr=, 
> ptr=, str=0x76b2c008 "double free or corruption (!prev)", 
> action=3)
> at malloc.c:5006
> #4  _int_free (av=, p=, have_lock=0) at 
> malloc.c:3867
> #5  0x76a1f53c in __GI___libc_free (mem=) at 
> malloc.c:2968
> #6  0x7fffbdfcc504 in std::_Sp_counted_ptr (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x9defb0)
> at /usr/include/c++/4.9/bits/shared_ptr_base.h:373
> #7  0x7fffbdfc903c in 
> std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x9defb0) 
> at /usr/include/c++/4.9/bits/shared_ptr_base.h:149
> #8  0x7fffbdfc82b9 in 
> std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count 
> (this=0x7fffc1214510, __in_chrg=)
> at /usr/include/c++/4.9/bits/shared_ptr_base.h:666
> #9  0x7fffbdfc8276 in std::__shared_ptr (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fffc1214508, 
> __in_chrg=)
> at /usr/include/c++/4.9/bits/shared_ptr_base.h:914
> #10 0x7fffbdfc8fc4 in std::shared_ptr::~shared_ptr 
> (this=0x7fffc1214508, __in_chrg=)
> at /usr/include/c++/4.9/bits/shared_ptr.h:93
> #11 0x7fffbdfc8fde in 
> __Pyx_call_destructor (x=...)
> at /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/plasma.cxx:281
> #12 0x7fffbdfbc317 in __pyx_tp_dealloc_7pyarrow_6plasma_PlasmaClient 
> (o=0x7fffc12144f0)
> at /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/plasma.cxx:10383
> #13 0x7fffbdfb8986 in __pyx_pf_7pyarrow_6plasma_2connect (__pyx_self=0x0, 
> __pyx_v_store_socket_name=0x7fffbc922c48, 
> __pyx_v_manager_socket_name=0x77fa0ab0, __pyx_v_release_delay=0, 
> __pyx_v_num_retries=1)
> at /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/plasma.cxx:9147
> #14 0x7fffbdfb7dec in __pyx_pw_7pyarrow_6plasma_3connect (__pyx_self=0x0, 
> __pyx_args=0x7fffbc4d9688, __pyx_kwds=0x0)
> at /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/plasma.cxx:8978
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2489) [Plasma] test_plasma.py crashes

2018-04-21 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446980#comment-16446980
 ] 

Robert Nishihara commented on ARROW-2489:
-

Oh wow, nice job tracking that down. I'm all for the PIMPL solution to 
Arrow-2448, but why would PIMPL solve this issue?

> [Plasma] test_plasma.py crashes
> ---
>
> Key: ARROW-2489
> URL: https://issues.apache.org/jira/browse/ARROW-2489
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GPU, Plasma (C++), Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> This is new here:
> {code}$ py.test   --tb=native pyarrow/tests/test_plasma.py 
> = test 
> session starts 
> ==
> platform linux -- Python 3.6.5, pytest-3.3.2, py-1.5.2, pluggy-0.6.0
> rootdir: /home/antoine/arrow/python, inifile: setup.cfg
> plugins: xdist-1.22.0, timeout-1.2.1, repeat-0.4.1, forked-0.2, 
> faulthandler-1.3.1
> collected 23 items
>   
>
> pyarrow/tests/test_plasma.py *** Error in 
> `/home/antoine/miniconda3/envs/pyarrow/bin/python': double free or corruption 
> (!prev): 0x01699520 ***
> [...]
> Current thread 0x7fe7e8570700 (most recent call first):
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 211 in 
> test_connection_failure_raises_exception
> [...]
> {code}
> Here is the C backtrace under gdb:
> {code}
> #0  0x769d0428 in __GI_raise (sig=sig@entry=6) at 
> ../sysdeps/unix/sysv/linux/raise.c:54
> #1  0x769d202a in __GI_abort () at abort.c:89
> #2  0x76a127ea in __libc_message (do_abort=do_abort@entry=2, 
> fmt=fmt@entry=0x76b2bed8 "*** Error in `%s': %s: 0x%s ***\n")
> at ../sysdeps/posix/libc_fatal.c:175
> #3  0x76a1b37a in malloc_printerr (ar_ptr=, 
> ptr=, str=0x76b2c008 "double free or corruption (!prev)", 
> action=3)
> at malloc.c:5006
> #4  _int_free (av=, p=, have_lock=0) at 
> malloc.c:3867
> #5  0x76a1f53c in __GI___libc_free (mem=) at 
> malloc.c:2968
> #6  0x7fffbdfcc504 in std::_Sp_counted_ptr (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x9defb0)
> at /usr/include/c++/4.9/bits/shared_ptr_base.h:373
> #7  0x7fffbdfc903c in 
> std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x9defb0) 
> at /usr/include/c++/4.9/bits/shared_ptr_base.h:149
> #8  0x7fffbdfc82b9 in 
> std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count 
> (this=0x7fffc1214510, __in_chrg=)
> at /usr/include/c++/4.9/bits/shared_ptr_base.h:666
> #9  0x7fffbdfc8276 in std::__shared_ptr (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fffc1214508, 
> __in_chrg=)
> at /usr/include/c++/4.9/bits/shared_ptr_base.h:914
> #10 0x7fffbdfc8fc4 in std::shared_ptr::~shared_ptr 
> (this=0x7fffc1214508, __in_chrg=)
> at /usr/include/c++/4.9/bits/shared_ptr.h:93
> #11 0x7fffbdfc8fde in 
> __Pyx_call_destructor (x=...)
> at /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/plasma.cxx:281
> #12 0x7fffbdfbc317 in __pyx_tp_dealloc_7pyarrow_6plasma_PlasmaClient 
> (o=0x7fffc12144f0)
> at /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/plasma.cxx:10383
> #13 0x7fffbdfb8986 in __pyx_pf_7pyarrow_6plasma_2connect (__pyx_self=0x0, 
> __pyx_v_store_socket_name=0x7fffbc922c48, 
> __pyx_v_manager_socket_name=0x77fa0ab0, __pyx_v_release_delay=0, 
> __pyx_v_num_retries=1)
> at /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/plasma.cxx:9147
> #14 0x7fffbdfb7dec in __pyx_pw_7pyarrow_6plasma_3connect (__pyx_self=0x0, 
> __pyx_args=0x7fffbc4d9688, __pyx_kwds=0x0)
> at /home/antoine/arrow/python/build/temp.linux-x86_64-3.6/plasma.cxx:8978
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-2458) [Plasma] PlasmaClient uses global variable

2018-04-19 Thread Robert Nishihara (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-2458.
-
   Resolution: Fixed
Fix Version/s: JS-0.4.0

Issue resolved by pull request 1893
[https://github.com/apache/arrow/pull/1893]

> [Plasma] PlasmaClient uses global variable
> --
>
> Key: ARROW-2458
> URL: https://issues.apache.org/jira/browse/ARROW-2458
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Affects Versions: 0.9.0
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>
> The threadpool threadpool_ that PlasmaClient is using is global at the 
> moment. This prevents us from using multiple PlasmaClients in the same 
> process (one per thread).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2469) Make out arguments last in ReadMessage API.

2018-04-18 Thread Robert Nishihara (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara updated ARROW-2469:

Component/s: C++

> Make out arguments last in ReadMessage API.
> ---
>
> Key: ARROW-2469
> URL: https://issues.apache.org/jira/browse/ARROW-2469
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2469) Make out arguments last in ReadMessage API.

2018-04-18 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2469:
---

 Summary: Make out arguments last in ReadMessage API.
 Key: ARROW-2469
 URL: https://issues.apache.org/jira/browse/ARROW-2469
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Robert Nishihara
Assignee: Robert Nishihara






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-2397) Document changes in Tensor encoding in IPC.md.

2018-04-14 Thread Robert Nishihara (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-2397.
-
   Resolution: Fixed
Fix Version/s: JS-0.4.0

Issue resolved by pull request 1837
[https://github.com/apache/arrow/pull/1837]

> Document changes in Tensor encoding in IPC.md.
> --
>
> Key: ARROW-2397
> URL: https://issues.apache.org/jira/browse/ARROW-2397
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>
> Update IPC.md to reflect the changes in 
> https://github.com/apache/arrow/pull/1802.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (ARROW-2448) Segfault when plasma client goes out of scope before buffer.

2018-04-12 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436675#comment-16436675
 ] 

Robert Nishihara edited comment on ARROW-2448 at 4/13/18 1:55 AM:
--

We may need to wrap the {{PlasmaClient}} in a wrapper class, which the user 
allocates, and which has a shared pointer to the actual {{PlasmaClient}}. 
Similarly, each {{PlasmaBuffer}} needs a shared pointer to the 
{{PlasmaClient}}. What do you think about something like that?

I think something like this could be made to work.

Not sure if I understand your question correctly, but the buffer can still 
point to a valid region of memory after the client is destroyed (since the 
store is still running).


was (Author: robertnishihara):
We may need to wrap the {{PlasmaClient}} in a wrapper class, which the user 
allocates, and which has a shared pointer to the actual {{PlasmaClient}}. 
Similarly, each {{PlasmaBuffer}} needs a shared pointer to the {{PlasmaClient}}.

I think something like this could be made to work.

Not sure if I understand your question correctly, but the buffer can still 
point to a valid region of memory after the client is destroyed (since the 
store is still running).

> Segfault when plasma client goes out of scope before buffer.
> 
>
> Key: ARROW-2448
> URL: https://issues.apache.org/jira/browse/ARROW-2448
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++), Python
>Reporter: Robert Nishihara
>Priority: Major
>
> The following causes a segfault.
>  
> First start a plasma store with
> {code:java}
> plasma_store -s /tmp/store -m 100{code}
> Then run the following in Python.
> {code}
> import pyarrow.plasma as plasma
> import numpy as np
> client = plasma.connect('/tmp/store', '', 0)
> object_id = client.put(np.zeros(3))
> buf = client.get(object_id)
> del client
> del buf  # This segfaults.{code}
> The backtrace is 
> {code:java}
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0xfffc)
>   * frame #0: 0x0001056deaee 
> libplasma.0.dylib`plasma::PlasmaClient::Release(plasma::UniqueID const&) + 142
>     frame #1: 0x0001056de9e9 
> libplasma.0.dylib`plasma::PlasmaBuffer::~PlasmaBuffer() + 41
>     frame #2: 0x0001056dec9f libplasma.0.dylib`arrow::Buffer::~Buffer() + 
> 63
>     frame #3: 0x000106206661 
> lib.cpython-36m-darwin.so`std::__1::shared_ptr::~shared_ptr() 
> [inlined] std::__1::__shared_count::__release_shared(this=0x0001019b7d20) 
> at memory:3444
>     frame #4: 0x000106206617 
> lib.cpython-36m-darwin.so`std::__1::shared_ptr::~shared_ptr() 
> [inlined] 
> std::__1::__shared_weak_count::__release_shared(this=0x0001019b7d20) at 
> memory:3486
>     frame #5: 0x000106206617 
> lib.cpython-36m-darwin.so`std::__1::shared_ptr::~shared_ptr(this=0x000100791780)
>  at memory:4412
>     frame #6: 0x000106002b35 
> lib.cpython-36m-darwin.so`std::__1::shared_ptr::~shared_ptr(this=0x000100791780)
>  at memory:4410
>     frame #7: 0x0001061052c5 lib.cpython-36m-darwin.so`void 
> __Pyx_call_destructor >(x=std::__1::shared_ptr::element_type @ 0x0001019b7d38 
> strong=0 weak=1) at lib.cxx:486
>     frame #8: 0x000106104f93 
> lib.cpython-36m-darwin.so`__pyx_tp_dealloc_7pyarrow_3lib_Buffer(o=0x000100791768)
>  at lib.cxx:107704
>     frame #9: 0x0001069fcd54 
> multiarray.cpython-36m-darwin.so`array_dealloc + 292
>     frame #10: 0x0001000e8daf 
> libpython3.6m.dylib`_PyDict_DelItem_KnownHash + 463
>     frame #11: 0x000100171899 
> libpython3.6m.dylib`_PyEval_EvalFrameDefault + 13321
>     frame #12: 0x0001001791ef 
> libpython3.6m.dylib`_PyEval_EvalCodeWithName + 2447
>     frame #13: 0x00010016e3d4 libpython3.6m.dylib`PyEval_EvalCode + 100
>     frame #14: 0x0001001a3bd6 
> libpython3.6m.dylib`PyRun_InteractiveOneObject + 582
>     frame #15: 0x0001001a350e 
> libpython3.6m.dylib`PyRun_InteractiveLoopFlags + 222
>     frame #16: 0x0001001a33fc libpython3.6m.dylib`PyRun_AnyFileExFlags + 
> 60
>     frame #17: 0x0001001bc835 libpython3.6m.dylib`Py_Main + 3829
>     frame #18: 0x00010df8 python`main + 232
>     frame #19: 0x7fff6cd80015 libdyld.dylib`start + 1
>     frame #20: 0x7fff6cd80015 libdyld.dylib`start + 1{code}
> Basically, the issue is that when the buffer goes out of scope, it calls 
> {{Release}} on the plasma client, but the client has already been deallocated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2448) Segfault when plasma client goes out of scope before buffer.

2018-04-12 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436675#comment-16436675
 ] 

Robert Nishihara commented on ARROW-2448:
-

We may need to wrap the {{PlasmaClient}} in a wrapper class, which the user 
allocates, and which has a shared pointer to the actual {{PlasmaClient}}. 
Similarly, each {{PlasmaBuffer}} needs a shared pointer to the {{PlasmaClient}}.

I think something like this could be made to work.

Not sure if I understand your question correctly, but the buffer can still 
point to a valid region of memory after the client is destroyed (since the 
store is still running).

> Segfault when plasma client goes out of scope before buffer.
> 
>
> Key: ARROW-2448
> URL: https://issues.apache.org/jira/browse/ARROW-2448
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++), Python
>Reporter: Robert Nishihara
>Priority: Major
>
> The following causes a segfault.
>  
> First start a plasma store with
> {code:java}
> plasma_store -s /tmp/store -m 100{code}
> Then run the following in Python.
> {code}
> import pyarrow.plasma as plasma
> import numpy as np
> client = plasma.connect('/tmp/store', '', 0)
> object_id = client.put(np.zeros(3))
> buf = client.get(object_id)
> del client
> del buf  # This segfaults.{code}
> The backtrace is 
> {code:java}
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0xfffc)
>   * frame #0: 0x0001056deaee 
> libplasma.0.dylib`plasma::PlasmaClient::Release(plasma::UniqueID const&) + 142
>     frame #1: 0x0001056de9e9 
> libplasma.0.dylib`plasma::PlasmaBuffer::~PlasmaBuffer() + 41
>     frame #2: 0x0001056dec9f libplasma.0.dylib`arrow::Buffer::~Buffer() + 
> 63
>     frame #3: 0x000106206661 
> lib.cpython-36m-darwin.so`std::__1::shared_ptr::~shared_ptr() 
> [inlined] std::__1::__shared_count::__release_shared(this=0x0001019b7d20) 
> at memory:3444
>     frame #4: 0x000106206617 
> lib.cpython-36m-darwin.so`std::__1::shared_ptr::~shared_ptr() 
> [inlined] 
> std::__1::__shared_weak_count::__release_shared(this=0x0001019b7d20) at 
> memory:3486
>     frame #5: 0x000106206617 
> lib.cpython-36m-darwin.so`std::__1::shared_ptr::~shared_ptr(this=0x000100791780)
>  at memory:4412
>     frame #6: 0x000106002b35 
> lib.cpython-36m-darwin.so`std::__1::shared_ptr::~shared_ptr(this=0x000100791780)
>  at memory:4410
>     frame #7: 0x0001061052c5 lib.cpython-36m-darwin.so`void 
> __Pyx_call_destructor >(x=std::__1::shared_ptr::element_type @ 0x0001019b7d38 
> strong=0 weak=1) at lib.cxx:486
>     frame #8: 0x000106104f93 
> lib.cpython-36m-darwin.so`__pyx_tp_dealloc_7pyarrow_3lib_Buffer(o=0x000100791768)
>  at lib.cxx:107704
>     frame #9: 0x0001069fcd54 
> multiarray.cpython-36m-darwin.so`array_dealloc + 292
>     frame #10: 0x0001000e8daf 
> libpython3.6m.dylib`_PyDict_DelItem_KnownHash + 463
>     frame #11: 0x000100171899 
> libpython3.6m.dylib`_PyEval_EvalFrameDefault + 13321
>     frame #12: 0x0001001791ef 
> libpython3.6m.dylib`_PyEval_EvalCodeWithName + 2447
>     frame #13: 0x00010016e3d4 libpython3.6m.dylib`PyEval_EvalCode + 100
>     frame #14: 0x0001001a3bd6 
> libpython3.6m.dylib`PyRun_InteractiveOneObject + 582
>     frame #15: 0x0001001a350e 
> libpython3.6m.dylib`PyRun_InteractiveLoopFlags + 222
>     frame #16: 0x0001001a33fc libpython3.6m.dylib`PyRun_AnyFileExFlags + 
> 60
>     frame #17: 0x0001001bc835 libpython3.6m.dylib`Py_Main + 3829
>     frame #18: 0x00010df8 python`main + 232
>     frame #19: 0x7fff6cd80015 libdyld.dylib`start + 1
>     frame #20: 0x7fff6cd80015 libdyld.dylib`start + 1{code}
> Basically, the issue is that when the buffer goes out of scope, it calls 
> {{Release}} on the plasma client, but the client has already been deallocated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-2437) [C++] Change of arrow::ipc::ReadMessage signature breaks ABI compability

2018-04-12 Thread Robert Nishihara (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-2437.
-
   Resolution: Fixed
Fix Version/s: (was: 0.9.1)
   JS-0.4.0

Issue resolved by pull request 1874
[https://github.com/apache/arrow/pull/1874]

> [C++] Change of arrow::ipc::ReadMessage signature breaks ABI compability
> 
>
> Key: ARROW-2437
> URL: https://issues.apache.org/jira/browse/ARROW-2437
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>
> We changed the signature of the method from
> {code}
> ReadMessage ( arrow::io::InputStream* file, std::unique_ptr std::default_delete >* message ) 
> {code}
> to
> {code}
> ReadMessage ( arrow::io::InputStream* file, std::unique_ptr std::default_delete >* message, bool aligned ) 
> {code}
> We should add the old signature so that the 0.9.1 release is ABI compatible 
> to 0.9.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (ARROW-2437) [C++] Change of arrow::ipc::ReadMessage signature breaks ABI compability

2018-04-12 Thread Robert Nishihara (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara reassigned ARROW-2437:
---

Assignee: Robert Nishihara

> [C++] Change of arrow::ipc::ReadMessage signature breaks ABI compability
> 
>
> Key: ARROW-2437
> URL: https://issues.apache.org/jira/browse/ARROW-2437
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>
> We changed the signature of the method from
> {code}
> ReadMessage ( arrow::io::InputStream* file, std::unique_ptr std::default_delete >* message ) 
> {code}
> to
> {code}
> ReadMessage ( arrow::io::InputStream* file, std::unique_ptr std::default_delete >* message, bool aligned ) 
> {code}
> We should add the old signature so that the 0.9.1 release is ABI compatible 
> to 0.9.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-2451) Handle more dtypes efficiently in custom numpy array serializer.

2018-04-12 Thread Robert Nishihara (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-2451.
-
   Resolution: Fixed
Fix Version/s: JS-0.4.0

Issue resolved by pull request 1887
[https://github.com/apache/arrow/pull/1887]

> Handle more dtypes efficiently in custom numpy array serializer.
> 
>
> Key: ARROW-2451
> URL: https://issues.apache.org/jira/browse/ARROW-2451
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>
> Right now certain dtypes like bool or fixed length strings are serialized as 
> lists, which is inefficient. We can handle these more efficiently by casting 
> them to uint8 and saving the original dtype as additional data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (ARROW-2451) Handle more dtypes efficiently in custom numpy array serializer.

2018-04-12 Thread Robert Nishihara (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara reassigned ARROW-2451:
---

Assignee: Robert Nishihara

> Handle more dtypes efficiently in custom numpy array serializer.
> 
>
> Key: ARROW-2451
> URL: https://issues.apache.org/jira/browse/ARROW-2451
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Major
>
> Right now certain dtypes like bool or fixed length strings are serialized as 
> lists, which is inefficient. We can handle these more efficiently by casting 
> them to uint8 and saving the original dtype as additional data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2451) Handle more dtypes efficiently in custom numpy array serializer.

2018-04-12 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2451:
---

 Summary: Handle more dtypes efficiently in custom numpy array 
serializer.
 Key: ARROW-2451
 URL: https://issues.apache.org/jira/browse/ARROW-2451
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara


Right now certain dtypes like bool or fixed length strings are serialized as 
lists, which is inefficient. We can handle these more efficiently by casting 
them to uint8 and saving the original dtype as additional data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2448) Segfault when plasma client goes out of scope before buffer.

2018-04-11 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2448:
---

 Summary: Segfault when plasma client goes out of scope before 
buffer.
 Key: ARROW-2448
 URL: https://issues.apache.org/jira/browse/ARROW-2448
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++), Python
Reporter: Robert Nishihara


The following causes a segfault.

 

First start a plasma store with
{code:java}
plasma_store -s /tmp/store -m 100{code}
Then run the following in Python.
{code}
import pyarrow.plasma as plasma
import numpy as np

client = plasma.connect('/tmp/store', '', 0)

object_id = client.put(np.zeros(3))

buf = client.get(object_id)

del client

del buf  # This segfaults.{code}
The backtrace is 
{code:java}
(lldb) bt

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
(code=1, address=0xfffc)

  * frame #0: 0x0001056deaee 
libplasma.0.dylib`plasma::PlasmaClient::Release(plasma::UniqueID const&) + 142

    frame #1: 0x0001056de9e9 
libplasma.0.dylib`plasma::PlasmaBuffer::~PlasmaBuffer() + 41

    frame #2: 0x0001056dec9f libplasma.0.dylib`arrow::Buffer::~Buffer() + 63

    frame #3: 0x000106206661 
lib.cpython-36m-darwin.so`std::__1::shared_ptr::~shared_ptr() 
[inlined] std::__1::__shared_count::__release_shared(this=0x0001019b7d20) 
at memory:3444

    frame #4: 0x000106206617 
lib.cpython-36m-darwin.so`std::__1::shared_ptr::~shared_ptr() 
[inlined] 
std::__1::__shared_weak_count::__release_shared(this=0x0001019b7d20) at 
memory:3486

    frame #5: 0x000106206617 
lib.cpython-36m-darwin.so`std::__1::shared_ptr::~shared_ptr(this=0x000100791780)
 at memory:4412

    frame #6: 0x000106002b35 
lib.cpython-36m-darwin.so`std::__1::shared_ptr::~shared_ptr(this=0x000100791780)
 at memory:4410

    frame #7: 0x0001061052c5 lib.cpython-36m-darwin.so`void 
__Pyx_call_destructor(x=std::__1::shared_ptr::element_type @ 0x0001019b7d38 
strong=0 weak=1) at lib.cxx:486

    frame #8: 0x000106104f93 
lib.cpython-36m-darwin.so`__pyx_tp_dealloc_7pyarrow_3lib_Buffer(o=0x000100791768)
 at lib.cxx:107704

    frame #9: 0x0001069fcd54 multiarray.cpython-36m-darwin.so`array_dealloc 
+ 292

    frame #10: 0x0001000e8daf libpython3.6m.dylib`_PyDict_DelItem_KnownHash 
+ 463

    frame #11: 0x000100171899 libpython3.6m.dylib`_PyEval_EvalFrameDefault 
+ 13321

    frame #12: 0x0001001791ef libpython3.6m.dylib`_PyEval_EvalCodeWithName 
+ 2447

    frame #13: 0x00010016e3d4 libpython3.6m.dylib`PyEval_EvalCode + 100

    frame #14: 0x0001001a3bd6 
libpython3.6m.dylib`PyRun_InteractiveOneObject + 582

    frame #15: 0x0001001a350e 
libpython3.6m.dylib`PyRun_InteractiveLoopFlags + 222

    frame #16: 0x0001001a33fc libpython3.6m.dylib`PyRun_AnyFileExFlags + 60

    frame #17: 0x0001001bc835 libpython3.6m.dylib`Py_Main + 3829

    frame #18: 0x00010df8 python`main + 232

    frame #19: 0x7fff6cd80015 libdyld.dylib`start + 1

    frame #20: 0x7fff6cd80015 libdyld.dylib`start + 1{code}
Basically, the issue is that when the buffer goes out of scope, it calls 
{{Release}} on the plasma client, but the client has already been deallocated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2397) Document changes in Tensor encoding in IPC.md.

2018-04-04 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2397:
---

 Summary: Document changes in Tensor encoding in IPC.md.
 Key: ARROW-2397
 URL: https://issues.apache.org/jira/browse/ARROW-2397
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Robert Nishihara


Update IPC.md to reflect the changes in 
https://github.com/apache/arrow/pull/1802.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2308) Serialized tensor data should be 64-byte aligned.

2018-03-14 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398143#comment-16398143
 ] 

Robert Nishihara commented on ARROW-2308:
-

It's probably worth discussing what the best way to do this since it involves 
changing the format a little.

 

cc [~pcmoritz] [~wesmckinn]

> Serialized tensor data should be 64-byte aligned.
> -
>
> Key: ARROW-2308
> URL: https://issues.apache.org/jira/browse/ARROW-2308
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>
> See [https://github.com/ray-project/ray/issues/1658] for an example of this 
> issue. Non-aligned data can trigger a copy when fed into TensorFlow and 
> things like that.
> {code}
> import pyarrow as pa
> import numpy as np
> x = np.zeros(10)
> y = pa.deserialize(pa.serialize(x).to_buffer())
> x.ctypes.data % 64  # 0 (it starts out aligned)
> y.ctypes.data % 64  # 48 (it is no longer aligned)
> {code}
> It should be possible to fix this by calling something like 
> {{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. 
> Note that we already do this before writing the tensor header, but the tensor 
> header is not necessarily a multiple of 64 bytes, so the subsequent data can 
> be unaligned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (ARROW-2308) Serialized tensor data should be 64-byte aligned.

2018-03-14 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398143#comment-16398143
 ] 

Robert Nishihara edited comment on ARROW-2308 at 3/14/18 6:27 AM:
--

It's probably worth discussing what the best way to do this since it involves 
changing the format a little.

cc [~pcmoritz] [~wesmckinn]


was (Author: robertnishihara):
It's probably worth discussing what the best way to do this since it involves 
changing the format a little.

 

cc [~pcmoritz] [~wesmckinn]

> Serialized tensor data should be 64-byte aligned.
> -
>
> Key: ARROW-2308
> URL: https://issues.apache.org/jira/browse/ARROW-2308
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>
> See [https://github.com/ray-project/ray/issues/1658] for an example of this 
> issue. Non-aligned data can trigger a copy when fed into TensorFlow and 
> things like that.
> {code}
> import pyarrow as pa
> import numpy as np
> x = np.zeros(10)
> y = pa.deserialize(pa.serialize(x).to_buffer())
> x.ctypes.data % 64  # 0 (it starts out aligned)
> y.ctypes.data % 64  # 48 (it is no longer aligned)
> {code}
> It should be possible to fix this by calling something like 
> {{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. 
> Note that we already do this before writing the tensor header, but the tensor 
> header is not necessarily a multiple of 64 bytes, so the subsequent data can 
> be unaligned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2308) Serialized tensor data should be 64-byte aligned.

2018-03-13 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2308:
---

 Summary: Serialized tensor data should be 64-byte aligned.
 Key: ARROW-2308
 URL: https://issues.apache.org/jira/browse/ARROW-2308
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara


See [https://github.com/ray-project/ray/issues/1658] for an example of this 
issue. Non-aligned data can trigger a copy when fed into TensorFlow and things 
like that.
{code}
import pyarrow as pa
import numpy as np

x = np.zeros(10)
y = pa.deserialize(pa.serialize(x).to_buffer())

x.ctypes.data % 64  # 0 (it starts out aligned)
y.ctypes.data % 64  # 48 (it is no longer aligned)
{code}
It should be possible to fix this by calling something like 
{{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. 
Note that we already do this before writing the tensor header, but the tensor 
header is not necessarily a multiple of 64 bytes, so the subsequent data can be 
unaligned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2265) Serializing subclasses of np.ndarray returns a np.ndarray.

2018-03-05 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2265:
---

 Summary: Serializing subclasses of np.ndarray returns a np.ndarray.
 Key: ARROW-2265
 URL: https://issues.apache.org/jira/browse/ARROW-2265
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Robert Nishihara
Assignee: Robert Nishihara






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2237) [Python] Huge tables test failure

2018-03-01 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382868#comment-16382868
 ] 

Robert Nishihara commented on ARROW-2237:
-

Interesting, does {{/mnt/hugepages}} exist locally? If not, the test should be 
skipped. If yes, then maybe there is some permission error or something.

> [Python] Huge tables test failure
> -
>
> Key: ARROW-2237
> URL: https://issues.apache.org/jira/browse/ARROW-2237
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> This is a new failure here (Ubuntu 16.04, x86-64):
> {code}
> _ test_use_huge_pages 
> _
> Traceback (most recent call last):
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, 
> in test_use_huge_pages
> create_object(plasma_client, 1)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in 
> create_object
> seal=seal)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in 
> create_object_with_id
> memory_buffer = client.create(object_id, data_size, metadata)
>   File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create
>   File "error.pxi", line 79, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 
> code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, )
> /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, 
> , buffer)
> Encountered unexpected EOF
>  Captured stderr call 
> -
> Allowing the Plasma store to use up to 0.1GB of memory.
> Starting object store with directory /mnt/hugepages and huge page support 
> enabled
> create_buffer failed to open file /mnt/hugepages/plasmapSNc0X
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-2215) [Plasma] Error when using huge pages

2018-02-28 Thread Robert Nishihara (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-2215.
-
   Resolution: Fixed
Fix Version/s: JS-0.4.0

Issue resolved by pull request 1660
[https://github.com/apache/arrow/pull/1660]

> [Plasma] Error when using huge pages
> 
>
> Key: ARROW-2215
> URL: https://issues.apache.org/jira/browse/ARROW-2215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>
> see https://github.com/ray-project/ray/issues/1592



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (ARROW-2215) [Plasma] Error when using huge pages

2018-02-28 Thread Robert Nishihara (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara reassigned ARROW-2215:
---

Assignee: Philipp Moritz

> [Plasma] Error when using huge pages
> 
>
> Key: ARROW-2215
> URL: https://issues.apache.org/jira/browse/ARROW-2215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.4.0
>
>
> see https://github.com/ray-project/ray/issues/1592



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371973#comment-16371973
 ] 

Robert Nishihara commented on ARROW-2193:
-

Do you know that {{fork}} is being called? Another way this could happen is if 
the tests fail to kill the plasma store and leave a bunch of them running.

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2138) [C++] Have FatalLog abort instead of exiting

2018-02-12 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361134#comment-16361134
 ] 

Robert Nishihara commented on ARROW-2138:
-

We did this in Ray a while back (to generate core dumps), and it was a great 
change.

> [C++] Have FatalLog abort instead of exiting
> 
>
> Key: ARROW-2138
> URL: https://issues.apache.org/jira/browse/ARROW-2138
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Trivial
> Fix For: 0.9.0
>
>
> Not sure this is desirable, since {{util/logging.h}} was taken from glog, but 
> the various debug checks current {{std::exit(1)}} on failure. This is a clean 
> exit (though with an error code) and therefore doesn't trigger the usual 
> debugging tools such as gdb or Python's faulthandler. By replacing it with 
> something like {{std::abort()}} the exit would be recognized as a process 
> crash.
>  
> Thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2122) Pyarrow fails to serialize dataframe with timestamp.

2018-02-08 Thread Robert Nishihara (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara updated ARROW-2122:

Description: 
The bug can be reproduced as follows.
{code:java}
import pyarrow as pa
import pandas as pd

df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) 

s = pa.serialize(df).to_buffer()
new_df = pa.deserialize(s) # this fails{code}
The last line fails with
{code:java}
Traceback (most recent call last):
  File "", line 1, in 
  File "serialization.pxi", line 441, in pyarrow.lib.deserialize
  File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from
  File "serialization.pxi", line 257, in 
pyarrow.lib.SerializedPyObject.deserialize
  File "serialization.pxi", line 174, in 
pyarrow.lib.SerializationContext._deserialize_callback
  File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in 
_deserialize_pandas_dataframe
    return pdcompat.serialized_dict_to_dataframe(data)
  File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
serialized_dict_to_dataframe
    for block in data['blocks']]
  File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 

    for block in data['blocks']]
  File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in 
_reconstruct_block
    dtype = _make_datetimetz(item['timezone'])
  File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in 
_make_datetimetz
    return DatetimeTZDtype('ns', tz=tz)
  File 
"/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py",
 line 409, in __new__
    raise ValueError("DatetimeTZDtype constructor must have a tz "
ValueError: DatetimeTZDtype constructor must have a tz supplied{code}
 

  was:
The bug can be reproduced as follows.
{code:java}
import pyarrow as pa
import pandas as pd


s = pa.serialize({code}


> Pyarrow fails to serialize dataframe with timestamp.
> 
>
> Key: ARROW-2122
> URL: https://issues.apache.org/jira/browse/ARROW-2122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>
> The bug can be reproduced as follows.
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) 
> s = pa.serialize(df).to_buffer()
> new_df = pa.deserialize(s) # this fails{code}
> The last line fails with
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "serialization.pxi", line 441, in pyarrow.lib.deserialize
>   File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from
>   File "serialization.pxi", line 257, in 
> pyarrow.lib.SerializedPyObject.deserialize
>   File "serialization.pxi", line 174, in 
> pyarrow.lib.SerializationContext._deserialize_callback
>   File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in 
> _deserialize_pandas_dataframe
>     return pdcompat.serialized_dict_to_dataframe(data)
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> serialized_dict_to_dataframe
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> 
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in 
> _reconstruct_block
>     dtype = _make_datetimetz(item['timezone'])
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in 
> _make_datetimetz
>     return DatetimeTZDtype('ns', tz=tz)
>   File 
> "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py",
>  line 409, in __new__
>     raise ValueError("DatetimeTZDtype constructor must have a tz "
> ValueError: DatetimeTZDtype constructor must have a tz supplied{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2122) Pyarrow fails to serialize dataframe with timestamp.

2018-02-08 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2122:
---

 Summary: Pyarrow fails to serialize dataframe with timestamp.
 Key: ARROW-2122
 URL: https://issues.apache.org/jira/browse/ARROW-2122
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Robert Nishihara


The bug can be reproduced as follows.
{code:java}
import pyarrow as pa
import pandas as pd


s = pa.serialize({code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2121) Consider special casing object arrays in pandas serializers.

2018-02-08 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2121:
---

 Summary: Consider special casing object arrays in pandas 
serializers.
 Key: ARROW-2121
 URL: https://issues.apache.org/jira/browse/ARROW-2121
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2065) Fix bug in SerializationContext.clone().

2018-01-30 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2065:
---

 Summary: Fix bug in SerializationContext.clone().
 Key: ARROW-2065
 URL: https://issues.apache.org/jira/browse/ARROW-2065
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara


We currently fail to copy over one of the fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2045) More primitive operations on plasma store

2018-01-27 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16342457#comment-16342457
 ] 

Robert Nishihara commented on ARROW-2045:
-

That makes sense. So if I understand correctly, the smallest change that would 
enable this to work for you is to be able to call {{put}} and pass in a flag 
that says not to evict anything if there is not enough space.

Then you could implement a blocking put by calling {{put}}, catching any 
exceptions, and looping until the {{put}} call succeeds.

And the blocking {{get}} is already implemented.

Does that sound right?

> More primitive operations on plasma store
> -
>
> Key: ARROW-2045
> URL: https://issues.apache.org/jira/browse/ARROW-2045
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Plasma (C++)
>Reporter: Yuxin Wu
>Priority: Minor
>
> Hi Developers,
> I found plasma store very useful – it's fast and simple to use. However, I 
> think there are more operations that can make it a more general IPC/messaging 
> tool and potentially helpful in more scenarios.
> Conceptually, an object store can support the following "put" methods:
>  # Evict when full
>  # Wait for space when full, perhaps with a timeout (i.e. blocking)
>  # Return failure when full (i.e. non-blocking)
> And the following "get" methods:
>  # Wait for the object to appear (i.e. blocking)
>  # Return failure when object doesn't exist (i.e. non-blocking)
>  # Remove the object after get
> Some of the above features can be implemented with others. But some of them 
> are primitives (e.g. return failure when full) that needs to be supported.
>  
> My use case: I wanted to use plasma to send/recv large buffers between 
> processes, i.e. build a message passing interface on top of shared memory. 
> Plasma has made it quite easy (only have to send/recv the id) and efficient 
> (faster than unix pipe). But "evict when full" is now the only available 
> "put" method, so that could create many trouble if I want to ensure message 
> delivery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1880) [Python] Plasma test flakiness in Travis CI

2018-01-25 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340472#comment-16340472
 ] 

Robert Nishihara commented on ARROW-1880:
-

Thanks, please point it out if you do see it.

> [Python] Plasma test flakiness in Travis CI
> ---
>
> Key: ARROW-1880
> URL: https://issues.apache.org/jira/browse/ARROW-1880
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> We've been seeing intermittent flakiness of the variety:
> {code}
>  ERRORS 
> 
> __ ERROR at setup of TestPlasmaClient.test_use_one_memory_mapped_file 
> __
> self = 
> test_method =  of >
> [1mdef setup_method(self, test_method):[0m
> [1muse_one_memory_mapped_file = (test_method ==[0m
> [1m  
> self.test_use_one_memory_mapped_file)[0m
> [1m[0m
> [1mimport pyarrow.plasma as plasma[0m
> [1m# Start Plasma store.[0m
> [1mplasma_store_name, self.p = start_plasma_store([0m
> [1muse_valgrind=os.getenv("PLASMA_VALGRIND") == "1",[0m
> [1muse_one_memory_mapped_file=use_one_memory_mapped_file)[0m
> [1m# Connect to Plasma.[0m
> [1m>   self.plasma_client = plasma.connect(plasma_store_name, "", 64)[0m
> [1m[31mpyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/tests/test_plasma.py[0m:164:
>  
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> [1m[31mplasma.pyx[0m:672: in pyarrow.plasma.connect
> [1m???[0m
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> [1m>   ???[0m
> [1m[31mE   pyarrow.lib.ArrowIOError: Could not connect to socket 
> /tmp/plasma_store43998835[0m
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2024) Remove global SerializationContext variables.

2018-01-23 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2024:
---

 Summary: Remove global SerializationContext variables.
 Key: ARROW-2024
 URL: https://issues.apache.org/jira/browse/ARROW-2024
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara


We should get rid of the global variables 
_default_serialization_context and pandas_serialization_context 
and replace them with functions default_serialization_context() and 
pandas_serialization_context().

This will also make it faster to do import pyarrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (ARROW-2016) [Python] Fix up ASV benchmarking setup and document procedure for use

2018-01-21 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333689#comment-16333689
 ] 

Robert Nishihara edited comment on ARROW-2016 at 1/21/18 9:24 PM:
--

Are all of the benchmarks in 
[https://github.com/apache/arrow/tree/master/python/benchmarks] or are there 
any others?


was (Author: robertnishihara):
Are all of the benchmarks in 
[https://github.com/apache/arrow/tree/f72279b2dbfc663d2217e64075dd731199f12611/python/benchmarks?|https://github.com/apache/arrow/tree/f72279b2dbfc663d2217e64075dd731199f12611/python/benchmarks]
 Any others?

> [Python] Fix up ASV benchmarking setup and document procedure for use
> -
>
> Key: ARROW-2016
> URL: https://issues.apache.org/jira/browse/ARROW-2016
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> We need to start writing more microbenchmarks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2016) [Python] Fix up ASV benchmarking setup and document procedure for use

2018-01-21 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333689#comment-16333689
 ] 

Robert Nishihara commented on ARROW-2016:
-

Are all of the benchmarks in 
[https://github.com/apache/arrow/tree/f72279b2dbfc663d2217e64075dd731199f12611/python/benchmarks?|https://github.com/apache/arrow/tree/f72279b2dbfc663d2217e64075dd731199f12611/python/benchmarks]
 Any others?

> [Python] Fix up ASV benchmarking setup and document procedure for use
> -
>
> Key: ARROW-2016
> URL: https://issues.apache.org/jira/browse/ARROW-2016
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> We need to start writing more microbenchmarks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2011) Allow setting the pickler to use in pyarrow serialization.

2018-01-18 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2011:
---

 Summary: Allow setting the pickler to use in pyarrow serialization.
 Key: ARROW-2011
 URL: https://issues.apache.org/jira/browse/ARROW-2011
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara
Assignee: Robert Nishihara


We currently try to import cloudpickle and failing that fall back to pickle. 
However, given that there are many versions of cloudpickle and they are 
typically incompatible with one another, the caller may want to specify a 
specific version, so we should allow them to set the specific pickler to use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2000) Deduplicate file descriptors when plasma store replies to get request.

2018-01-15 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2000:
---

 Summary: Deduplicate file descriptors when plasma store replies to 
get request.
 Key: ARROW-2000
 URL: https://issues.apache.org/jira/browse/ARROW-2000
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


Right now when the plasma store replies to a GetRequest from a client, it sends 
many file descriptors over the relevant socket (by calling {{send_fd}}). 
However, many of these file descriptors are redundant and so we should 
deduplicate them before sending.

 

Note that I often see the error "Failed to send file descriptor, retrying." 
printed when getting around 100 objects from the store. This may alleviate that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1972) Deserialization of buffer objects (and pandas dataframes) segfaults on different processes.

2018-01-06 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16314927#comment-16314927
 ] 

Robert Nishihara commented on ARROW-1972:
-

Note that I encountered this first when using pandas dataframes and simplified 
it to this case.

> Deserialization of buffer objects (and pandas dataframes) segfaults on 
> different processes.
> ---
>
> Key: ARROW-1972
> URL: https://issues.apache.org/jira/browse/ARROW-1972
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>
> To see the issue, first serialize a pyarrow buffer.
> {code}
> import pyarrow as pa
> serialized = pa.serialize(pa.frombuffer(b'hello')).to_buffer().to_pybytes()
> print(serialized)  # b'\x00\x00\x00\x00\x01...'
> {code}
> Deserializing it within the same process succeeds, however deserializing it 
> in a **separate process** causes a segfault. E.g.,
> {code}
> import pyarrow as pa
> pa.deserialize(b'\x00\x00\x00\x00\x01...')  # This segfaults
> {code}
> The backtrace is
> {code}
> (lldb) bt
> * thread #1, queue = ‘com.apple.main-thread’, stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x0)
>   * frame #0: 0x
> frame #1: 0x000105605534 
> libarrow_python.0.dylib`arrow::py::wrap_buffer(buffer=std::__1::shared_ptr::element_type
>  @ 0x00010060c348 strong=1 weak=1) at pyarrow.cc:48
> frame #2: 0x00010554fdee 
> libarrow_python.0.dylib`arrow::py::GetValue(context=0x000108f17818, 
> parent=0x000100645438, arr=0x000100622938, index=0, type=0, 
> base=0x000108f0e528, blobs=0x000108f09588, result=0x7fff5fbfd218) 
> at arrow_to_python.cc:173
> frame #3: 0x00010554d93a 
> libarrow_python.0.dylib`arrow::py::DeserializeList(context=0x000108f17818,
>  array=0x000100645438, start_idx=0, stop_idx=2, base=0x000108f0e528, 
> blobs=0x000108f09588, out=0x7fff5fbfd470) at arrow_to_python.cc:208
> frame #4: 0x00010554d302 
> libarrow_python.0.dylib`arrow::py::DeserializeDict(context=0x000108f17818,
>  array=0x000100645338, start_idx=0, stop_idx=2, base=0x000108f0e528, 
> blobs=0x000108f09588, out=0x7fff5fbfddd8) at arrow_to_python.cc:74
> frame #5: 0x00010554f249 
> libarrow_python.0.dylib`arrow::py::GetValue(context=0x000108f17818, 
> parent=0x0001006377a8, arr=0x000100645298, index=0, type=0, 
> base=0x000108f0e528, blobs=0x000108f09588, result=0x7fff5fbfddd8) 
> at arrow_to_python.cc:158
> frame #6: 0x00010554d93a 
> libarrow_python.0.dylib`arrow::py::DeserializeList(context=0x000108f17818,
>  array=0x0001006377a8, start_idx=0, stop_idx=1, base=0x000108f0e528, 
> blobs=0x000108f09588, out=0x7fff5fbfdfe8) at arrow_to_python.cc:208
> frame #7: 0x000105551fbf 
> libarrow_python.0.dylib`arrow::py::DeserializeObject(context=0x000108f17818,
>  obj=0x000108f09588, base=0x000108f0e528, out=0x7fff5fbfdfe8) at 
> arrow_to_python.cc:287
> frame #8: 0x000104abecae 
> lib.cpython-36m-darwin.so`__pyx_pf_7pyarrow_3lib_18SerializedPyObject_2deserialize(__pyx_v_self=0x000108f09570,
>  __pyx_v_context=0x000108f17818) at lib.cxx:88592
> frame #9: 0x000104abdec4 
> lib.cpython-36m-darwin.so`__pyx_pw_7pyarrow_3lib_18SerializedPyObject_3deserialize(__pyx_v_self=0x000108f09570,
>  __pyx_args=0x00010231f358, __pyx_kwds=0x) at 
> lib.cxx:88514
> frame #10: 0x00010008b5f1 python`PyCFunction_Call + 145
> frame #11: 0x000104941208 
> lib.cpython-36m-darwin.so`__Pyx_PyObject_Call(func=0x000108f302d0, 
> arg=0x00010231f358, kw=0x) at lib.cxx:116108
> frame #12: 0x000104b0e3fa 
> lib.cpython-36m-darwin.so`__Pyx__PyObject_CallOneArg(func=0x000108f302d0, 
> arg=0x000108f17818) at lib.cxx:116147
> frame #13: 0x000104944bc6 
> lib.cpython-36m-darwin.so`__Pyx_PyObject_CallOneArg(func=0x000108f302d0, 
> arg=0x000108f17818) at lib.cxx:116166
> frame #14: 0x000104b09873 
> lib.cpython-36m-darwin.so`__pyx_pf_7pyarrow_3lib_124deserialize_from(__pyx_self=0x,
>  __pyx_v_source=0x000108ddeee8, __pyx_v_base=0x000108f0e528, 
> __pyx_v_context=0x000108f17818) at lib.cxx:90327
> frame #15: 0x000104b09310 
> lib.cpython-36m-darwin.so`__pyx_pw_7pyarrow_3lib_125deserialize_from(__pyx_self=0x,
>  __pyx_args=0x000108f10d38, __pyx_kwds=0x) at 
> lib.cxx:90260
> frame #16: 0x00010008b5f1 python`PyCFunction_Call + 145
> frame #17: 0x000104941208 
> lib.cpython-36m-darwin.so`__Pyx_PyObject_Call(func=0x000108baf1b0, 
> arg=0x000108f10d38, kw=0x) at lib.cxx:116108
> frame #18:

[jira] [Created] (ARROW-1972) Deserialization of buffer objects (and pandas dataframes) segfaults on different processes.

2018-01-06 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-1972:
---

 Summary: Deserialization of buffer objects (and pandas dataframes) 
segfaults on different processes.
 Key: ARROW-1972
 URL: https://issues.apache.org/jira/browse/ARROW-1972
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Robert Nishihara


To see the issue, first serialize a pyarrow buffer.

{code}
import pyarrow as pa

serialized = pa.serialize(pa.frombuffer(b'hello')).to_buffer().to_pybytes()

print(serialized)  # 
b'\x00\x00\x00\x00\x01\x00\x00\x00\xcc\x00\x00\x00\x10\x00\x00\x00\x0c\x00\x0e\x00\x06\x00\x05\x00\x08\x00\x00\x00\x0c\x00\x00\x00\x00\x01\x03\x00\x10\x00\x00\x00\x00\x00\n\x00\x08\x00\x00\x00\x04\x00\x00\x00\n\x00\x00\x00\x04\x00\x00\x00\x01\x00\x00\x00\x04\x00\x00\x00\xc6\xff\xff\xff\x00\x00\x01\x0e|\x00\x00\x00\x18\x00\x00\x00\x04\x00\x00\x00\x01\x00\x00\x004\x00\x00\x00\x08\x00\x0c\x00\x06\x00\x08\x00\x08\x00\x00\x00\x00\x00\x01\x00\x04\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x12\x00\x14\x00\x08\x00\x06\x00\x07\x00\x0c\x00\x00\x00\x10\x00\x00\x00\x12\x00\x00\x00\x00\x00\x01\x02$\x00\x00\x00\x14\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x08\x00\x0c\x00\x08\x00\x07\x00\x08\x00\x00\x00\x00\x00\x00\x01
 
\x00\x00\x00\x06\x00\x00\x00buffer\x00\x00\x04\x00\x00\x00list\x00\x00\x00\x00\x00\x00\x00\x00\xcc\x00\x00\x00\x14\x00\x00\x00\x00\x00\x00\x00\x0c\x00\x16\x00\x06\x00\x05\x00\x08\x00\x0c\x00\x0c\x00\x00\x00\x00\x03\x03\x00\x18\x00\x00\x00h\x00\x00\x00\x00\x00\x00\x00\x00\x00\n\x00\x18\x00\x0c\x00\x04\x00\x08\x00\n\x00\x00\x00l\x00\x00\x00\x10\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
 \x00\x00\x00\x00\x00\x00\x00 
\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x00`\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00`\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00hello'
{code}

Deserializing it within the same process succeeds, however deserializing it in 
a **separate process** causes a segfault. E.g.,

{code}
import pyarrow as pa

pa.deserialize(b'\x00\x00\x00\x00\x01...')  # This segfaults
{code}

The backtrace is

{code}
(lldb) bt
* thread #1, queue = ‘com.apple.main-thread’, stop reason = EXC_BAD_ACCESS 
(code=1, address=0x0)
  * frame #0: 0x
frame #1: 0x000105605534 
libarrow_python.0.dylib`arrow::py::wrap_buffer(buffer=std::__1::shared_ptr::element_type
 @ 0x00010060c348 strong=1 weak=1) at pyarrow.cc:48
frame #2: 0x00010554fdee 
libarrow_python.0.dylib`arrow::py::GetValue(context=0x000108f17818, 
parent=0x000100645438, arr=0x000100622938, index=0, type=0, 
base=0x000108f0e528, blobs=0x000108f09588, result=0x7fff5fbfd218) 
at arrow_to_python.cc:173
frame #3: 0x00010554d93a 
libarrow_python.0.dylib`arrow::py::DeserializeList(context=0x000108f17818, 
array=0x000100645438, start_idx=0, stop_idx=2, base=0x000108f0e528, 
blobs=0x000108f09588, out=0x7fff5fbfd470) at arrow_to_python.cc:208
frame #4: 0x00010554d302 
libarrow_python.0.dylib`arrow::py::DeserializeDict(context=0x000108f17818, 
array=0x000100645338, start_idx=0, stop_idx=2, base=0x000108f0e528, 
blobs=0x000108f09588, out=0x7fff5fbfddd8) at arrow_to_python.cc:74
frame #5: 0x00010554f249 
libarrow_python.0.dylib`arrow::py::GetValue(context=0x000108f17818, 
parent=0x0001006377a8, arr=0x000100645298, index=0, type=0, 
base=0x000108f0e528, blobs=0x000108f09588, result=0x7fff5fbfddd8) 
at arrow_to_python.cc:158
frame #6: 0x00010554d93a 
libarrow_python.0.dylib`arrow::py::DeserializeList(context=0x000108f17818, 
array=0x0001006377a8, start_idx=0, stop_idx=1, base=0x000108f0e528, 
blobs=0x000108f09588, out=0x7fff5fbfdfe8) at arrow_to_python.cc:208
frame #7: 0x000105551fbf 
libarrow_python.0.dylib`arrow::py::DeserializeObject(context=0x000108f17818,
 obj=0x000108f09588, base=0x000108f0e528, out=0x7fff5fbfdfe8) at 
arrow_to_python.cc:287
frame #8: 0x000104abecae

[jira] [Updated] (ARROW-1972) Deserialization of buffer objects (and pandas dataframes) segfaults on different processes.

2018-01-06 Thread Robert Nishihara (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara updated ARROW-1972:

Description: 
To see the issue, first serialize a pyarrow buffer.

{code}
import pyarrow as pa

serialized = pa.serialize(pa.frombuffer(b'hello')).to_buffer().to_pybytes()

print(serialized)  # b'\x00\x00\x00\x00\x01...'
{code}

Deserializing it within the same process succeeds, however deserializing it in 
a **separate process** causes a segfault. E.g.,

{code}
import pyarrow as pa

pa.deserialize(b'\x00\x00\x00\x00\x01...')  # This segfaults
{code}

The backtrace is

{code}
(lldb) bt
* thread #1, queue = ‘com.apple.main-thread’, stop reason = EXC_BAD_ACCESS 
(code=1, address=0x0)
  * frame #0: 0x
frame #1: 0x000105605534 
libarrow_python.0.dylib`arrow::py::wrap_buffer(buffer=std::__1::shared_ptr::element_type
 @ 0x00010060c348 strong=1 weak=1) at pyarrow.cc:48
frame #2: 0x00010554fdee 
libarrow_python.0.dylib`arrow::py::GetValue(context=0x000108f17818, 
parent=0x000100645438, arr=0x000100622938, index=0, type=0, 
base=0x000108f0e528, blobs=0x000108f09588, result=0x7fff5fbfd218) 
at arrow_to_python.cc:173
frame #3: 0x00010554d93a 
libarrow_python.0.dylib`arrow::py::DeserializeList(context=0x000108f17818, 
array=0x000100645438, start_idx=0, stop_idx=2, base=0x000108f0e528, 
blobs=0x000108f09588, out=0x7fff5fbfd470) at arrow_to_python.cc:208
frame #4: 0x00010554d302 
libarrow_python.0.dylib`arrow::py::DeserializeDict(context=0x000108f17818, 
array=0x000100645338, start_idx=0, stop_idx=2, base=0x000108f0e528, 
blobs=0x000108f09588, out=0x7fff5fbfddd8) at arrow_to_python.cc:74
frame #5: 0x00010554f249 
libarrow_python.0.dylib`arrow::py::GetValue(context=0x000108f17818, 
parent=0x0001006377a8, arr=0x000100645298, index=0, type=0, 
base=0x000108f0e528, blobs=0x000108f09588, result=0x7fff5fbfddd8) 
at arrow_to_python.cc:158
frame #6: 0x00010554d93a 
libarrow_python.0.dylib`arrow::py::DeserializeList(context=0x000108f17818, 
array=0x0001006377a8, start_idx=0, stop_idx=1, base=0x000108f0e528, 
blobs=0x000108f09588, out=0x7fff5fbfdfe8) at arrow_to_python.cc:208
frame #7: 0x000105551fbf 
libarrow_python.0.dylib`arrow::py::DeserializeObject(context=0x000108f17818,
 obj=0x000108f09588, base=0x000108f0e528, out=0x7fff5fbfdfe8) at 
arrow_to_python.cc:287
frame #8: 0x000104abecae 
lib.cpython-36m-darwin.so`__pyx_pf_7pyarrow_3lib_18SerializedPyObject_2deserialize(__pyx_v_self=0x000108f09570,
 __pyx_v_context=0x000108f17818) at lib.cxx:88592
frame #9: 0x000104abdec4 
lib.cpython-36m-darwin.so`__pyx_pw_7pyarrow_3lib_18SerializedPyObject_3deserialize(__pyx_v_self=0x000108f09570,
 __pyx_args=0x00010231f358, __pyx_kwds=0x) at lib.cxx:88514
frame #10: 0x00010008b5f1 python`PyCFunction_Call + 145
frame #11: 0x000104941208 
lib.cpython-36m-darwin.so`__Pyx_PyObject_Call(func=0x000108f302d0, 
arg=0x00010231f358, kw=0x) at lib.cxx:116108
frame #12: 0x000104b0e3fa 
lib.cpython-36m-darwin.so`__Pyx__PyObject_CallOneArg(func=0x000108f302d0, 
arg=0x000108f17818) at lib.cxx:116147
frame #13: 0x000104944bc6 
lib.cpython-36m-darwin.so`__Pyx_PyObject_CallOneArg(func=0x000108f302d0, 
arg=0x000108f17818) at lib.cxx:116166
frame #14: 0x000104b09873 
lib.cpython-36m-darwin.so`__pyx_pf_7pyarrow_3lib_124deserialize_from(__pyx_self=0x,
 __pyx_v_source=0x000108ddeee8, __pyx_v_base=0x000108f0e528, 
__pyx_v_context=0x000108f17818) at lib.cxx:90327
frame #15: 0x000104b09310 
lib.cpython-36m-darwin.so`__pyx_pw_7pyarrow_3lib_125deserialize_from(__pyx_self=0x,
 __pyx_args=0x000108f10d38, __pyx_kwds=0x) at lib.cxx:90260
frame #16: 0x00010008b5f1 python`PyCFunction_Call + 145
frame #17: 0x000104941208 
lib.cpython-36m-darwin.so`__Pyx_PyObject_Call(func=0x000108baf1b0, 
arg=0x000108f10d38, kw=0x) at lib.cxx:116108
frame #18: 0x000104b0bf9d 
lib.cpython-36m-darwin.so`__pyx_pf_7pyarrow_3lib_128deserialize(__pyx_self=0x,
 __pyx_v_obj=0x000108f0e528, __pyx_v_context=0x000108f17818) at 
lib.cxx:90770
frame #19: 0x000104b0b7ec 
lib.cpython-36m-darwin.so`__pyx_pw_7pyarrow_3lib_129deserialize(__pyx_self=0x,
 __pyx_args=0x000108def1c8, __pyx_kwds=0x) at lib.cxx:90680
frame #20: 0x00010008b5f1 python`PyCFunction_Call + 145
frame #21: 0x000108d5c468 
plasma.cpython-36m-darwin.so`__Pyx_PyObject_Call(func=0x000108baf240, 
arg=0x000108def1c8, kw=0x) at plasma.cxx:11200
frame #22: 0x000108d744a7

[jira] [Commented] (ARROW-1880) [Python] Plasma test flakiness in Travis CI

2017-12-05 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16279234#comment-16279234
 ] 

Robert Nishihara commented on ARROW-1880:
-

Thanks for filing the issue, I hope to get to this over the weekend. Traveling 
at the moment.

> [Python] Plasma test flakiness in Travis CI
> ---
>
> Key: ARROW-1880
> URL: https://issues.apache.org/jira/browse/ARROW-1880
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>
> We've been seeing intermittent flakiness of the variety:
> {code}
>  ERRORS 
> 
> __ ERROR at setup of TestPlasmaClient.test_use_one_memory_mapped_file 
> __
> self = 
> test_method =  of >
> [1mdef setup_method(self, test_method):[0m
> [1muse_one_memory_mapped_file = (test_method ==[0m
> [1m  
> self.test_use_one_memory_mapped_file)[0m
> [1m[0m
> [1mimport pyarrow.plasma as plasma[0m
> [1m# Start Plasma store.[0m
> [1mplasma_store_name, self.p = start_plasma_store([0m
> [1muse_valgrind=os.getenv("PLASMA_VALGRIND") == "1",[0m
> [1muse_one_memory_mapped_file=use_one_memory_mapped_file)[0m
> [1m# Connect to Plasma.[0m
> [1m>   self.plasma_client = plasma.connect(plasma_store_name, "", 64)[0m
> [1m[31mpyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/tests/test_plasma.py[0m:164:
>  
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> [1m[31mplasma.pyx[0m:672: in pyarrow.plasma.connect
> [1m???[0m
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> [1m>   ???[0m
> [1m[31mE   pyarrow.lib.ArrowIOError: Could not connect to socket 
> /tmp/plasma_store43998835[0m
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1854) [Python] Improve performance of serializing object dtype ndarrays

2017-11-25 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265919#comment-16265919
 ] 

Robert Nishihara commented on ARROW-1854:
-

We may run into problems when the numpy array can't be pickled/unpickled but it 
can be cloudpickled/cloudunpickled. E.g.,

{code}
import numpy as np
import pickle
import cloudpickle

class Foo(object):
pass

a = np.array([Foo()])
{code}

Pickle will succeed at pickling {{a}}, but it won't be able to unpickle it (in 
a different process). Cloudpickle will succeed but will be much slower. Our 
current approach will succeed and will be faster than cloudpickle.

> [Python] Improve performance of serializing object dtype ndarrays
> -
>
> Key: ARROW-1854
> URL: https://issues.apache.org/jira/browse/ARROW-1854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>
> I haven't looked carefully at the hot path for this, but I would expect these 
> statements to have roughly the same performance (offloading the ndarray 
> serialization to pickle)
> {code}
> In [1]: import pickle
> In [2]: import numpy as np
> In [3]: import pyarrow as pa
> a
> In [4]: arr = np.array(['foo', 'bar', None] * 10, dtype=object)
> In [5]: timeit serialized = pa.serialize(arr).to_buffer()
> 10 loops, best of 3: 27.1 ms per loop
> In [6]: timeit pickled = pickle.dumps(arr)
> 100 loops, best of 3: 6.03 ms per loop
> {code}
> [~robertnishihara] [~pcmoritz] I encountered this while working on 
> ARROW-1783, but it can likely be resolved independently



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1854) [Python] Improve performance of serializing object dtype ndarrays

2017-11-25 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265914#comment-16265914
 ] 

Robert Nishihara commented on ARROW-1854:
-

That would certainly work. It wouldn't give us any of the benefits of using 
Arrow, but for numpy arrays of general Python objects, we probably shouldn't 
expect that anyway.

It may be as simple as changing the custom serializer/deserializer. I'll take a 
quick look at that.

> [Python] Improve performance of serializing object dtype ndarrays
> -
>
> Key: ARROW-1854
> URL: https://issues.apache.org/jira/browse/ARROW-1854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>
> I haven't looked carefully at the hot path for this, but I would expect these 
> statements to have roughly the same performance (offloading the ndarray 
> serialization to pickle)
> {code}
> In [1]: import pickle
> In [2]: import numpy as np
> In [3]: import pyarrow as pa
> a
> In [4]: arr = np.array(['foo', 'bar', None] * 10, dtype=object)
> In [5]: timeit serialized = pa.serialize(arr).to_buffer()
> 10 loops, best of 3: 27.1 ms per loop
> In [6]: timeit pickled = pickle.dumps(arr)
> 100 loops, best of 3: 6.03 ms per loop
> {code}
> [~robertnishihara] [~pcmoritz] I encountered this while working on 
> ARROW-1783, but it can likely be resolved independently



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (ARROW-1854) [Python] Improve performance of serializing object dtype ndarrays

2017-11-24 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265525#comment-16265525
 ] 

Robert Nishihara edited comment on ARROW-1854 at 11/24/17 8:43 PM:
---

Your numbers are much better than what I'm seeing. It looks like the poor 
performance comes from our handling of lists. Since pyarrow handles the numpy 
array or objects by first converting it to a list and then serializing it, we 
can't do better than the list case.

EDIT: Actually I'm seeing similar numbers (updated below). I think I had 
compiled without optimizations.

{code}
import pickle
import pyarrow as pa
import numpy as np

print(pa.__version__)  # '0.7.2.dev165+ga446fbd.d20171116'

arr = np.array(['foo', 'bar', None] * 10, dtype=object)
arr_list = arr.tolist()

# Serializing the array.
%timeit pa.serialize(arr).to_buffer()
29.1 ms ± 535 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pickle.dumps(arr)
7.43 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Serializing the list.
%timeit pa.serialize(arr_list).to_buffer()
27.5 ms ± 669 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pickle.dumps(arr_list)
5.87 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
{code}


was (Author: robertnishihara):
Your numbers are much better than what I'm seeing. It looks like the poor 
performance comes from our handling of lists. Since pyarrow handles the numpy 
array or objects by first converting it to a list and then serializing it, we 
can't do better than the list case.

EDIT: Actually I'm seeing similar numbers (updated below). I think I had 
compiled without optimizations.

{code}
import pickle
import pyarrow as pa
import numpy as np

print(pa.__version__)  # '0.7.2.dev165+ga446fbd.d20171116'

arr = np.array(['foo', 'bar', None] * 10, dtype=object)
arr_list = arr.tolist()

# Serializing the array.
%timeit pa.serialize(arr).to_buffer()
130 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pickle.dumps(arr)
7.43 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Serializing the list.
%timeit pa.serialize(arr_list).to_buffer()
27.5 ms ± 669 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pickle.dumps(arr_list)
5.87 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
{code}

> [Python] Improve performance of serializing object dtype ndarrays
> -
>
> Key: ARROW-1854
> URL: https://issues.apache.org/jira/browse/ARROW-1854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>
> I haven't looked carefully at the hot path for this, but I would expect these 
> statements to have roughly the same performance (offloading the ndarray 
> serialization to pickle)
> {code}
> In [1]: import pickle
> In [2]: import numpy as np
> In [3]: import pyarrow as pa
> a
> In [4]: arr = np.array(['foo', 'bar', None] * 10, dtype=object)
> In [5]: timeit serialized = pa.serialize(arr).to_buffer()
> 10 loops, best of 3: 27.1 ms per loop
> In [6]: timeit pickled = pickle.dumps(arr)
> 100 loops, best of 3: 6.03 ms per loop
> {code}
> [~robertnishihara] [~pcmoritz] I encountered this while working on 
> ARROW-1783, but it can likely be resolved independently



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (ARROW-1854) [Python] Improve performance of serializing object dtype ndarrays

2017-11-24 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265525#comment-16265525
 ] 

Robert Nishihara edited comment on ARROW-1854 at 11/24/17 8:42 PM:
---

Your numbers are much better than what I'm seeing. It looks like the poor 
performance comes from our handling of lists. Since pyarrow handles the numpy 
array or objects by first converting it to a list and then serializing it, we 
can't do better than the list case.

EDIT: Actually I'm seeing similar numbers (updated below). I think I had 
compiled without optimizations.

{code}
import pickle
import pyarrow as pa
import numpy as np

print(pa.__version__)  # '0.7.2.dev165+ga446fbd.d20171116'

arr = np.array(['foo', 'bar', None] * 10, dtype=object)
arr_list = arr.tolist()

# Serializing the array.
%timeit pa.serialize(arr).to_buffer()
130 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pickle.dumps(arr)
7.43 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Serializing the list.
%timeit pa.serialize(arr_list).to_buffer()
27.5 ms ± 669 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pickle.dumps(arr_list)
5.87 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
{code}


was (Author: robertnishihara):
Your numbers are much better than what I'm seeing. It looks like the poor 
performance comes from our handling of lists. Since pyarrow handles the numpy 
array or objects by first converting it to a list and then serializing it, we 
can't do better than the list case.

{code}
import pickle
import pyarrow as pa
import numpy as np

print(pa.__version__)  # '0.7.2.dev165+ga446fbd.d20171116'

arr = np.array(['foo', 'bar', None] * 10, dtype=object)
arr_list = arr.tolist()

# Serializing the array.
%timeit pa.serialize(arr).to_buffer()
130 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pickle.dumps(arr)
7.43 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Serializing the list.
%timeit pa.serialize(arr_list).to_buffer()
127 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pickle.dumps(arr_list)
5.87 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
{code}

> [Python] Improve performance of serializing object dtype ndarrays
> -
>
> Key: ARROW-1854
> URL: https://issues.apache.org/jira/browse/ARROW-1854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>
> I haven't looked carefully at the hot path for this, but I would expect these 
> statements to have roughly the same performance (offloading the ndarray 
> serialization to pickle)
> {code}
> In [1]: import pickle
> In [2]: import numpy as np
> In [3]: import pyarrow as pa
> a
> In [4]: arr = np.array(['foo', 'bar', None] * 10, dtype=object)
> In [5]: timeit serialized = pa.serialize(arr).to_buffer()
> 10 loops, best of 3: 27.1 ms per loop
> In [6]: timeit pickled = pickle.dumps(arr)
> 100 loops, best of 3: 6.03 ms per loop
> {code}
> [~robertnishihara] [~pcmoritz] I encountered this while working on 
> ARROW-1783, but it can likely be resolved independently



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1854) [Python] Improve performance of serializing object dtype ndarrays

2017-11-24 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265525#comment-16265525
 ] 

Robert Nishihara commented on ARROW-1854:
-

Your numbers are much better than what I'm seeing. It looks like the poor 
performance comes from our handling of lists. Since pyarrow handles the numpy 
array or objects by first converting it to a list and then serializing it, we 
can't do better than the list case.

{code}
import pickle
import pyarrow as pa
import numpy as np

print(pa.__version__)  # '0.7.2.dev165+ga446fbd.d20171116'

arr = np.array(['foo', 'bar', None] * 10, dtype=object)
arr_list = arr.tolist()

# Serializing the array.
%timeit pa.serialize(arr).to_buffer()
130 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pickle.dumps(arr)
7.43 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Serializing the list.
%timeit pa.serialize(arr_list).to_buffer()
127 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pickle.dumps(arr_list)
5.87 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
{code}

> [Python] Improve performance of serializing object dtype ndarrays
> -
>
> Key: ARROW-1854
> URL: https://issues.apache.org/jira/browse/ARROW-1854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>
> I haven't looked carefully at the hot path for this, but I would expect these 
> statements to have roughly the same performance (offloading the ndarray 
> serialization to pickle)
> {code}
> In [1]: import pickle
> In [2]: import numpy as np
> In [3]: import pyarrow as pa
> a
> In [4]: arr = np.array(['foo', 'bar', None] * 10, dtype=object)
> In [5]: timeit serialized = pa.serialize(arr).to_buffer()
> 10 loops, best of 3: 27.1 ms per loop
> In [6]: timeit pickled = pickle.dumps(arr)
> 100 loops, best of 3: 6.03 ms per loop
> {code}
> [~robertnishihara] [~pcmoritz] I encountered this while working on 
> ARROW-1783, but it can likely be resolved independently



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1829) [Plasma] Clean up eviction policy bookkeeping

2017-11-16 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-1829:
---

 Summary: [Plasma] Clean up eviction policy bookkeeping
 Key: ARROW-1829
 URL: https://issues.apache.org/jira/browse/ARROW-1829
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


Currently, the eviction policy has a field {{memory_used_}} which keeps track 
of how much memory the store is currently using. However, this field is only 
updated when {{require_space}} is called, and it should be updated every time 
an object is created.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1792) [Plasma C++] continuous write tensor failed

2017-11-14 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252940#comment-16252940
 ] 

Robert Nishihara commented on ARROW-1792:
-

One natural way to express this, would be the following. Create only one plasma 
client, and use the higher-level client APIs. For example:

First start the store with

{code}
plasma_store -m 8 -s /tmp/plasma
{code}

Then continuously put objects with

{code}
import pyarrow.plasma as plasma

client = plasma.connect("/tmp/plasma", "", 0)

import numpy as np

def write_object(num_bytes):
object_id = plasma.ObjectID(np.random.bytes(20))
x = np.ones(num_bytes, dtype=np.uint8)
client.put(x)

for i in range(10):
print(i)
write_object(5)
{code}

This works for me (at least after https://github.com/apache/arrow/pull/1317, I 
haven't tried it on the master yet).

Would something like this work for you?

> [Plasma C++] continuous write tensor failed
> ---
>
> Key: ARROW-1792
> URL: https://issues.apache.org/jira/browse/ARROW-1792
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
> Environment: ubuntu 14.04 gcc 4.8.4
>Reporter: Lu Qi 
>   Original Estimate: 288h
>  Remaining Estimate: 288h
>
> start plasma using "plasma_store -m 80 -s /tmp/plasma"
> write tensor in python using  
> {code:python}
> for i in range(10):
> client = plasma.connect("/tmp/plasma", "", 0)
> x = np.random.rand(1000,1000,5*256).astype("float32")# write 5 GB
> object_id = pa.plasma.ObjectID(random_object_id())
> tensor = pa.Tensor.from_numpy(x)
> data_size = pa.get_tensor_size(tensor)
> buf = client.create(object_id, data_size)
> stream = pa.FixedSizeBufferWriter(buf)
> stream.set_memcopy_threads(6)
> pa.write_tensor(tensor, stream)
> client.seal(object_id)
> #client.release(object_id)
> #client.disconnect()
> print(i)
> {code}
> The error is like below:
> pyarrow.lib.PlasmaStoreFull: object does not fit in the plasma store
> If I add "client.release(object_id)" ,the error is:
> /arrow/cpp/src/plasma/client.cc296 Check failed: object_entry != 
> objects_in_use_.end()
> Also,sometimes error is:
>   buf = client.create(object_id, data_size)
>   File "pyarrow/plasma.pyx", line 301, in pyarrow.plasma.PlasmaClient.create 
> (/arrow/python/build/temp.linux-x86_64-2.7/plasma.cxx:4382)
>   File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:7888)
> pyarrow.lib.ArrowIOError: Broken pipe
> After adding "client.disconnect()" it seems to work , but using the code 
> below will fail:
> {code:python}
> client = plasma.connect("/tmp/plasma", "", 0)
> for i in range(10):
> x = np.random.rand(1000,1000,5*256).astype("float32")// write 5 GB
> object_id = pa.plasma.ObjectID(random_object_id())
> tensor = pa.Tensor.from_numpy(x)
> data_size = pa.get_tensor_size(tensor)
> buf = client.create(object_id, data_size)
> stream = pa.FixedSizeBufferWriter(buf)
> stream.set_memcopy_threads(6)
> pa.write_tensor(tensor, stream)
> client.seal(object_id)
> #client.release(object_id)
> #client.disconnect()
> print(i)
> {code}
> plus: I have input another issue about the memory evict policy [Arrow-1795]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1745) Compilation failure on Mac OS in plasma tests

2017-10-28 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-1745:
---

 Summary: Compilation failure on Mac OS in plasma tests
 Key: ARROW-1745
 URL: https://issues.apache.org/jira/browse/ARROW-1745
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

1 2 >

1 - 100 of 137 matches

Mail list logo