Re: Python Plasma Store Best Practices

Wes McKinney Tue, 02 Mar 2021 11:09:02 -0800

Also to be clear, if someone wants to maintain it, they are more than
welcome to do so.


On Tue, Mar 2, 2021 at 11:49 AM Sam Shleifer <[email protected]> wrote:

> Thanks, had no idea!
>
>
> On Tue, Mar 02, 2021 at 12:00 PM, Micah Kornfield <[email protected]>
> wrote:
>
>> Hi Sam,
>> I think the lack of responses might be because Plasma is not being
>> actively maintained.  The original authors have forked it into the Ray
>> project.
>>
>> I'm sorry I don't have the expertise to answer your questions.
>>
>> -Micah
>>
>> On Mon, Mar 1, 2021 at 6:48 PM Sam Shleifer <[email protected]> wrote:
>>
>>> Partial answers are super helpful!
>>> I'm happy to break this up if it's too much for 1 question @moderators
>>> Sam
>>>
>>>
>>>
>>> On Sat, Feb 27, 2021 at 1:27 PM, Sam Shleifer <[email protected]>
>>> wrote:
>>>
>>>> Hi!
>>>> I am trying to use plasma store to reduce the memory usage of a pytorch
>>>> dataset/dataloader combination, and had 4  questions. I don’t think any of
>>>> them require pytorch knowledge. If you prefer to comment inline there is a
>>>> quip with identical content and prettier formatting here
>>>> https://quip.com/3mwGAJ9KR2HT
>>>>
>>>> *1)* My script starts the plasma-store from python with 200 GB:
>>>>
>>>> nbytes = (1024 ** 3) * 200
>>>> _server = subprocess.Popen(["plasma_store", "-m", str(nbytes), "-s",
>>>> path])
>>>> where nbytes is chosen arbitrarily. From my experiments it seems that
>>>> one should start the store as large as possible within the limits of
>>>> dev/shm . I wanted to verify whether this is actually the best practice (it
>>>> would be hard for my app to know the storage needs up front) and also
>>>> whether there is an automated way to figure out how much storage to
>>>> allocate.
>>>>
>>>> *2)* Does plasma store support simultaneous reads? My code, which has
>>>> multiple clients all asking for the 6 arrays from the plasma-store
>>>> thousands of times, was segfaulting with different errors, e.g.
>>>> Check failed: RemoveFromClientObjectIds(object_id, entry, client) == 1
>>>> until I added a lock around my client.get
>>>>
>>>> if self.use_lock: # Fix segfault
>>>>     with FileLock("/tmp/plasma_lock"):
>>>>         ret = self.client.get(self.object_id)
>>>> else:
>>>>     ret = self.client.get(self.object_id)
>>>>
>>>> which fixes.
>>>>
>>>> Here is a full traceback of the failure without the lock
>>>> https://gist.github.com/sshleifer/75145ba828fcb4e998d5e34c46ce13fc
>>>> Is this expected behavior?
>>>>
>>>> *3)* Is there a simple way to add many objects to the plasma store at
>>>> once? Right now, we are considering changing,
>>>>
>>>> oid = client.put(array)
>>>> to
>>>> oids = [client.put(x) for x in array]
>>>>
>>>> so that we can fetch one entry at a time. but the writes are much
>>>> slower.
>>>>
>>>> * 3a) Is there a lower level interface for bulk writes?
>>>> * 3b) Or is it recommended to chunk the array and have different python
>>>> processes write simultaneously to make this faster?
>>>>
>>>> *4)* Is there a way to save/load the contents of the plasma-store to
>>>> disk without loading everything into memory and then saving it to some
>>>> other format?
>>>>
>>>> Replication
>>>>
>>>> Setup instructions for fairseq+replicating the segfault:
>>>> https://gist.github.com/sshleifer/bd6982b3f632f1d4bcefc9feceb30b1a
>>>> My code is here: https://github.com/pytorch/fairseq/pull/3287
>>>>
>>>> Thanks!
>>>> Sam
>>>>
>>>
>

Re: Python Plasma Store Best Practices

Reply via email to