Hi!
I am trying to use plasma store to reduce the memory usage of a pytorch
dataset/dataloader combination, and had 4 questions. I don’t think any of them
require pytorch knowledge. If you prefer to comment inline there is a quip with
identical content and prettier formatting here https://quip.com/3mwGAJ9KR2HT
*1)* My script starts the plasma-store from python with 200 GB:
nbytes = (1024 ** 3) * 200
_server = subprocess.Popen(["plasma_store", "-m", str(nbytes), "-s", path])
where nbytes is chosen arbitrarily. From my experiments it seems that one
should start the store as large as possible within the limits of dev/shm . I
wanted to verify whether this is actually the best practice (it would be hard
for my app to know the storage needs up front) and also whether there is an
automated way to figure out how much storage to allocate.
*2)* Does plasma store support simultaneous reads? My code, which has multiple
clients all asking for the 6 arrays from the plasma-store thousands of times,
was segfaulting with different errors, e.g.
Check failed: RemoveFromClientObjectIds(object_id, entry, client) == 1
until I added a lock around my client.get
if self.use_lock: # Fix segfault
with FileLock("/tmp/plasma_lock"):
ret = self.client.get(self.object_id)
else:
ret = self.client.get(self.object_id)
which fixes.
Here is a full traceback of the failure without the lock
https://gist.github.com/sshleifer/75145ba828fcb4e998d5e34c46ce13fc
Is this expected behavior?
*3)* Is there a simple way to add many objects to the plasma store at once?
Right now, we are considering changing,
oid = client.put(array)
to
oids = [client.put(x) for x in array]
so that we can fetch one entry at a time. but the writes are much slower.
* 3a) Is there a lower level interface for bulk writes?
* 3b) Or is it recommended to chunk the array and have different python
processes write simultaneously to make this faster?
*4)* Is there a way to save/load the contents of the plasma-store to disk
without loading everything into memory and then saving it to some other format?
Replication
Setup instructions for fairseq+replicating the segfault:
https://gist.github.com/sshleifer/bd6982b3f632f1d4bcefc9feceb30b1a
My code is here: https://github.com/pytorch/fairseq/pull/3287
Thanks!
Sam