GitHub user RayZ0rr edited a discussion: Safe way to periodically add arrow 
RecordBatch to a file

I have a use case where I want to save some information, which can consist of 
numpy nd array of variable shape, numpy 1D arrays, objects like pytorch model 
state_dict, pytorch optimizer state_dict, scalars like floats, ints, strings, 
custom types like MetricsInfo etc. These formats can be encoded as various 
datatypes in arrow which is straight forward for primitive types, tricks like 
https://github.com/apache/arrow/discussions/48099 for variable shape tensors 
and binary type for other python objects.

I only want to use a single file for all these. The information will be 
generated periodically like after each epoch in training deep learning models. 
So at each period, say epoch end, I need to save this information to the file. 
This is important because if the run is interrupted, I don't want to lose all 
information till the current epoch. Nor do I want to pressure the memory by 
buffering all information for a single write.

Can this be done with files of `arrow` or `parquet` format?

EDIT: after adding the data, I would like to get random access reads to the 
saved data

GitHub link: https://github.com/apache/arrow/discussions/48124

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to