GitHub user RayZ0rr edited a discussion: Safe way to periodically add arrow RecordBatch to a file
I have a use case where I want to save some information, which can consist of numpy nd array of variable shape, numpy 1D arrays, objects like pytorch model state_dict, pytorch optimizer state_dict, scalars like floats, ints, strings, custom types like MetricsInfo etc. These formats can be encoded as various datatypes in arrow which is straight forward for primitive types, tricks like https://github.com/apache/arrow/discussions/48099 for variable shape tensors and binary type for other python objects. I only want to use a single file for all these. The information will be generated periodically like after each epoch in training deep learning models. So at each period, say epoch end, I need to save this information to the file. This is important because if the run is interrupted, I don't want to lose all information till the current epoch. Nor do I want to pressure the memory by buffering all information for a single write. Can this be done with files of `arrow` or `parquet` format? EDIT: after adding the data, I would like to get random access reads to the saved data GitHub link: https://github.com/apache/arrow/discussions/48124 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
