I have json files which contains timestamped events. Each event associate with a user id.
Now I want to group by user id. So converts from Event1 -> UserIDA; Event2 -> UserIDA; Event3 -> UserIDB; To intermediate storage. UserIDA -> (Event1, Event2...) UserIDB-> (Event3...) Then I will label positives and featurize the Events Vector in many different ways, fit each of them into the Logistic Regression. I want to save intermediate storage permanently since it will be used many times. And there will new events coming every day. So I need to update this intermediate storage every day. Right now I store intermediate storage using Json files. Should I use Parquet instead? Or is there better solutions for this use case? Thanks a lot !