I have json files which contains timestamped events.  Each event associate
with a user id.

Now I want to group by user id. So converts from

Event1 -> UserIDA;
Event2 -> UserIDA;
Event3 -> UserIDB;

To intermediate storage.
UserIDA -> (Event1, Event2...)
UserIDB-> (Event3...)

Then I will label positives and featurize the Events Vector in many
different ways, fit each of them into the Logistic Regression.

I want to save intermediate storage permanently since it will be used many
times.  And there will new events coming every day. So I need to update
this intermediate storage every day.

Right now I store intermediate storage using Json files.  Should I use
Parquet instead?  Or is there better solutions for this use case?

Thanks a lot !

Reply via email to