here's my current flow. i have a java program that uses avro schema file to generate pojos. the code reads data from a postgres table and transfers the data from the db to a list of the generated pojos. i have 4.5m records in the db that the process is reading. once the avro pojos are populated, it then uses the avro writer to output parquet format that is ingested into our data lake.
the problem is that as the table keeps growing, we get oom. i'll be looking at where in the code the oom is coming. continually increasing the memory isn't a feasible solution. what are some common patterns for handling this? i'm thinking to chunk the records; is it possible to process 500k records at a time, then concatenate the parquet files? i'm pretty new to this.
