here's my current flow.  i have a java program that uses avro schema file
to generate pojos.  the code reads data from a postgres table and transfers
the data  from the db to a list of the generated pojos.  i have 4.5m
records in the db that the process is reading.  once the avro pojos are
populated, it then uses the avro writer to output parquet format that is
ingested into our data lake.

the problem is that as the table keeps growing, we get oom. i'll be looking
at where in the code the oom is coming.  continually increasing the memory
isn't a feasible solution. what are some common patterns for handling
this?  i'm thinking to chunk the records; is it possible to process 500k
records at a time, then concatenate the parquet files? i'm pretty new to
this.

Reply via email to