Hi, I think you're hitting something that can be fixed by configuring Redshift driver: http://docs.aws.amazon.com/redshift/latest/dg/queries-troubleshooting.html#set-the-JDBC-fetch-size-parameter *By default, the JDBC driver collects all the results for a query at one time. As a result, when you attempt to retrieve a large result set over a JDBC connection, you might encounter a client-side out-of-memory error. To enable your client to retrieve result sets in batches instead of in a single all-or-nothing fetch, set the JDBC fetch size parameter in your client application.*
On Wed, Nov 29, 2017 at 1:41 PM Chet Aldrich <[email protected]> wrote: > Hey all, > > I’m running a Dataflow job that uses the JDBC IO transform to pull in a > bunch of data (20mm rows, for reference) from Redshift, and I’m noticing > that I’m getting an OutofMemoryError on the Dataflow workers once I reach > around 4mm rows. > > It seems like given the code that I’m reading inside JDBC IO and the guide > here ( > https://beam.apache.org/documentation/io/authoring-overview/#read-transforms) > that it’s just pulling the data in from the result one-by-one and the > emitting each output. Considering that this is sort of a limitation of the > driver, this makes sense, but is there a way I can get around the memory > limitation somehow? It seems like Dataflow repeatedly tries to create more > workers to handle the work, but it can’t, which is part of the problem. > > If more info is needed in order to help me sort out what I could do to not > run into the memory limitations I’m happy to provide it. > > > Thanks, > > Chet >
