Hi,
I think you're hitting something that can be fixed by configuring Redshift
driver:
http://docs.aws.amazon.com/redshift/latest/dg/queries-troubleshooting.html#set-the-JDBC-fetch-size-parameter
*By default, the JDBC driver collects all the results for a query at one
time. As a result, when you attempt to retrieve a large result set over a
JDBC connection, you might encounter a client-side out-of-memory error. To
enable your client to retrieve result sets in batches instead of in a
single all-or-nothing fetch, set the JDBC fetch size parameter in your
client application.*

On Wed, Nov 29, 2017 at 1:41 PM Chet Aldrich <[email protected]>
wrote:

> Hey all,
>
> I’m running a Dataflow job that uses the JDBC IO transform to pull in a
> bunch of data (20mm rows, for reference) from Redshift, and I’m noticing
> that I’m getting an OutofMemoryError on the Dataflow workers once I reach
> around 4mm rows.
>
> It seems like given the code that I’m reading inside JDBC IO and the guide
> here (
> https://beam.apache.org/documentation/io/authoring-overview/#read-transforms)
> that it’s just pulling the data in from the result one-by-one and the
> emitting each output. Considering that this is sort of a limitation of the
> driver, this makes sense, but is there a way I can get around the memory
> limitation somehow? It seems like Dataflow repeatedly tries to create more
> workers to handle the work, but it can’t, which is part of the problem.
>
> If more info is needed in order to help me sort out what I could do to not
> run into the memory limitations I’m happy to provide it.
>
>
> Thanks,
>
> Chet
>

Reply via email to