Re: Using JDBC IO read transform, running out of memory on DataflowRunner.

2017-11-30 Thread Chet Aldrich
Hey Eugene, 

Thanks for this, didn’t realize this was a parameter I could tune. Fixed my 
problems straight away. 

Chet

> On Nov 29, 2017, at 2:14 PM, Eugene Kirpichov  wrote:
> 
> Hi,
> I think you're hitting something that can be fixed by configuring Redshift 
> driver:
> http://docs.aws.amazon.com/redshift/latest/dg/queries-troubleshooting.html#set-the-JDBC-fetch-size-parameter
>  
> 
> By default, the JDBC driver collects all the results for a query at one time. 
> As a result, when you attempt to retrieve a large result set over a JDBC 
> connection, you might encounter a client-side out-of-memory error. To enable 
> your client to retrieve result sets in batches instead of in a single 
> all-or-nothing fetch, set the JDBC fetch size parameter in your client 
> application.
> 
> On Wed, Nov 29, 2017 at 1:41 PM Chet Aldrich  > wrote:
> Hey all, 
> 
> I’m running a Dataflow job that uses the JDBC IO transform to pull in a bunch 
> of data (20mm rows, for reference) from Redshift, and I’m noticing that I’m 
> getting an OutofMemoryError on the Dataflow workers once I reach around 4mm 
> rows. 
> 
> It seems like given the code that I’m reading inside JDBC IO and the guide 
> here 
> (https://beam.apache.org/documentation/io/authoring-overview/#read-transforms 
> )
>  that it’s just pulling the data in from the result one-by-one and the 
> emitting each output. Considering that this is sort of a limitation of the 
> driver, this makes sense, but is there a way I can get around the memory 
> limitation somehow? It seems like Dataflow repeatedly tries to create more 
> workers to handle the work, but it can’t, which is part of the problem. 
> 
> If more info is needed in order to help me sort out what I could do to not 
> run into the memory limitations I’m happy to provide it. 
> 
> 
> Thanks,
> 
> Chet 



Re: Using JDBC IO read transform, running out of memory on DataflowRunner.

2017-11-29 Thread Eugene Kirpichov
Hi,
I think you're hitting something that can be fixed by configuring Redshift
driver:
http://docs.aws.amazon.com/redshift/latest/dg/queries-troubleshooting.html#set-the-JDBC-fetch-size-parameter
*By default, the JDBC driver collects all the results for a query at one
time. As a result, when you attempt to retrieve a large result set over a
JDBC connection, you might encounter a client-side out-of-memory error. To
enable your client to retrieve result sets in batches instead of in a
single all-or-nothing fetch, set the JDBC fetch size parameter in your
client application.*

On Wed, Nov 29, 2017 at 1:41 PM Chet Aldrich 
wrote:

> Hey all,
>
> I’m running a Dataflow job that uses the JDBC IO transform to pull in a
> bunch of data (20mm rows, for reference) from Redshift, and I’m noticing
> that I’m getting an OutofMemoryError on the Dataflow workers once I reach
> around 4mm rows.
>
> It seems like given the code that I’m reading inside JDBC IO and the guide
> here (
> https://beam.apache.org/documentation/io/authoring-overview/#read-transforms)
> that it’s just pulling the data in from the result one-by-one and the
> emitting each output. Considering that this is sort of a limitation of the
> driver, this makes sense, but is there a way I can get around the memory
> limitation somehow? It seems like Dataflow repeatedly tries to create more
> workers to handle the work, but it can’t, which is part of the problem.
>
> If more info is needed in order to help me sort out what I could do to not
> run into the memory limitations I’m happy to provide it.
>
>
> Thanks,
>
> Chet
>


Using JDBC IO read transform, running out of memory on DataflowRunner.

2017-11-29 Thread Chet Aldrich
Hey all, 

I’m running a Dataflow job that uses the JDBC IO transform to pull in a bunch 
of data (20mm rows, for reference) from Redshift, and I’m noticing that I’m 
getting an OutofMemoryError on the Dataflow workers once I reach around 4mm 
rows. 

It seems like given the code that I’m reading inside JDBC IO and the guide here 
(https://beam.apache.org/documentation/io/authoring-overview/#read-transforms 
) 
that it’s just pulling the data in from the result one-by-one and the emitting 
each output. Considering that this is sort of a limitation of the driver, this 
makes sense, but is there a way I can get around the memory limitation somehow? 
It seems like Dataflow repeatedly tries to create more workers to handle the 
work, but it can’t, which is part of the problem. 

If more info is needed in order to help me sort out what I could do to not run 
into the memory limitations I’m happy to provide it. 


Thanks,

Chet