Analyzing data from non-standard data sources (e.g. AWS Redshift)

Denis Mikhalkin Sat, 24 Jan 2015 03:46:24 -0800

Hello,

we've got some analytics data in AWS Redshift. The data is being constantly 
updated.
I'd like to be able to write a query against Redshift which would return a 
subset of data, and then run a Spark job (Pyspark) to do some analysis.
I could not find an RDD which would let me do it OOB (Python), so I tried 
writing my own. For example, tried combination of a generator (via yield) with 
parallelize. It appears though that "parallelize" reads all the data first into 
memory as I get either OOM or Python swaps as soon as I increase the number of 
rows beyond trivial limits.
I've also looked at Java RDDs (there is an example of MySQL RDD) but it seems 
that it also reads all the data into memory.
So my question is - how to correctly feed Spark with huge datasets which don't 
initially reside in HDFS/S3 (ideally for Pyspark, but would appreciate any 
tips)?
Thanks.
Denis

Analyzing data from non-standard data sources (e.g. AWS Redshift)

Reply via email to