I believe databricks provides an rdd interface to redshift. Did you check spark-packages.org? On 2015년 1월 24일 (토) at 오전 6:45 Denis Mikhalkin <[email protected]> wrote:
> Hello, > > we've got some analytics data in AWS Redshift. The data is being > constantly updated. > > I'd like to be able to write a query against Redshift which would return a > subset of data, and then run a Spark job (Pyspark) to do some analysis. > > I could not find an RDD which would let me do it OOB (Python), so I tried > writing my own. For example, tried combination of a generator (via yield) > with parallelize. It appears though that "parallelize" reads all the data > first into memory as I get either OOM or Python swaps as soon as I increase > the number of rows beyond trivial limits. > > I've also looked at Java RDDs (there is an example of MySQL RDD) but it > seems that it also reads all the data into memory. > > So my question is - how to correctly feed Spark with huge datasets which > don't initially reside in HDFS/S3 (ideally for Pyspark, but would > appreciate any tips)? > > Thanks. > > Denis > > >
