I believe databricks provides an rdd interface to redshift. Did you check
spark-packages.org?
On 2015년 1월 24일 (토) at 오전 6:45 Denis Mikhalkin <[email protected]>
wrote:

> Hello,
>
> we've got some analytics data in AWS Redshift. The data is being
> constantly updated.
>
> I'd like to be able to write a query against Redshift which would return a
> subset of data, and then run a Spark job (Pyspark) to do some analysis.
>
> I could not find an RDD which would let me do it OOB (Python), so I tried
> writing my own. For example, tried combination of a generator (via yield)
> with parallelize. It appears though that "parallelize" reads all the data
> first into memory as I get either OOM or Python swaps as soon as I increase
> the number of rows beyond trivial limits.
>
> I've also looked at Java RDDs (there is an example of MySQL RDD) but it
> seems that it also reads all the data into memory.
>
> So my question is - how to correctly feed Spark with huge datasets which
> don't initially reside in HDFS/S3 (ideally for Pyspark, but would
> appreciate any tips)?
>
> Thanks.
>
> Denis
>
>
>

Reply via email to