Hi guys,
We ultimately needed to add 8 ec2 xl's to get better performance. As was
suspected, we could not fit all the data into ram.
This worked great with files sized around 100-350MB in size as our initial
export task produced. Unfortunately, for the partition settings that we
were able to ge
I'll try this out and follow up with what I find.
On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng wrote:
> For each node, if the CSV reader is implemented efficiently, you should be
> able to hit at least half of the theoretical network bandwidth, which is
> about 60MB/second/node. So if you just
Hmm, we actually read the CSV data in S3 now and were looking to avoid
that. Unfortunately, we've experienced dreadful performance reading 100GB
of text data for a job directly from S3 - our hope had been connecting
directly to Redshift would provide some boost.
We had been using 12 m3.xlarges, b
I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD
command used to produce the data. Xiangrui can correct me if I'm wrong
though.
On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf wrote:
> We have a bunch of data in RedShift tables that we'd like to pull in
> during job runs to S
We have a bunch of data in RedShift tables that we'd like to pull in during
job runs to Spark. What is the path/url format one uses to pull data from
there? (This is in reference to using the
https://github.com/mengxr/redshift-input-format)