Re: Etl with spark

Miguel Morales Sun, 12 Feb 2017 10:34:07 -0800

You can parallelize the collection of s3 keys and then pass that to your map 
function so that files are read in parallel.


Sent from my iPhone

> On Feb 12, 2017, at 9:41 AM, Sam Elamin <hussam.ela...@gmail.com> wrote:
> 
> thanks Ayan but i was hoping to remove the dependency on a file and just use 
> in memory list or dictionary 
> 
> So from the reading I've done today it seems.the concept of a bespoke async 
> method doesn't really apply in spsrk since the cluster deals with 
> distributing the work load 
> 
> 
> Am I mistaken?
> 
> Regards 
> Sam 
> On Sun, 12 Feb 2017 at 12:13, ayan guha <guha.a...@gmail.com> wrote:
> You can store the list of keys (I believe you use them in source file path, 
> right?) in a file, one key per line. Then you can read the file using 
> sc.textFile (So you will get a RDD of file paths) and then apply your 
> function as a map.
> 
> r = sc.textFile(list_file).map(your_function)
> 
> HTH
> 
> On Sun, Feb 12, 2017 at 10:04 PM, Sam Elamin <hussam.ela...@gmail.com> wrote:
> Hey folks 
> 
> Really simple question here. I currently have an etl pipeline that reads from 
> s3 and saves the data to an endstore 
> 
> 
> I have to read from a list of keys in s3 but I am doing a raw extract then 
> saving. Only some of the extracts have a simple transformation but overall 
> the code looks the same 
> 
> 
> I abstracted away this logic into a method that takes in an s3 path does the 
> common transformations and saves to source 
> 
> 
> But the job takes about 10 mins or so because I'm iteratively going down a 
> list of keys 
> 
> Is it possible to asynchronously do this?
> 
> FYI I'm using spark.read.json to read from s3 because it infers my schema 
> 
> Regards 
> Sam 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha

Re: Etl with spark

Reply via email to