If you start with an RDD, you do have to collect to the driver and broadcast to do this. Between the two options you listed, I think this one is simpler to implement, and there won't be a huge difference in performance, so you can go for it. Opening InputStreams to a distributed file system by hand can be a lot of code.
Matei > On Nov 5, 2014, at 12:37 PM, Shuai Zheng <szheng.c...@gmail.com> wrote: > > And another similar case: > > If I have get a RDD from previous step, but for next step it should be a map > side join (so I need to broadcast this RDD to every nodes). What is the best > way for me to do that? Collect RDD in driver first and create broadcast? Or > any shortcut in spark for this? > > Thanks! > > -----Original Message----- > From: Shuai Zheng [mailto:szheng.c...@gmail.com] > Sent: Wednesday, November 05, 2014 3:32 PM > To: 'Matei Zaharia' > Cc: 'user@spark.apache.org' > Subject: RE: Any "Replicated" RDD in Spark? > > Nice. > > Then I have another question, if I have a file (or a set of files: part-0, > part-1, might be a few hundreds MB csv to 1-2 GB, created by other program), > need to create hashtable from it, later broadcast it to each node to allow > query (map side join). I have two options to do it: > > 1, I can just load the file in a general code (open a inputstream, etc), > parse content and then create the broadcast from it. > 2, I also can use a standard way to create the RDD from these file, run the > map to parse it, then collect it as map, wrap the result as broadcast to > push to all nodes again. > > I think the option 2 might be more consistent with spark's concept (and less > code?)? But how about the performance? The gain is can parallel load and > parse the data, penalty is after load we need to collect and broadcast > result again? Please share your opinion. I am not sure what is the best > practice here (in theory, either way works, but in real world, which one is > better?). > > Regards, > > Shuai > > -----Original Message----- > From: Matei Zaharia [mailto:matei.zaha...@gmail.com] > Sent: Monday, November 03, 2014 4:15 PM > To: Shuai Zheng > Cc: user@spark.apache.org > Subject: Re: Any "Replicated" RDD in Spark? > > You need to use broadcast followed by flatMap or mapPartitions to do > map-side joins (in your map function, you can look at the hash table you > broadcast and see what records match it). Spark SQL also does it by default > for tables smaller than the spark.sql.autoBroadcastJoinThreshold setting (by > default 10 KB, which is really small, but you can bump this up with set > spark.sql.autoBroadcastJoinThreshold=1000000 for example). > > Matei > >> On Nov 3, 2014, at 1:03 PM, Shuai Zheng <szheng.c...@gmail.com> wrote: >> >> Hi All, >> >> I have spent last two years on hadoop but new to spark. >> I am planning to move one of my existing system to spark to get some > enhanced features. >> >> My question is: >> >> If I try to do a map side join (something similar to "Replicated" key word > in Pig), how can I do it? Is it anyway to declare a RDD as "replicated" > (means distribute it to all nodes and each node will have a full copy)? >> >> I know I can use accumulator to get this feature, but I am not sure what > is the best practice. And if I accumulator to broadcast the data set, can > then (after broadcast) convert it into a RDD and do the join? >> >> Regards, >> >> Shuai > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org