What you should be doing is a join, something like this: //Create a key, value pair, key being the column1 val rdd1 = sc.textFile(file1).map(x => (x.split(",")(0),x.split(","))
//Create a key, value pair, key being the column2 val rdd2 = sc.textFile(file2).map(x => (x.split(",")(1),x.split(",")) //Now join the dataset val joined = rdd1.join(rdd2) //Now do the replacement val replaced = joined.map(...) Thanks Best Regards On Mon, Mar 21, 2016 at 10:31 AM, Shishir Anshuman < shishiranshu...@gmail.com> wrote: > I have stored the contents of two csv files in separate RDDs. > > file1.csv format*: (column1,column2,column3)* > file2.csv format*: (column1, column2)* > > *column1 of file1 *and* column2 of file2 *contains similar data. I want > to compare the two columns and if match is found: > > - Replace the data at *column1(file1)* with the* column1(file2)* > > > For this reason, I am not using normal RDD. > > I am still new to apache spark, so any suggestion will be greatly > appreciated. > > On Mon, Mar 21, 2016 at 10:09 AM, Prem Sure <premsure...@gmail.com> wrote: > >> any specific reason you would like to use collectasmap only? You probably >> move to normal RDD instead of a Pair. >> >> >> On Monday, March 21, 2016, Mark Hamstra <m...@clearstorydata.com> wrote: >> >>> You're not getting what Ted is telling you. Your `dict` is an >>> RDD[String] -- i.e. it is a collection of a single value type, String. >>> But `collectAsMap` is only defined for PairRDDs that have key-value pairs >>> for their data elements. Both a key and a value are needed to collect into >>> a Map[K, V]. >>> >>> On Sun, Mar 20, 2016 at 8:19 PM, Shishir Anshuman < >>> shishiranshu...@gmail.com> wrote: >>> >>>> yes I have included that class in my code. >>>> I guess its something to do with the RDD format. Not able to figure out >>>> the exact reason. >>>> >>>> On Fri, Mar 18, 2016 at 9:27 AM, Ted Yu <yuzhih...@gmail.com> wrote: >>>> >>>>> It is defined in: >>>>> core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala >>>>> >>>>> On Thu, Mar 17, 2016 at 8:55 PM, Shishir Anshuman < >>>>> shishiranshu...@gmail.com> wrote: >>>>> >>>>>> I am using following code snippet in scala: >>>>>> >>>>>> >>>>>> *val dict: RDD[String] = sc.textFile("path/to/csv/file")* >>>>>> *val dict_broadcast=sc.broadcast(dict.collectAsMap())* >>>>>> >>>>>> On compiling It generates this error: >>>>>> >>>>>> *scala:42: value collectAsMap is not a member of >>>>>> org.apache.spark.rdd.RDD[String]* >>>>>> >>>>>> >>>>>> *val dict_broadcast=sc.broadcast(dict.collectAsMap()) >>>>>> ^* >>>>>> >>>>> >>>>> >>>> >>> >