sc.textFile already returns just one RDD for all of your files. The sc.union is unnecessary, although I don't know if it's adding any overhead. The data is certainly processed in parallel and how it is parallelized depends on where the data is -- how many InputSplits Hadoop produces for them.
If you're willing to tolerate a little bit of approximation, use countApproxDistinctByKey instead of a groupBy and map. You can set relativeSD to trade off speed and accuracy. If not, you can probably do better than collecting all of the keys and then making a set. You can use aggregateByKey to build up a Set in the first place. On Tue, Aug 19, 2014 at 2:14 AM, SK <skrishna...@gmail.com> wrote: > > Hi, > > I have a piece of code that reads all the (csv) files in a folder. For each > file, it parses each line, extracts the first 2 elements from each row of > the file, groups the tuple by the key and finally outputs the number of > unique values for each key. > > val conf = new SparkConf().setAppName("App") > val sc = new SparkContext(conf) > > val user_time = sc.union(sc.textFile("/directory/*")) // union of all > files in the directory > .map(line => { > val fields = line.split(",") > (fields(1), fields(0)) // extract first > 2 elements > }) > .groupByKey // group by timestamp > .map(g=> (g._1, g._2.toSet.size)) // get the > number of unique ids per timestamp > > I have a lot of files in the directory (several hundreds). The program takes > a long time. I am not sure if the union operation is preventing the files > from being processed in parallel. Is there a better way to parallelize the > above code ? For example, the first two operations (reading each file and > extracting the first 2 columns from each file) can be done in parallel, but > I am not sure if that is how Spark schedules the above code. > > thanks > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Processing-multiple-files-in-parallel-tp12336.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org