sc.textFile already returns just one RDD for all of your files. The
sc.union is unnecessary, although I don't know if it's adding any
overhead. The data is certainly processed in parallel and how it is
parallelized depends on where the data is -- how many InputSplits
Hadoop produces for them.

If you're willing to tolerate a little bit of approximation, use
countApproxDistinctByKey instead of a groupBy and map. You can set
relativeSD to trade off speed and accuracy.

If not, you can probably do better than collecting all of the keys and
then making a set. You can use aggregateByKey to build up a Set in the
first place.

On Tue, Aug 19, 2014 at 2:14 AM, SK <skrishna...@gmail.com> wrote:
>
> Hi,
>
> I have a piece of code that reads all the (csv) files in a folder. For each
> file, it parses each line, extracts the first 2 elements from each row of
> the file, groups the tuple  by the key and finally outputs the  number of
> unique values for each key.
>
>     val conf = new SparkConf().setAppName("App")
>     val sc = new SparkContext(conf)
>
>     val user_time = sc.union(sc.textFile("/directory/*"))    // union of all
> files in the directory
>                                .map(line => {
>                                    val fields = line.split(",")
>                                    (fields(1), fields(0))  // extract first
> 2 elements
>                                   })
>                                .groupByKey  // group by timestamp
>                               .map(g=> (g._1, g._2.toSet.size)) // get the
> number of unique ids per timestamp
>
> I have a lot of files in the directory (several hundreds). The program takes
> a long time. I am not sure if the union operation is preventing the files
> from being processed in parallel. Is there a better way to parallelize the
> above code ? For example, the first two operations (reading each file and
> extracting the first 2 columns from each file) can be done in parallel, but
> I am not sure if that is how Spark schedules the above code.
>
> thanks
>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Processing-multiple-files-in-parallel-tp12336.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to