Re: Parallelizing operations using Spark

Susheel Kumar Tue, 17 Nov 2015 06:36:59 -0800

Any input/suggestions on parallelizing below operations using Spark over
Java Thread pooling
- reading of 100 thousands json files from local file system
- processing each file content and submitting to Solr as Input document


Thanks,
Susheel

On Mon, Nov 16, 2015 at 5:44 PM, Susheel Kumar <susheel2...@gmail.com>
wrote:

> Hello Spark Users,
>
> My first email to spark mailing list and looking forward. I have been
> working on Solr and in the past have used Java thread pooling to
> parallelize Solr indexing using SolrJ.
>
> Now i am again working on indexing data and this time from JSON files (in
> 100 thousands) and before I try out parallelizing the operations using
> Spark (reading each JSON file, post its content to Solr) I wanted to
> confirm my understanding.
>
>
> By reading json files using wholeTextFiles and then posting the content to
> Solr
>
> - would be similar to what i will achieve using Java multi-threading /
> thread pooling and using ExecutorFramework  and
> - what additional other advantages i would get by using Spark (less
> code...)
> - How we can parallelize/batch this further? For e.g. In my Java
> multi-threaded i not only parallelize the reading / data acquisition but
> also posting in batches in parallel.
>
>
> Below is the code snippet to give you an idea of what i am thinking to
> start initially.  Please feel free to suggest/correct my understanding and
> below code structure.
>
> SparkConf conf = new SparkConf().setAppName(appName).setMaster("local[8]"
> );
>
> JavaSparkContext sc = new JavaSparkContext(conf);
>
> JavaPairRDD<String,String> rdd = sc.wholeTextFiles("/../*.json");
>
> rdd.foreach(new VoidFunction<Tuple2<String,String>>() {
>
>
> @Override
>
> public void post(Tuple2<String, String> arg0) throws Exception {
>
> //post content to Solr
>
> arg0._2
>
> ...
>
> ...
>
> }
>
> });
>
>
> Thanks,
>
> Susheel
>

Re: Parallelizing operations using Spark

Reply via email to