Re: Parallelizing operations using Spark
You should try passing your solr writer into rdd.foreachPartition() for max parallelism - each partition on each executor will execute the function passed in. HTH, Duc On Tue, Nov 17, 2015 at 7:36 AM, Susheel Kumarwrote: > Any input/suggestions on parallelizing below operations using Spark over > Java Thread pooling > - reading of 100 thousands json files from local file system > - processing each file content and submitting to Solr as Input document > > Thanks, > Susheel > > On Mon, Nov 16, 2015 at 5:44 PM, Susheel Kumar > wrote: > >> Hello Spark Users, >> >> My first email to spark mailing list and looking forward. I have been >> working on Solr and in the past have used Java thread pooling to >> parallelize Solr indexing using SolrJ. >> >> Now i am again working on indexing data and this time from JSON files (in >> 100 thousands) and before I try out parallelizing the operations using >> Spark (reading each JSON file, post its content to Solr) I wanted to >> confirm my understanding. >> >> >> By reading json files using wholeTextFiles and then posting the content >> to Solr >> >> - would be similar to what i will achieve using Java multi-threading / >> thread pooling and using ExecutorFramework and >> - what additional other advantages i would get by using Spark (less >> code...) >> - How we can parallelize/batch this further? For e.g. In my Java >> multi-threaded i not only parallelize the reading / data acquisition but >> also posting in batches in parallel. >> >> >> Below is the code snippet to give you an idea of what i am thinking to >> start initially. Please feel free to suggest/correct my understanding and >> below code structure. >> >> SparkConf conf = new SparkConf().setAppName(appName).setMaster("local[8]" >> ); >> >> JavaSparkContext sc = new JavaSparkContext(conf); >> >> JavaPairRDD rdd = sc.wholeTextFiles("/../*.json"); >> >> rdd.foreach(new VoidFunction >() { >> >> >> @Override >> >> public void post(Tuple2 arg0) throws Exception { >> >> //post content to Solr >> >> arg0._2 >> >> ... >> >> ... >> >> } >> >> }); >> >> >> Thanks, >> >> Susheel >> > >
Re: Parallelizing operations using Spark
Any input/suggestions on parallelizing below operations using Spark over Java Thread pooling - reading of 100 thousands json files from local file system - processing each file content and submitting to Solr as Input document Thanks, Susheel On Mon, Nov 16, 2015 at 5:44 PM, Susheel Kumarwrote: > Hello Spark Users, > > My first email to spark mailing list and looking forward. I have been > working on Solr and in the past have used Java thread pooling to > parallelize Solr indexing using SolrJ. > > Now i am again working on indexing data and this time from JSON files (in > 100 thousands) and before I try out parallelizing the operations using > Spark (reading each JSON file, post its content to Solr) I wanted to > confirm my understanding. > > > By reading json files using wholeTextFiles and then posting the content to > Solr > > - would be similar to what i will achieve using Java multi-threading / > thread pooling and using ExecutorFramework and > - what additional other advantages i would get by using Spark (less > code...) > - How we can parallelize/batch this further? For e.g. In my Java > multi-threaded i not only parallelize the reading / data acquisition but > also posting in batches in parallel. > > > Below is the code snippet to give you an idea of what i am thinking to > start initially. Please feel free to suggest/correct my understanding and > below code structure. > > SparkConf conf = new SparkConf().setAppName(appName).setMaster("local[8]" > ); > > JavaSparkContext sc = new JavaSparkContext(conf); > > JavaPairRDD rdd = sc.wholeTextFiles("/../*.json"); > > rdd.foreach(new VoidFunction >() { > > > @Override > > public void post(Tuple2 arg0) throws Exception { > > //post content to Solr > > arg0._2 > > ... > > ... > > } > > }); > > > Thanks, > > Susheel >
Parallelizing operations using Spark
Hello Spark Users, My first email to spark mailing list and looking forward. I have been working on Solr and in the past have used Java thread pooling to parallelize Solr indexing using SolrJ. Now i am again working on indexing data and this time from JSON files (in 100 thousands) and before I try out parallelizing the operations using Spark (reading each JSON file, post its content to Solr) I wanted to confirm my understanding. By reading json files using wholeTextFiles and then posting the content to Solr - would be similar to what i will achieve using Java multi-threading / thread pooling and using ExecutorFramework and - what additional other advantages i would get by using Spark (less code...) - How we can parallelize/batch this further? For e.g. In my Java multi-threaded i not only parallelize the reading / data acquisition but also posting in batches in parallel. Below is the code snippet to give you an idea of what i am thinking to start initially. Please feel free to suggest/correct my understanding and below code structure. SparkConf conf = new SparkConf().setAppName(appName).setMaster("local[8]"); JavaSparkContext sc = new JavaSparkContext(conf); JavaPairRDDrdd = sc.wholeTextFiles("/../*.json"); rdd.foreach(new VoidFunction >() { @Override public void post(Tuple2 arg0) throws Exception { //post content to Solr arg0._2 ... ... } }); Thanks, Susheel