Any input/suggestions on parallelizing below operations using Spark over Java Thread pooling - reading of 100 thousands json files from local file system - processing each file content and submitting to Solr as Input document
Thanks, Susheel On Mon, Nov 16, 2015 at 5:44 PM, Susheel Kumar <susheel2...@gmail.com> wrote: > Hello Spark Users, > > My first email to spark mailing list and looking forward. I have been > working on Solr and in the past have used Java thread pooling to > parallelize Solr indexing using SolrJ. > > Now i am again working on indexing data and this time from JSON files (in > 100 thousands) and before I try out parallelizing the operations using > Spark (reading each JSON file, post its content to Solr) I wanted to > confirm my understanding. > > > By reading json files using wholeTextFiles and then posting the content to > Solr > > - would be similar to what i will achieve using Java multi-threading / > thread pooling and using ExecutorFramework and > - what additional other advantages i would get by using Spark (less > code...) > - How we can parallelize/batch this further? For e.g. In my Java > multi-threaded i not only parallelize the reading / data acquisition but > also posting in batches in parallel. > > > Below is the code snippet to give you an idea of what i am thinking to > start initially. Please feel free to suggest/correct my understanding and > below code structure. > > SparkConf conf = new SparkConf().setAppName(appName).setMaster("local[8]" > ); > > JavaSparkContext sc = new JavaSparkContext(conf); > > JavaPairRDD<String,String> rdd = sc.wholeTextFiles("/../*.json"); > > rdd.foreach(new VoidFunction<Tuple2<String,String>>() { > > > @Override > > public void post(Tuple2<String, String> arg0) throws Exception { > > //post content to Solr > > arg0._2 > > ... > > ... > > } > > }); > > > Thanks, > > Susheel >