Hello Spark Users,
My first email to spark mailing list and looking forward. I have been
working on Solr and in the past have used Java thread pooling to
parallelize Solr indexing using SolrJ.
Now i am again working on indexing data and this time from JSON files (in
100 thousands) and before I try out parallelizing the operations using
Spark (reading each JSON file, post its content to Solr) I wanted to
confirm my understanding.
By reading json files using wholeTextFiles and then posting the content to
Solr
- would be similar to what i will achieve using Java multi-threading /
thread pooling and using ExecutorFramework and
- what additional other advantages i would get by using Spark (less code...)
- How we can parallelize/batch this further? For e.g. In my Java
multi-threaded i not only parallelize the reading / data acquisition but
also posting in batches in parallel.
Below is the code snippet to give you an idea of what i am thinking to
start initially. Please feel free to suggest/correct my understanding and
below code structure.
SparkConf conf = new SparkConf().setAppName(appName).setMaster("local[8]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaPairRDD<String,String> rdd = sc.wholeTextFiles("/../*.json");
rdd.foreach(new VoidFunction<Tuple2<String,String>>() {
@Override
public void post(Tuple2<String, String> arg0) throws Exception {
//post content to Solr
arg0._2
...
...
}
});
Thanks,
Susheel