Re: Parallelizing operations using Spark

2015-11-17 Thread PhuDuc Nguyen
You should try passing your solr writer into rdd.foreachPartition() for max
parallelism - each partition on each executor will execute the function
passed in.

HTH,
Duc

On Tue, Nov 17, 2015 at 7:36 AM, Susheel Kumar 
wrote:

> Any input/suggestions on parallelizing below operations using Spark over
> Java Thread pooling
> - reading of 100 thousands json files from local file system
> - processing each file content and submitting to Solr as Input document
>
> Thanks,
> Susheel
>
> On Mon, Nov 16, 2015 at 5:44 PM, Susheel Kumar 
> wrote:
>
>> Hello Spark Users,
>>
>> My first email to spark mailing list and looking forward. I have been
>> working on Solr and in the past have used Java thread pooling to
>> parallelize Solr indexing using SolrJ.
>>
>> Now i am again working on indexing data and this time from JSON files (in
>> 100 thousands) and before I try out parallelizing the operations using
>> Spark (reading each JSON file, post its content to Solr) I wanted to
>> confirm my understanding.
>>
>>
>> By reading json files using wholeTextFiles and then posting the content
>> to Solr
>>
>> - would be similar to what i will achieve using Java multi-threading /
>> thread pooling and using ExecutorFramework  and
>> - what additional other advantages i would get by using Spark (less
>> code...)
>> - How we can parallelize/batch this further? For e.g. In my Java
>> multi-threaded i not only parallelize the reading / data acquisition but
>> also posting in batches in parallel.
>>
>>
>> Below is the code snippet to give you an idea of what i am thinking to
>> start initially.  Please feel free to suggest/correct my understanding and
>> below code structure.
>>
>> SparkConf conf = new SparkConf().setAppName(appName).setMaster("local[8]"
>> );
>>
>> JavaSparkContext sc = new JavaSparkContext(conf);
>>
>> JavaPairRDD rdd = sc.wholeTextFiles("/../*.json");
>>
>> rdd.foreach(new VoidFunction>() {
>>
>>
>> @Override
>>
>> public void post(Tuple2 arg0) throws Exception {
>>
>> //post content to Solr
>>
>> arg0._2
>>
>> ...
>>
>> ...
>>
>> }
>>
>> });
>>
>>
>> Thanks,
>>
>> Susheel
>>
>
>


Re: Parallelizing operations using Spark

2015-11-17 Thread Susheel Kumar
Any input/suggestions on parallelizing below operations using Spark over
Java Thread pooling
- reading of 100 thousands json files from local file system
- processing each file content and submitting to Solr as Input document

Thanks,
Susheel

On Mon, Nov 16, 2015 at 5:44 PM, Susheel Kumar 
wrote:

> Hello Spark Users,
>
> My first email to spark mailing list and looking forward. I have been
> working on Solr and in the past have used Java thread pooling to
> parallelize Solr indexing using SolrJ.
>
> Now i am again working on indexing data and this time from JSON files (in
> 100 thousands) and before I try out parallelizing the operations using
> Spark (reading each JSON file, post its content to Solr) I wanted to
> confirm my understanding.
>
>
> By reading json files using wholeTextFiles and then posting the content to
> Solr
>
> - would be similar to what i will achieve using Java multi-threading /
> thread pooling and using ExecutorFramework  and
> - what additional other advantages i would get by using Spark (less
> code...)
> - How we can parallelize/batch this further? For e.g. In my Java
> multi-threaded i not only parallelize the reading / data acquisition but
> also posting in batches in parallel.
>
>
> Below is the code snippet to give you an idea of what i am thinking to
> start initially.  Please feel free to suggest/correct my understanding and
> below code structure.
>
> SparkConf conf = new SparkConf().setAppName(appName).setMaster("local[8]"
> );
>
> JavaSparkContext sc = new JavaSparkContext(conf);
>
> JavaPairRDD rdd = sc.wholeTextFiles("/../*.json");
>
> rdd.foreach(new VoidFunction>() {
>
>
> @Override
>
> public void post(Tuple2 arg0) throws Exception {
>
> //post content to Solr
>
> arg0._2
>
> ...
>
> ...
>
> }
>
> });
>
>
> Thanks,
>
> Susheel
>


Parallelizing operations using Spark

2015-11-16 Thread Susheel Kumar
Hello Spark Users,

My first email to spark mailing list and looking forward. I have been
working on Solr and in the past have used Java thread pooling to
parallelize Solr indexing using SolrJ.

Now i am again working on indexing data and this time from JSON files (in
100 thousands) and before I try out parallelizing the operations using
Spark (reading each JSON file, post its content to Solr) I wanted to
confirm my understanding.


By reading json files using wholeTextFiles and then posting the content to
Solr

- would be similar to what i will achieve using Java multi-threading /
thread pooling and using ExecutorFramework  and
- what additional other advantages i would get by using Spark (less code...)
- How we can parallelize/batch this further? For e.g. In my Java
multi-threaded i not only parallelize the reading / data acquisition but
also posting in batches in parallel.


Below is the code snippet to give you an idea of what i am thinking to
start initially.  Please feel free to suggest/correct my understanding and
below code structure.

SparkConf conf = new SparkConf().setAppName(appName).setMaster("local[8]");

JavaSparkContext sc = new JavaSparkContext(conf);

JavaPairRDD rdd = sc.wholeTextFiles("/../*.json");

rdd.foreach(new VoidFunction>() {


@Override

public void post(Tuple2 arg0) throws Exception {

//post content to Solr

arg0._2

...

...

}

});


Thanks,

Susheel