What kind of cluster? How many cores on each worker? Is there config for http solr client? I remember standard httpclient has limit per route/host. On Aug 2, 2015 8:17 PM, "Sujit Pal" <sujitatgt...@gmail.com> wrote:
> No one has any ideas? > > Is there some more information I should provide? > > I am looking for ways to increase the parallelism among workers. Currently > I just see number of simultaneous connections to Solr equal to the number > of workers. My number of partitions is (2.5x) larger than number of > workers, and the workers seem to be large enough to handle more than one > task at a time. > > I am creating a single client per partition in my mapPartition call. Not > sure if that is creating the gating situation? Perhaps I should use a Pool > of clients instead? > > Would really appreciate some pointers. > > Thanks in advance for any help you can provide. > > -sujit > > > On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal <sujitatgt...@gmail.com> wrote: > >> Hello, >> >> I am trying to run a Spark job that hits an external webservice to get >> back some information. The cluster is 1 master + 4 workers, each worker has >> 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server, >> and is accessed using code similar to that shown below. >> >> def getResults(keyValues: Iterator[(String, Array[String])]): >>> Iterator[(String, String)] = { >>> val solr = new HttpSolrClient() >>> initializeSolrParameters(solr) >>> keyValues.map(keyValue => (keyValue._1, process(solr, keyValue))) >>> } >>> myRDD.repartition(10) >> >> .mapPartitions(keyValues => getResults(keyValues)) >>> >> >> The mapPartitions does some initialization to the SolrJ client per >> partition and then hits it for each record in the partition via the >> getResults() call. >> >> I repartitioned in the hope that this will result in 10 clients hitting >> Solr simultaneously (I would like to go upto maybe 30-40 simultaneous >> clients if I can). However, I counted the number of open connections using >> "netstat -anp | grep ":8983.*ESTABLISHED" in a loop on the Solr box and >> observed that Solr has a constant 4 clients (ie, equal to the number of >> workers) over the lifetime of the run. >> >> My observation leads me to believe that each worker processes a single >> stream of work sequentially. However, from what I understand about how >> Spark works, each worker should be able to process number of tasks >> parallelly, and that repartition() is a hint for it to do so. >> >> Is there some SparkConf environment variable I should set to increase >> parallelism in these workers, or should I just configure a cluster with >> multiple workers per machine? Or is there something I am doing wrong? >> >> Thank you in advance for any pointers you can provide. >> >> -sujit >> >> >