Uff.... hello fine people.

So the cause of the above issue was, unsurprisingly, human error. I found a
local[*] spark master config which gazumped my own one.... so mystery
solved. But I have another question, that is still the crux of this problem:

Here's a bit of trimmed code, that I'm currently testing with. I
deliberately stuck in a repartition(50), just to force it to, what I
believe was chunk it up and distribute it. Which is all good.

override def run(): Unit = {
    ...

    val rdd = new MemexCrawlDbRDD(sc, job, maxGroups = topG, topN = topN)
    val f = rdd.map(r => (r.getGroup, r))
      .groupByKey().repartition(50);

    val c = f.getNumPartitions
      val fetchedRdd = f.flatMap({ case (grp, rs) => new
FairFetcher(job, rs.iterator, localFetchDelay,
        FetchFunction, ParseFunction, OutLinkFilterFunction,
StatusUpdateSolrTransformer) })
      .persist()

    val d = fetchedRdd.getNumPartitions

    ...

    val scoredRdd = score(fetchedRdd)

    ...

}

def score(fetchedRdd: RDD[CrawlData]): RDD[CrawlData] = {
  val job = this.job.asInstanceOf[SparklerJob]

  val scoredRdd = fetchedRdd.map(d => ScoreFunction(job, d))

  val scoreUpdateRdd: RDD[SolrInputDocument] = scoredRdd.map(d =>
ScoreUpdateSolrTransformer(d))
  val scoreUpdateFunc = new SolrStatusUpdate(job)
  sc.runJob(scoreUpdateRdd, scoreUpdateFunc)

}


Basically for anyone new to this, the business logic lives inside the
FairFetcher and I need that distributed over all the nodes in spark cluster.

Here's a quick illustration of what I'm seeing:
https://pasteboard.co/K7VovBO.png

It chunks up to code and distributes the tasks across the cluster, but that
occurs _prior_ to the business logic  in the FlatMap being executed.

So specifically, has anyone got any ideas about how to split that flatmap
operation up so the RDD processing runs across the nodes, not limited to a
single node?

Thanks for all your help so far,

Tom

On Wed, Jun 9, 2021 at 8:08 PM Tom Barber <t...@spicule.co.uk> wrote:

> Ah no sorry, so in the load image, the crawl has just kicked off on the
> driver node which is why its flagged red and the load is spiking.
> https://pasteboard.co/K5QHOJN.png here's the cluster now its been running
> a while. The red node is still (and is always every time I tested it) the
> driver node.
>
> Tom
>
>
>
> On Wed, Jun 9, 2021 at 8:03 PM Sean Owen <sro...@gmail.com> wrote:
>
>> Where do you see that ... I see 3 executors busy at first. If that's the
>> crawl then ?
>>
>> On Wed, Jun 9, 2021 at 1:59 PM Tom Barber <t...@spicule.co.uk> wrote:
>>
>>> Yeah :)
>>>
>>> But it's all running through the same node. So I can run multiple tasks
>>> of the same type on the same node(the driver), but I can't run multiple
>>> tasks on multiple nodes.
>>>
>>> On Wed, Jun 9, 2021 at 7:57 PM Sean Owen <sro...@gmail.com> wrote:
>>>
>>>> Wait. Isn't that what you were trying to parallelize in the first place?
>>>>
>>>> On Wed, Jun 9, 2021 at 1:49 PM Tom Barber <t...@spicule.co.uk> wrote:
>>>>
>>>>> Yeah but that something else is the crawl being run, which is
>>>>> triggered from inside the RDDs, because the log output is slowly 
>>>>> outputting
>>>>> crawl data.
>>>>>
>>>>>
>>> Spicule Limited is registered in England & Wales. Company Number:
>>> 09954122. Registered office: First Floor, Telecom House, 125-135 Preston
>>> Road, Brighton, England, BN1 6AF. VAT No. 251478891.
>>>
>>>
>>> All engagements are subject to Spicule Terms and Conditions of Business.
>>> This email and its contents are intended solely for the individual to whom
>>> it is addressed and may contain information that is confidential,
>>> privileged or otherwise protected from disclosure, distributing or copying.
>>> Any views or opinions presented in this email are solely those of the
>>> author and do not necessarily represent those of Spicule Limited. The
>>> company accepts no liability for any damage caused by any virus transmitted
>>> by this email. If you have received this message in error, please notify us
>>> immediately by reply email before deleting it from your system. Service of
>>> legal notice cannot be effected on Spicule Limited by email.
>>>
>>

-- 


Spicule Limited is registered in England & Wales. Company Number: 
09954122. Registered office: First Floor, Telecom House, 125-135 Preston 
Road, Brighton, England, BN1 6AF. VAT No. 251478891.




All engagements 
are subject to Spicule Terms and Conditions of Business. This email and its 
contents are intended solely for the individual to whom it is addressed and 
may contain information that is confidential, privileged or otherwise 
protected from disclosure, distributing or copying. Any views or opinions 
presented in this email are solely those of the author and do not 
necessarily represent those of Spicule Limited. The company accepts no 
liability for any damage caused by any virus transmitted by this email. If 
you have received this message in error, please notify us immediately by 
reply email before deleting it from your system. Service of legal notice 
cannot be effected on Spicule Limited by email.

Reply via email to