I mean getResults is called only after foo has been called on all records.

It could be useful if foo is asynchronous call to external service
returning Future that provide you some additional data i.e REST API (IO
operations).
If such API has latency of 100ms, sending all requests (for 1000 records)
before waiting on first result will give you total latency around 100ms.
If you sequentially invoke foo (hit to the rest API), wait for it result
and just after that process second record you will loose around 100ms on
each record.

Ideally (from external service point of view) it will be to not only use
asynchronous requests but also batch them.

2015-10-16 12:08 GMT+02:00 Sean Owen <so...@cloudera.com>:

> If you mean, getResult is called on the result of foo for each record,
> then that already happens. If you mean getResults is called only after foo
> has been called on all records, then you have to collect to a list, yes.
>
> Why does it help with foo being slow in either case though?
> You can try to consume the iterator in parallel with ".par" if that's what
> you're getting at.
>
> On Fri, Oct 16, 2015 at 10:47 AM, alberskib <albers...@gmail.com> wrote:
>
>> Hi all,
>>
>> I am wondering whether there is way to ensure that two consecutive maps
>> inside mapPartition will not be chained together.
>>
>> To illustrate my question I prepared short example:
>>
>> rdd.mapPartitions(it => {
>>     it.map(x => foo(x)).map(y => y.getResult)
>> }
>>
>> I would like to ensure that foo method will be applied to all records
>> (from
>> partition) and only after that method getResult invoked on each record. It
>> could be beneficial in situation that foo method is some kind of time
>> consuming IO operation i.e. request to external service for data (data
>> that
>> couldn't be prefetched).
>>
>> I know that converting iterator into list will do the job but maybe there
>> is
>> more clever way for doing it.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Ensuring-eager-evaluation-inside-mapPartitions-tp25085.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to