Additional observation - the map and mapValues are pipelined and executed -
as expected - in pairs. This means that there is a simple sequence of steps
- first read from Cassandra and then processing for each value of K. This
is the exact behaviour of a normal Java loop with these two steps inside. I
understand that this eliminates batch loading first and pile up of massive
text arrays.

Also the keys are relatively evenly distributed across Executors.

The question is - why is this still so slow? I would appreciate any
suggestions on where to focus my search.

Thank you,
Oleg



On 6 June 2014 16:24, Oleg Proudnikov <oleg.proudni...@gmail.com> wrote:

> Hi All,
>
> I am passing Java static methods into RDD transformations map and
> mapValues. The first map is from a simple string K into a (K,V) where V is
> a Java ArrayList of large text strings, 50K each, read from Cassandra.
> MapValues does processing of these text blocks into very small ArrayLists.
>
> The code runs quite slow compared to running it in parallel on the same
> servers from plain Java.
>
> I gave the same heap to Executors and Java. Does java run slower under
> Spark or do I suffer from excess heap pressure or am I missing something?
>
> Thank you for any insight,
> Oleg
>
>


-- 
Kind regards,

Oleg

Reply via email to