Additional observation - the map and mapValues are pipelined and executed - as expected - in pairs. This means that there is a simple sequence of steps - first read from Cassandra and then processing for each value of K. This is the exact behaviour of a normal Java loop with these two steps inside. I understand that this eliminates batch loading first and pile up of massive text arrays.
Also the keys are relatively evenly distributed across Executors. The question is - why is this still so slow? I would appreciate any suggestions on where to focus my search. Thank you, Oleg On 6 June 2014 16:24, Oleg Proudnikov <oleg.proudni...@gmail.com> wrote: > Hi All, > > I am passing Java static methods into RDD transformations map and > mapValues. The first map is from a simple string K into a (K,V) where V is > a Java ArrayList of large text strings, 50K each, read from Cassandra. > MapValues does processing of these text blocks into very small ArrayLists. > > The code runs quite slow compared to running it in parallel on the same > servers from plain Java. > > I gave the same heap to Executors and Java. Does java run slower under > Spark or do I suffer from excess heap pressure or am I missing something? > > Thank you for any insight, > Oleg > > -- Kind regards, Oleg