subject:"Re\: Kafka \+ Spark streaming, RDD partitions not processed in parallel"

RE: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-14 Thread Mukul Gupta

Thanks, Behavior is now clear to me.

I tried with "foreachRDD" and indeed all partitions are being processed in 
parallel.
I also tried using "saveAsTextFile" instead of print and  again all partitions 
were processed in parallel.

-Original Message-
From: Cody Koeninger [mailto:c...@koeninger.org]
Sent: Monday, March 14, 2016 9:39 PM
To: Mukul Gupta 
Cc: user@spark.apache.org
Subject: Re: Kafka + Spark streaming, RDD partitions not processed in parallel

So what's happening here is that print() uses take().  Take() will try to 
satisfy the request using only the first partition of the rdd, then use other 
partitions if necessary.

If you change to using something like foreach

processed.foreachRDD(new VoidFunction>() {
@Override
public void call(JavaRDD it) {
  it.foreach(new VoidFunction() {
  @Override
  public void call(String s) {
System.err.println(s);
  }
});
}
  });

you'll see 3 or 5 or however many partitions being processed simultaneously.

As an aside, if you don't have much experience with or investment in java, I'd 
highly recommend you use scala for interacting with spark.
Most of the code is written in scala, it's more concise, and it's easier to use 
the spark shell to experiment when you're learning.


On Sun, Mar 13, 2016 at 7:03 AM, Mukul Gupta  wrote:
> Sorry for the late reply. I am new to Java and it took me a while to set 
> things up.
>
> Yes, you are correct that kafka client libs need not be specifically added. I 
> didn't realized that . I removed the same and code still compiled. However, 
> upon execution, I encountered the same issue as before.
>
> Following is the link to repository:
> https://github.com/guptamukul/sparktest.git
>
> 
> From: Cody Koeninger 
> Sent: 11 March 2016 23:04
> To: Mukul Gupta
> Cc: user@spark.apache.org
> Subject: Re: Kafka + Spark streaming, RDD partitions not processed in
> parallel
>
> Why are you including a specific dependency on Kafka?  Spark's
> external streaming kafka module already depends on kafka.
>
> Can you link to an actual repo with build file etc?
>
> On Fri, Mar 11, 2016 at 11:21 AM, Mukul Gupta  wrote:
>> Please note that while building jar of code below, i used spark 1.6.0
>> + kafka 0.9.0.0 libraries I also tried spark 1.5.0 + kafka 0.9.0.1 
>> combination, but encountered the same issue.
>>
>> I could not use the ideal combination spark 1.6.0 + kafka 0.9.0.1 (which 
>> matches with spark and kafka versions installed on my machine) because while 
>> doing so, i get the following error at run time:
>> Exception in thread "main" java.lang.ClassCastException:
>> kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
>>
>> package sparktest;
>>
>> import java.util.Arrays;
>> import java.util.HashMap;
>> import java.util.HashSet;
>>
>> import org.apache.spark.SparkConf;
>> import org.apache.spark.streaming.api.java.JavaDStream;
>> import org.apache.spark.api.java.function.Function;
>> import org.apache.spark.streaming.Durations;
>> import org.apache.spark.streaming.api.java.JavaPairInputDStream;
>> import org.apache.spark.streaming.api.java.JavaStreamingContext;
>> import org.apache.spark.streaming.kafka.KafkaUtils;
>> import kafka.serializer.StringDecoder; import scala.Tuple2;
>>
>> package sparktest;
>>
>> import java.util.Arrays;
>> import java.util.HashMap;
>> import java.util.HashSet;
>>
>> import org.apache.spark.SparkConf;
>> import org.apache.spark.streaming.api.java.JavaDStream;
>> import org.apache.spark.api.java.function.Function;
>> import org.apache.spark.streaming.Durations;
>> import org.apache.spark.streaming.api.java.JavaPairInputDStream;
>> import org.apache.spark.streaming.api.java.JavaStreamingContext;
>> import org.apache.spark.streaming.kafka.KafkaUtils;
>> import kafka.serializer.StringDecoder; import scala.Tuple2;
>>
>> public class SparkTest {
>>
>> public static void main(String[] args) {
>>
>> if (args.length < 5) {
>> System.err.println("Usage: SparkTest  
>>   "); System.exit(1); }
>>
>> String kafkaBroker = args[0];
>> String sparkMaster = args[1];
>> String topics = args[2];
>> String consumerGroupID = args[3];
>> String durationSec = args[4];
>>
>> int duration = 0;
>>
>> try {
>> duration = Integ

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-14 Thread Cody Koeninger

So what's happening here is that print() uses take().  Take() will try
to satisfy the request using only the first partition of the rdd, then
use other partitions if necessary.

If you change to using something like foreach

processed.foreachRDD(new VoidFunction>() {
@Override
public void call(JavaRDD it) {
  it.foreach(new VoidFunction() {
  @Override
  public void call(String s) {
System.err.println(s);
  }
});
}
  });

you'll see 3 or 5 or however many partitions being processed simultaneously.

As an aside, if you don't have much experience with or investment in
java, I'd highly recommend you use scala for interacting with spark.
Most of the code is written in scala, it's more concise, and it's
easier to use the spark shell to experiment when you're learning.


On Sun, Mar 13, 2016 at 7:03 AM, Mukul Gupta  wrote:
> Sorry for the late reply. I am new to Java and it took me a while to set 
> things up.
>
> Yes, you are correct that kafka client libs need not be specifically added. I 
> didn't realized that . I removed the same and code still compiled. However, 
> upon execution, I encountered the same issue as before.
>
> Following is the link to repository:
> https://github.com/guptamukul/sparktest.git
>
> 
> From: Cody Koeninger 
> Sent: 11 March 2016 23:04
> To: Mukul Gupta
> Cc: user@spark.apache.org
> Subject: Re: Kafka + Spark streaming, RDD partitions not processed in parallel
>
> Why are you including a specific dependency on Kafka?  Spark's
> external streaming kafka module already depends on kafka.
>
> Can you link to an actual repo with build file etc?
>
> On Fri, Mar 11, 2016 at 11:21 AM, Mukul Gupta  wrote:
>> Please note that while building jar of code below, i used spark 1.6.0 + 
>> kafka 0.9.0.0 libraries
>> I also tried spark 1.5.0 + kafka 0.9.0.1 combination, but encountered the 
>> same issue.
>>
>> I could not use the ideal combination spark 1.6.0 + kafka 0.9.0.1 (which 
>> matches with spark and kafka versions installed on my machine) because while 
>> doing so, i get the following error at run time:
>> Exception in thread "main" java.lang.ClassCastException: 
>> kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
>>
>> package sparktest;
>>
>> import java.util.Arrays;
>> import java.util.HashMap;
>> import java.util.HashSet;
>>
>> import org.apache.spark.SparkConf;
>> import org.apache.spark.streaming.api.java.JavaDStream;
>> import org.apache.spark.api.java.function.Function;
>> import org.apache.spark.streaming.Durations;
>> import org.apache.spark.streaming.api.java.JavaPairInputDStream;
>> import org.apache.spark.streaming.api.java.JavaStreamingContext;
>> import org.apache.spark.streaming.kafka.KafkaUtils;
>> import kafka.serializer.StringDecoder;
>> import scala.Tuple2;
>>
>> package sparktest;
>>
>> import java.util.Arrays;
>> import java.util.HashMap;
>> import java.util.HashSet;
>>
>> import org.apache.spark.SparkConf;
>> import org.apache.spark.streaming.api.java.JavaDStream;
>> import org.apache.spark.api.java.function.Function;
>> import org.apache.spark.streaming.Durations;
>> import org.apache.spark.streaming.api.java.JavaPairInputDStream;
>> import org.apache.spark.streaming.api.java.JavaStreamingContext;
>> import org.apache.spark.streaming.kafka.KafkaUtils;
>> import kafka.serializer.StringDecoder;
>> import scala.Tuple2;
>>
>> public class SparkTest {
>>
>> public static void main(String[] args) {
>>
>> if (args.length < 5) {
>> System.err.println("Usage: SparkTest
>>  ");
>> System.exit(1);
>> }
>>
>> String kafkaBroker = args[0];
>> String sparkMaster = args[1];
>> String topics = args[2];
>> String consumerGroupID = args[3];
>> String durationSec = args[4];
>>
>> int duration = 0;
>>
>> try {
>> duration = Integer.parseInt(durationSec);
>> } catch (Exception e) {
>> System.err.println("Illegal duration");
>> System.exit(1);
>> }
>>
>> HashSet topicsSet = new 
>> HashSet(Arrays.asList(topics.split(",")));
>>
>> SparkConf  conf = new 
>> SparkConf().setMaster(sparkMaster).setAppName("DirectStreamDemo");
>>
>> JavaStreamingContext jssc = new JavaStr

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-13 Thread Mukul Gupta

Sorry for the late reply. I am new to Java and it took me a while to set things 
up.

Yes, you are correct that kafka client libs need not be specifically added. I 
didn't realized that . I removed the same and code still compiled. However, 
upon execution, I encountered the same issue as before.

Following is the link to repository:
https://github.com/guptamukul/sparktest.git


From: Cody Koeninger 
Sent: 11 March 2016 23:04
To: Mukul Gupta
Cc: user@spark.apache.org
Subject: Re: Kafka + Spark streaming, RDD partitions not processed in parallel

Why are you including a specific dependency on Kafka?  Spark's
external streaming kafka module already depends on kafka.

Can you link to an actual repo with build file etc?

On Fri, Mar 11, 2016 at 11:21 AM, Mukul Gupta  wrote:
> Please note that while building jar of code below, i used spark 1.6.0 + kafka 
> 0.9.0.0 libraries
> I also tried spark 1.5.0 + kafka 0.9.0.1 combination, but encountered the 
> same issue.
>
> I could not use the ideal combination spark 1.6.0 + kafka 0.9.0.1 (which 
> matches with spark and kafka versions installed on my machine) because while 
> doing so, i get the following error at run time:
> Exception in thread "main" java.lang.ClassCastException: 
> kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
>
> package sparktest;
>
> import java.util.Arrays;
> import java.util.HashMap;
> import java.util.HashSet;
>
> import org.apache.spark.SparkConf;
> import org.apache.spark.streaming.api.java.JavaDStream;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.api.java.JavaPairInputDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.kafka.KafkaUtils;
> import kafka.serializer.StringDecoder;
> import scala.Tuple2;
>
> package sparktest;
>
> import java.util.Arrays;
> import java.util.HashMap;
> import java.util.HashSet;
>
> import org.apache.spark.SparkConf;
> import org.apache.spark.streaming.api.java.JavaDStream;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.api.java.JavaPairInputDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.kafka.KafkaUtils;
> import kafka.serializer.StringDecoder;
> import scala.Tuple2;
>
> public class SparkTest {
>
> public static void main(String[] args) {
>
> if (args.length < 5) {
> System.err.println("Usage: SparkTest
>  ");
> System.exit(1);
> }
>
> String kafkaBroker = args[0];
> String sparkMaster = args[1];
> String topics = args[2];
> String consumerGroupID = args[3];
> String durationSec = args[4];
>
> int duration = 0;
>
> try {
> duration = Integer.parseInt(durationSec);
> } catch (Exception e) {
> System.err.println("Illegal duration");
> System.exit(1);
> }
>
> HashSet topicsSet = new 
> HashSet(Arrays.asList(topics.split(",")));
>
> SparkConf  conf = new 
> SparkConf().setMaster(sparkMaster).setAppName("DirectStreamDemo");
>
> JavaStreamingContext jssc = new JavaStreamingContext(conf, 
> Durations.seconds(duration));
>
> HashMap kafkaParams = new HashMap();
> kafkaParams.put("metadata.broker.list", kafkaBroker);
> kafkaParams.put("group.id", consumerGroupID);
>
> JavaPairInputDStream messages = 
> KafkaUtils.createDirectStream(jssc, String.class, String.class,
> StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet);
>
> JavaDStream processed = messages.map(new Function String>, String>() {
>
> @Override
> public String call(Tuple2 arg0) throws Exception {
>
> Thread.sleep(7000);
> return arg0._2;
> }
> });
>
> processed.print(90);
>
> try {
> jssc.start();
> jssc.awaitTermination();
> } catch (Exception e) {
>
> } finally {
> jssc.close();
> }
> }
> }
>
>
> 
> From: Cody Koeninger 
> Sent: 11 March 2016 20:42
> To: Mukul Gupta
> Cc: user@spark.apache.org
> Subject: Re: Kafka + Spark streaming, RDD partitions not processed in parallel
>
> Can you post your actual code?
>
> On Thu, Mar 10, 2016 at 9:55 PM, Mukul Gupta  wrote:
>> Hi All, I was running the following test: Setup 9 VM runing spark workers
>> with 1 spark executor each. 1 VM running kafka and spark master. Spark
>> version is 1.6.0 Kafka version is 0.9.0.1 Spark is using its own resource
>> manager and is not running over YARN. Test I created a kafka topic with 3
>&

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Cody Koeninger

Why are you including a specific dependency on Kafka?  Spark's
external streaming kafka module already depends on kafka.

Can you link to an actual repo with build file etc?

On Fri, Mar 11, 2016 at 11:21 AM, Mukul Gupta  wrote:
> Please note that while building jar of code below, i used spark 1.6.0 + kafka 
> 0.9.0.0 libraries
> I also tried spark 1.5.0 + kafka 0.9.0.1 combination, but encountered the 
> same issue.
>
> I could not use the ideal combination spark 1.6.0 + kafka 0.9.0.1 (which 
> matches with spark and kafka versions installed on my machine) because while 
> doing so, i get the following error at run time:
> Exception in thread "main" java.lang.ClassCastException: 
> kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
>
> package sparktest;
>
> import java.util.Arrays;
> import java.util.HashMap;
> import java.util.HashSet;
>
> import org.apache.spark.SparkConf;
> import org.apache.spark.streaming.api.java.JavaDStream;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.api.java.JavaPairInputDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.kafka.KafkaUtils;
> import kafka.serializer.StringDecoder;
> import scala.Tuple2;
>
> package sparktest;
>
> import java.util.Arrays;
> import java.util.HashMap;
> import java.util.HashSet;
>
> import org.apache.spark.SparkConf;
> import org.apache.spark.streaming.api.java.JavaDStream;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.api.java.JavaPairInputDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.kafka.KafkaUtils;
> import kafka.serializer.StringDecoder;
> import scala.Tuple2;
>
> public class SparkTest {
>
> public static void main(String[] args) {
>
> if (args.length < 5) {
> System.err.println("Usage: SparkTest
>  ");
> System.exit(1);
> }
>
> String kafkaBroker = args[0];
> String sparkMaster = args[1];
> String topics = args[2];
> String consumerGroupID = args[3];
> String durationSec = args[4];
>
> int duration = 0;
>
> try {
> duration = Integer.parseInt(durationSec);
> } catch (Exception e) {
> System.err.println("Illegal duration");
> System.exit(1);
> }
>
> HashSet topicsSet = new 
> HashSet(Arrays.asList(topics.split(",")));
>
> SparkConf  conf = new 
> SparkConf().setMaster(sparkMaster).setAppName("DirectStreamDemo");
>
> JavaStreamingContext jssc = new JavaStreamingContext(conf, 
> Durations.seconds(duration));
>
> HashMap kafkaParams = new HashMap();
> kafkaParams.put("metadata.broker.list", kafkaBroker);
> kafkaParams.put("group.id", consumerGroupID);
>
> JavaPairInputDStream messages = 
> KafkaUtils.createDirectStream(jssc, String.class, String.class,
> StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet);
>
> JavaDStream processed = messages.map(new Function String>, String>() {
>
> @Override
> public String call(Tuple2 arg0) throws Exception {
>
> Thread.sleep(7000);
> return arg0._2;
> }
> });
>
> processed.print(90);
>
> try {
> jssc.start();
> jssc.awaitTermination();
> } catch (Exception e) {
>
> } finally {
> jssc.close();
> }
> }
> }
>
>
> 
> From: Cody Koeninger 
> Sent: 11 March 2016 20:42
> To: Mukul Gupta
> Cc: user@spark.apache.org
> Subject: Re: Kafka + Spark streaming, RDD partitions not processed in parallel
>
> Can you post your actual code?
>
> On Thu, Mar 10, 2016 at 9:55 PM, Mukul Gupta  wrote:
>> Hi All, I was running the following test: Setup 9 VM runing spark workers
>> with 1 spark executor each. 1 VM running kafka and spark master. Spark
>> version is 1.6.0 Kafka version is 0.9.0.1 Spark is using its own resource
>> manager and is not running over YARN. Test I created a kafka topic with 3
>> partition. next I used "KafkaUtils.createDirectStream" to get a DStream.
>> JavaPairInputDStream stream =
>> KafkaUtils.createDirectStream(…); JavaDStream stream1 = stream.map(func1);
>> stream1.print(); where func1 just contains a sleep followed by returning of
>> value. Observation First RDD partition corresponding to partition 1 of kafka
>> was processed on one of the spark executor. Once processing is finished,
>> then RDD partitions corresponding to remaining two kafka partitions were
>> processed in parallel on different spark

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Mukul Gupta

Please note that while building jar of code below, i used spark 1.6.0 + kafka 
0.9.0.0 libraries
I also tried spark 1.5.0 + kafka 0.9.0.1 combination, but encountered the same 
issue.

I could not use the ideal combination spark 1.6.0 + kafka 0.9.0.1 (which 
matches with spark and kafka versions installed on my machine) because while 
doing so, i get the following error at run time:
Exception in thread "main" java.lang.ClassCastException: 
kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker

package sparktest;

import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;

import org.apache.spark.SparkConf;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import kafka.serializer.StringDecoder;
import scala.Tuple2;

package sparktest;

import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;

import org.apache.spark.SparkConf;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import kafka.serializer.StringDecoder;
import scala.Tuple2;

public class SparkTest {

public static void main(String[] args) {

if (args.length < 5) {
System.err.println("Usage: SparkTest
 ");
System.exit(1);
}

String kafkaBroker = args[0];
String sparkMaster = args[1];
String topics = args[2];
String consumerGroupID = args[3];
String durationSec = args[4];

int duration = 0;

try {
duration = Integer.parseInt(durationSec);
} catch (Exception e) {
System.err.println("Illegal duration");
System.exit(1);
}

HashSet topicsSet = new 
HashSet(Arrays.asList(topics.split(",")));

SparkConf  conf = new 
SparkConf().setMaster(sparkMaster).setAppName("DirectStreamDemo");

JavaStreamingContext jssc = new JavaStreamingContext(conf, 
Durations.seconds(duration));

HashMap kafkaParams = new HashMap();
kafkaParams.put("metadata.broker.list", kafkaBroker);
kafkaParams.put("group.id", consumerGroupID);

JavaPairInputDStream messages = 
KafkaUtils.createDirectStream(jssc, String.class, String.class,
StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet);

JavaDStream processed = messages.map(new Function, String>() {

@Override
public String call(Tuple2 arg0) throws Exception {

Thread.sleep(7000);
return arg0._2;
}
});

processed.print(90);

try {
jssc.start();
jssc.awaitTermination();
} catch (Exception e) {

} finally {
jssc.close();
}
}
}



From: Cody Koeninger 
Sent: 11 March 2016 20:42
To: Mukul Gupta
Cc: user@spark.apache.org
Subject: Re: Kafka + Spark streaming, RDD partitions not processed in parallel

Can you post your actual code?

On Thu, Mar 10, 2016 at 9:55 PM, Mukul Gupta  wrote:
> Hi All, I was running the following test: Setup 9 VM runing spark workers
> with 1 spark executor each. 1 VM running kafka and spark master. Spark
> version is 1.6.0 Kafka version is 0.9.0.1 Spark is using its own resource
> manager and is not running over YARN. Test I created a kafka topic with 3
> partition. next I used "KafkaUtils.createDirectStream" to get a DStream.
> JavaPairInputDStream stream =
> KafkaUtils.createDirectStream(…); JavaDStream stream1 = stream.map(func1);
> stream1.print(); where func1 just contains a sleep followed by returning of
> value. Observation First RDD partition corresponding to partition 1 of kafka
> was processed on one of the spark executor. Once processing is finished,
> then RDD partitions corresponding to remaining two kafka partitions were
> processed in parallel on different spark executors. I expected that all
> three RDD partitions should have been processed in parallel as there were
> spark executors available which were lying idle. I re-ran the test after
> increasing the partitions of kafka topic to 5. This time also RDD partition
> corresponding to partition 1 of kafka was processed on one of the spark
> executor. Once processing is finished for this RDD partition, then RDD
> partitions corresponding to remaining four kafka partitions were processed
> in parallel on different spark executors. I am not clear about why spark is
> waiting for operations on first RDD partition to finish, while it could
> process remaining partitions in parallel? Am I missing any configuration?
> Any help is appreciated. Thanks, Mukul
> 
> View this message in context: Ka

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-11 Thread Cody Koeninger

Can you post your actual code?

On Thu, Mar 10, 2016 at 9:55 PM, Mukul Gupta  wrote:
> Hi All, I was running the following test: Setup 9 VM runing spark workers
> with 1 spark executor each. 1 VM running kafka and spark master. Spark
> version is 1.6.0 Kafka version is 0.9.0.1 Spark is using its own resource
> manager and is not running over YARN. Test I created a kafka topic with 3
> partition. next I used "KafkaUtils.createDirectStream" to get a DStream.
> JavaPairInputDStream stream =
> KafkaUtils.createDirectStream(…); JavaDStream stream1 = stream.map(func1);
> stream1.print(); where func1 just contains a sleep followed by returning of
> value. Observation First RDD partition corresponding to partition 1 of kafka
> was processed on one of the spark executor. Once processing is finished,
> then RDD partitions corresponding to remaining two kafka partitions were
> processed in parallel on different spark executors. I expected that all
> three RDD partitions should have been processed in parallel as there were
> spark executors available which were lying idle. I re-ran the test after
> increasing the partitions of kafka topic to 5. This time also RDD partition
> corresponding to partition 1 of kafka was processed on one of the spark
> executor. Once processing is finished for this RDD partition, then RDD
> partitions corresponding to remaining four kafka partitions were processed
> in parallel on different spark executors. I am not clear about why spark is
> waiting for operations on first RDD partition to finish, while it could
> process remaining partitions in parallel? Am I missing any configuration?
> Any help is appreciated. Thanks, Mukul
> 
> View this message in context: Kafka + Spark streaming, RDD partitions not
> processed in parallel
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Kafka + Spark streaming, RDD partitions not processed in parallel

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

Re: Kafka + Spark streaming, RDD partitions not processed in parallel

6 matches

Site Navigation

Mail list logo

Footer information