I am PhD student at Ohio State working on a study to evaluate streaming
frameworks (Spark Streaming, Storm, Flink) using the the Intel HiBench
benchmarks. But I think I am having a problem  with Spark. I have Spark
Streaming application which I am trying to run on a 5 node cluster
(including master). I have 2 zookeeper and 4 kafka brokers. However,
whenever I run a Spark Streaming application I encounter the following
error:

java.lang.IllegalArgumentException: requirement failed: numRecords
must not be negative
        at scala.Predef$.require(Predef.scala:224)
        at 
org.apache.spark.streaming.scheduler.StreamInputInfo.<init>(InputInfoTracker.scala:38)
        at 
org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:165)
        at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
        at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
        at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
        at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
        at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
        at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
        at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
        at scala.Option.orElse(Option.scala:289)

The application starts fine, but as soon as the Kafka producers start
emitting the stream data I start receiving the aforementioned error
repeatedly.

I have tried removing Spark Streaming checkpointing files as has been
suggested in similar posts on the internet. However, the problem persists
even if I start a Kafka topic and its corresponding consumer Spark
Streaming application for the first time. Also the problem could not be
offset related as I start the topic for the first time.
Although the application seems to be processing the stream properly as I
can see by the benchmark numbers generated. However, the numbers are way of
from what I got for Storm and Flink, suspecting me to believe that there is
something wrong with the pipeline and Spark is not able to process the
stream as cleanly as it should. Any help in this regard would be really
appreciated.

Reply via email to