Re: spark streaming 1.3 kafka error

2015-08-22 Thread Shushant Arora
Exception comes when client has so many connections to some another
external server also.
So I think Exception is coming because of client side issue only- server
side there is no issue.


Want to understand is executor(simple consumer) not making new connection
to kafka broker at start of each task ? Or is it created once only and that
is getting closed somehow ?

On Sat, Aug 22, 2015 at 9:41 AM, Shushant Arora shushantaror...@gmail.com
wrote:

 it comes at start of each tasks when there is new data inserted in kafka.(
 data inserted is very few)
 kafka topic has 300 partitions - data inserted is ~10 MB.

 Tasks gets failed and it retries which succeed and after certain no of
 fail tasks it kills the job.




 On Sat, Aug 22, 2015 at 2:08 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 That looks like you are choking your kafka machine. Do a top on the kafka
 machines and see the workload, it may happen that you are spending too much
 time on disk io etc.
 On Aug 21, 2015 7:32 AM, Cody Koeninger c...@koeninger.org wrote:

 Sounds like that's happening consistently, not an occasional network
 problem?

 Look at the Kafka broker logs

 Make sure you've configured the correct kafka broker hosts / ports (note
 that direct stream does not use zookeeper host / port).

 Make sure that host / port is reachable from your driver and worker
 nodes, ie telnet or netcat to it.  It looks like your driver can reach it
 (since there's partition info in the logs), but that doesn't mean the
 worker can.

 Use lsof / netstat to see what's going on with those ports while the job
 is running, or tcpdump if you need to.

 If you can't figure out what's going on from a networking point of view,
 post a minimal reproducible code sample that demonstrates the issue, so it
 can be tested in a different environment.





 On Fri, Aug 21, 2015 at 4:06 AM, Shushant Arora 
 shushantaror...@gmail.com wrote:

 Hi


 Getting below error in spark streaming 1.3 while consuming from kafka 
 using directkafka stream. Few of tasks are getting failed in each run.


 What is the reason /solution of this error?


 15/08/21 08:54:54 ERROR executor.Executor: Exception in task 262.0 in 
 stage 130.0 (TID 16332)
 java.io.EOFException: Received -1 when reading from channel, socket has 
 likely been closed.
at kafka.utils.Utils$.read(Utils.scala:376)
at 
 kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
at kafka.network.Receive$class.readCompletely(Transmission.scala:56)
at 
 kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
at kafka.network.BlockingChannel.receive(BlockingChannel.scala:100)
at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:81)
at 
 kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:71)
at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:109)
at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108)
at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
at 
 org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:150)
at 
 org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:162)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
 15/08/21 08:54:54 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 16348
 15/08/21 08:54:54 INFO executor.Executor: Running task 260.1 in stage 
 130.0 (TID 16348)
 15/08/21 08:54:54 INFO kafka.KafkaRDD: Computing topic 
 test_hbrealtimeevents, partition 75 

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-22 Thread satish chandra j
HI All,
Currently using DSE 4.7 and Spark 1.2.2 version

Regards,
Satish

On Fri, Aug 21, 2015 at 7:30 PM, java8964 java8...@hotmail.com wrote:

 What version of Spark you are using, or comes with DSE 4.7?

 We just cannot reproduce it in Spark.

 yzhang@localhost$ more test.spark
 val pairs = sc.makeRDD(Seq((0,1),(0,2),(1,20),(1,30),(2,40)))
 pairs.reduceByKey((x,y) = x + y).collect
 yzhang@localhost$ ~/spark/bin/spark-shell --master local -i test.spark
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 1.3.1
   /_/

 Using Scala version 2.10.4
 Spark context available as sc.
 SQL context available as sqlContext.
 Loading test.spark...
 pairs: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at
 makeRDD at console:21
 15/08/21 09:58:51 WARN SizeEstimator: Failed to check whether
 UseCompressedOops is set; assuming yes
 res0: Array[(Int, Int)] = Array((0,3), (1,50), (2,40))

 Yong


 --
 Date: Fri, 21 Aug 2015 19:24:09 +0530
 Subject: Re: Transformation not happening for reduceByKey or GroupByKey
 From: jsatishchan...@gmail.com
 To: abhis...@tetrationanalytics.com
 CC: user@spark.apache.org


 HI Abhishek,

 I have even tried that but rdd2 is empty

 Regards,
 Satish

 On Fri, Aug 21, 2015 at 6:47 PM, Abhishek R. Singh 
 abhis...@tetrationanalytics.com wrote:

 You had:

  RDD.reduceByKey((x,y) = x+y)
  RDD.take(3)

 Maybe try:

  rdd2 = RDD.reduceByKey((x,y) = x+y)
  rdd2.take(3)

 -Abhishek-

 On Aug 20, 2015, at 3:05 AM, satish chandra j jsatishchan...@gmail.com
 wrote:

  HI All,
  I have data in RDD as mentioned below:
 
  RDD : Array[(Int),(Int)] = Array((0,1), (0,2),(1,20),(1,30),(2,40))
 
 
  I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function
 on Values for each key
 
  Code:
  RDD.reduceByKey((x,y) = x+y)
  RDD.take(3)
 
  Result in console:
  RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey
 at console:73
  res:Array[(Int,Int)] = Array()
 
  Command as mentioned
 
  dse spark --master local --jars postgresql-9.4-1201.jar -i  ScriptFile
 
 
  Please let me know what is missing in my code, as my resultant Array is
 empty
 
 
 
  Regards,
  Satish
 





Re: Spark streaming multi-tasking during I/O

2015-08-22 Thread Akhil Das
Hmm for a singl core VM you will have to run it in local mode(specifying
master= local[4]). The flag is available in all the versions of spark i
guess.
On Aug 22, 2015 5:04 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote:

 Thanks Akhil. Does this mean that the executor running in the VM can spawn
 two concurrent jobs on the same core? If this is the case, this is what we
 are looking for. Also, which version of Spark is this flag in?

 Thanks,
 Sateesh

 On Sat, Aug 22, 2015 at 1:44 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 You can look at the spark.streaming.concurrentJobs by default it runs a
 single job. If set it to 2 then it can run 2 jobs parallely. Its an
 experimental flag, but go ahead and give it a try.
 On Aug 21, 2015 3:36 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:

 Hi,

 My scenario goes like this:
 I have an algorithm running in Spark streaming mode on a 4 core virtual
 machine. Majority of the time, the algorithm does disk I/O and database
 I/O. Question is, during the I/O, where the CPU is not considerably loaded,
 is it possible to run any other task/thread so as to efficiently utilize
 the CPU?

 Note that one DStream of the algorithm runs completely on a single CPU

 Thank you,
 Sateesh





sparkStreaming how to work with partitions,how tp create partition

2015-08-22 Thread Gaurav Agarwal
1. how to work with partition in spark streaming from kafka
2. how to create partition in spark streaming from kafka



when i send the message from kafka topic having three partitions.

Spark will listen the message when i say kafkautils.createStream or
createDirectstSream have local[4]
Now i want to see if spark will create partitions when it receive
message from kafka using dstream, how and where ,prwhich method of
spark api i have to see to find out


Re: Spark streaming multi-tasking during I/O

2015-08-22 Thread Sateesh Kavuri
Hi Rishitesh,

We are not using any RDD's to parallelize the processing and all of the
algorithm runs on a single core (and in a single thread). The parallelism
is done at the user level

The disk can be started in a separate IO, but then the executor will not be
able to take up more jobs, since thats how I believe Spark is designed by
default

On Sat, Aug 22, 2015 at 12:51 AM, Rishitesh Mishra rishi80.mis...@gmail.com
 wrote:

 Hi Sateesh,
 It is interesting to know , how did you determine that the Dstream runs on
 a single core. Did you mean receivers?

 Coming back to your question, could you not start disk io in a separate
 thread, so that the sceduler can go ahead and assign other tasks ?
 On 21 Aug 2015 16:06, Sateesh Kavuri sateesh.kav...@gmail.com wrote:

 Hi,

 My scenario goes like this:
 I have an algorithm running in Spark streaming mode on a 4 core virtual
 machine. Majority of the time, the algorithm does disk I/O and database
 I/O. Question is, during the I/O, where the CPU is not considerably loaded,
 is it possible to run any other task/thread so as to efficiently utilize
 the CPU?

 Note that one DStream of the algorithm runs completely on a single CPU

 Thank you,
 Sateesh




Re: spark streaming 1.3 kafka error

2015-08-22 Thread Shushant Arora
On trying the consumer without external connections  or with low number of
external conections its working fine -

so doubt is how  socket got closed -

java.io.EOFException: Received -1 when reading from channel, socket
has likely been closed.



On Sat, Aug 22, 2015 at 7:24 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Can you try some other consumer and see if the issue still exists?
 On Aug 22, 2015 12:47 AM, Shushant Arora shushantaror...@gmail.com
 wrote:

 Exception comes when client has so many connections to some another
 external server also.
 So I think Exception is coming because of client side issue only- server
 side there is no issue.


 Want to understand is executor(simple consumer) not making new connection
 to kafka broker at start of each task ? Or is it created once only and that
 is getting closed somehow ?

 On Sat, Aug 22, 2015 at 9:41 AM, Shushant Arora 
 shushantaror...@gmail.com wrote:

 it comes at start of each tasks when there is new data inserted in
 kafka.( data inserted is very few)
 kafka topic has 300 partitions - data inserted is ~10 MB.

 Tasks gets failed and it retries which succeed and after certain no of
 fail tasks it kills the job.




 On Sat, Aug 22, 2015 at 2:08 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 That looks like you are choking your kafka machine. Do a top on the
 kafka machines and see the workload, it may happen that you are spending
 too much time on disk io etc.
 On Aug 21, 2015 7:32 AM, Cody Koeninger c...@koeninger.org wrote:

 Sounds like that's happening consistently, not an occasional network
 problem?

 Look at the Kafka broker logs

 Make sure you've configured the correct kafka broker hosts / ports
 (note that direct stream does not use zookeeper host / port).

 Make sure that host / port is reachable from your driver and worker
 nodes, ie telnet or netcat to it.  It looks like your driver can reach it
 (since there's partition info in the logs), but that doesn't mean the
 worker can.

 Use lsof / netstat to see what's going on with those ports while the
 job is running, or tcpdump if you need to.

 If you can't figure out what's going on from a networking point of
 view, post a minimal reproducible code sample that demonstrates the issue,
 so it can be tested in a different environment.





 On Fri, Aug 21, 2015 at 4:06 AM, Shushant Arora 
 shushantaror...@gmail.com wrote:

 Hi


 Getting below error in spark streaming 1.3 while consuming from kafka 
 using directkafka stream. Few of tasks are getting failed in each run.


 What is the reason /solution of this error?


 15/08/21 08:54:54 ERROR executor.Executor: Exception in task 262.0 in 
 stage 130.0 (TID 16332)
 java.io.EOFException: Received -1 when reading from channel, socket has 
 likely been closed.
  at kafka.utils.Utils$.read(Utils.scala:376)
  at 
 kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
  at kafka.network.Receive$class.readCompletely(Transmission.scala:56)
  at 
 kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
  at kafka.network.BlockingChannel.receive(BlockingChannel.scala:100)
  at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:81)
  at 
 kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:71)
  at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:109)
  at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
  at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
  at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
  at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108)
  at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
  at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
  at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
  at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
  at 
 org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:150)
  at 
 org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:162)
  at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
  at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
  at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
  at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
  at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
  at org.apache.spark.scheduler.Task.run(Task.scala:64)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
  at 
 

subscribe

2015-08-22 Thread Lars Hermes

subscribe

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



pickling error with PySpark and Elasticsearch-py analyzer

2015-08-22 Thread pkphlam
Reposting my question from SO:
http://stackoverflow.com/questions/32161865/elasticsearch-analyze-not-compatible-with-spark-in-python

I'm using the elasticsearch-py client within PySpark using Python 3 and I'm
running into a problem using the analyze() function with ES in conjunction
with an RDD. In particular, each record in my RDD is a string of text and
I'm trying to analyze it to get out the token information, but I'm getting
an error when trying to use it within a map function in Spark.

For example, this works perfectly fine:

 from elasticsearch import Elasticsearch
 es = Elasticsearch()
 t = 'the quick brown fox'
 es.indices.analyze(text=t)['tokens'][0]

{'end_offset': 3,
'position': 1,
'start_offset': 0,
'token': 'the',
'type': 'ALPHANUM'}

However, when I try this:

 trdd = sc.parallelize(['the quick brown fox'])
 trdd.map(lambda x: es.indices.analyze(text=x)['tokens'][0]).collect()

I get a really really long error message related to pickling (Here's the end
of it):

(self, obj)109if'recursion'in.[0]:110=Could not pickle object as
excessively deep recursion required.-- 111 
picklePicklingErrormsg

  save_memoryviewself obj

: Could not pickle object as excessively deep recursion required.

raise.()112113def(,):PicklingError


I'm not sure what the error means. Am I doing something wrong? Is there a
way to map the ES analyze function onto records of an RDD?






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pickling-error-with-PySpark-and-Elasticsearch-py-analyzer-tp24402.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark streaming multi-tasking during I/O

2015-08-22 Thread Sateesh Kavuri
Hi Akhil,

Think of the scenario as running a piece of code in normal Java with
multiple threads. Lets say there are 4 threads spawned by a Java process to
handle reading from database, some processing and storing to database. In
this process, while a thread is performing a database I/O, the CPU could
allow another thread to perform the processing, thus efficiently using the
resources.

Incase of Spark, while a node executor is running the same read from DB =
process data = store to DB, during the read from DB and store to DB
phase, the CPU is not given to other requests in queue, since the executor
will allocate the resources completely to the current ongoing request.

Does not flag spark.streaming.concurrentJobs enable this kind of scenario
or is there any other way to achieve what I am looking for

Thanks,
Sateesh

On Sat, Aug 22, 2015 at 7:26 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Hmm for a singl core VM you will have to run it in local mode(specifying
 master= local[4]). The flag is available in all the versions of spark i
 guess.
 On Aug 22, 2015 5:04 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:

 Thanks Akhil. Does this mean that the executor running in the VM can
 spawn two concurrent jobs on the same core? If this is the case, this is
 what we are looking for. Also, which version of Spark is this flag in?

 Thanks,
 Sateesh

 On Sat, Aug 22, 2015 at 1:44 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 You can look at the spark.streaming.concurrentJobs by default it runs a
 single job. If set it to 2 then it can run 2 jobs parallely. Its an
 experimental flag, but go ahead and give it a try.
 On Aug 21, 2015 3:36 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:

 Hi,

 My scenario goes like this:
 I have an algorithm running in Spark streaming mode on a 4 core virtual
 machine. Majority of the time, the algorithm does disk I/O and database
 I/O. Question is, during the I/O, where the CPU is not considerably loaded,
 is it possible to run any other task/thread so as to efficiently utilize
 the CPU?

 Note that one DStream of the algorithm runs completely on a single CPU

 Thank you,
 Sateesh





Re: Using spark streaming to load data from Kafka to HDFS

2015-08-22 Thread Xu (Simon) Chen
Last time I checked, Camus doesn't support storing data as parquet, which
is a deal breaker for me. Otherwise it works well for my Kafka topics with
low data volume.

I am currently using spark streaming to ingest data, generate semi-realtime
stats and publish to a dashboard, and dump full dataset into hdfs in
parquet at a longer interval. One problem is that storing parquet is
sometimes time consuming, and that cause delay of my regular
stats-generating tasks. I am thinking of splitting my streaming job into
two, one for parquet output and one for stats generation, but obviously
this would consume data from Kafka twice.

-Simon

On Wednesday, May 6, 2015, Rendy Bambang Junior rendy.b.jun...@gmail.com
wrote:

 Because using spark streaming looks like a lot simpler. Whats the
 difference between Camus and Kafka Streaming for this case? Why Camus excel?

 Rendy

 On Wed, May 6, 2015 at 2:15 PM, Saisai Shao sai.sai.s...@gmail.com
 javascript:_e(%7B%7D,'cvml','sai.sai.s...@gmail.com'); wrote:

 Also Kafka has a Hadoop consumer API for doing such things, please refer
 to http://kafka.apache.org/081/documentation.html#kafkahadoopconsumerapi


 2015-05-06 12:22 GMT+08:00 MrAsanjar . afsan...@gmail.com
 javascript:_e(%7B%7D,'cvml','afsan...@gmail.com');:

 why not try https://github.com/linkedin/camus - camus is kafka to HDFS
 pipeline

 On Tue, May 5, 2015 at 11:13 PM, Rendy Bambang Junior 
 rendy.b.jun...@gmail.com
 javascript:_e(%7B%7D,'cvml','rendy.b.jun...@gmail.com'); wrote:

 Hi all,

 I am planning to load data from Kafka to HDFS. Is it normal to use
 spark streaming to load data from Kafka to HDFS? What are concerns on doing
 this?

 There are no processing to be done by Spark, only to store data to HDFS
 from Kafka for storage and for further Spark processing

 Rendy







Re: subscribe

2015-08-22 Thread Brandon White
https://www.youtube.com/watch?v=umDr0mPuyQc

On Sat, Aug 22, 2015 at 8:01 AM, Ted Yu yuzhih...@gmail.com wrote:

 See http://spark.apache.org/community.html

 Cheers

 On Sat, Aug 22, 2015 at 2:51 AM, Lars Hermes 
 li...@hermes-it-consulting.de wrote:

 subscribe

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





spark 1.4.1 - LZFException

2015-08-22 Thread Yadid Ayzenberg



Hi All,

We have a spark standalone cluster running 1.4.1 and we are setting 
spark.io.compression.codec to lzf.
I have a long running interactive application which behaves as normal, 
but after a few days I get the following exception in multiple jobs. Any 
ideas on what could be causing this ?


Yadid



Job aborted due to stage failure: Task 27 in stage 286.0 failed 4 times, most 
recent failure: Lost task 27.3 in stage 286.0 (TID 516817, xx.yy.zz.ww): 
com.esotericsoftware.kryo.KryoException: com.ning.compress.lzf.LZFException: 
Corrupt input data, block did not start with 2 byte signature ('ZV') followed 
by type byte, 2-byte length)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:142)
at com.esotericsoftware.kryo.io.Input.require(Input.java:155)
at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:109)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:182)
at 
org.apache.spark.serializer.DeserializationStream.readKey(Serializer.scala:169)
at 
org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:200)
at 
org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:197)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:127)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.ning.compress.lzf.LZFException: Corrupt input data, block did 
not start with 2 byte signature ('ZV') followed by type byte, 2-byte length)
at 
com.ning.compress.lzf.ChunkDecoder._reportCorruptHeader(ChunkDecoder.java:267)
at 
com.ning.compress.lzf.impl.UnsafeChunkDecoder.decodeChunk(UnsafeChunkDecoder.java:55)
at 
com.ning.compress.lzf.LZFInputStream.readyBuffer(LZFInputStream.java:363)
at com.ning.compress.lzf.LZFInputStream.read(LZFInputStream.java:193)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:140)
... 37 more





how to migrate from spark 0.9 to spark 1.4

2015-08-22 Thread sai rakesh
currently i am using spark 0.9 on my data i wrote code in java for
sparksql.now i want to use spark 1.4 so how to do and what changes i have to
do for tables.i ahve .sql file,pom file,.py file. iam using s3 for storage



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-migrate-from-spark-0-9-to-spark-1-4-tp24403.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Worker Machine running out of disk for Long running Streaming process

2015-08-22 Thread Ashish Rangole
Interesting. TD, can you please throw some light on why this is and point
to  the relevant code in Spark repo. It will help in a better understanding
of things that can affect a long running streaming job.
On Aug 21, 2015 1:44 PM, Tathagata Das t...@databricks.com wrote:

 Could you periodically (say every 10 mins) run System.gc() on the driver.
 The cleaning up shuffles is tied to the garbage collection.


 On Fri, Aug 21, 2015 at 2:59 AM, gaurav sharma sharmagaura...@gmail.com
 wrote:

 Hi All,


 I have a 24x7 running Streaming Process, which runs on 2 hour windowed
 data

 The issue i am facing is my worker machines are running OUT OF DISK space

 I checked that the SHUFFLE FILES are not getting cleaned up.


 /log/spark-2b875d98-1101-4e61-86b4-67c9e71954cc/executor-5bbb53c1-cee9-4438-87a2-b0f2becfac6f/blockmgr-c905b93b-c817-4124-a774-be1e706768c1//00/shuffle_2739_5_0.data

 Ultimately the machines runs out of Disk Spac


 i read about *spark.cleaner.ttl *config param which what i can
 understand from the documentation, says cleans up all the metadata beyond
 the time limit.

 I went through https://issues.apache.org/jira/browse/SPARK-5836
 it says resolved, but there is no code commit

 Can anyone please throw some light on the issue.






Re: Spark streaming multi-tasking during I/O

2015-08-22 Thread Sateesh Kavuri
Thanks Akhil. Does this mean that the executor running in the VM can spawn
two concurrent jobs on the same core? If this is the case, this is what we
are looking for. Also, which version of Spark is this flag in?

Thanks,
Sateesh

On Sat, Aug 22, 2015 at 1:44 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 You can look at the spark.streaming.concurrentJobs by default it runs a
 single job. If set it to 2 then it can run 2 jobs parallely. Its an
 experimental flag, but go ahead and give it a try.
 On Aug 21, 2015 3:36 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:

 Hi,

 My scenario goes like this:
 I have an algorithm running in Spark streaming mode on a 4 core virtual
 machine. Majority of the time, the algorithm does disk I/O and database
 I/O. Question is, during the I/O, where the CPU is not considerably loaded,
 is it possible to run any other task/thread so as to efficiently utilize
 the CPU?

 Note that one DStream of the algorithm runs completely on a single CPU

 Thank you,
 Sateesh




Re: spark streaming 1.3 kafka error

2015-08-22 Thread Shushant Arora
On trying the consumer without external connections  or with low
number of external conections its working fine -

so doubt is how  socket got closed -


15/08/21 08:54:54 ERROR executor.Executor: Exception in task 262.0 in
stage 130.0 (TID 16332)
java.io.EOFException: Received -1 when reading from channel, socket
has likely been closed.
at kafka.utils.Utils$.read(Utils.scala:376)


is executor(simple consumer) not making new connection to kafka broker at
start of each task ? Or is it created once only and that is getting closed
somehow ?

On Sat, Aug 22, 2015 at 7:35 PM, Shushant Arora shushantaror...@gmail.com
wrote:

 On trying the consumer without external connections  or with low number of
 external conections its working fine -

 so doubt is how  socket got closed -

 java.io.EOFException: Received -1 when reading from channel, socket has 
 likely been closed.



 On Sat, Aug 22, 2015 at 7:24 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try some other consumer and see if the issue still exists?
 On Aug 22, 2015 12:47 AM, Shushant Arora shushantaror...@gmail.com
 wrote:

 Exception comes when client has so many connections to some another
 external server also.
 So I think Exception is coming because of client side issue only- server
 side there is no issue.


 Want to understand is executor(simple consumer) not making new
 connection to kafka broker at start of each task ? Or is it created once
 only and that is getting closed somehow ?

 On Sat, Aug 22, 2015 at 9:41 AM, Shushant Arora 
 shushantaror...@gmail.com wrote:

 it comes at start of each tasks when there is new data inserted in
 kafka.( data inserted is very few)
 kafka topic has 300 partitions - data inserted is ~10 MB.

 Tasks gets failed and it retries which succeed and after certain no of
 fail tasks it kills the job.




 On Sat, Aug 22, 2015 at 2:08 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 That looks like you are choking your kafka machine. Do a top on the
 kafka machines and see the workload, it may happen that you are spending
 too much time on disk io etc.
 On Aug 21, 2015 7:32 AM, Cody Koeninger c...@koeninger.org wrote:

 Sounds like that's happening consistently, not an occasional network
 problem?

 Look at the Kafka broker logs

 Make sure you've configured the correct kafka broker hosts / ports
 (note that direct stream does not use zookeeper host / port).

 Make sure that host / port is reachable from your driver and worker
 nodes, ie telnet or netcat to it.  It looks like your driver can reach it
 (since there's partition info in the logs), but that doesn't mean the
 worker can.

 Use lsof / netstat to see what's going on with those ports while the
 job is running, or tcpdump if you need to.

 If you can't figure out what's going on from a networking point of
 view, post a minimal reproducible code sample that demonstrates the 
 issue,
 so it can be tested in a different environment.





 On Fri, Aug 21, 2015 at 4:06 AM, Shushant Arora 
 shushantaror...@gmail.com wrote:

 Hi


 Getting below error in spark streaming 1.3 while consuming from kafka 
 using directkafka stream. Few of tasks are getting failed in each run.


 What is the reason /solution of this error?


 15/08/21 08:54:54 ERROR executor.Executor: Exception in task 262.0 in 
 stage 130.0 (TID 16332)
 java.io.EOFException: Received -1 when reading from channel, socket has 
 likely been closed.
 at kafka.utils.Utils$.read(Utils.scala:376)
 at 
 kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
 at 
 kafka.network.Receive$class.readCompletely(Transmission.scala:56)
 at 
 kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
 at 
 kafka.network.BlockingChannel.receive(BlockingChannel.scala:100)
 at 
 kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:81)
 at 
 kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:71)
 at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:109)
 at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
 at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
 at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
 at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108)
 at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
 at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
 at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
 at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
 at 
 org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:150)
 at 

Re: subscribe

2015-08-22 Thread Ted Yu
See http://spark.apache.org/community.html

Cheers

On Sat, Aug 22, 2015 at 2:51 AM, Lars Hermes li...@hermes-it-consulting.de
wrote:

 subscribe

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How can I save the RDD result as Orcfile with spark1.3?

2015-08-22 Thread Ted Yu
In Spark 1.4, there was considerable refactoring around interaction with
Hive, such as SPARK-7491.

It would not be straight forward to port ORC support to 1.3

FYI

On Fri, Aug 21, 2015 at 10:21 PM, dong.yajun dongt...@gmail.com wrote:

 hi Ted,

 thanks for your reply, are there any other way to do this with spark 1.3?
 such as write the orcfile manually in foreachPartition method?

 On Sat, Aug 22, 2015 at 12:19 PM, Ted Yu yuzhih...@gmail.com wrote:

 ORC support was added in Spark 1.4
 See SPARK-2883

 On Fri, Aug 21, 2015 at 7:36 PM, dong.yajun dongt...@gmail.com wrote:

 Hi list,

 Is there a way to save the RDD result as Orcfile in spark1.3?  due to
 some reasons we can't upgrade our spark version to 1.4 now.

 --
 *Ric Dong*





 --
 *Ric Dong*




Re: spark streaming 1.3 kafka error

2015-08-22 Thread Akhil Das
Can you try some other consumer and see if the issue still exists?
On Aug 22, 2015 12:47 AM, Shushant Arora shushantaror...@gmail.com
wrote:

 Exception comes when client has so many connections to some another
 external server also.
 So I think Exception is coming because of client side issue only- server
 side there is no issue.


 Want to understand is executor(simple consumer) not making new connection
 to kafka broker at start of each task ? Or is it created once only and that
 is getting closed somehow ?

 On Sat, Aug 22, 2015 at 9:41 AM, Shushant Arora shushantaror...@gmail.com
  wrote:

 it comes at start of each tasks when there is new data inserted in
 kafka.( data inserted is very few)
 kafka topic has 300 partitions - data inserted is ~10 MB.

 Tasks gets failed and it retries which succeed and after certain no of
 fail tasks it kills the job.




 On Sat, Aug 22, 2015 at 2:08 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 That looks like you are choking your kafka machine. Do a top on the
 kafka machines and see the workload, it may happen that you are spending
 too much time on disk io etc.
 On Aug 21, 2015 7:32 AM, Cody Koeninger c...@koeninger.org wrote:

 Sounds like that's happening consistently, not an occasional network
 problem?

 Look at the Kafka broker logs

 Make sure you've configured the correct kafka broker hosts / ports
 (note that direct stream does not use zookeeper host / port).

 Make sure that host / port is reachable from your driver and worker
 nodes, ie telnet or netcat to it.  It looks like your driver can reach it
 (since there's partition info in the logs), but that doesn't mean the
 worker can.

 Use lsof / netstat to see what's going on with those ports while the
 job is running, or tcpdump if you need to.

 If you can't figure out what's going on from a networking point of
 view, post a minimal reproducible code sample that demonstrates the issue,
 so it can be tested in a different environment.





 On Fri, Aug 21, 2015 at 4:06 AM, Shushant Arora 
 shushantaror...@gmail.com wrote:

 Hi


 Getting below error in spark streaming 1.3 while consuming from kafka 
 using directkafka stream. Few of tasks are getting failed in each run.


 What is the reason /solution of this error?


 15/08/21 08:54:54 ERROR executor.Executor: Exception in task 262.0 in 
 stage 130.0 (TID 16332)
 java.io.EOFException: Received -1 when reading from channel, socket has 
 likely been closed.
   at kafka.utils.Utils$.read(Utils.scala:376)
   at 
 kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
   at kafka.network.Receive$class.readCompletely(Transmission.scala:56)
   at 
 kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
   at kafka.network.BlockingChannel.receive(BlockingChannel.scala:100)
   at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:81)
   at 
 kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:71)
   at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:109)
   at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
   at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
   at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
   at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108)
   at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
   at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
   at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
   at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
   at 
 org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:150)
   at 
 org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:162)
   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
   at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 15/08/21 08:54:54 INFO executor.CoarseGrainedExecutorBackend: Got 
 assigned task 16348
 15/08/21 08:54:54 INFO executor.Executor: Running task 260.1 in 

Re: spark streaming 1.3 kafka error

2015-08-22 Thread Dibyendu Bhattacharya
I think you also can give a try to this consumer :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer in your
environment. This has been running fine for topic with large number of
Kafka partition (  200 ) like yours without any issue.. no issue with
connection as this consumer re-use kafka connection , and also can recover
from any failures ( network loss , Kafka leader goes down, ZK down etc ..).


Regards,
Dibyendu

On Sat, Aug 22, 2015 at 7:35 PM, Shushant Arora shushantaror...@gmail.com
wrote:

 On trying the consumer without external connections  or with low number of
 external conections its working fine -

 so doubt is how  socket got closed -

 java.io.EOFException: Received -1 when reading from channel, socket has 
 likely been closed.



 On Sat, Aug 22, 2015 at 7:24 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try some other consumer and see if the issue still exists?
 On Aug 22, 2015 12:47 AM, Shushant Arora shushantaror...@gmail.com
 wrote:

 Exception comes when client has so many connections to some another
 external server also.
 So I think Exception is coming because of client side issue only- server
 side there is no issue.


 Want to understand is executor(simple consumer) not making new
 connection to kafka broker at start of each task ? Or is it created once
 only and that is getting closed somehow ?

 On Sat, Aug 22, 2015 at 9:41 AM, Shushant Arora 
 shushantaror...@gmail.com wrote:

 it comes at start of each tasks when there is new data inserted in
 kafka.( data inserted is very few)
 kafka topic has 300 partitions - data inserted is ~10 MB.

 Tasks gets failed and it retries which succeed and after certain no of
 fail tasks it kills the job.




 On Sat, Aug 22, 2015 at 2:08 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 That looks like you are choking your kafka machine. Do a top on the
 kafka machines and see the workload, it may happen that you are spending
 too much time on disk io etc.
 On Aug 21, 2015 7:32 AM, Cody Koeninger c...@koeninger.org wrote:

 Sounds like that's happening consistently, not an occasional network
 problem?

 Look at the Kafka broker logs

 Make sure you've configured the correct kafka broker hosts / ports
 (note that direct stream does not use zookeeper host / port).

 Make sure that host / port is reachable from your driver and worker
 nodes, ie telnet or netcat to it.  It looks like your driver can reach it
 (since there's partition info in the logs), but that doesn't mean the
 worker can.

 Use lsof / netstat to see what's going on with those ports while the
 job is running, or tcpdump if you need to.

 If you can't figure out what's going on from a networking point of
 view, post a minimal reproducible code sample that demonstrates the 
 issue,
 so it can be tested in a different environment.





 On Fri, Aug 21, 2015 at 4:06 AM, Shushant Arora 
 shushantaror...@gmail.com wrote:

 Hi


 Getting below error in spark streaming 1.3 while consuming from kafka 
 using directkafka stream. Few of tasks are getting failed in each run.


 What is the reason /solution of this error?


 15/08/21 08:54:54 ERROR executor.Executor: Exception in task 262.0 in 
 stage 130.0 (TID 16332)
 java.io.EOFException: Received -1 when reading from channel, socket has 
 likely been closed.
 at kafka.utils.Utils$.read(Utils.scala:376)
 at 
 kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
 at 
 kafka.network.Receive$class.readCompletely(Transmission.scala:56)
 at 
 kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
 at 
 kafka.network.BlockingChannel.receive(BlockingChannel.scala:100)
 at 
 kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:81)
 at 
 kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:71)
 at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:109)
 at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
 at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
 at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
 at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:108)
 at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
 at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:108)
 at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
 at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:107)
 at 
 org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:150)
 at 
 org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:162)
 at 
 

Re: spark streaming 1.3 kafka error

2015-08-22 Thread Cody Koeninger
To be perfectly clear, the direct kafka stream will also recover from any
failures, because it does the simplest thing possible - fail the task and
let spark retry it.

If you're consistently having socket closed problems on one task after
another, there's probably something else going on in your environment.
Shushant, none of your responses have indicated whether you've tried any of
the system level troubleshooting suggestions that have been made by various
people.

Also, if you have 300 partitions, and only 10mb of data, that is completely
unnecessary.  You're probably going to have lots of empty partitions, which
will have a negative effect on your runtime.

On Sat, Aug 22, 2015 at 11:28 AM, Dibyendu Bhattacharya 
dibyendu.bhattach...@gmail.com wrote:

 I think you also can give a try to this consumer :
 http://spark-packages.org/package/dibbhatt/kafka-spark-consumer in your
 environment. This has been running fine for topic with large number of
 Kafka partition (  200 ) like yours without any issue.. no issue with
 connection as this consumer re-use kafka connection , and also can recover
 from any failures ( network loss , Kafka leader goes down, ZK down etc ..).


 Regards,
 Dibyendu

 On Sat, Aug 22, 2015 at 7:35 PM, Shushant Arora shushantaror...@gmail.com
  wrote:

 On trying the consumer without external connections  or with low number
 of external conections its working fine -

 so doubt is how  socket got closed -

 java.io.EOFException: Received -1 when reading from channel, socket has 
 likely been closed.



 On Sat, Aug 22, 2015 at 7:24 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try some other consumer and see if the issue still exists?
 On Aug 22, 2015 12:47 AM, Shushant Arora shushantaror...@gmail.com
 wrote:

 Exception comes when client has so many connections to some another
 external server also.
 So I think Exception is coming because of client side issue only-
 server side there is no issue.


 Want to understand is executor(simple consumer) not making new
 connection to kafka broker at start of each task ? Or is it created once
 only and that is getting closed somehow ?

 On Sat, Aug 22, 2015 at 9:41 AM, Shushant Arora 
 shushantaror...@gmail.com wrote:

 it comes at start of each tasks when there is new data inserted in
 kafka.( data inserted is very few)
 kafka topic has 300 partitions - data inserted is ~10 MB.

 Tasks gets failed and it retries which succeed and after certain no of
 fail tasks it kills the job.




 On Sat, Aug 22, 2015 at 2:08 AM, Akhil Das ak...@sigmoidanalytics.com
  wrote:

 That looks like you are choking your kafka machine. Do a top on the
 kafka machines and see the workload, it may happen that you are spending
 too much time on disk io etc.
 On Aug 21, 2015 7:32 AM, Cody Koeninger c...@koeninger.org wrote:

 Sounds like that's happening consistently, not an occasional network
 problem?

 Look at the Kafka broker logs

 Make sure you've configured the correct kafka broker hosts / ports
 (note that direct stream does not use zookeeper host / port).

 Make sure that host / port is reachable from your driver and worker
 nodes, ie telnet or netcat to it.  It looks like your driver can reach 
 it
 (since there's partition info in the logs), but that doesn't mean the
 worker can.

 Use lsof / netstat to see what's going on with those ports while the
 job is running, or tcpdump if you need to.

 If you can't figure out what's going on from a networking point of
 view, post a minimal reproducible code sample that demonstrates the 
 issue,
 so it can be tested in a different environment.





 On Fri, Aug 21, 2015 at 4:06 AM, Shushant Arora 
 shushantaror...@gmail.com wrote:

 Hi


 Getting below error in spark streaming 1.3 while consuming from kafka 
 using directkafka stream. Few of tasks are getting failed in each run.


 What is the reason /solution of this error?


 15/08/21 08:54:54 ERROR executor.Executor: Exception in task 262.0 in 
 stage 130.0 (TID 16332)
 java.io.EOFException: Received -1 when reading from channel, socket 
 has likely been closed.
at kafka.utils.Utils$.read(Utils.scala:376)
at 
 kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
at 
 kafka.network.Receive$class.readCompletely(Transmission.scala:56)
at 
 kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
at 
 kafka.network.BlockingChannel.receive(BlockingChannel.scala:100)
at 
 kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:81)
at 
 kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:71)
at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:109)
at 
 kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:109)
at